All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation
@ 2014-01-17 14:14 Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
                   ` (29 more replies)
  0 siblings, 30 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This patch series adds code to the block layer that allows performing
I/O requests in smaller granularities than required by the host backend
(most importantly, O_DIRECT restrictions). It achieves this for reads
by rounding the request to host-side block boundary, and for writes by
performing a read-modify-write cycle (and serialising requests
touching the same block so that the RMW doesn't write back stale data).

Originally I intended to reuse a lot of code from Paolo's previous
patch series, however as I tried to integrate pread/pwrite, which
already do a very similar thing (except for considering concurrency),
and because I wanted to implement zero-copy, most of this series ended
up being new code.

Zero-copy is possible in a common case because while XFS defauls to a
4k sector size and therefore 4k on-disk O_DIRECT alignment for 512E
disks, it still only has a 512 byte memory alignment requirement.
(Unfortunately the XFS_IOC_DIOINFO ioctl claims 4k even for memory, but
we know that the value is wrong and can probe it.)


Changes in v2 -> v3:
- Fixed I/O throttling bypass by converting to byte granularity [Wenchao]
- Made 'bytes' argument to tracked_request_overlaps() unsigned [Max]
- Fixed a corruption bug that came from using outdated RMW buffers after
  waiting for another request and added some assertions to check the
  assumptions [Peter]
- Fixed bytes vs. sectors error in zero-after-EOF code of
  bdrv_co_do_preadv [Max]
- Removed orphaned protoype in block.h [Max]
- A qemu-iotests case and some infrastructure to support it

Changes in v1 -> v2:
- Fixed overlap_bytes calculation in mark_request_serialising()
- Fixed wait_serialising_requests() deadlock
- iscsi: Set bs->request_alignment [Peter]
- iscsi: Query block limits only in iscsi_open() when no other request
  are in flight, and in iscsi_refresh_limits() copy the stored values
  into bs->bl [Peter]

Changes in RFC -> v1:
- Moved opt_mem_alignment into BlockLimits [Paolo]
- Changed BlockLimits in turn to work a bit more like the
  .bdrv_opt_mem_align() callback of the RFC; allows updating the
  BlockLimits later when the chain changes or bdrv_reopen() toggles
  O_DIRECT
- Fixed a typo in a commit message [Eric]


Kevin Wolf (26):
  block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  block: Inherit opt_transfer_length
  block: Update BlockLimits when they might have changed
  qemu_memalign: Allow small alignments
  block: Detect unaligned length in bdrv_qiov_is_aligned()
  block: Don't use guest sector size for qemu_blockalign()
  block: Introduce bdrv_aligned_preadv()
  block: Introduce bdrv_co_do_preadv()
  block: Introduce bdrv_aligned_pwritev()
  block: write: Handle COR dependency after I/O throttling
  block: Introduce bdrv_co_do_pwritev()
  block: Switch BdrvTrackedRequest to byte granularity
  block: Allow waiting for overlapping requests between begin/end
  block: Make zero-after-EOF work with larger alignment
  block: Generalise and optimise COR serialisation
  block: Make overlap range for serialisation dynamic
  block: Allow wait_serialising_requests() at any point
  block: Align requests in bdrv_co_do_pwritev()
  block: Assert serialisation assumptions in pwritev
  block: Change coroutine wrapper to byte granularity
  block: Make bdrv_pread() a bdrv_prwv_co() wrapper
  block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
  blkdebug: Make required alignment configurable
  qemu-io: New command 'sleep'
  qemu-iotests: Test pwritev RMW logic
  block: Switch bdrv_io_limits_intercept() to byte granularity

Paolo Bonzini (3):
  block: rename buffer_alignment to guest_block_size
  raw: Probe required direct I/O alignment
  iscsi: Set bs->request_alignment

 block.c                    | 644 +++++++++++++++++++++++++++++++--------------
 block/backup.c             |   7 +-
 block/blkdebug.c           |  24 ++
 block/iscsi.c              |  47 ++--
 block/qcow2.c              |  11 +-
 block/qed.c                |  11 +-
 block/raw-posix.c          | 102 +++++--
 block/raw-win32.c          |  41 +++
 block/stream.c             |   2 +
 block/vmdk.c               |  22 +-
 hw/block/virtio-blk.c      |   2 +-
 hw/ide/core.c              |   2 +-
 hw/scsi/scsi-disk.c        |   2 +-
 hw/scsi/scsi-generic.c     |   2 +-
 include/block/block.h      |  15 +-
 include/block/block_int.h  |  27 +-
 qemu-io-cmds.c             |  42 +++
 tests/qemu-iotests/077     | 278 +++++++++++++++++++
 tests/qemu-iotests/077.out | 202 ++++++++++++++
 tests/qemu-iotests/group   |   1 +
 util/oslib-posix.c         |   5 +
 21 files changed, 1234 insertions(+), 255 deletions(-)
 create mode 100755 tests/qemu-iotests/077
 create mode 100644 tests/qemu-iotests/077.out

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 22:39   ` Benoît Canet
  2014-01-21 12:49   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length Kevin Wolf
                   ` (28 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This function separates filling the BlockLimits from bdrv_open(), which
allows it to call it from other operations which may change the limits
(e.g. modifications to the backing file chain or bdrv_reopen)

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 19 +++++++++++++++++++
 block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
 block/qcow2.c             | 11 ++++++++++-
 block/qed.c               | 11 ++++++++++-
 block/vmdk.c              | 22 ++++++++++++++++++----
 include/block/block_int.h |  2 ++
 6 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/block.c b/block.c
index 64e7d22..99e69da 100644
--- a/block.c
+++ b/block.c
@@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
     return ret;
 }
 
+static int bdrv_refresh_limits(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+
+    memset(&bs->bl, 0, sizeof(bs->bl));
+
+    if (drv && drv->bdrv_refresh_limits) {
+        return drv->bdrv_refresh_limits(bs);
+    }
+
+    return 0;
+}
+
 /*
  * Create a uniquely-named empty temporary file.
  * Return 0 upon success, otherwise a negative errno value.
@@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
         goto free_and_fail;
     }
 
+    bdrv_refresh_limits(bs);
+
 #ifndef _WIN32
     if (bs->is_temporary) {
         assert(bs->filename[0] != '\0');
@@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     }
     pstrcpy(bs->backing_file, sizeof(bs->backing_file),
             bs->backing_hd->file->filename);
+
+    /* Recalculate the BlockLimits with the backing file */
+    bdrv_refresh_limits(bs);
+
     return 0;
 }
 
diff --git a/block/iscsi.c b/block/iscsi.c
index c0ea0c4..3202dc5 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1265,23 +1265,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
                sizeof(struct scsi_inquiry_block_limits));
         scsi_free_scsi_task(task);
         task = NULL;
-
-        if (iscsilun->bl.max_unmap < 0xffffffff) {
-            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
-                                                 iscsilun);
-        }
-        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
-                                                   iscsilun);
-
-        if (iscsilun->bl.max_ws_len < 0xffffffff) {
-            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
-                                                      iscsilun);
-        }
-        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
-                                                        iscsilun);
-
-        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
-                                                     iscsilun);
     }
 
 #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
@@ -1326,6 +1309,34 @@ static void iscsi_close(BlockDriverState *bs)
     memset(iscsilun, 0, sizeof(IscsiLun));
 }
 
+static int iscsi_refresh_limits(BlockDriverState *bs)
+{
+    IscsiLun *iscsilun = bs->opaque;
+
+    /* We don't actually refresh here, but just return data queried in
+     * iscsi_open(): iscsi targets don't change their limits. */
+    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
+        if (iscsilun->bl.max_unmap < 0xffffffff) {
+            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
+                                                 iscsilun);
+        }
+        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
+                                                   iscsilun);
+
+        if (iscsilun->bl.max_ws_len < 0xffffffff) {
+            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
+                                                      iscsilun);
+        }
+        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
+                                                        iscsilun);
+
+        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
+                                                     iscsilun);
+    }
+
+    return 0;
+}
+
 static int iscsi_truncate(BlockDriverState *bs, int64_t offset)
 {
     IscsiLun *iscsilun = bs->opaque;
@@ -1438,6 +1449,7 @@ static BlockDriver bdrv_iscsi = {
     .bdrv_getlength  = iscsi_getlength,
     .bdrv_get_info   = iscsi_get_info,
     .bdrv_truncate   = iscsi_truncate,
+    .bdrv_refresh_limits = iscsi_refresh_limits,
 
 #if defined(LIBISCSI_FEATURE_IOVECTOR)
     .bdrv_co_get_block_status = iscsi_co_get_block_status,
diff --git a/block/qcow2.c b/block/qcow2.c
index 8ec9db1..6562994 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -718,7 +718,6 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     qemu_opts_del(opts);
-    bs->bl.write_zeroes_alignment = s->cluster_sectors;
 
     if (s->use_lazy_refcounts && s->qcow_version < 3) {
         error_setg(errp, "Lazy refcounts require a qcow2 image with at least "
@@ -751,6 +750,15 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     return ret;
 }
 
+static int qcow2_refresh_limits(BlockDriverState *bs)
+{
+    BDRVQcowState *s = bs->opaque;
+
+    bs->bl.write_zeroes_alignment = s->cluster_sectors;
+
+    return 0;
+}
+
 static int qcow2_set_key(BlockDriverState *bs, const char *key)
 {
     BDRVQcowState *s = bs->opaque;
@@ -2268,6 +2276,7 @@ static BlockDriver bdrv_qcow2 = {
 
     .bdrv_change_backing_file   = qcow2_change_backing_file,
 
+    .bdrv_refresh_limits        = qcow2_refresh_limits,
     .bdrv_invalidate_cache      = qcow2_invalidate_cache,
 
     .create_options = qcow2_create_options,
diff --git a/block/qed.c b/block/qed.c
index 450a1fa..d35bd1c 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -495,7 +495,6 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
-    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
     s->need_check_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                             qed_need_check_timer_cb, s);
 
@@ -507,6 +506,15 @@ out:
     return ret;
 }
 
+static int bdrv_qed_refresh_limits(BlockDriverState *bs)
+{
+    BDRVQEDState *s = bs->opaque;
+
+    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
+
+    return 0;
+}
+
 /* We have nothing to do for QED reopen, stubs just return
  * success */
 static int bdrv_qed_reopen_prepare(BDRVReopenState *state,
@@ -1616,6 +1624,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
+    .bdrv_refresh_limits      = bdrv_qed_refresh_limits,
     .bdrv_change_backing_file = bdrv_qed_change_backing_file,
     .bdrv_invalidate_cache    = bdrv_qed_invalidate_cache,
     .bdrv_check               = bdrv_qed_check,
diff --git a/block/vmdk.c b/block/vmdk.c
index c6b60b4..e26df97 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -428,10 +428,6 @@ static int vmdk_add_extent(BlockDriverState *bs,
     extent->l2_size = l2_size;
     extent->cluster_sectors = flat ? sectors : cluster_sectors;
 
-    if (!flat) {
-        bs->bl.write_zeroes_alignment =
-            MAX(bs->bl.write_zeroes_alignment, cluster_sectors);
-    }
     if (s->num_extents > 1) {
         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
     } else {
@@ -891,6 +887,23 @@ fail:
     return ret;
 }
 
+
+static int vmdk_refresh_limits(BlockDriverState *bs)
+{
+    BDRVVmdkState *s = bs->opaque;
+    int i;
+
+    for (i = 0; i < s->num_extents; i++) {
+        if (!s->extents[i].flat) {
+            bs->bl.write_zeroes_alignment =
+                MAX(bs->bl.write_zeroes_alignment,
+                    s->extents[i].cluster_sectors);
+        }
+    }
+
+    return 0;
+}
+
 static int get_whole_cluster(BlockDriverState *bs,
                 VmdkExtent *extent,
                 uint64_t cluster_offset,
@@ -2002,6 +2015,7 @@ static BlockDriver bdrv_vmdk = {
     .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
     .bdrv_has_zero_init           = vmdk_has_zero_init,
     .bdrv_get_specific_info       = vmdk_get_specific_info,
+    .bdrv_refresh_limits          = vmdk_refresh_limits,
 
     .create_options               = vmdk_create_options,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2772f2f..3dfe724 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -226,6 +226,8 @@ struct BlockDriver {
     int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
     bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
 
+    int (*bdrv_refresh_limits)(BlockDriverState *bs);
+
     /*
      * Returns 1 if newly created images are guaranteed to contain only
      * zeros, 0 otherwise.
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 22:42   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed Kevin Wolf
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

When there is a format driver between the backend, it's not guaranteed
that exposing the opt_transfer_length for the format driver results in
the optimal requests (because of fragmentation etc.), but it can't make
things worse, so let's just do it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 99e69da..20f85cb 100644
--- a/block.c
+++ b/block.c
@@ -485,7 +485,25 @@ static int bdrv_refresh_limits(BlockDriverState *bs)
 
     memset(&bs->bl, 0, sizeof(bs->bl));
 
-    if (drv && drv->bdrv_refresh_limits) {
+    if (!drv) {
+        return 0;
+    }
+
+    /* Take some limits from the children as a default */
+    if (bs->file) {
+        bdrv_refresh_limits(bs->file);
+        bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
+    }
+
+    if (bs->backing_hd) {
+        bdrv_refresh_limits(bs->backing_hd);
+        bs->bl.opt_transfer_length =
+            MAX(bs->bl.opt_transfer_length,
+                bs->backing_hd->bl.opt_transfer_length);
+    }
+
+    /* Then let the driver override it */
+    if (drv->bdrv_refresh_limits) {
         return drv->bdrv_refresh_limits(bs);
     }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 22:47   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments Kevin Wolf
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

When reopening with different flags, or when backing files disappear
from the chain, the limits may change. Make sure they get updated in
these cases.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c               | 5 ++++-
 block/stream.c        | 2 ++
 include/block/block.h | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 20f85cb..9a8562d 100644
--- a/block.c
+++ b/block.c
@@ -479,7 +479,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
     return ret;
 }
 
-static int bdrv_refresh_limits(BlockDriverState *bs)
+int bdrv_refresh_limits(BlockDriverState *bs)
 {
     BlockDriver *drv = bs->drv;
 
@@ -1464,6 +1464,8 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
     reopen_state->bs->enable_write_cache = !!(reopen_state->flags &
                                               BDRV_O_CACHE_WB);
     reopen_state->bs->read_only = !(reopen_state->flags & BDRV_O_RDWR);
+
+    bdrv_refresh_limits(reopen_state->bs);
 }
 
 /*
@@ -2261,6 +2263,7 @@ int bdrv_drop_intermediate(BlockDriverState *active, BlockDriverState *top,
     }
     new_top_bs->backing_hd = base_bs;
 
+    bdrv_refresh_limits(new_top_bs);
 
     QSIMPLEQ_FOREACH_SAFE(intermediate_state, &states_to_delete, entry, next) {
         /* so that bdrv_close() does not recursively close the chain */
diff --git a/block/stream.c b/block/stream.c
index 46bec7d..dd0b4ac 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -75,6 +75,8 @@ static void close_unused_images(BlockDriverState *top, BlockDriverState *base,
         unused->backing_hd = NULL;
         bdrv_unref(unused);
     }
+
+    bdrv_refresh_limits(top);
 }
 
 static void coroutine_fn stream_run(void *opaque)
diff --git a/include/block/block.h b/include/block/block.h
index 36efaea..64fb319 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -249,6 +249,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset);
 int64_t bdrv_getlength(BlockDriverState *bs);
 int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
 void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
+int bdrv_refresh_limits(BlockDriverState *bs);
 int bdrv_commit(BlockDriverState *bs);
 int bdrv_commit_all(void);
 int bdrv_change_backing_file(BlockDriverState *bs,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (2 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 22:49   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 05/29] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

The functions used by qemu_memalign() require an alignment that is at
least sizeof(void*). Adjust it if it is too small.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 util/oslib-posix.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e00a44c..54f8932 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -85,6 +85,11 @@ void *qemu_oom_check(void *ptr)
 void *qemu_memalign(size_t alignment, size_t size)
 {
     void *ptr;
+
+    if (alignment < sizeof(void*)) {
+        alignment = sizeof(void*);
+    }
+
 #if defined(_POSIX_C_SOURCE) && !defined(__sun__)
     int ret;
     ret = posix_memalign(&ptr, alignment, size);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 05/29] block: Detect unaligned length in bdrv_qiov_is_aligned()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (3 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 06/29] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

For an O_DIRECT request to succeed, it's not only necessary that all
base addresses in the qiov are aligned, but also that each length in it
is aligned.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block.c b/block.c
index 9a8562d..bf9ac5f 100644
--- a/block.c
+++ b/block.c
@@ -4566,6 +4566,9 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
         if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
             return false;
         }
+        if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+            return false;
+        }
     }
 
     return true;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 06/29] block: Don't use guest sector size for qemu_blockalign()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (4 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 05/29] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size Kevin Wolf
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

bs->buffer_alignment is set by the device emulation and contains the
logical block size of the guest device. This isn't something that the
block layer should know, and even less something to use for determining
the right alignment of buffers to be used for the host.

The new BlockLimits field opt_mem_alignment tells the qemu block layer
the optimal alignment to be used so that no bounce buffer must be used
in the driver.

This patch may change the buffer alignment from 4k to 512 for all
callers that used qemu_blockalign() with the top-level image format
BlockDriverState. The value was never propagated to other levels in the
tree, so in particular raw-posix never required anything else than 512.

While on disks with 4k sectors direct I/O requires a 4k alignment,
memory may still be okay when aligned to 512 byte boundaries. This is
what must have happened in practice, because otherwise this would
already have failed earlier. Therefore I don't expect regressions even
with this intermediate state. Later, raw-posix can implement the hook
and expose a different memory alignment requirement.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 23 ++++++++++++++++++++---
 include/block/block.h     |  3 +++
 include/block/block_int.h |  3 +++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index bf9ac5f..04c409a 100644
--- a/block.c
+++ b/block.c
@@ -214,6 +214,16 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
     qemu_co_queue_next(&bs->throttled_reqs[is_write]);
 }
 
+size_t bdrv_opt_mem_align(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        /* 4k should be on the safe side */
+        return 4096;
+    }
+
+    return bs->bl.opt_mem_alignment;
+}
+
 /* check if the path starts with "<protocol>:" */
 static int path_has_protocol(const char *path)
 {
@@ -493,6 +503,9 @@ int bdrv_refresh_limits(BlockDriverState *bs)
     if (bs->file) {
         bdrv_refresh_limits(bs->file);
         bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
+        bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
+    } else {
+        bs->bl.opt_mem_alignment = 512;
     }
 
     if (bs->backing_hd) {
@@ -500,6 +513,9 @@ int bdrv_refresh_limits(BlockDriverState *bs)
         bs->bl.opt_transfer_length =
             MAX(bs->bl.opt_transfer_length,
                 bs->backing_hd->bl.opt_transfer_length);
+        bs->bl.opt_mem_alignment =
+            MAX(bs->bl.opt_mem_alignment,
+                bs->backing_hd->bl.opt_mem_alignment);
     }
 
     /* Then let the driver override it */
@@ -4552,7 +4568,7 @@ void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
-    return qemu_memalign((bs && bs->buffer_alignment) ? bs->buffer_alignment : 512, size);
+    return qemu_memalign(bdrv_opt_mem_align(bs), size);
 }
 
 /*
@@ -4561,12 +4577,13 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size)
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
 {
     int i;
+    size_t alignment = bdrv_opt_mem_align(bs);
 
     for (i = 0; i < qiov->niov; i++) {
-        if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
+        if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
             return false;
         }
-        if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+        if (qiov->iov[i].iov_len % alignment) {
             return false;
         }
     }
diff --git a/include/block/block.h b/include/block/block.h
index 64fb319..cf63ee2 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -419,6 +419,9 @@ void bdrv_img_create(const char *filename, const char *fmt,
                      char *options, uint64_t img_size, int flags,
                      Error **errp, bool quiet);
 
+/* Returns the alignment in bytes that is required so that no bounce buffer
+ * is required throughout the stack */
+size_t bdrv_opt_mem_align(BlockDriverState *bs);
 void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 3dfe724..a602e14 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -252,6 +252,9 @@ typedef struct BlockLimits {
 
     /* optimal transfer length in sectors */
     int opt_transfer_length;
+
+    /* memory alignment so that no bounce buffer is needed */
+    size_t opt_mem_alignment;
 } BlockLimits;
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (5 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 06/29] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-21 12:54   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment Kevin Wolf
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

From: Paolo Bonzini <pbonzini@redhat.com>

The alignment field is now set to the value that is promised to the
guest, rather than required by the host.  The next patches will make
QEMU aware of the host-provided values, so make this clear.

The alignment is also not about memory buffers, but about the sectors on
the disk, change the documentation of the field.

At this point, the field is set by the device emulation, but completely
ignored by the block layer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 10 +++++-----
 hw/block/virtio-blk.c     |  2 +-
 hw/ide/core.c             |  2 +-
 hw/scsi/scsi-disk.c       |  2 +-
 hw/scsi/scsi-generic.c    |  2 +-
 include/block/block.h     |  2 +-
 include/block/block_int.h |  4 ++--
 7 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 04c409a..b738abe 100644
--- a/block.c
+++ b/block.c
@@ -812,7 +812,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
     }
 
     bs->open_flags = flags;
-    bs->buffer_alignment = 512;
+    bs->guest_block_size = 512;
     bs->zero_beyond_eof = true;
     open_flags = bdrv_open_flags(bs, flags);
     bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -1648,7 +1648,7 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->dev_ops            = bs_src->dev_ops;
     bs_dest->dev_opaque         = bs_src->dev_opaque;
     bs_dest->dev                = bs_src->dev;
-    bs_dest->buffer_alignment   = bs_src->buffer_alignment;
+    bs_dest->guest_block_size   = bs_src->guest_block_size;
     bs_dest->copy_on_read       = bs_src->copy_on_read;
 
     bs_dest->enable_write_cache = bs_src->enable_write_cache;
@@ -1800,7 +1800,7 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev)
     bs->dev = NULL;
     bs->dev_ops = NULL;
     bs->dev_opaque = NULL;
-    bs->buffer_alignment = 512;
+    bs->guest_block_size = 512;
 }
 
 /* TODO change to return DeviceState * when all users are qdevified */
@@ -4561,9 +4561,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
     return NULL;
 }
 
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
 {
-    bs->buffer_alignment = align;
+    bs->guest_block_size = align;
 }
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 19d0961..8a568e5 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -731,7 +731,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
                     virtio_blk_save, virtio_blk_load, s);
     bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
-    bdrv_set_buffer_alignment(s->bs, s->conf->logical_block_size);
+    bdrv_set_guest_block_size(s->bs, s->conf->logical_block_size);
 
     bdrv_iostatus_enable(s->bs);
 
diff --git a/hw/ide/core.c b/hw/ide/core.c
index e1f4c33..036cd4a 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2103,7 +2103,7 @@ int ide_init_drive(IDEState *s, BlockDriverState *bs, IDEDriveKind kind,
     s->smart_selftest_count = 0;
     if (kind == IDE_CD) {
         bdrv_set_dev_ops(bs, &ide_cd_block_ops, s);
-        bdrv_set_buffer_alignment(bs, 2048);
+        bdrv_set_guest_block_size(bs, 2048);
     } else {
         if (!bdrv_is_inserted(s->bs)) {
             error_report("Device needs media, but drive is empty");
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index bce617c..6491091 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2254,7 +2254,7 @@ static int scsi_initfn(SCSIDevice *dev)
     } else {
         bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s);
     }
-    bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize);
+    bdrv_set_guest_block_size(s->qdev.conf.bs, s->qdev.blocksize);
 
     bdrv_iostatus_enable(s->qdev.conf.bs);
     add_boot_device_path(s->qdev.conf.bootindex, &dev->qdev, NULL);
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 8f195be..f08b64e 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -210,7 +210,7 @@ static void scsi_read_complete(void * opaque, int ret)
             s->blocksize = ldl_be_p(&r->buf[8]);
             s->max_lba = ldq_be_p(&r->buf[0]);
         }
-        bdrv_set_buffer_alignment(s->conf.bs, s->blocksize);
+        bdrv_set_guest_block_size(s->conf.bs, s->blocksize);
 
         scsi_req_data(&r->req, len);
         if (!r->req.io_canceled) {
diff --git a/include/block/block.h b/include/block/block.h
index cf63ee2..05252d5 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -422,7 +422,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
 /* Returns the alignment in bytes that is required so that no bounce buffer
  * is required throughout the stack */
 size_t bdrv_opt_mem_align(BlockDriverState *bs);
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a602e14..36c068f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,8 +319,8 @@ struct BlockDriverState {
     /* Whether produces zeros when read beyond eof */
     bool zero_beyond_eof;
 
-    /* the memory alignment required for the buffers handled by this driver */
-    int buffer_alignment;
+    /* the block size for which the guest device expects atomicity */
+    int guest_block_size;
 
     /* do we need to tell the quest if we have a volatile write cache? */
     int enable_write_cache;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (6 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-21 13:03   ` Benoît Canet
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv() Kevin Wolf
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

From: Paolo Bonzini <pbonzini@redhat.com>

Add a bs->request_alignment field that contains the required
offset/length alignment for I/O requests and fill it in the raw block
drivers. Use ioctls if possible, else see what alignment it takes for
O_DIRECT to succeed.

While at it, also expose the memory alignment requirements, which may be
(and in practice are) different from the disk alignment requirements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   |   3 ++
 block/raw-posix.c         | 102 ++++++++++++++++++++++++++++++++++++++--------
 block/raw-win32.c         |  41 +++++++++++++++++++
 include/block/block_int.h |   3 ++
 4 files changed, 132 insertions(+), 17 deletions(-)

diff --git a/block.c b/block.c
index b738abe..25ae896 100644
--- a/block.c
+++ b/block.c
@@ -813,6 +813,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
 
     bs->open_flags = flags;
     bs->guest_block_size = 512;
+    bs->request_alignment = 512;
     bs->zero_beyond_eof = true;
     open_flags = bdrv_open_flags(bs, flags);
     bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -881,6 +882,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
     }
 
     bdrv_refresh_limits(bs);
+    assert(bdrv_opt_mem_align(bs) != 0);
+    assert(bs->request_alignment != 0);
 
 #ifndef _WIN32
     if (bs->is_temporary) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 0676037..126a634 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -127,6 +127,8 @@ typedef struct BDRVRawState {
     int fd;
     int type;
     int open_flags;
+    size_t buf_align;
+
 #if defined(__linux__)
     /* linux floppy specific */
     int64_t fd_open_time;
@@ -213,6 +215,76 @@ static int raw_normalize_devicepath(const char **filename)
 }
 #endif
 
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    char *buf;
+    unsigned int sector_size;
+
+    /* For /dev/sg devices the alignment is not really used.
+       With buffered I/O, we don't have any restrictions. */
+    if (bs->sg || !(s->open_flags & O_DIRECT)) {
+        bs->request_alignment = 1;
+        s->buf_align = 1;
+        return;
+    }
+
+    /* Try a few ioctls to get the right size */
+    bs->request_alignment = 0;
+    s->buf_align = 0;
+
+#ifdef BLKSSZGET
+    if (ioctl(s->fd, BLKSSZGET, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef DKIOCGETBLOCKSIZE
+    if (ioctl(s->fd, DKIOCGETBLOCKSIZE, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef DIOCGSECTORSIZE
+    if (ioctl(s->fd, DIOCGSECTORSIZE, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        struct dioattr da;
+        if (xfsctl(NULL, s->fd, XFS_IOC_DIOINFO, &da) >= 0) {
+            bs->request_alignment = da.d_miniosz;
+            /* The kernel returns wrong information for d_mem */
+            /* s->buf_align = da.d_mem; */
+        }
+    }
+#endif
+
+    /* If we could not get the sizes so far, we can only guess them */
+    if (!s->buf_align) {
+        size_t align;
+        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
+        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+            if (pread(s->fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
+                s->buf_align = align;
+                break;
+            }
+        }
+        qemu_vfree(buf);
+    }
+
+    if (!bs->request_alignment) {
+        size_t align;
+        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
+        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+            if (pread(s->fd, buf, align, 0) >= 0) {
+                bs->request_alignment = align;
+                break;
+            }
+        }
+        qemu_vfree(buf);
+    }
+}
+
 static void raw_parse_flags(int bdrv_flags, int *open_flags)
 {
     assert(open_flags != NULL);
@@ -463,7 +535,6 @@ static int raw_reopen_prepare(BDRVReopenState *state,
     return ret;
 }
 
-
 static void raw_reopen_commit(BDRVReopenState *state)
 {
     BDRVRawReopenState *raw_s = state->opaque;
@@ -499,23 +570,15 @@ static void raw_reopen_abort(BDRVReopenState *state)
     state->opaque = NULL;
 }
 
+static int raw_refresh_limits(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
 
-/* XXX: use host sector size if necessary with:
-#ifdef DIOCGSECTORSIZE
-        {
-            unsigned int sectorsize = 512;
-            if (!ioctl(fd, DIOCGSECTORSIZE, &sectorsize) &&
-                sectorsize > bufsize)
-                bufsize = sectorsize;
-        }
-#endif
-#ifdef CONFIG_COCOA
-        uint32_t blockSize = 512;
-        if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
-            bufsize = blockSize;
-        }
-#endif
-*/
+    raw_probe_alignment(bs);
+    bs->bl.opt_mem_alignment = s->buf_align;
+
+    return 0;
+}
 
 static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
 {
@@ -1363,6 +1426,7 @@ static BlockDriver bdrv_file = {
     .bdrv_aio_writev = raw_aio_writev,
     .bdrv_aio_flush = raw_aio_flush,
     .bdrv_aio_discard = raw_aio_discard,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
@@ -1740,6 +1804,7 @@ static BlockDriver bdrv_host_device = {
     .bdrv_aio_writev	= raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_aio_discard   = hdev_aio_discard,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength	= raw_getlength,
@@ -1871,6 +1936,7 @@ static BlockDriver bdrv_host_floppy = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
@@ -1981,6 +2047,7 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
@@ -2110,6 +2177,7 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
diff --git a/block/raw-win32.c b/block/raw-win32.c
index ce314fd..beb7f23 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -202,6 +202,35 @@ static int set_sparse(int fd)
 				 NULL, 0, NULL, 0, &returned, NULL);
 }
 
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    DWORD sectorsPerCluster, freeClusters, totalClusters, count;
+    DISK_GEOMETRY_EX dg;
+    BOOL status;
+
+    if (s->type == FTYPE_CD) {
+        bs->request_alignment = 2048;
+        return;
+    }
+    if (s->type == FTYPE_HARDDISK) {
+        status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
+                                 NULL, 0, &dg, sizeof(dg), &count, NULL);
+        if (status != 0) {
+            bs->request_alignment = dg.Geometry.BytesPerSector;
+            return;
+        }
+        /* try GetDiskFreeSpace too */
+    }
+
+    if (s->drive_path[0]) {
+        GetDiskFreeSpace(s->drive_path, &sectorsPerCluster,
+                         &dg.Geometry.BytesPerSector,
+                         &freeClusters, &totalClusters);
+        bs->request_alignment = dg.Geometry.BytesPerSector;
+    }
+}
+
 static void raw_parse_flags(int flags, int *access_flags, DWORD *overlapped)
 {
     assert(access_flags != NULL);
@@ -269,6 +298,17 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    if (filename[0] && filename[1] == ':') {
+        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
+    } else if (filename[0] == '\\' && filename[1] == '\\') {
+        s->drive_path[0] = 0;
+    } else {
+        /* Relative path.  */
+        char buf[MAX_PATH];
+        GetCurrentDirectory(MAX_PATH, buf);
+        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
+    }
+
     s->hfile = CreateFile(filename, access_flags,
                           FILE_SHARE_READ, NULL,
                           OPEN_EXISTING, overlapped, NULL);
@@ -293,6 +333,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
         s->aio = aio;
     }
 
+    raw_probe_alignment(bs);
     ret = 0;
 fail:
     qemu_opts_del(opts);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 36c068f..fb1d355 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,6 +319,9 @@ struct BlockDriverState {
     /* Whether produces zeros when read beyond eof */
     bool zero_beyond_eof;
 
+    /* Alignment requirement for offset/length of I/O requests */
+    unsigned int request_alignment;
+
     /* the block size for which the guest device expects atomicity */
     int guest_block_size;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (7 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment Kevin Wolf
@ 2014-01-17 14:14 ` Kevin Wolf
  2014-01-21 13:13   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv() Kevin Wolf
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This separates the part of bdrv_co_do_readv() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 61 +++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index 25ae896..0c2baf5 100644
--- a/block.c
+++ b/block.c
@@ -2705,26 +2705,24 @@ err:
 }
 
 /*
- * Handle a read request in coroutine context
+ * Forwards an already correctly aligned request to the BlockDriver. This
+ * handles copy on read and zeroing after EOF; any other features must be
+ * implemented by the caller.
  */
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
-    BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
     int ret;
 
-    if (!drv) {
-        return -ENOMEDIUM;
-    }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
-        return -EIO;
-    }
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
 
-    if (bs->copy_on_read) {
-        flags |= BDRV_REQ_COPY_ON_READ;
-    }
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
         bs->copy_on_read_in_flight++;
     }
@@ -2733,11 +2731,6 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
     }
 
-    /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, false);
-    }
-
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -2754,6 +2747,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         }
     }
 
+    /* Forward the request to the BlockDriver */
     if (!(bs->zero_beyond_eof && bs->growable)) {
         ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
     } else {
@@ -2794,6 +2788,37 @@ out:
     return ret;
 }
 
+/*
+ * Handle a read request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    BlockDriver *drv = bs->drv;
+    int ret;
+
+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+        return -EIO;
+    }
+
+    if (bs->copy_on_read) {
+        flags |= BDRV_REQ_COPY_ON_READ;
+    }
+
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_intercept(bs, nb_sectors, false);
+    }
+
+    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+    return ret;
+}
+
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (8 preceding siblings ...)
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv() Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:59   ` Max Reitz
  2014-01-21 13:29   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
                   ` (19 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Similar to bdrv_pread(), which aligns byte-aligned request to 512 byte
sectors, bdrv_co_do_preadv() takes a byte-aligned request and aligns it
to the alignment specified in bs->request_alignment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 0c2baf5..f378a10 100644
--- a/block.c
+++ b/block.c
@@ -2791,17 +2791,23 @@ out:
 /*
  * Handle a read request in coroutine context
  */
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
+    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+    uint8_t *head_buf = NULL;
+    uint8_t *tail_buf = NULL;
+    QEMUIOVector local_qiov;
+    bool use_local_qiov = false;
     int ret;
 
     if (!drv) {
         return -ENOMEDIUM;
     }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+    if (bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
@@ -2811,14 +2817,60 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, false);
+        /* TODO Switch to byte granularity */
+        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
+    }
+
+    /* Align read if necessary by padding qiov */
+    if (offset & (align - 1)) {
+        head_buf = qemu_blockalign(bs, align);
+        qemu_iovec_init(&local_qiov, qiov->niov + 2);
+        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+        use_local_qiov = true;
+
+        bytes += offset & (align - 1);
+        offset = offset & ~(align - 1);
+    }
+
+    if ((offset + bytes) & (align - 1)) {
+        if (!use_local_qiov) {
+            qemu_iovec_init(&local_qiov, qiov->niov + 1);
+            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+            use_local_qiov = true;
+        }
+        tail_buf = qemu_blockalign(bs, align);
+        qemu_iovec_add(&local_qiov, tail_buf,
+                       align - ((offset + bytes) & (align - 1)));
+
+        bytes = ROUND_UP(bytes, align);
+    }
+
+    ret = bdrv_aligned_preadv(bs, offset, bytes,
+                              use_local_qiov ? &local_qiov : qiov,
+                              flags);
+
+    if (use_local_qiov) {
+        qemu_iovec_destroy(&local_qiov);
+        qemu_vfree(head_buf);
+        qemu_vfree(tail_buf);
     }
 
-    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
-                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
     return ret;
 }
 
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
+        return -EINVAL;
+    }
+
+    return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (9 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv() Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-21 13:31   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling Kevin Wolf
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This separates the part of bdrv_co_do_writev() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index f378a10..70b72f0 100644
--- a/block.c
+++ b/block.c
@@ -2964,34 +2964,20 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 }
 
 /*
- * Handle a write request in coroutine context
+ * Forwards an already correctly aligned write request to the BlockDriver.
  */
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
-    BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
     int ret;
 
-    if (!bs->drv) {
-        return -ENOMEDIUM;
-    }
-    if (bs->read_only) {
-        return -EACCES;
-    }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
-        return -EIO;
-    }
-
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
-    }
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
 
-    /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, true);
-    }
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
@@ -3023,6 +3009,40 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     return ret;
 }
 
+/*
+ * Handle a write request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    int ret;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+    if (bs->read_only) {
+        return -EACCES;
+    }
+    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+        return -EIO;
+    }
+
+    if (bs->copy_on_read_in_flight) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_intercept(bs, nb_sectors, true);
+    }
+
+    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+
+    return ret;
+}
+
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (10 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-21 13:33   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

First waiting for all COR requests to complete and calling the
throttling function afterwards means that the request could be delayed
and we still need to wait for the COR request even if it was issued only
after the throttled write request.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 70b72f0..fe55217 100644
--- a/block.c
+++ b/block.c
@@ -2979,6 +2979,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
+    if (bs->copy_on_read_in_flight) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
@@ -3028,10 +3032,6 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         return -EIO;
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
-    }
-
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
         bdrv_io_limits_intercept(bs, nb_sectors, true);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (11 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-18  0:00   ` Max Reitz
  2014-01-21 13:36   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
                   ` (16 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This is going to become the bdrv_co_do_preadv() equivalent for writes.
In this patch, however, just a function taking byte offsets is created,
it doesn't align anything yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index fe55217..328f592 100644
--- a/block.c
+++ b/block.c
@@ -3016,8 +3016,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
 /*
  * Handle a write request in coroutine context
  */
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
     int ret;
@@ -3028,21 +3028,33 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     if (bs->read_only) {
         return -EACCES;
     }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+    if (bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, true);
+        /* TODO Switch to byte granularity */
+        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
     }
 
-    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
-                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
 
     return ret;
 }
 
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    if (nb_sectors < 0 || nb_sectors > (INT_MAX >> BDRV_SECTOR_BITS)) {
+        return -EINVAL;
+    }
+
+    return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (12 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:19   ` Max Reitz
  2014-01-21 13:49   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
                   ` (15 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 52 +++++++++++++++++++++++++++++++----------------
 block/backup.c            |  7 ++++++-
 include/block/block_int.h |  4 ++--
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index 328f592..85f28ab 100644
--- a/block.c
+++ b/block.c
@@ -2037,13 +2037,13 @@ static void tracked_request_end(BdrvTrackedRequest *req)
  */
 static void tracked_request_begin(BdrvTrackedRequest *req,
                                   BlockDriverState *bs,
-                                  int64_t sector_num,
-                                  int nb_sectors, bool is_write)
+                                  int64_t offset,
+                                  unsigned int bytes, bool is_write)
 {
     *req = (BdrvTrackedRequest){
         .bs = bs,
-        .sector_num = sector_num,
-        .nb_sectors = nb_sectors,
+        .offset = offset,
+        .bytes = bytes,
         .is_write = is_write,
         .co = qemu_coroutine_self(),
     };
@@ -2074,25 +2074,43 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
     }
 }
 
+static void round_bytes_to_clusters(BlockDriverState *bs,
+                                    int64_t offset, unsigned int bytes,
+                                    int64_t *cluster_offset,
+                                    unsigned int *cluster_bytes)
+{
+    BlockDriverInfo bdi;
+
+    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+        *cluster_offset = offset;
+        *cluster_bytes = bytes;
+    } else {
+        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
+        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
+                                       bdi.cluster_size);
+    }
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
-                                     int64_t sector_num, int nb_sectors) {
+                                     int64_t offset, unsigned int bytes)
+{
     /*        aaaa   bbbb */
-    if (sector_num >= req->sector_num + req->nb_sectors) {
+    if (offset >= req->offset + req->bytes) {
         return false;
     }
     /* bbbb   aaaa        */
-    if (req->sector_num >= sector_num + nb_sectors) {
+    if (req->offset >= offset + bytes) {
         return false;
     }
     return true;
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors)
+        int64_t offset, unsigned int bytes)
 {
     BdrvTrackedRequest *req;
-    int64_t cluster_sector_num;
-    int cluster_nb_sectors;
+    int64_t cluster_offset;
+    unsigned int cluster_bytes;
     bool retry;
 
     /* If we touch the same cluster it counts as an overlap.  This guarantees
@@ -2101,14 +2119,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
      * CoR read and write operations are atomic and guest writes cannot
      * interleave between them.
      */
-    bdrv_round_to_clusters(bs, sector_num, nb_sectors,
-                           &cluster_sector_num, &cluster_nb_sectors);
+    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
 
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (tracked_request_overlaps(req, cluster_sector_num,
-                                         cluster_nb_sectors)) {
+            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
                  * never happen since it means deadlock.
@@ -2728,10 +2744,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+        wait_for_overlapping_requests(bs, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+    tracked_request_begin(&req, bs, offset, bytes, false);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
@@ -2980,10 +2996,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+        wait_for_overlapping_requests(bs, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+    tracked_request_begin(&req, bs, offset, bytes, true);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
 
diff --git a/block/backup.c b/block/backup.c
index 0198514..15a2e55 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -181,8 +181,13 @@ static int coroutine_fn backup_before_write_notify(
         void *opaque)
 {
     BdrvTrackedRequest *req = opaque;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
 
-    return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL);
+    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    return backup_do_cow(req->bs, sector_num, nb_sectors, NULL);
 }
 
 static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index fb1d355..97a4d23 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -57,8 +57,8 @@
 
 typedef struct BdrvTrackedRequest {
     BlockDriverState *bs;
-    int64_t sector_num;
-    int nb_sectors;
+    int64_t offset;
+    unsigned int bytes;
     bool is_write;
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (13 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-22 19:46   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment Kevin Wolf
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Previously, it was not possible to use wait_for_overlapping_requests()
between tracked_request_begin()/end() because it would wait for itself.

Ignore the current request in the overlap check and run more of the
bdrv_co_do_preadv/pwritev code with a BdrvTrackedRequest present.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index 85f28ab..18dcd43 100644
--- a/block.c
+++ b/block.c
@@ -2106,7 +2106,7 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        int64_t offset, unsigned int bytes)
+        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
 {
     BdrvTrackedRequest *req;
     int64_t cluster_offset;
@@ -2124,6 +2124,9 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
+            if (req == self) {
+                continue;
+            }
             if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
@@ -2726,10 +2729,10 @@ err:
  * implemented by the caller.
  */
 static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
-    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+    QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
-    BdrvTrackedRequest req;
     int ret;
 
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2744,11 +2747,9 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, offset, bytes);
+        wait_for_overlapping_requests(bs, req, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, offset, bytes, false);
-
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
 
@@ -2795,8 +2796,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
 out:
-    tracked_request_end(&req);
-
     if (flags & BDRV_REQ_COPY_ON_READ) {
         bs->copy_on_read_in_flight--;
     }
@@ -2812,6 +2811,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
+
     /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
     uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
     uint8_t *head_buf = NULL;
@@ -2862,9 +2863,11 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
         bytes = ROUND_UP(bytes, align);
     }
 
-    ret = bdrv_aligned_preadv(bs, offset, bytes,
+    tracked_request_begin(&req, bs, offset, bytes, false);
+    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
+    tracked_request_end(&req);
 
     if (use_local_qiov) {
         qemu_iovec_destroy(&local_qiov);
@@ -2983,10 +2986,10 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
  * Forwards an already correctly aligned write request to the BlockDriver.
  */
 static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
-    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+    QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
-    BdrvTrackedRequest req;
     int ret;
 
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2996,12 +2999,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, offset, bytes);
+        wait_for_overlapping_requests(bs, req, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, offset, bytes, true);
-
-    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
+    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
     if (ret < 0) {
         /* Do nothing, write notifier decided to fail this request */
@@ -3024,8 +3025,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
         bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
     }
 
-    tracked_request_end(&req);
-
     return ret;
 }
 
@@ -3036,6 +3035,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
+    BdrvTrackedRequest req;
     int ret;
 
     if (!bs->drv) {
@@ -3054,7 +3054,9 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
     }
 
-    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
+    tracked_request_begin(&req, bs, offset, bytes, true);
+    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+    tracked_request_end(&req);
 
     return ret;
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (14 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:21   ` Max Reitz
  2014-01-22 19:50   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation Kevin Wolf
                   ` (13 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Odd file sizes could make bdrv_aligned_preadv() shorten the request in
non-aligned ways. Fix it by rounding to the required alignment instead
of 512 bytes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 18dcd43..d7156ce 100644
--- a/block.c
+++ b/block.c
@@ -2730,7 +2730,7 @@ err:
  */
 static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
-    QEMUIOVector *qiov, int flags)
+    int64_t align, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     int ret;
@@ -2778,7 +2778,8 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
         }
 
         total_sectors = DIV_ROUND_UP(len, BDRV_SECTOR_SIZE);
-        max_nb_sectors = MAX(0, total_sectors - sector_num);
+        max_nb_sectors = MAX(0, ROUND_UP(total_sectors - sector_num,
+                                         align >> BDRV_SECTOR_BITS));
         if (max_nb_sectors > 0) {
             ret = drv->bdrv_co_readv(bs, sector_num,
                                      MIN(nb_sectors, max_nb_sectors), qiov);
@@ -2864,7 +2865,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     }
 
     tracked_request_begin(&req, bs, offset, bytes, false);
-    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
+    ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
     tracked_request_end(&req);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (15 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-22 20:00   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic Kevin Wolf
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Change the API so that specific requests can be marked serialising. Only
these requests are checked for overlaps then.

This means that during a Copy on Read operation, not all requests
overlapping other requests are serialised any more, but only those that
actually overlap with the specific COR request.

Also remove COR from function and variable names because this
functionality can be useful in other contexts.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 48 ++++++++++++++++++++++++++++-------------------
 include/block/block_int.h |  5 +++--
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index d7156ce..efa8979 100644
--- a/block.c
+++ b/block.c
@@ -2028,6 +2028,10 @@ int bdrv_commit_all(void)
  */
 static void tracked_request_end(BdrvTrackedRequest *req)
 {
+    if (req->serialising) {
+        req->bs->serialising_in_flight--;
+    }
+
     QLIST_REMOVE(req, list);
     qemu_co_queue_restart_all(&req->wait_queue);
 }
@@ -2042,10 +2046,11 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
 {
     *req = (BdrvTrackedRequest){
         .bs = bs,
-        .offset = offset,
-        .bytes = bytes,
-        .is_write = is_write,
-        .co = qemu_coroutine_self(),
+        .offset         = offset,
+        .bytes          = bytes,
+        .is_write       = is_write,
+        .co             = qemu_coroutine_self(),
+        .serialising    = false,
     };
 
     qemu_co_queue_init(&req->wait_queue);
@@ -2053,6 +2058,14 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+static void mark_request_serialising(BdrvTrackedRequest *req)
+{
+    if (!req->serialising) {
+        req->bs->serialising_in_flight++;
+        req->serialising = true;
+    }
+}
+
 /**
  * Round a region to cluster boundaries
  */
@@ -2105,26 +2118,31 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
-static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
+static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
+    BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     int64_t cluster_offset;
     unsigned int cluster_bytes;
     bool retry;
 
+    if (!bs->serialising_in_flight) {
+        return;
+    }
+
     /* If we touch the same cluster it counts as an overlap.  This guarantees
      * that allocating writes will be serialized and not race with each other
      * for the same cluster.  For example, in copy-on-read it ensures that the
      * CoR read and write operations are atomic and guest writes cannot
      * interleave between them.
      */
-    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+    round_bytes_to_clusters(bs, self->offset, self->bytes,
+                            &cluster_offset, &cluster_bytes);
 
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (req == self) {
+            if (req == self || (!req->serialising && !self->serialising)) {
                 continue;
             }
             if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
@@ -2743,12 +2761,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
-        bs->copy_on_read_in_flight++;
+        mark_request_serialising(req);
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, req, offset, bytes);
-    }
+    wait_serialising_requests(req);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
@@ -2797,10 +2813,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
 out:
-    if (flags & BDRV_REQ_COPY_ON_READ) {
-        bs->copy_on_read_in_flight--;
-    }
-
     return ret;
 }
 
@@ -2999,9 +3011,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, req, offset, bytes);
-    }
+    wait_serialising_requests(req);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 97a4d23..d8443df 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -60,6 +60,7 @@ typedef struct BdrvTrackedRequest {
     int64_t offset;
     unsigned int bytes;
     bool is_write;
+    bool serialising;
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
@@ -296,8 +297,8 @@ struct BlockDriverState {
     /* Callback before write request is processed */
     NotifierWithReturnList before_write_notifiers;
 
-    /* number of in-flight copy-on-read requests */
-    unsigned int copy_on_read_in_flight;
+    /* number of in-flight serialising requests */
+    unsigned int serialising_in_flight;
 
     /* I/O throttling */
     ThrottleState throttle_state;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (16 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-22 20:15   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point Kevin Wolf
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Copy on Read wants to serialise with all requests touching the same
cluster, so wait_serialising_requests() rounded to cluster boundaries.
Other users like alignment RMW will have different requirements, though
(requests touching the same sector), so make it dynamic.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 53 ++++++++++++++++++++++++-----------------------
 include/block/block_int.h |  4 ++++
 2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index efa8979..e72966a 100644
--- a/block.c
+++ b/block.c
@@ -2051,6 +2051,8 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
         .is_write       = is_write,
         .co             = qemu_coroutine_self(),
         .serialising    = false,
+        .overlap_offset = offset,
+        .overlap_bytes  = bytes,
     };
 
     qemu_co_queue_init(&req->wait_queue);
@@ -2058,12 +2060,19 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
-static void mark_request_serialising(BdrvTrackedRequest *req)
+static void mark_request_serialising(BdrvTrackedRequest *req, size_t align)
 {
+    int64_t overlap_offset = req->offset & ~(align - 1);
+    int overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
+                      - overlap_offset;
+
     if (!req->serialising) {
         req->bs->serialising_in_flight++;
         req->serialising = true;
     }
+
+    req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
+    req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
 }
 
 /**
@@ -2087,20 +2096,16 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
     }
 }
 
-static void round_bytes_to_clusters(BlockDriverState *bs,
-                                    int64_t offset, unsigned int bytes,
-                                    int64_t *cluster_offset,
-                                    unsigned int *cluster_bytes)
+static int bdrv_get_cluster_size(BlockDriverState *bs)
 {
     BlockDriverInfo bdi;
+    int ret;
 
-    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
-        *cluster_offset = offset;
-        *cluster_bytes = bytes;
+    ret = bdrv_get_info(bs, &bdi);
+    if (ret < 0 || bdi.cluster_size == 0) {
+        return bs->request_alignment;
     } else {
-        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
-        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
-                                       bdi.cluster_size);
+        return bdi.cluster_size;
     }
 }
 
@@ -2108,11 +2113,11 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t offset, unsigned int bytes)
 {
     /*        aaaa   bbbb */
-    if (offset >= req->offset + req->bytes) {
+    if (offset >= req->overlap_offset + req->overlap_bytes) {
         return false;
     }
     /* bbbb   aaaa        */
-    if (req->offset >= offset + bytes) {
+    if (req->overlap_offset >= offset + bytes) {
         return false;
     }
     return true;
@@ -2122,30 +2127,21 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
-    int64_t cluster_offset;
-    unsigned int cluster_bytes;
     bool retry;
 
     if (!bs->serialising_in_flight) {
         return;
     }
 
-    /* If we touch the same cluster it counts as an overlap.  This guarantees
-     * that allocating writes will be serialized and not race with each other
-     * for the same cluster.  For example, in copy-on-read it ensures that the
-     * CoR read and write operations are atomic and guest writes cannot
-     * interleave between them.
-     */
-    round_bytes_to_clusters(bs, self->offset, self->bytes,
-                            &cluster_offset, &cluster_bytes);
-
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
             if (req == self || (!req->serialising && !self->serialising)) {
                 continue;
             }
-            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
+            if (tracked_request_overlaps(req, self->overlap_offset,
+                                         self->overlap_bytes))
+            {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
                  * never happen since it means deadlock.
@@ -2761,7 +2757,12 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
-        mark_request_serialising(req);
+        /* If we touch the same cluster it counts as an overlap.  This
+         * guarantees that allocating writes will be serialized and not race
+         * with each other for the same cluster.  For example, in copy-on-read
+         * it ensures that the CoR read and write operations are atomic and
+         * guest writes cannot interleave between them. */
+        mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
     wait_serialising_requests(req);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index d8443df..ccd2c68 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -60,7 +60,11 @@ typedef struct BdrvTrackedRequest {
     int64_t offset;
     unsigned int bytes;
     bool is_write;
+
     bool serialising;
+    int64_t overlap_offset;
+    unsigned int overlap_bytes;
+
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (17 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-22 20:21   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

We can only have a single wait_serialising_requests() call per request
because otherwise we can run into deadlocks where requests are waiting
for each other. The same is true when wait_serialising_requests() is not
at the very beginning of a request, so that other requests can be issued
between the start of the tracking and wait_serialising_requests().

Fix this by changing wait_serialising_requests() to ignore requests that
are already (directly or indirectly) waiting for the calling request.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 13 ++++++++++---
 include/block/block_int.h |  2 ++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index e72966a..55e8c69 100644
--- a/block.c
+++ b/block.c
@@ -2148,9 +2148,16 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  */
                 assert(qemu_coroutine_self() != req->co);
 
-                qemu_co_queue_wait(&req->wait_queue);
-                retry = true;
-                break;
+                /* If the request is already (indirectly) waiting for us, or
+                 * will wait for us as soon as it wakes up, then just go on
+                 * (instead of producing a deadlock in the former case). */
+                if (!req->waiting_for) {
+                    self->waiting_for = req;
+                    qemu_co_queue_wait(&req->wait_queue);
+                    self->waiting_for = NULL;
+                    retry = true;
+                    break;
+                }
             }
         }
     } while (retry);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ccd2c68..fdf0e0b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -68,6 +68,8 @@ typedef struct BdrvTrackedRequest {
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
+
+    struct BdrvTrackedRequest *waiting_for;
 } BdrvTrackedRequest;
 
 struct BlockDriver {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev()
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (18 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-22 20:29   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev Kevin Wolf
                   ` (9 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

This patch changes bdrv_co_do_pwritev() to actually be what its name
promises. If requests aren't properly aligned, it performs a RMW.

Requests touching the same block are serialised against the RMW request.
Further optimisation of this is possible by differentiating types of
requests (concurrent reads should actually be okay here).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 55e8c69..859e1aa 100644
--- a/block.c
+++ b/block.c
@@ -3055,6 +3055,12 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     BdrvRequestFlags flags)
 {
     BdrvTrackedRequest req;
+    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+    uint8_t *head_buf = NULL;
+    uint8_t *tail_buf = NULL;
+    QEMUIOVector local_qiov;
+    bool use_local_qiov = false;
     int ret;
 
     if (!bs->drv) {
@@ -3073,10 +3079,88 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
     }
 
+    /*
+     * Align write if necessary by performing a read-modify-write cycle.
+     * Pad qiov with the read parts and be sure to have a tracked request not
+     * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
+     */
     tracked_request_begin(&req, bs, offset, bytes, true);
-    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+
+    if (offset & (align - 1)) {
+        QEMUIOVector head_qiov;
+        struct iovec head_iov;
+
+        mark_request_serialising(&req, align);
+        wait_serialising_requests(&req);
+
+        head_buf = qemu_blockalign(bs, align);
+        head_iov = (struct iovec) {
+            .iov_base   = head_buf,
+            .iov_len    = align,
+        };
+        qemu_iovec_init_external(&head_qiov, &head_iov, 1);
+
+        ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
+                                  align, &head_qiov, 0);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_iovec_init(&local_qiov, qiov->niov + 2);
+        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+        use_local_qiov = true;
+
+        bytes += offset & (align - 1);
+        offset = offset & ~(align - 1);
+    }
+
+    if ((offset + bytes) & (align - 1)) {
+        QEMUIOVector tail_qiov;
+        struct iovec tail_iov;
+        size_t tail_bytes;
+
+        mark_request_serialising(&req, align);
+        wait_serialising_requests(&req);
+
+        tail_buf = qemu_blockalign(bs, align);
+        tail_iov = (struct iovec) {
+            .iov_base   = tail_buf,
+            .iov_len    = align,
+        };
+        qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
+
+        ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
+                                  align, &tail_qiov, 0);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        if (!use_local_qiov) {
+            qemu_iovec_init(&local_qiov, qiov->niov + 1);
+            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+            use_local_qiov = true;
+        }
+
+        tail_bytes = (offset + bytes) & (align - 1);
+        qemu_iovec_add(&local_qiov, tail_buf + tail_bytes, align - tail_bytes);
+
+        bytes = ROUND_UP(bytes, align);
+    }
+
+    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes,
+                               use_local_qiov ? &local_qiov : qiov,
+                               flags);
+
+fail:
     tracked_request_end(&req);
 
+    if (use_local_qiov) {
+        qemu_iovec_destroy(&local_qiov);
+        qemu_vfree(head_buf);
+        qemu_vfree(tail_buf);
+    }
+
     return ret;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (19 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:42   ` Max Reitz
  2014-01-24 16:09   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 22/29] block: Change coroutine wrapper to byte granularity Kevin Wolf
                   ` (8 subsequent siblings)
  29 siblings, 2 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

If a request calls wait_serialising_requests() and actually has to wait
in this function (i.e. a coroutine yield), other requests can run and
previously read data (like the head or tail buffer) could become
outdated. In this case, we would have to restart from the beginning to
read in the updated data.

However, we're lucky and don't actually need to do that: A request can
only wait in the first call of wait_serialising_requests() because we
mark it as serialising before that call, so any later requests would
wait. So as we don't wait in practice, we don't have to reload the data.

This is an important assumption that may not be broken or data
corruption will happen. Document it with some assertions.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 859e1aa..53d9bd5 100644
--- a/block.c
+++ b/block.c
@@ -2123,14 +2123,15 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
-static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     bool retry;
+    bool waited = false;
 
     if (!bs->serialising_in_flight) {
-        return;
+        return false;
     }
 
     do {
@@ -2156,11 +2157,14 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                     qemu_co_queue_wait(&req->wait_queue);
                     self->waiting_for = NULL;
                     retry = true;
+                    waited = true;
                     break;
                 }
             }
         }
     } while (retry);
+
+    return waited;
 }
 
 /*
@@ -3011,6 +3015,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
+    bool waited;
     int ret;
 
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -3019,7 +3024,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
-    wait_serialising_requests(req);
+    waited = wait_serialising_requests(req);
+    assert(!waited || !req->serialising);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
@@ -3119,9 +3125,11 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         QEMUIOVector tail_qiov;
         struct iovec tail_iov;
         size_t tail_bytes;
+        bool waited;
 
         mark_request_serialising(&req, align);
-        wait_serialising_requests(&req);
+        waited = wait_serialising_requests(&req);
+        assert(!waited || !use_local_qiov);
 
         tail_buf = qemu_blockalign(bs, align);
         tail_iov = (struct iovec) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 22/29] block: Change coroutine wrapper to byte granularity
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (20 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 23/29] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 48 ++++++++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/block.c b/block.c
index 53d9bd5..2a7d1b3 100644
--- a/block.c
+++ b/block.c
@@ -69,11 +69,11 @@ static int coroutine_fn bdrv_co_readv_em(BlockDriverState *bs,
 static int coroutine_fn bdrv_co_writev_em(BlockDriverState *bs,
                                          int64_t sector_num, int nb_sectors,
                                          QEMUIOVector *iov);
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
@@ -2374,8 +2374,7 @@ static int bdrv_check_request(BlockDriverState *bs, int64_t sector_num,
 
 typedef struct RwCo {
     BlockDriverState *bs;
-    int64_t sector_num;
-    int nb_sectors;
+    int64_t offset;
     QEMUIOVector *qiov;
     bool is_write;
     int ret;
@@ -2387,34 +2386,32 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
     RwCo *rwco = opaque;
 
     if (!rwco->is_write) {
-        rwco->ret = bdrv_co_do_readv(rwco->bs, rwco->sector_num,
-                                     rwco->nb_sectors, rwco->qiov,
-                                     rwco->flags);
-    } else {
-        rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
-                                      rwco->nb_sectors, rwco->qiov,
+        rwco->ret = bdrv_co_do_preadv(rwco->bs, rwco->offset,
+                                      rwco->qiov->size, rwco->qiov,
                                       rwco->flags);
+    } else {
+        rwco->ret = bdrv_co_do_pwritev(rwco->bs, rwco->offset,
+                                       rwco->qiov->size, rwco->qiov,
+                                       rwco->flags);
     }
 }
 
 /*
  * Process a vectored synchronous request using coroutines
  */
-static int bdrv_rwv_co(BlockDriverState *bs, int64_t sector_num,
-                       QEMUIOVector *qiov, bool is_write,
-                       BdrvRequestFlags flags)
+static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
+                        QEMUIOVector *qiov, bool is_write,
+                        BdrvRequestFlags flags)
 {
     Coroutine *co;
     RwCo rwco = {
         .bs = bs,
-        .sector_num = sector_num,
-        .nb_sectors = qiov->size >> BDRV_SECTOR_BITS,
+        .offset = offset,
         .qiov = qiov,
         .is_write = is_write,
         .ret = NOT_DONE,
         .flags = flags,
     };
-    assert((qiov->size & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     /**
      * In sync call context, when the vcpu is blocked, this throttling timer
@@ -2453,7 +2450,8 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
     };
 
     qemu_iovec_init_external(&qiov, &iov, 1);
-    return bdrv_rwv_co(bs, sector_num, &qiov, is_write, flags);
+    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS,
+                        &qiov, is_write, flags);
 }
 
 /* return < 0 if error. See bdrv_write() for the return codes */
@@ -2491,7 +2489,7 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
 
 int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
 {
-    return bdrv_rwv_co(bs, sector_num, qiov, true, 0);
+    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
 }
 
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
@@ -4611,9 +4609,15 @@ int bdrv_flush(BlockDriverState *bs)
     return rwco.ret;
 }
 
+typedef struct DiscardCo {
+    BlockDriverState *bs;
+    int64_t sector_num;
+    int nb_sectors;
+    int ret;
+} DiscardCo;
 static void coroutine_fn bdrv_discard_co_entry(void *opaque)
 {
-    RwCo *rwco = opaque;
+    DiscardCo *rwco = opaque;
 
     rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
 }
@@ -4697,7 +4701,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
 {
     Coroutine *co;
-    RwCo rwco = {
+    DiscardCo rwco = {
         .bs = bs,
         .sector_num = sector_num,
         .nb_sectors = nb_sectors,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 23/29] block: Make bdrv_pread() a bdrv_prwv_co() wrapper
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (21 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 22/29] block: Change coroutine wrapper to byte granularity Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() " Kevin Wolf
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_preadv().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 49 +++++++++++++------------------------------------
 1 file changed, 13 insertions(+), 36 deletions(-)

diff --git a/block.c b/block.c
index 2a7d1b3..4dc4711 100644
--- a/block.c
+++ b/block.c
@@ -2541,49 +2541,26 @@ int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags)
     }
 }
 
-int bdrv_pread(BlockDriverState *bs, int64_t offset,
-               void *buf, int count1)
+int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
 {
-    uint8_t tmp_buf[BDRV_SECTOR_SIZE];
-    int len, nb_sectors, count;
-    int64_t sector_num;
+    QEMUIOVector qiov;
+    struct iovec iov = {
+        .iov_base = (void *)buf,
+        .iov_len = bytes,
+    };
     int ret;
 
-    count = count1;
-    /* first read to align to sector start */
-    len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
-    if (len > count)
-        len = count;
-    sector_num = offset >> BDRV_SECTOR_BITS;
-    if (len > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        memcpy(buf, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)), len);
-        count -= len;
-        if (count == 0)
-            return count1;
-        sector_num++;
-        buf += len;
+    if (bytes < 0) {
+        return -EINVAL;
     }
 
-    /* read the sectors "in place" */
-    nb_sectors = count >> BDRV_SECTOR_BITS;
-    if (nb_sectors > 0) {
-        if ((ret = bdrv_read(bs, sector_num, buf, nb_sectors)) < 0)
-            return ret;
-        sector_num += nb_sectors;
-        len = nb_sectors << BDRV_SECTOR_BITS;
-        buf += len;
-        count -= len;
+    qemu_iovec_init_external(&qiov, &iov, 1);
+    ret = bdrv_prwv_co(bs, offset, &qiov, false, 0);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* add data from the last sector */
-    if (count > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        memcpy(buf, tmp_buf, count);
-    }
-    return count1;
+    return bytes;
 }
 
 int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (22 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 23/29] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:43   ` Max Reitz
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment Kevin Wolf
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_pwritev().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c               | 64 ++++++++-------------------------------------------
 include/block/block.h |  1 -
 2 files changed, 9 insertions(+), 56 deletions(-)

diff --git a/block.c b/block.c
index 4dc4711..812b1b2 100644
--- a/block.c
+++ b/block.c
@@ -2487,11 +2487,6 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
     return bdrv_rw_co(bs, sector_num, (uint8_t *)buf, nb_sectors, true, 0);
 }
 
-int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
-{
-    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
-}
-
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
                       int nb_sectors, BdrvRequestFlags flags)
 {
@@ -2565,70 +2560,29 @@ int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
 
 int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
 {
-    uint8_t tmp_buf[BDRV_SECTOR_SIZE];
-    int len, nb_sectors, count;
-    int64_t sector_num;
     int ret;
 
-    count = qiov->size;
-
-    /* first write to align to sector start */
-    len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
-    if (len > count)
-        len = count;
-    sector_num = offset >> BDRV_SECTOR_BITS;
-    if (len > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        qemu_iovec_to_buf(qiov, 0, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)),
-                          len);
-        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        count -= len;
-        if (count == 0)
-            return qiov->size;
-        sector_num++;
-    }
-
-    /* write the sectors "in place" */
-    nb_sectors = count >> BDRV_SECTOR_BITS;
-    if (nb_sectors > 0) {
-        QEMUIOVector qiov_inplace;
-
-        qemu_iovec_init(&qiov_inplace, qiov->niov);
-        qemu_iovec_concat(&qiov_inplace, qiov, len,
-                          nb_sectors << BDRV_SECTOR_BITS);
-        ret = bdrv_writev(bs, sector_num, &qiov_inplace);
-        qemu_iovec_destroy(&qiov_inplace);
-        if (ret < 0) {
-            return ret;
-        }
-
-        sector_num += nb_sectors;
-        len = nb_sectors << BDRV_SECTOR_BITS;
-        count -= len;
+    ret = bdrv_prwv_co(bs, offset, qiov, true, 0);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* add data from the last sector */
-    if (count > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        qemu_iovec_to_buf(qiov, qiov->size - count, tmp_buf, count);
-        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-    }
     return qiov->size;
 }
 
 int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
-                const void *buf, int count1)
+                const void *buf, int bytes)
 {
     QEMUIOVector qiov;
     struct iovec iov = {
         .iov_base   = (void *) buf,
-        .iov_len    = count1,
+        .iov_len    = bytes,
     };
 
+    if (bytes < 0) {
+        return -EINVAL;
+    }
+
     qemu_iovec_init_external(&qiov, &iov, 1);
     return bdrv_pwritev(bs, offset, &qiov);
 }
diff --git a/include/block/block.h b/include/block/block.h
index 05252d5..7e40ccc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -220,7 +220,6 @@ BlockDriverAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs, int64_t sector_num
                                         int nb_sectors, BdrvRequestFlags flags,
                                         BlockDriverCompletionFunc *cb, void *opaque);
 int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags);
-int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov);
 int bdrv_pread(BlockDriverState *bs, int64_t offset,
                void *buf, int count);
 int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (23 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() " Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-24 16:29   ` Benoît Canet
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable Kevin Wolf
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

From: Paolo Bonzini <pbonzini@redhat.com>

The iSCSI backend already gets the block size from the READ CAPACITY
command it sends.  Save it so that the generic block layer gets it
too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/iscsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index 3202dc5..23c3fc4 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1217,6 +1217,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
         goto out;
     }
     bs->total_sectors = sector_lun2qemu(iscsilun->num_blocks, iscsilun);
+    bs->request_alignment = iscsilun->block_size;
 
     /* Medium changer or tape. We dont have any emulation for this so this must
      * be sg ioctl compatible. We force it to be sg, otherwise qemu will try
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (24 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:50   ` Max Reitz
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep' Kevin Wolf
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

The new 'align' option of blkdebug can be used in order to emulate
backends with a required 4k alignment on hosts which only really require
512 byte alignment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/blkdebug.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index ebc5f13..dc4ba46 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -346,6 +346,11 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_STRING,
             .help = "[internal use only, will be removed]",
         },
+        {
+            .name = "align",
+            .type = QEMU_OPT_SIZE,
+            .help = "Required alignment in bytes",
+        },
         { /* end of list */ }
     },
 };
@@ -357,6 +362,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
     QemuOpts *opts;
     Error *local_err = NULL;
     const char *filename, *config;
+    uint64_t align;
     int ret;
 
     opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
@@ -394,6 +400,16 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
+    /* Set request alignment */
+    align = qemu_opt_get_size(opts, "align", bs->request_alignment);
+    if (align > 0 && align < INT_MAX && !(align & (align - 1))) {
+        bs->request_alignment = align;
+    } else {
+        error_setg(errp, "Invalid alignment");
+        ret = -EINVAL;
+        goto fail;
+    }
+
     ret = 0;
 fail:
     qemu_opts_del(opts);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep'
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (25 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:55   ` Max Reitz
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic Kevin Wolf
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

There is no easy way to check that a request correctly waits for a
different request. With a sleep command we can at least approximate it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-io-cmds.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 85e4982..978a3a0 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -12,6 +12,7 @@
 #include "block/block_int.h"
 #include "block/qapi.h"
 #include "qemu/main-loop.h"
+#include "qemu/timer.h"
 
 #define CMD_NOFILE_OK   0x01
 
@@ -2038,6 +2039,46 @@ static const cmdinfo_t abort_cmd = {
        .oneline        = "simulate a program crash using abort(3)",
 };
 
+static void sleep_cb(void *opaque)
+{
+    bool *expired = opaque;
+    *expired = true;
+}
+
+static int sleep_f(BlockDriverState *bs, int argc, char **argv)
+{
+    char *endptr;
+    long ms;
+    struct QEMUTimer *timer;
+    bool expired = false;
+
+    ms = strtol(argv[1], &endptr, 0);
+    if (ms < 0 || *endptr != '\0') {
+        printf("%s is not a valid number\n", argv[1]);
+        return 0;
+    }
+
+    timer = timer_new_ns(QEMU_CLOCK_HOST, sleep_cb, &expired);
+    timer_mod(timer, qemu_clock_get_ns(QEMU_CLOCK_HOST) + SCALE_MS * ms);
+
+    while (!expired) {
+        main_loop_wait(false);
+    }
+
+    timer_free(timer);
+
+    return 0;
+}
+
+static const cmdinfo_t sleep_cmd = {
+       .name           = "sleep",
+       .argmin         = 1,
+       .argmax         = 1,
+       .cfunc          = sleep_f,
+       .flags          = CMD_NOFILE_OK,
+       .oneline        = "waits for the given value in milliseconds",
+};
+
 static void help_oneline(const char *cmd, const cmdinfo_t *ct)
 {
     if (cmd) {
@@ -2151,4 +2192,5 @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&resume_cmd);
     qemuio_add_command(&wait_break_cmd);
     qemuio_add_command(&abort_cmd);
+    qemuio_add_command(&sleep_cmd);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (26 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep' Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-18 16:01   ` Max Reitz
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity Kevin Wolf
  2014-01-22 20:30 ` [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Christian Borntraeger
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                    |   7 ++
 block/blkdebug.c           |   8 ++
 include/block/block.h      |   8 ++
 tests/qemu-iotests/077     | 278 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/077.out | 202 ++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 6 files changed, 504 insertions(+)
 create mode 100755 tests/qemu-iotests/077
 create mode 100644 tests/qemu-iotests/077.out

diff --git a/block.c b/block.c
index 812b1b2..12af7fb 100644
--- a/block.c
+++ b/block.c
@@ -2961,10 +2961,13 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     if (ret < 0) {
         /* Do nothing, write notifier decided to fail this request */
     } else if (flags & BDRV_REQ_ZERO_WRITE) {
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_ZERO);
         ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, flags);
     } else {
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV);
         ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
     }
+    BLKDBG_EVENT(bs, BLKDBG_PWRITEV_DONE);
 
     if (ret == 0 && !bs->enable_write_cache) {
         ret = bdrv_co_flush(bs);
@@ -3035,11 +3038,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         };
         qemu_iovec_init_external(&head_qiov, &head_iov, 1);
 
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_HEAD);
         ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
                                   align, &head_qiov, 0);
         if (ret < 0) {
             goto fail;
         }
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_HEAD);
 
         qemu_iovec_init(&local_qiov, qiov->niov + 2);
         qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
@@ -3067,11 +3072,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         };
         qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
 
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_TAIL);
         ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
                                   align, &tail_qiov, 0);
         if (ret < 0) {
             goto fail;
         }
+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
 
         if (!use_local_qiov) {
             qemu_iovec_init(&local_qiov, qiov->niov + 1);
diff --git a/block/blkdebug.c b/block/blkdebug.c
index dc4ba46..6657671 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -186,6 +186,14 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
 
     [BLKDBG_FLUSH_TO_OS]                    = "flush_to_os",
     [BLKDBG_FLUSH_TO_DISK]                  = "flush_to_disk",
+
+    [BLKDBG_PWRITEV_RMW_HEAD]               = "pwritev_rmw.head",
+    [BLKDBG_PWRITEV_RMW_AFTER_HEAD]         = "pwritev_rmw.after_head",
+    [BLKDBG_PWRITEV_RMW_TAIL]               = "pwritev_rmw.tail",
+    [BLKDBG_PWRITEV_RMW_AFTER_TAIL]         = "pwritev_rmw.after_tail",
+    [BLKDBG_PWRITEV]                        = "pwritev",
+    [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
+    [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
 };
 
 static int get_event_by_name(const char *name, BlkDebugEvent *event)
diff --git a/include/block/block.h b/include/block/block.h
index 7e40ccc..127fc2f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -518,6 +518,14 @@ typedef enum {
     BLKDBG_FLUSH_TO_OS,
     BLKDBG_FLUSH_TO_DISK,
 
+    BLKDBG_PWRITEV_RMW_HEAD,
+    BLKDBG_PWRITEV_RMW_AFTER_HEAD,
+    BLKDBG_PWRITEV_RMW_TAIL,
+    BLKDBG_PWRITEV_RMW_AFTER_TAIL,
+    BLKDBG_PWRITEV,
+    BLKDBG_PWRITEV_ZERO,
+    BLKDBG_PWRITEV_DONE,
+
     BLKDBG_EVENT_MAX,
 } BlkDebugEvent;
 
diff --git a/tests/qemu-iotests/077 b/tests/qemu-iotests/077
new file mode 100755
index 0000000..58bfc8f
--- /dev/null
+++ b/tests/qemu-iotests/077
@@ -0,0 +1,278 @@
+#!/bin/bash
+#
+# Test concurrent pread/pwrite
+#
+# Copyright (C) 2014 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=kwolf@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt generic
+_supported_proto generic
+_supported_os Linux
+
+CLUSTER_SIZE=4k
+size=128M
+
+_make_test_img $size
+
+echo
+echo "== Some concurrent requests involving RMW =="
+
+function test_io()
+{
+echo "open -o file.align=4k blkdebug::$TEST_IMG"
+# A simple RMW request
+cat  <<EOF
+aio_write -P 10 0x200 0x200
+aio_flush
+EOF
+
+# Sequential RMW requests on the same physical sector
+off=0x1000
+for ev in "head" "after_head" "tail" "after_tail"; do
+cat  <<EOF
+break pwritev_rmw.$ev A
+aio_write -P 10 $((off + 0x200)) 0x200
+wait_break A
+aio_write -P 11 $((off + 0x400)) 0x200
+sleep 100
+resume A
+aio_flush
+EOF
+off=$((off + 0x1000))
+done
+
+# Chained dependencies
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0x5000 0x200
+wait_break A
+aio_write -P 11 0x5200 0x200
+aio_write -P 12 0x5400 0x200
+aio_write -P 13 0x5600 0x200
+aio_write -P 14 0x5800 0x200
+aio_write -P 15 0x5a00 0x200
+aio_write -P 16 0x5c00 0x200
+aio_write -P 17 0x5e00 0x200
+sleep 100
+resume A
+aio_flush
+EOF
+
+# Overlapping multiple requests
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0x6000 0x200
+wait_break A
+break pwritev_rmw.after_head B
+aio_write -P 10 0x7e00 0x200
+wait_break B
+aio_write -P 11 0x6800 0x1000
+resume A
+sleep 100
+resume B
+aio_flush
+EOF
+
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0x8000 0x200
+wait_break A
+break pwritev_rmw.after_head B
+aio_write -P 10 0x9e00 0x200
+wait_break B
+aio_write -P 11 0x8800 0x1000
+resume B
+sleep 100
+resume A
+aio_flush
+EOF
+
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0xa000 0x200
+wait_break A
+aio_write -P 11 0xa800 0x1000
+break pwritev_rmw.after_head B
+aio_write -P 10 0xbe00 0x200
+wait_break B
+resume A
+sleep 100
+resume B
+aio_flush
+EOF
+
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0xc000 0x200
+wait_break A
+aio_write -P 11 0xc800 0x1000
+break pwritev_rmw.after_head B
+aio_write -P 10 0xde00 0x200
+wait_break B
+resume B
+sleep 100
+resume A
+aio_flush
+EOF
+
+# Only RMW for the tail part
+cat  <<EOF
+break pwritev_rmw.after_tail A
+aio_write -P 10 0xe000 0x1800
+wait_break A
+aio_write -P 11 0xf000 0xc00
+sleep 100
+resume A
+aio_flush
+EOF
+
+cat  <<EOF
+break pwritev A
+aio_write -P 10 0x10000 0x800
+wait_break A
+break pwritev_rmw.after_tail B
+aio_write -P 11 0x10000 0x400
+break pwritev_done C
+resume A
+wait_break C
+resume C
+sleep 100
+wait_break B
+resume B
+aio_flush
+EOF
+
+cat  <<EOF
+break pwritev A
+aio_write -P 10 0x11000 0x800
+wait_break A
+aio_write -P 11 0x11000 0x1000
+sleep 100
+resume A
+aio_flush
+EOF
+}
+
+test_io | $QEMU_IO  | _filter_qemu_io | \
+    sed -e 's,[0-9/]* bytes at offset [0-9]*,XXX/XXX bytes at offset XXX,g' \
+        -e 's/^[0-9]* \(bytes\|KiB\)/XXX bytes/' \
+        -e '/Suspended/d'
+
+echo
+echo "== Verify image content =="
+
+function verify_io()
+{
+    # A simple RMW request
+    echo read -P 0       0 0x200
+    echo read -P 10  0x200 0x200
+    echo read -P 0   0x400 0xc00
+
+    # Sequential RMW requests on the same physical sector
+    echo read -P 0  0x1000 0x200
+    echo read -P 10 0x1200 0x200
+    echo read -P 11 0x1400 0x200
+    echo read -P 0  0x1600 0xa00
+
+    echo read -P 0  0x2000 0x200
+    echo read -P 10 0x2200 0x200
+    echo read -P 11 0x2400 0x200
+    echo read -P 0  0x2600 0xa00
+
+    echo read -P 0  0x3000 0x200
+    echo read -P 10 0x3200 0x200
+    echo read -P 11 0x3400 0x200
+    echo read -P 0  0x3600 0xa00
+
+    echo read -P 0  0x4000 0x200
+    echo read -P 10 0x4200 0x200
+    echo read -P 11 0x4400 0x200
+    echo read -P 0  0x4600 0xa00
+
+    # Chained dependencies
+    echo read -P 10 0x5000 0x200
+    echo read -P 11 0x5200 0x200
+    echo read -P 12 0x5400 0x200
+    echo read -P 13 0x5600 0x200
+    echo read -P 14 0x5800 0x200
+    echo read -P 15 0x5a00 0x200
+    echo read -P 16 0x5c00 0x200
+    echo read -P 17 0x5e00 0x200
+
+    # Overlapping multiple requests
+    echo read -P 10 0x6000 0x200
+    echo read -P  0 0x6200 0x600
+    echo read -P 11 0x6800 0x1000
+    echo read -P  0 0x7800 0x600
+    echo read -P 10 0x7e00 0x200
+
+    echo read -P 10 0x8000 0x200
+    echo read -P  0 0x8200 0x600
+    echo read -P 11 0x8800 0x1000
+    echo read -P  0 0x9800 0x600
+    echo read -P 10 0x9e00 0x200
+
+    echo read -P 10 0xa000 0x200
+    echo read -P  0 0xa200 0x600
+    echo read -P 11 0xa800 0x1000
+    echo read -P  0 0xb800 0x600
+    echo read -P 10 0xbe00 0x200
+
+    echo read -P 10 0xc000 0x200
+    echo read -P  0 0xc200 0x600
+    echo read -P 11 0xc800 0x1000
+    echo read -P  0 0xd800 0x600
+    echo read -P 10 0xde00 0x200
+
+    # Only RMW for the tail part
+    echo read -P 10 0xe000 0x200
+    echo read -P 11 0xf800 0x400
+    echo read -P  0 0xfc00 0x400
+
+    echo read -P 11 0x10000 0x400
+    echo read -P 10 0x10400 0x400
+
+    echo read -P 11 0x11800 0x800
+}
+
+verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
+
+_check_test_img
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/077.out b/tests/qemu-iotests/077.out
new file mode 100644
index 0000000..169315a
--- /dev/null
+++ b/tests/qemu-iotests/077.out
@@ -0,0 +1,202 @@
+QA output created by 077
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
+
+== Some concurrent requests involving RMW ==
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'B'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'B'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'B'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'B'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+blkdebug: Resuming request 'C'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'B'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== Verify image content ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 3072/3072 bytes at offset 1024
+3 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 4096
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 4608
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 5120
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2560/2560 bytes at offset 5632
+2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 8192
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 8704
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 9216
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2560/2560 bytes at offset 9728
+2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 12288
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 12800
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 13312
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2560/2560 bytes at offset 13824
+2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 16384
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 16896
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 17408
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2560/2560 bytes at offset 17920
+2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 20480
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 20992
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 21504
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 22016
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 22528
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 23040
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 23552
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 24064
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 24576
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 25088
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 26624
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 30720
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 32256
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 32768
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 33280
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 34816
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 38912
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 40448
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 40960
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 41472
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 43008
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 47104
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 48640
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 49152
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 49664
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 51200
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1536/1536 bytes at offset 55296
+1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 56832
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 57344
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 63488
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 64512
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 65536
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 1024/1024 bytes at offset 66560
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 2048/2048 bytes at offset 71680
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index cc750c9..420396f 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -79,3 +79,4 @@
 070 rw auto
 073 rw auto
 074 rw auto
+077 rw auto
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (27 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic Kevin Wolf
@ 2014-01-17 14:15 ` Kevin Wolf
  2014-01-17 23:59   ` Max Reitz
  2014-01-22 20:30 ` [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Christian Borntraeger
  29 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-17 14:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, mreitz, stefanha, pbonzini, xiawenc

Request sizes used to be rounded down to the next sector boundary,
allowing to bypass the I/O limit. Now all requests are accounted for
with their exact byte size.

Reported-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 12af7fb..d3e3291 100644
--- a/block.c
+++ b/block.c
@@ -188,7 +188,7 @@ void bdrv_io_limits_enable(BlockDriverState *bs)
  * @is_write:   is the IO a write
  */
 static void bdrv_io_limits_intercept(BlockDriverState *bs,
-                                     int nb_sectors,
+                                     unsigned int bytes,
                                      bool is_write)
 {
     /* does this io must wait */
@@ -201,9 +201,8 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
     }
 
     /* the IO will be executed, do the accounting */
-    throttle_account(&bs->throttle_state,
-                     is_write,
-                     nb_sectors * BDRV_SECTOR_SIZE);
+    throttle_account(&bs->throttle_state, is_write, bytes);
+
 
     /* if the next request must wait -> do nothing */
     if (throttle_schedule_timer(&bs->throttle_state, is_write)) {
@@ -2788,8 +2787,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        /* TODO Switch to byte granularity */
-        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
+        bdrv_io_limits_intercept(bs, bytes, false);
     }
 
     /* Align read if necessary by padding qiov */
@@ -3013,8 +3011,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        /* TODO Switch to byte granularity */
-        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
+        bdrv_io_limits_intercept(bs, bytes, true);
     }
 
     /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
@ 2014-01-17 22:39   ` Benoît Canet
  2014-01-20  9:31     ` Kevin Wolf
  2014-01-21 12:49   ` Benoît Canet
  1 sibling, 1 reply; 70+ messages in thread
From: Benoît Canet @ 2014-01-17 22:39 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:51 (+0100), Kevin Wolf a écrit :
> This function separates filling the BlockLimits from bdrv_open(), which
> allows it to call it from other operations which may change the limits
> (e.g. modifications to the backing file chain or bdrv_reopen)
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 19 +++++++++++++++++++
>  block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
>  block/qcow2.c             | 11 ++++++++++-
>  block/qed.c               | 11 ++++++++++-
>  block/vmdk.c              | 22 ++++++++++++++++++----
>  include/block/block_int.h |  2 ++
>  6 files changed, 88 insertions(+), 23 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 64e7d22..99e69da 100644
> --- a/block.c
> +++ b/block.c
> @@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
>      return ret;
>  }
>  
> +static int bdrv_refresh_limits(BlockDriverState *bs)
> +{
> +    BlockDriver *drv = bs->drv;
> +
> +    memset(&bs->bl, 0, sizeof(bs->bl));
> +
> +    if (drv && drv->bdrv_refresh_limits) {
> +        return drv->bdrv_refresh_limits(bs);
> +    }
> +
> +    return 0;
> +}
> +
>  /*
>   * Create a uniquely-named empty temporary file.
>   * Return 0 upon success, otherwise a negative errno value.
> @@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>          goto free_and_fail;
>      }
>  
> +    bdrv_refresh_limits(bs);
> +
>  #ifndef _WIN32
>      if (bs->is_temporary) {
>          assert(bs->filename[0] != '\0');
> @@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
>      }
>      pstrcpy(bs->backing_file, sizeof(bs->backing_file),
>              bs->backing_hd->file->filename);
> +
> +    /* Recalculate the BlockLimits with the backing file */
> +    bdrv_refresh_limits(bs);
> +
>      return 0;
>  }
>  
> diff --git a/block/iscsi.c b/block/iscsi.c
> index c0ea0c4..3202dc5 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -1265,23 +1265,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>                 sizeof(struct scsi_inquiry_block_limits));
>          scsi_free_scsi_task(task);
>          task = NULL;
> -
> -        if (iscsilun->bl.max_unmap < 0xffffffff) {
> -            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> -                                                 iscsilun);
> -        }
> -        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> -                                                   iscsilun);
> -
> -        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> -            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> -                                                      iscsilun);
> -        }
> -        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> -                                                        iscsilun);
> -
> -        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> -                                                     iscsilun);
>      }
>  
>  #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
> @@ -1326,6 +1309,34 @@ static void iscsi_close(BlockDriverState *bs)
>      memset(iscsilun, 0, sizeof(IscsiLun));
>  }
>  
> +static int iscsi_refresh_limits(BlockDriverState *bs)
> +{
> +    IscsiLun *iscsilun = bs->opaque;
> +
> +    /* We don't actually refresh here, but just return data queried in
> +     * iscsi_open(): iscsi targets don't change their limits. */
> +    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
> +        if (iscsilun->bl.max_unmap < 0xffffffff) {
> +            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> +                                                 iscsilun);
> +        }
> +        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> +                                                   iscsilun);
> +
> +        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> +            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> +                                                      iscsilun);
> +        }
> +        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> +                                                        iscsilun);
> +
> +        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> +                                                     iscsilun);
Why is bl.opt_transfer_length filled only inside the test for unmap and write same ?
Is there a relationship ?

> +    }
> +
> +    return 0;
> +}
> +
>  static int iscsi_truncate(BlockDriverState *bs, int64_t offset)
>  {
>      IscsiLun *iscsilun = bs->opaque;
> @@ -1438,6 +1449,7 @@ static BlockDriver bdrv_iscsi = {
>      .bdrv_getlength  = iscsi_getlength,
>      .bdrv_get_info   = iscsi_get_info,
>      .bdrv_truncate   = iscsi_truncate,
> +    .bdrv_refresh_limits = iscsi_refresh_limits,
>  
>  #if defined(LIBISCSI_FEATURE_IOVECTOR)
>      .bdrv_co_get_block_status = iscsi_co_get_block_status,
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8ec9db1..6562994 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -718,7 +718,6 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      qemu_opts_del(opts);
> -    bs->bl.write_zeroes_alignment = s->cluster_sectors;
>  
>      if (s->use_lazy_refcounts && s->qcow_version < 3) {
>          error_setg(errp, "Lazy refcounts require a qcow2 image with at least "
> @@ -751,6 +750,15 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
>      return ret;
>  }
>  
> +static int qcow2_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +
> +    bs->bl.write_zeroes_alignment = s->cluster_sectors;
> +
> +    return 0;
> +}
> +
>  static int qcow2_set_key(BlockDriverState *bs, const char *key)
>  {
>      BDRVQcowState *s = bs->opaque;
> @@ -2268,6 +2276,7 @@ static BlockDriver bdrv_qcow2 = {
>  
>      .bdrv_change_backing_file   = qcow2_change_backing_file,
>  
> +    .bdrv_refresh_limits        = qcow2_refresh_limits,
>      .bdrv_invalidate_cache      = qcow2_invalidate_cache,
>  
>      .create_options = qcow2_create_options,
> diff --git a/block/qed.c b/block/qed.c
> index 450a1fa..d35bd1c 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -495,7 +495,6 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
>          }
>      }
>  
> -    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
>      s->need_check_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>                                              qed_need_check_timer_cb, s);
>  
> @@ -507,6 +506,15 @@ out:
>      return ret;
>  }
>  
> +static int bdrv_qed_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +
> +    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
> +
> +    return 0;
> +}
> +
>  /* We have nothing to do for QED reopen, stubs just return
>   * success */
>  static int bdrv_qed_reopen_prepare(BDRVReopenState *state,
> @@ -1616,6 +1624,7 @@ static BlockDriver bdrv_qed = {
>      .bdrv_truncate            = bdrv_qed_truncate,
>      .bdrv_getlength           = bdrv_qed_getlength,
>      .bdrv_get_info            = bdrv_qed_get_info,
> +    .bdrv_refresh_limits      = bdrv_qed_refresh_limits,
>      .bdrv_change_backing_file = bdrv_qed_change_backing_file,
>      .bdrv_invalidate_cache    = bdrv_qed_invalidate_cache,
>      .bdrv_check               = bdrv_qed_check,
> diff --git a/block/vmdk.c b/block/vmdk.c
> index c6b60b4..e26df97 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -428,10 +428,6 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>  
> -    if (!flat) {
> -        bs->bl.write_zeroes_alignment =
> -            MAX(bs->bl.write_zeroes_alignment, cluster_sectors);
> -    }
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
>      } else {
> @@ -891,6 +887,23 @@ fail:
>      return ret;
>  }
>  
> +
> +static int vmdk_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVVmdkState *s = bs->opaque;
> +    int i;
> +
> +    for (i = 0; i < s->num_extents; i++) {
> +        if (!s->extents[i].flat) {
> +            bs->bl.write_zeroes_alignment =
> +                MAX(bs->bl.write_zeroes_alignment,
> +                    s->extents[i].cluster_sectors);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  static int get_whole_cluster(BlockDriverState *bs,
>                  VmdkExtent *extent,
>                  uint64_t cluster_offset,
> @@ -2002,6 +2015,7 @@ static BlockDriver bdrv_vmdk = {
>      .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
>      .bdrv_has_zero_init           = vmdk_has_zero_init,
>      .bdrv_get_specific_info       = vmdk_get_specific_info,
> +    .bdrv_refresh_limits          = vmdk_refresh_limits,
>  
>      .create_options               = vmdk_create_options,
>  };
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 2772f2f..3dfe724 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -226,6 +226,8 @@ struct BlockDriver {
>      int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
>      bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
>  
> +    int (*bdrv_refresh_limits)(BlockDriverState *bs);
> +
>      /*
>       * Returns 1 if newly created images are guaranteed to contain only
>       * zeros, 0 otherwise.
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length Kevin Wolf
@ 2014-01-17 22:42   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-17 22:42 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:52 (+0100), Kevin Wolf a écrit :
> When there is a format driver between the backend, it's not guaranteed
> that exposing the opt_transfer_length for the format driver results in
> the optimal requests (because of fragmentation etc.), but it can't make
> things worse, so let's just do it.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 99e69da..20f85cb 100644
> --- a/block.c
> +++ b/block.c
> @@ -485,7 +485,25 @@ static int bdrv_refresh_limits(BlockDriverState *bs)
>  
>      memset(&bs->bl, 0, sizeof(bs->bl));
>  
> -    if (drv && drv->bdrv_refresh_limits) {
> +    if (!drv) {
> +        return 0;
> +    }
> +
> +    /* Take some limits from the children as a default */
> +    if (bs->file) {
> +        bdrv_refresh_limits(bs->file);
> +        bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
> +    }
> +
> +    if (bs->backing_hd) {
> +        bdrv_refresh_limits(bs->backing_hd);
> +        bs->bl.opt_transfer_length =
> +            MAX(bs->bl.opt_transfer_length,
> +                bs->backing_hd->bl.opt_transfer_length);
> +    }
> +
> +    /* Then let the driver override it */
> +    if (drv->bdrv_refresh_limits) {
>          return drv->bdrv_refresh_limits(bs);
>      }

Reviewed-by: Benoît Canet <benoit@irqsave.net>
>  
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed Kevin Wolf
@ 2014-01-17 22:47   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-17 22:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:53 (+0100), Kevin Wolf a écrit :
> When reopening with different flags, or when backing files disappear
> from the chain, the limits may change. Make sure they get updated in
> these cases.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c               | 5 ++++-
>  block/stream.c        | 2 ++
>  include/block/block.h | 1 +
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 20f85cb..9a8562d 100644
> --- a/block.c
> +++ b/block.c
> @@ -479,7 +479,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
>      return ret;
>  }
>  
> -static int bdrv_refresh_limits(BlockDriverState *bs)
> +int bdrv_refresh_limits(BlockDriverState *bs)
>  {
>      BlockDriver *drv = bs->drv;
>  
> @@ -1464,6 +1464,8 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
>      reopen_state->bs->enable_write_cache = !!(reopen_state->flags &
>                                                BDRV_O_CACHE_WB);
>      reopen_state->bs->read_only = !(reopen_state->flags & BDRV_O_RDWR);
> +
> +    bdrv_refresh_limits(reopen_state->bs);
>  }
>  
>  /*
> @@ -2261,6 +2263,7 @@ int bdrv_drop_intermediate(BlockDriverState *active, BlockDriverState *top,
>      }
>      new_top_bs->backing_hd = base_bs;
>  
> +    bdrv_refresh_limits(new_top_bs);
>  
>      QSIMPLEQ_FOREACH_SAFE(intermediate_state, &states_to_delete, entry, next) {
>          /* so that bdrv_close() does not recursively close the chain */
> diff --git a/block/stream.c b/block/stream.c
> index 46bec7d..dd0b4ac 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -75,6 +75,8 @@ static void close_unused_images(BlockDriverState *top, BlockDriverState *base,
>          unused->backing_hd = NULL;
>          bdrv_unref(unused);
>      }
> +
> +    bdrv_refresh_limits(top);
>  }
>  
>  static void coroutine_fn stream_run(void *opaque)
> diff --git a/include/block/block.h b/include/block/block.h
> index 36efaea..64fb319 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -249,6 +249,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset);
>  int64_t bdrv_getlength(BlockDriverState *bs);
>  int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
>  void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
> +int bdrv_refresh_limits(BlockDriverState *bs);
>  int bdrv_commit(BlockDriverState *bs);
>  int bdrv_commit_all(void);
>  int bdrv_change_backing_file(BlockDriverState *bs,
> -- 
> 1.8.1.4
> 
Reviewed-by: Benoît Canet <benoit@irqsave.net>

> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments Kevin Wolf
@ 2014-01-17 22:49   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-17 22:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:54 (+0100), Kevin Wolf a écrit :
> The functions used by qemu_memalign() require an alignment that is at
> least sizeof(void*). Adjust it if it is too small.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  util/oslib-posix.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index e00a44c..54f8932 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -85,6 +85,11 @@ void *qemu_oom_check(void *ptr)
>  void *qemu_memalign(size_t alignment, size_t size)
>  {
>      void *ptr;
> +
> +    if (alignment < sizeof(void*)) {
> +        alignment = sizeof(void*);
> +    }
> +
>  #if defined(_POSIX_C_SOURCE) && !defined(__sun__)
>      int ret;
>      ret = posix_memalign(&ptr, alignment, size);
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoît Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
@ 2014-01-17 23:19   ` Max Reitz
  2014-01-21 13:49   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:19 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 52 +++++++++++++++++++++++++++++++----------------
>   block/backup.c            |  7 ++++++-
>   include/block/block_int.h |  4 ++--
>   3 files changed, 42 insertions(+), 21 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment Kevin Wolf
@ 2014-01-17 23:21   ` Max Reitz
  2014-01-22 19:50   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:21 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Odd file sizes could make bdrv_aligned_preadv() shorten the request in
> non-aligned ways. Fix it by rounding to the required alignment instead
> of 512 bytes.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev Kevin Wolf
@ 2014-01-17 23:42   ` Max Reitz
  2014-01-24 16:09   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:42 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> If a request calls wait_serialising_requests() and actually has to wait
> in this function (i.e. a coroutine yield), other requests can run and
> previously read data (like the head or tail buffer) could become
> outdated. In this case, we would have to restart from the beginning to
> read in the updated data.
>
> However, we're lucky and don't actually need to do that: A request can
> only wait in the first call of wait_serialising_requests() because we
> mark it as serialising before that call, so any later requests would
> wait. So as we don't wait in practice, we don't have to reload the data.
>
> This is an important assumption that may not be broken or data
> corruption will happen. Document it with some assertions.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 16 ++++++++++++----
>   1 file changed, 12 insertions(+), 4 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() " Kevin Wolf
@ 2014-01-17 23:43   ` Max Reitz
  0 siblings, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:43 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Instead of implementing the alignment adjustment here, use the now
> existing functionality of bdrv_co_do_pwritev().
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c               | 64 ++++++++-------------------------------------------
>   include/block/block.h |  1 -
>   2 files changed, 9 insertions(+), 56 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable Kevin Wolf
@ 2014-01-17 23:50   ` Max Reitz
  0 siblings, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:50 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> The new 'align' option of blkdebug can be used in order to emulate
> backends with a required 4k alignment on hosts which only really require
> 512 byte alignment.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/blkdebug.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)

The patch itself is okay, but will have to be rebased on the 
blkdebug/blkverify series – specifically, we'll need to add this new 
option to qapi-schema.json (BlockdevOptionsBlkdebug).

> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index ebc5f13..dc4ba46 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -346,6 +346,11 @@ static QemuOptsList runtime_opts = {
>               .type = QEMU_OPT_STRING,
>               .help = "[internal use only, will be removed]",
>           },
> +        {
> +            .name = "align",
> +            .type = QEMU_OPT_SIZE,
> +            .help = "Required alignment in bytes",
> +        },
>           { /* end of list */ }
>       },
>   };
> @@ -357,6 +362,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>       QemuOpts *opts;
>       Error *local_err = NULL;
>       const char *filename, *config;
> +    uint64_t align;
>       int ret;
>   
>       opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> @@ -394,6 +400,16 @@ static int blkdebug_open(BlockDriverState *bs, QDict *options, int flags,
>           goto fail;
>       }
>   
> +    /* Set request alignment */
> +    align = qemu_opt_get_size(opts, "align", bs->request_alignment);
> +    if (align > 0 && align < INT_MAX && !(align & (align - 1))) {

Hm, I like that test to check whether align is a power of two. ;-)

Max

> +        bs->request_alignment = align;
> +    } else {
> +        error_setg(errp, "Invalid alignment");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
>       ret = 0;
>   fail:
>       qemu_opts_del(opts);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep'
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep' Kevin Wolf
@ 2014-01-17 23:55   ` Max Reitz
  2014-01-20  9:58     ` Kevin Wolf
  0 siblings, 1 reply; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:55 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> There is no easy way to check that a request correctly waits for a
> different request. With a sleep command we can at least approximate it.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   qemu-io-cmds.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
>
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 85e4982..978a3a0 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -12,6 +12,7 @@
>   #include "block/block_int.h"
>   #include "block/qapi.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/timer.h"
>   
>   #define CMD_NOFILE_OK   0x01
>   
> @@ -2038,6 +2039,46 @@ static const cmdinfo_t abort_cmd = {
>          .oneline        = "simulate a program crash using abort(3)",
>   };
>   
> +static void sleep_cb(void *opaque)
> +{
> +    bool *expired = opaque;
> +    *expired = true;
> +}
> +
> +static int sleep_f(BlockDriverState *bs, int argc, char **argv)
> +{
> +    char *endptr;
> +    long ms;
> +    struct QEMUTimer *timer;
> +    bool expired = false;
> +
> +    ms = strtol(argv[1], &endptr, 0);
> +    if (ms < 0 || *endptr != '\0') {
> +        printf("%s is not a valid number\n", argv[1]);
> +        return 0;
> +    }
> +
> +    timer = timer_new_ns(QEMU_CLOCK_HOST, sleep_cb, &expired);
> +    timer_mod(timer, qemu_clock_get_ns(QEMU_CLOCK_HOST) + SCALE_MS * ms);
> +
> +    while (!expired) {

I don't know whether compilers don't optimize accesses to variables 
whose addresses have been given to other functions, but shouldn't 
expired be marked volatile just to be sure?

Max

> +        main_loop_wait(false);
> +    }
> +
> +    timer_free(timer);
> +
> +    return 0;
> +}
> +
> +static const cmdinfo_t sleep_cmd = {
> +       .name           = "sleep",
> +       .argmin         = 1,
> +       .argmax         = 1,
> +       .cfunc          = sleep_f,
> +       .flags          = CMD_NOFILE_OK,
> +       .oneline        = "waits for the given value in milliseconds",
> +};
> +
>   static void help_oneline(const char *cmd, const cmdinfo_t *ct)
>   {
>       if (cmd) {
> @@ -2151,4 +2192,5 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>       qemuio_add_command(&resume_cmd);
>       qemuio_add_command(&wait_break_cmd);
>       qemuio_add_command(&abort_cmd);
> +    qemuio_add_command(&sleep_cmd);
>   }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity Kevin Wolf
@ 2014-01-17 23:59   ` Max Reitz
  0 siblings, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:59 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Request sizes used to be rounded down to the next sector boundary,
> allowing to bypass the I/O limit. Now all requests are accounted for
> with their exact byte size.
>
> Reported-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 13 +++++--------
>   1 file changed, 5 insertions(+), 8 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv() Kevin Wolf
@ 2014-01-17 23:59   ` Max Reitz
  2014-01-21 13:29   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-17 23:59 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Similar to bdrv_pread(), which aligns byte-aligned request to 512 byte
> sectors, bdrv_co_do_preadv() takes a byte-aligned request and aligns it
> to the alignment specified in bs->request_alignment.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 58 insertions(+), 6 deletions(-)

Together with patch 29:

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-18  0:00   ` Max Reitz
  2014-01-21 13:36   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Max Reitz @ 2014-01-18  0:00 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> This is going to become the bdrv_co_do_preadv() equivalent for writes.
> In this patch, however, just a function taking byte offsets is created,
> it doesn't align anything yet.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 24 ++++++++++++++++++------
>   1 file changed, 18 insertions(+), 6 deletions(-)

Together with patch 29:

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic Kevin Wolf
@ 2014-01-18 16:01   ` Max Reitz
  2014-01-20  9:44     ` Kevin Wolf
  0 siblings, 1 reply; 70+ messages in thread
From: Max Reitz @ 2014-01-18 16:01 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha

On 17.01.2014 15:15, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                    |   7 ++
>   block/blkdebug.c           |   8 ++
>   include/block/block.h      |   8 ++
>   tests/qemu-iotests/077     | 278 +++++++++++++++++++++++++++++++++++++++++++++
>   tests/qemu-iotests/077.out | 202 ++++++++++++++++++++++++++++++++
>   tests/qemu-iotests/group   |   1 +
>   6 files changed, 504 insertions(+)
>   create mode 100755 tests/qemu-iotests/077
>   create mode 100644 tests/qemu-iotests/077.out
>
> diff --git a/block.c b/block.c
> index 812b1b2..12af7fb 100644
> --- a/block.c
> +++ b/block.c
> @@ -2961,10 +2961,13 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>       if (ret < 0) {
>           /* Do nothing, write notifier decided to fail this request */
>       } else if (flags & BDRV_REQ_ZERO_WRITE) {
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_ZERO);
>           ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, flags);
>       } else {
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV);
>           ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
>       }
> +    BLKDBG_EVENT(bs, BLKDBG_PWRITEV_DONE);
>   
>       if (ret == 0 && !bs->enable_write_cache) {
>           ret = bdrv_co_flush(bs);
> @@ -3035,11 +3038,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>           };
>           qemu_iovec_init_external(&head_qiov, &head_iov, 1);
>   
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_HEAD);
>           ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
>                                     align, &head_qiov, 0);
>           if (ret < 0) {
>               goto fail;
>           }
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_HEAD);
>   
>           qemu_iovec_init(&local_qiov, qiov->niov + 2);
>           qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
> @@ -3067,11 +3072,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>           };
>           qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
>   
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_TAIL);
>           ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
>                                     align, &tail_qiov, 0);
>           if (ret < 0) {
>               goto fail;
>           }
> +        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
>   
>           if (!use_local_qiov) {
>               qemu_iovec_init(&local_qiov, qiov->niov + 1);
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index dc4ba46..6657671 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -186,6 +186,14 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>   
>       [BLKDBG_FLUSH_TO_OS]                    = "flush_to_os",
>       [BLKDBG_FLUSH_TO_DISK]                  = "flush_to_disk",
> +
> +    [BLKDBG_PWRITEV_RMW_HEAD]               = "pwritev_rmw.head",
> +    [BLKDBG_PWRITEV_RMW_AFTER_HEAD]         = "pwritev_rmw.after_head",
> +    [BLKDBG_PWRITEV_RMW_TAIL]               = "pwritev_rmw.tail",
> +    [BLKDBG_PWRITEV_RMW_AFTER_TAIL]         = "pwritev_rmw.after_tail",
> +    [BLKDBG_PWRITEV]                        = "pwritev",
> +    [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
> +    [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
>   };
>   
>   static int get_event_by_name(const char *name, BlkDebugEvent *event)
> diff --git a/include/block/block.h b/include/block/block.h
> index 7e40ccc..127fc2f 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -518,6 +518,14 @@ typedef enum {
>       BLKDBG_FLUSH_TO_OS,
>       BLKDBG_FLUSH_TO_DISK,
>   
> +    BLKDBG_PWRITEV_RMW_HEAD,
> +    BLKDBG_PWRITEV_RMW_AFTER_HEAD,
> +    BLKDBG_PWRITEV_RMW_TAIL,
> +    BLKDBG_PWRITEV_RMW_AFTER_TAIL,
> +    BLKDBG_PWRITEV,
> +    BLKDBG_PWRITEV_ZERO,
> +    BLKDBG_PWRITEV_DONE,
> +
>       BLKDBG_EVENT_MAX,
>   } BlkDebugEvent;
>   
> diff --git a/tests/qemu-iotests/077 b/tests/qemu-iotests/077
> new file mode 100755
> index 0000000..58bfc8f
> --- /dev/null
> +++ b/tests/qemu-iotests/077
> @@ -0,0 +1,278 @@
> +#!/bin/bash
> +#
> +# Test concurrent pread/pwrite
> +#
> +# Copyright (C) 2014 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +#
> +
> +# creator
> +owner=kwolf@redhat.com
> +
> +seq=`basename $0`
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +
> +_cleanup()
> +{
> +	_cleanup_test_img
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +
> +_supported_fmt generic
> +_supported_proto generic
> +_supported_os Linux
> +
> +CLUSTER_SIZE=4k
> +size=128M
> +
> +_make_test_img $size
> +
> +echo
> +echo "== Some concurrent requests involving RMW =="
> +
> +function test_io()
> +{
> +echo "open -o file.align=4k blkdebug::$TEST_IMG"
> +# A simple RMW request
> +cat  <<EOF
> +aio_write -P 10 0x200 0x200
> +aio_flush
> +EOF
> +
> +# Sequential RMW requests on the same physical sector
> +off=0x1000
> +for ev in "head" "after_head" "tail" "after_tail"; do
> +cat  <<EOF
> +break pwritev_rmw.$ev A
> +aio_write -P 10 $((off + 0x200)) 0x200
> +wait_break A
> +aio_write -P 11 $((off + 0x400)) 0x200
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +off=$((off + 0x1000))
> +done
> +
> +# Chained dependencies
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0x5000 0x200
> +wait_break A
> +aio_write -P 11 0x5200 0x200
> +aio_write -P 12 0x5400 0x200
> +aio_write -P 13 0x5600 0x200
> +aio_write -P 14 0x5800 0x200
> +aio_write -P 15 0x5a00 0x200
> +aio_write -P 16 0x5c00 0x200
> +aio_write -P 17 0x5e00 0x200
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +
> +# Overlapping multiple requests
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0x6000 0x200
> +wait_break A
> +break pwritev_rmw.after_head B
> +aio_write -P 10 0x7e00 0x200
> +wait_break B
> +aio_write -P 11 0x6800 0x1000
> +resume A
> +sleep 100
> +resume B
> +aio_flush
> +EOF
> +
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0x8000 0x200
> +wait_break A
> +break pwritev_rmw.after_head B
> +aio_write -P 10 0x9e00 0x200
> +wait_break B
> +aio_write -P 11 0x8800 0x1000
> +resume B
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0xa000 0x200
> +wait_break A
> +aio_write -P 11 0xa800 0x1000
> +break pwritev_rmw.after_head B
> +aio_write -P 10 0xbe00 0x200
> +wait_break B
> +resume A
> +sleep 100
> +resume B
> +aio_flush
> +EOF
> +
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0xc000 0x200
> +wait_break A
> +aio_write -P 11 0xc800 0x1000
> +break pwritev_rmw.after_head B
> +aio_write -P 10 0xde00 0x200
> +wait_break B
> +resume B
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +
> +# Only RMW for the tail part
> +cat  <<EOF
> +break pwritev_rmw.after_tail A
> +aio_write -P 10 0xe000 0x1800
> +wait_break A
> +aio_write -P 11 0xf000 0xc00
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +
> +cat  <<EOF
> +break pwritev A
> +aio_write -P 10 0x10000 0x800
> +wait_break A
> +break pwritev_rmw.after_tail B
> +aio_write -P 11 0x10000 0x400
> +break pwritev_done C
> +resume A
> +wait_break C
> +resume C
> +sleep 100
> +wait_break B
> +resume B
> +aio_flush
> +EOF
> +
> +cat  <<EOF
> +break pwritev A
> +aio_write -P 10 0x11000 0x800
> +wait_break A
> +aio_write -P 11 0x11000 0x1000
> +sleep 100
> +resume A
> +aio_flush
> +EOF
> +}
> +
> +test_io | $QEMU_IO  | _filter_qemu_io | \
> +    sed -e 's,[0-9/]* bytes at offset [0-9]*,XXX/XXX bytes at offset XXX,g' \
> +        -e 's/^[0-9]* \(bytes\|KiB\)/XXX bytes/' \
> +        -e '/Suspended/d'
> +
> +echo
> +echo "== Verify image content =="
> +
> +function verify_io()
> +{
> +    # A simple RMW request
> +    echo read -P 0       0 0x200
> +    echo read -P 10  0x200 0x200
> +    echo read -P 0   0x400 0xc00
> +
> +    # Sequential RMW requests on the same physical sector
> +    echo read -P 0  0x1000 0x200
> +    echo read -P 10 0x1200 0x200
> +    echo read -P 11 0x1400 0x200
> +    echo read -P 0  0x1600 0xa00
> +
> +    echo read -P 0  0x2000 0x200
> +    echo read -P 10 0x2200 0x200
> +    echo read -P 11 0x2400 0x200
> +    echo read -P 0  0x2600 0xa00
> +
> +    echo read -P 0  0x3000 0x200
> +    echo read -P 10 0x3200 0x200
> +    echo read -P 11 0x3400 0x200
> +    echo read -P 0  0x3600 0xa00
> +
> +    echo read -P 0  0x4000 0x200
> +    echo read -P 10 0x4200 0x200
> +    echo read -P 11 0x4400 0x200
> +    echo read -P 0  0x4600 0xa00
> +
> +    # Chained dependencies
> +    echo read -P 10 0x5000 0x200
> +    echo read -P 11 0x5200 0x200
> +    echo read -P 12 0x5400 0x200
> +    echo read -P 13 0x5600 0x200
> +    echo read -P 14 0x5800 0x200
> +    echo read -P 15 0x5a00 0x200
> +    echo read -P 16 0x5c00 0x200
> +    echo read -P 17 0x5e00 0x200
> +
> +    # Overlapping multiple requests
> +    echo read -P 10 0x6000 0x200
> +    echo read -P  0 0x6200 0x600
> +    echo read -P 11 0x6800 0x1000
> +    echo read -P  0 0x7800 0x600
> +    echo read -P 10 0x7e00 0x200
> +
> +    echo read -P 10 0x8000 0x200
> +    echo read -P  0 0x8200 0x600
> +    echo read -P 11 0x8800 0x1000
> +    echo read -P  0 0x9800 0x600
> +    echo read -P 10 0x9e00 0x200
> +
> +    echo read -P 10 0xa000 0x200
> +    echo read -P  0 0xa200 0x600
> +    echo read -P 11 0xa800 0x1000
> +    echo read -P  0 0xb800 0x600
> +    echo read -P 10 0xbe00 0x200
> +
> +    echo read -P 10 0xc000 0x200
> +    echo read -P  0 0xc200 0x600
> +    echo read -P 11 0xc800 0x1000
> +    echo read -P  0 0xd800 0x600
> +    echo read -P 10 0xde00 0x200
> +
> +    # Only RMW for the tail part
> +    echo read -P 10 0xe000 0x200
> +    echo read -P 11 0xf800 0x400

I'd propose:

read -P 10 0xe000 0x1000
read -P 11 0xf000 0xc00

instead. That way, this would cover the overlapping section as well; but 
since this is covered by the following reads just as well:


Reviewed-by: Max Reitz <mreitz@redhat.com>

> +    echo read -P  0 0xfc00 0x400
> +
> +    echo read -P 11 0x10000 0x400
> +    echo read -P 10 0x10400 0x400
> +
> +    echo read -P 11 0x11800 0x800
> +}
> +
> +verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
> +
> +_check_test_img
> +
> +# success, all done
> +echo "*** done"
> +rm -f $seq.full
> +status=0
> diff --git a/tests/qemu-iotests/077.out b/tests/qemu-iotests/077.out
> new file mode 100644
> index 0000000..169315a
> --- /dev/null
> +++ b/tests/qemu-iotests/077.out
> @@ -0,0 +1,202 @@
> +QA output created by 077
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
> +
> +== Some concurrent requests involving RMW ==
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'B'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'B'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'B'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'B'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +blkdebug: Resuming request 'C'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'B'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +blkdebug: Resuming request 'A'
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote XXX/XXX bytes at offset XXX
> +XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +
> +== Verify image content ==
> +read 512/512 bytes at offset 0
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 512
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 3072/3072 bytes at offset 1024
> +3 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 4096
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 4608
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 5120
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 2560/2560 bytes at offset 5632
> +2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 8192
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 8704
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 9216
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 2560/2560 bytes at offset 9728
> +2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 12288
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 12800
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 13312
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 2560/2560 bytes at offset 13824
> +2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 16384
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 16896
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 17408
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 2560/2560 bytes at offset 17920
> +2.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 20480
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 20992
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 21504
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 22016
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 22528
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 23040
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 23552
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 24064
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 24576
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 25088
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 4096/4096 bytes at offset 26624
> +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 30720
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 32256
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 32768
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 33280
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 4096/4096 bytes at offset 34816
> +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 38912
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 40448
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 40960
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 41472
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 4096/4096 bytes at offset 43008
> +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 47104
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 48640
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 49152
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 49664
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 4096/4096 bytes at offset 51200
> +4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1536/1536 bytes at offset 55296
> +1.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 56832
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 512/512 bytes at offset 57344
> +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 63488
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 64512
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 65536
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 1024/1024 bytes at offset 66560
> +1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 2048/2048 bytes at offset 71680
> +2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +No errors were found on the image.
> +*** done
> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
> index cc750c9..420396f 100644
> --- a/tests/qemu-iotests/group
> +++ b/tests/qemu-iotests/group
> @@ -79,3 +79,4 @@
>   070 rw auto
>   073 rw auto
>   074 rw auto
> +077 rw auto

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2014-01-17 22:39   ` Benoît Canet
@ 2014-01-20  9:31     ` Kevin Wolf
  2014-01-20  9:49       ` Peter Lieven
  0 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-20  9:31 UTC (permalink / raw)
  To: Benoît Canet; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Am 17.01.2014 um 23:39 hat Benoît Canet geschrieben:
> Le Friday 17 Jan 2014 à 15:14:51 (+0100), Kevin Wolf a écrit :
> > This function separates filling the BlockLimits from bdrv_open(), which
> > allows it to call it from other operations which may change the limits
> > (e.g. modifications to the backing file chain or bdrv_reopen)
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > Reviewed-by: Max Reitz <mreitz@redhat.com>
> > ---
> >  block.c                   | 19 +++++++++++++++++++
> >  block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
> >  block/qcow2.c             | 11 ++++++++++-
> >  block/qed.c               | 11 ++++++++++-
> >  block/vmdk.c              | 22 ++++++++++++++++++----
> >  include/block/block_int.h |  2 ++
> >  6 files changed, 88 insertions(+), 23 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 64e7d22..99e69da 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
> >      return ret;
> >  }
> >  
> > +static int bdrv_refresh_limits(BlockDriverState *bs)
> > +{
> > +    BlockDriver *drv = bs->drv;
> > +
> > +    memset(&bs->bl, 0, sizeof(bs->bl));
> > +
> > +    if (drv && drv->bdrv_refresh_limits) {
> > +        return drv->bdrv_refresh_limits(bs);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  /*
> >   * Create a uniquely-named empty temporary file.
> >   * Return 0 upon success, otherwise a negative errno value.
> > @@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
> >          goto free_and_fail;
> >      }
> >  
> > +    bdrv_refresh_limits(bs);
> > +
> >  #ifndef _WIN32
> >      if (bs->is_temporary) {
> >          assert(bs->filename[0] != '\0');
> > @@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
> >      }
> >      pstrcpy(bs->backing_file, sizeof(bs->backing_file),
> >              bs->backing_hd->file->filename);
> > +
> > +    /* Recalculate the BlockLimits with the backing file */
> > +    bdrv_refresh_limits(bs);
> > +
> >      return 0;
> >  }
> >  
> > diff --git a/block/iscsi.c b/block/iscsi.c
> > index c0ea0c4..3202dc5 100644
> > --- a/block/iscsi.c
> > +++ b/block/iscsi.c
> > @@ -1265,23 +1265,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
> >                 sizeof(struct scsi_inquiry_block_limits));
> >          scsi_free_scsi_task(task);
> >          task = NULL;
> > -
> > -        if (iscsilun->bl.max_unmap < 0xffffffff) {
> > -            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> > -                                                 iscsilun);
> > -        }
> > -        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> > -                                                   iscsilun);
> > -
> > -        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> > -            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> > -                                                      iscsilun);
> > -        }
> > -        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> > -                                                        iscsilun);
> > -
> > -        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> > -                                                     iscsilun);
> >      }
> >  
> >  #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
> > @@ -1326,6 +1309,34 @@ static void iscsi_close(BlockDriverState *bs)
> >      memset(iscsilun, 0, sizeof(IscsiLun));
> >  }
> >  
> > +static int iscsi_refresh_limits(BlockDriverState *bs)
> > +{
> > +    IscsiLun *iscsilun = bs->opaque;
> > +
> > +    /* We don't actually refresh here, but just return data queried in
> > +     * iscsi_open(): iscsi targets don't change their limits. */
> > +    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
> > +        if (iscsilun->bl.max_unmap < 0xffffffff) {
> > +            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> > +                                                 iscsilun);
> > +        }
> > +        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> > +                                                   iscsilun);
> > +
> > +        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> > +            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> > +                                                      iscsilun);
> > +        }
> > +        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> > +                                                        iscsilun);
> > +
> > +        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> > +                                                     iscsilun);
> Why is bl.opt_transfer_length filled only inside the test for unmap and write same ?
> Is there a relationship ?

Good question. The only thing I can say is that I'm moving existing code
here, so this aspect doesnt' change. I suspect this needs a fix on top
of this series. Peter?

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic
  2014-01-18 16:01   ` Max Reitz
@ 2014-01-20  9:44     ` Kevin Wolf
  0 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-20  9:44 UTC (permalink / raw)
  To: Max Reitz; +Cc: pbonzini, pl, qemu-devel, stefanha, xiawenc

Am 18.01.2014 um 17:01 hat Max Reitz geschrieben:
> On 17.01.2014 15:15, Kevin Wolf wrote:
> >Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >---
> >  block.c                    |   7 ++
> >  block/blkdebug.c           |   8 ++
> >  include/block/block.h      |   8 ++
> >  tests/qemu-iotests/077     | 278 +++++++++++++++++++++++++++++++++++++++++++++
> >  tests/qemu-iotests/077.out | 202 ++++++++++++++++++++++++++++++++
> >  tests/qemu-iotests/group   |   1 +
> >  6 files changed, 504 insertions(+)
> >  create mode 100755 tests/qemu-iotests/077
> >  create mode 100644 tests/qemu-iotests/077.out
> >
> >diff --git a/block.c b/block.c
> >index 812b1b2..12af7fb 100644
> >--- a/block.c
> >+++ b/block.c
> >@@ -2961,10 +2961,13 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> >      if (ret < 0) {
> >          /* Do nothing, write notifier decided to fail this request */
> >      } else if (flags & BDRV_REQ_ZERO_WRITE) {
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_ZERO);
> >          ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, flags);
> >      } else {
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV);
> >          ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
> >      }
> >+    BLKDBG_EVENT(bs, BLKDBG_PWRITEV_DONE);
> >      if (ret == 0 && !bs->enable_write_cache) {
> >          ret = bdrv_co_flush(bs);
> >@@ -3035,11 +3038,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> >          };
> >          qemu_iovec_init_external(&head_qiov, &head_iov, 1);
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_HEAD);
> >          ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
> >                                    align, &head_qiov, 0);
> >          if (ret < 0) {
> >              goto fail;
> >          }
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_HEAD);
> >          qemu_iovec_init(&local_qiov, qiov->niov + 2);
> >          qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
> >@@ -3067,11 +3072,13 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> >          };
> >          qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_TAIL);
> >          ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
> >                                    align, &tail_qiov, 0);
> >          if (ret < 0) {
> >              goto fail;
> >          }
> >+        BLKDBG_EVENT(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
> >          if (!use_local_qiov) {
> >              qemu_iovec_init(&local_qiov, qiov->niov + 1);
> >diff --git a/block/blkdebug.c b/block/blkdebug.c
> >index dc4ba46..6657671 100644
> >--- a/block/blkdebug.c
> >+++ b/block/blkdebug.c
> >@@ -186,6 +186,14 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
> >      [BLKDBG_FLUSH_TO_OS]                    = "flush_to_os",
> >      [BLKDBG_FLUSH_TO_DISK]                  = "flush_to_disk",
> >+
> >+    [BLKDBG_PWRITEV_RMW_HEAD]               = "pwritev_rmw.head",
> >+    [BLKDBG_PWRITEV_RMW_AFTER_HEAD]         = "pwritev_rmw.after_head",
> >+    [BLKDBG_PWRITEV_RMW_TAIL]               = "pwritev_rmw.tail",
> >+    [BLKDBG_PWRITEV_RMW_AFTER_TAIL]         = "pwritev_rmw.after_tail",
> >+    [BLKDBG_PWRITEV]                        = "pwritev",
> >+    [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
> >+    [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
> >  };
> >  static int get_event_by_name(const char *name, BlkDebugEvent *event)
> >diff --git a/include/block/block.h b/include/block/block.h
> >index 7e40ccc..127fc2f 100644
> >--- a/include/block/block.h
> >+++ b/include/block/block.h
> >@@ -518,6 +518,14 @@ typedef enum {
> >      BLKDBG_FLUSH_TO_OS,
> >      BLKDBG_FLUSH_TO_DISK,
> >+    BLKDBG_PWRITEV_RMW_HEAD,
> >+    BLKDBG_PWRITEV_RMW_AFTER_HEAD,
> >+    BLKDBG_PWRITEV_RMW_TAIL,
> >+    BLKDBG_PWRITEV_RMW_AFTER_TAIL,
> >+    BLKDBG_PWRITEV,
> >+    BLKDBG_PWRITEV_ZERO,
> >+    BLKDBG_PWRITEV_DONE,
> >+
> >      BLKDBG_EVENT_MAX,
> >  } BlkDebugEvent;
> >diff --git a/tests/qemu-iotests/077 b/tests/qemu-iotests/077
> >new file mode 100755
> >index 0000000..58bfc8f
> >--- /dev/null
> >+++ b/tests/qemu-iotests/077
> >@@ -0,0 +1,278 @@
> >+#!/bin/bash
> >+#
> >+# Test concurrent pread/pwrite
> >+#
> >+# Copyright (C) 2014 Red Hat, Inc.
> >+#
> >+# This program is free software; you can redistribute it and/or modify
> >+# it under the terms of the GNU General Public License as published by
> >+# the Free Software Foundation; either version 2 of the License, or
> >+# (at your option) any later version.
> >+#
> >+# This program is distributed in the hope that it will be useful,
> >+# but WITHOUT ANY WARRANTY; without even the implied warranty of
> >+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >+# GNU General Public License for more details.
> >+#
> >+# You should have received a copy of the GNU General Public License
> >+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> >+#
> >+
> >+# creator
> >+owner=kwolf@redhat.com
> >+
> >+seq=`basename $0`
> >+echo "QA output created by $seq"
> >+
> >+here=`pwd`
> >+tmp=/tmp/$$
> >+status=1	# failure is the default!
> >+
> >+_cleanup()
> >+{
> >+	_cleanup_test_img
> >+}
> >+trap "_cleanup; exit \$status" 0 1 2 3 15
> >+
> >+# get standard environment, filters and checks
> >+. ./common.rc
> >+. ./common.filter
> >+
> >+_supported_fmt generic
> >+_supported_proto generic
> >+_supported_os Linux
> >+
> >+CLUSTER_SIZE=4k
> >+size=128M
> >+
> >+_make_test_img $size
> >+
> >+echo
> >+echo "== Some concurrent requests involving RMW =="
> >+
> >+function test_io()
> >+{
> >+echo "open -o file.align=4k blkdebug::$TEST_IMG"
> >+# A simple RMW request
> >+cat  <<EOF
> >+aio_write -P 10 0x200 0x200
> >+aio_flush
> >+EOF
> >+
> >+# Sequential RMW requests on the same physical sector
> >+off=0x1000
> >+for ev in "head" "after_head" "tail" "after_tail"; do
> >+cat  <<EOF
> >+break pwritev_rmw.$ev A
> >+aio_write -P 10 $((off + 0x200)) 0x200
> >+wait_break A
> >+aio_write -P 11 $((off + 0x400)) 0x200
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+off=$((off + 0x1000))
> >+done
> >+
> >+# Chained dependencies
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0x5000 0x200
> >+wait_break A
> >+aio_write -P 11 0x5200 0x200
> >+aio_write -P 12 0x5400 0x200
> >+aio_write -P 13 0x5600 0x200
> >+aio_write -P 14 0x5800 0x200
> >+aio_write -P 15 0x5a00 0x200
> >+aio_write -P 16 0x5c00 0x200
> >+aio_write -P 17 0x5e00 0x200
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+
> >+# Overlapping multiple requests
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0x6000 0x200
> >+wait_break A
> >+break pwritev_rmw.after_head B
> >+aio_write -P 10 0x7e00 0x200
> >+wait_break B
> >+aio_write -P 11 0x6800 0x1000
> >+resume A
> >+sleep 100
> >+resume B
> >+aio_flush
> >+EOF
> >+
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0x8000 0x200
> >+wait_break A
> >+break pwritev_rmw.after_head B
> >+aio_write -P 10 0x9e00 0x200
> >+wait_break B
> >+aio_write -P 11 0x8800 0x1000
> >+resume B
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0xa000 0x200
> >+wait_break A
> >+aio_write -P 11 0xa800 0x1000
> >+break pwritev_rmw.after_head B
> >+aio_write -P 10 0xbe00 0x200
> >+wait_break B
> >+resume A
> >+sleep 100
> >+resume B
> >+aio_flush
> >+EOF
> >+
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0xc000 0x200
> >+wait_break A
> >+aio_write -P 11 0xc800 0x1000
> >+break pwritev_rmw.after_head B
> >+aio_write -P 10 0xde00 0x200
> >+wait_break B
> >+resume B
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+
> >+# Only RMW for the tail part
> >+cat  <<EOF
> >+break pwritev_rmw.after_tail A
> >+aio_write -P 10 0xe000 0x1800
> >+wait_break A
> >+aio_write -P 11 0xf000 0xc00
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+
> >+cat  <<EOF
> >+break pwritev A
> >+aio_write -P 10 0x10000 0x800
> >+wait_break A
> >+break pwritev_rmw.after_tail B
> >+aio_write -P 11 0x10000 0x400
> >+break pwritev_done C
> >+resume A
> >+wait_break C
> >+resume C
> >+sleep 100
> >+wait_break B
> >+resume B
> >+aio_flush
> >+EOF
> >+
> >+cat  <<EOF
> >+break pwritev A
> >+aio_write -P 10 0x11000 0x800
> >+wait_break A
> >+aio_write -P 11 0x11000 0x1000
> >+sleep 100
> >+resume A
> >+aio_flush
> >+EOF
> >+}
> >+
> >+test_io | $QEMU_IO  | _filter_qemu_io | \
> >+    sed -e 's,[0-9/]* bytes at offset [0-9]*,XXX/XXX bytes at offset XXX,g' \
> >+        -e 's/^[0-9]* \(bytes\|KiB\)/XXX bytes/' \
> >+        -e '/Suspended/d'
> >+
> >+echo
> >+echo "== Verify image content =="
> >+
> >+function verify_io()
> >+{
> >+    # A simple RMW request
> >+    echo read -P 0       0 0x200
> >+    echo read -P 10  0x200 0x200
> >+    echo read -P 0   0x400 0xc00
> >+
> >+    # Sequential RMW requests on the same physical sector
> >+    echo read -P 0  0x1000 0x200
> >+    echo read -P 10 0x1200 0x200
> >+    echo read -P 11 0x1400 0x200
> >+    echo read -P 0  0x1600 0xa00
> >+
> >+    echo read -P 0  0x2000 0x200
> >+    echo read -P 10 0x2200 0x200
> >+    echo read -P 11 0x2400 0x200
> >+    echo read -P 0  0x2600 0xa00
> >+
> >+    echo read -P 0  0x3000 0x200
> >+    echo read -P 10 0x3200 0x200
> >+    echo read -P 11 0x3400 0x200
> >+    echo read -P 0  0x3600 0xa00
> >+
> >+    echo read -P 0  0x4000 0x200
> >+    echo read -P 10 0x4200 0x200
> >+    echo read -P 11 0x4400 0x200
> >+    echo read -P 0  0x4600 0xa00
> >+
> >+    # Chained dependencies
> >+    echo read -P 10 0x5000 0x200
> >+    echo read -P 11 0x5200 0x200
> >+    echo read -P 12 0x5400 0x200
> >+    echo read -P 13 0x5600 0x200
> >+    echo read -P 14 0x5800 0x200
> >+    echo read -P 15 0x5a00 0x200
> >+    echo read -P 16 0x5c00 0x200
> >+    echo read -P 17 0x5e00 0x200
> >+
> >+    # Overlapping multiple requests
> >+    echo read -P 10 0x6000 0x200
> >+    echo read -P  0 0x6200 0x600
> >+    echo read -P 11 0x6800 0x1000
> >+    echo read -P  0 0x7800 0x600
> >+    echo read -P 10 0x7e00 0x200
> >+
> >+    echo read -P 10 0x8000 0x200
> >+    echo read -P  0 0x8200 0x600
> >+    echo read -P 11 0x8800 0x1000
> >+    echo read -P  0 0x9800 0x600
> >+    echo read -P 10 0x9e00 0x200
> >+
> >+    echo read -P 10 0xa000 0x200
> >+    echo read -P  0 0xa200 0x600
> >+    echo read -P 11 0xa800 0x1000
> >+    echo read -P  0 0xb800 0x600
> >+    echo read -P 10 0xbe00 0x200
> >+
> >+    echo read -P 10 0xc000 0x200
> >+    echo read -P  0 0xc200 0x600
> >+    echo read -P 11 0xc800 0x1000
> >+    echo read -P  0 0xd800 0x600
> >+    echo read -P 10 0xde00 0x200
> >+
> >+    # Only RMW for the tail part
> >+    echo read -P 10 0xe000 0x200
> >+    echo read -P 11 0xf800 0x400
> 
> I'd propose:
> 
> read -P 10 0xe000 0x1000
> read -P 11 0xf000 0xc00
> 
> instead. That way, this would cover the overlapping section as well;
> but since this is covered by the following reads just as well:
> 
> Reviewed-by: Max Reitz <mreitz@redhat.com>

Not sure why I only read 0x200 bytes from 0xe000, I think this should be
0x1000 indeed. However, we can't test the overlapping part because this
is AIO and the order of the requests is undefined. Both 10 and 11 are
correct results.

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2014-01-20  9:31     ` Kevin Wolf
@ 2014-01-20  9:49       ` Peter Lieven
  0 siblings, 0 replies; 70+ messages in thread
From: Peter Lieven @ 2014-01-20  9:49 UTC (permalink / raw)
  To: Kevin Wolf, Benoît Canet
  Cc: pbonzini, xiawenc, qemu-devel, stefanha, mreitz

On 20.01.2014 10:31, Kevin Wolf wrote:
> Am 17.01.2014 um 23:39 hat Benoît Canet geschrieben:
>> Le Friday 17 Jan 2014 à 15:14:51 (+0100), Kevin Wolf a écrit :
>>> This function separates filling the BlockLimits from bdrv_open(), which
>>> allows it to call it from other operations which may change the limits
>>> (e.g. modifications to the backing file chain or bdrv_reopen)
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   block.c                   | 19 +++++++++++++++++++
>>>   block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
>>>   block/qcow2.c             | 11 ++++++++++-
>>>   block/qed.c               | 11 ++++++++++-
>>>   block/vmdk.c              | 22 ++++++++++++++++++----
>>>   include/block/block_int.h |  2 ++
>>>   6 files changed, 88 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 64e7d22..99e69da 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
>>>       return ret;
>>>   }
>>>   
>>> +static int bdrv_refresh_limits(BlockDriverState *bs)
>>> +{
>>> +    BlockDriver *drv = bs->drv;
>>> +
>>> +    memset(&bs->bl, 0, sizeof(bs->bl));
>>> +
>>> +    if (drv && drv->bdrv_refresh_limits) {
>>> +        return drv->bdrv_refresh_limits(bs);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   /*
>>>    * Create a uniquely-named empty temporary file.
>>>    * Return 0 upon success, otherwise a negative errno value.
>>> @@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>>>           goto free_and_fail;
>>>       }
>>>   
>>> +    bdrv_refresh_limits(bs);
>>> +
>>>   #ifndef _WIN32
>>>       if (bs->is_temporary) {
>>>           assert(bs->filename[0] != '\0');
>>> @@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
>>>       }
>>>       pstrcpy(bs->backing_file, sizeof(bs->backing_file),
>>>               bs->backing_hd->file->filename);
>>> +
>>> +    /* Recalculate the BlockLimits with the backing file */
>>> +    bdrv_refresh_limits(bs);
>>> +
>>>       return 0;
>>>   }
>>>   
>>> diff --git a/block/iscsi.c b/block/iscsi.c
>>> index c0ea0c4..3202dc5 100644
>>> --- a/block/iscsi.c
>>> +++ b/block/iscsi.c
>>> @@ -1265,23 +1265,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>>>                  sizeof(struct scsi_inquiry_block_limits));
>>>           scsi_free_scsi_task(task);
>>>           task = NULL;
>>> -
>>> -        if (iscsilun->bl.max_unmap < 0xffffffff) {
>>> -            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
>>> -                                                 iscsilun);
>>> -        }
>>> -        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
>>> -                                                   iscsilun);
>>> -
>>> -        if (iscsilun->bl.max_ws_len < 0xffffffff) {
>>> -            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
>>> -                                                      iscsilun);
>>> -        }
>>> -        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
>>> -                                                        iscsilun);
>>> -
>>> -        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
>>> -                                                     iscsilun);
>>>       }
>>>   
>>>   #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
>>> @@ -1326,6 +1309,34 @@ static void iscsi_close(BlockDriverState *bs)
>>>       memset(iscsilun, 0, sizeof(IscsiLun));
>>>   }
>>>   
>>> +static int iscsi_refresh_limits(BlockDriverState *bs)
>>> +{
>>> +    IscsiLun *iscsilun = bs->opaque;
>>> +
>>> +    /* We don't actually refresh here, but just return data queried in
>>> +     * iscsi_open(): iscsi targets don't change their limits. */
>>> +    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
>>> +        if (iscsilun->bl.max_unmap < 0xffffffff) {
>>> +            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
>>> +                                                 iscsilun);
>>> +        }
>>> +        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
>>> +                                                   iscsilun);
>>> +
>>> +        if (iscsilun->bl.max_ws_len < 0xffffffff) {
>>> +            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
>>> +                                                      iscsilun);
>>> +        }
>>> +        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
>>> +                                                        iscsilun);
>>> +
>>> +        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
>>> +                                                     iscsilun);
>> Why is bl.opt_transfer_length filled only inside the test for unmap and write same ?
>> Is there a relationship ?
> Good question. The only thing I can say is that I'm moving existing code
> here, so this aspect doesnt' change. I suspect this needs a fix on top
> of this series. Peter?
Yes, the opt_transfer_length is independend. It got there because, if either lbp.lbpu or lbp.lbpws
is set, we can be sure that the Block Limits page is supported. I will change this
when Kevins series is merged.

Peter

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep'
  2014-01-17 23:55   ` Max Reitz
@ 2014-01-20  9:58     ` Kevin Wolf
  0 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-20  9:58 UTC (permalink / raw)
  To: Max Reitz; +Cc: pbonzini, pl, qemu-devel, stefanha, xiawenc

Am 18.01.2014 um 00:55 hat Max Reitz geschrieben:
> On 17.01.2014 15:15, Kevin Wolf wrote:
> >There is no easy way to check that a request correctly waits for a
> >different request. With a sleep command we can at least approximate it.
> >
> >Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >---
> >  qemu-io-cmds.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> >
> >diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> >index 85e4982..978a3a0 100644
> >--- a/qemu-io-cmds.c
> >+++ b/qemu-io-cmds.c
> >@@ -12,6 +12,7 @@
> >  #include "block/block_int.h"
> >  #include "block/qapi.h"
> >  #include "qemu/main-loop.h"
> >+#include "qemu/timer.h"
> >  #define CMD_NOFILE_OK   0x01
> >@@ -2038,6 +2039,46 @@ static const cmdinfo_t abort_cmd = {
> >         .oneline        = "simulate a program crash using abort(3)",
> >  };
> >+static void sleep_cb(void *opaque)
> >+{
> >+    bool *expired = opaque;
> >+    *expired = true;
> >+}
> >+
> >+static int sleep_f(BlockDriverState *bs, int argc, char **argv)
> >+{
> >+    char *endptr;
> >+    long ms;
> >+    struct QEMUTimer *timer;
> >+    bool expired = false;
> >+
> >+    ms = strtol(argv[1], &endptr, 0);
> >+    if (ms < 0 || *endptr != '\0') {
> >+        printf("%s is not a valid number\n", argv[1]);
> >+        return 0;
> >+    }
> >+
> >+    timer = timer_new_ns(QEMU_CLOCK_HOST, sleep_cb, &expired);
> >+    timer_mod(timer, qemu_clock_get_ns(QEMU_CLOCK_HOST) + SCALE_MS * ms);
> >+
> >+    while (!expired) {
> 
> I don't know whether compilers don't optimize accesses to variables
> whose addresses have been given to other functions, but shouldn't
> expired be marked volatile just to be sure?

That would be a pretty broken compiler, and we're already using the same
pattern all over qemu without volatile, so no. (And if there was a
problem, I guess we'd rather use barriers than volatile variables.)

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
  2014-01-17 22:39   ` Benoît Canet
@ 2014-01-21 12:49   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 12:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:51 (+0100), Kevin Wolf a écrit :
> This function separates filling the BlockLimits from bdrv_open(), which
> allows it to call it from other operations which may change the limits
> (e.g. modifications to the backing file chain or bdrv_reopen)
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 19 +++++++++++++++++++
>  block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
>  block/qcow2.c             | 11 ++++++++++-
>  block/qed.c               | 11 ++++++++++-
>  block/vmdk.c              | 22 ++++++++++++++++++----
>  include/block/block_int.h |  2 ++
>  6 files changed, 88 insertions(+), 23 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 64e7d22..99e69da 100644
> --- a/block.c
> +++ b/block.c
> @@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
>      return ret;
>  }
>  
> +static int bdrv_refresh_limits(BlockDriverState *bs)
> +{
> +    BlockDriver *drv = bs->drv;
> +
> +    memset(&bs->bl, 0, sizeof(bs->bl));
> +
> +    if (drv && drv->bdrv_refresh_limits) {
> +        return drv->bdrv_refresh_limits(bs);
> +    }
> +
> +    return 0;
> +}
> +
>  /*
>   * Create a uniquely-named empty temporary file.
>   * Return 0 upon success, otherwise a negative errno value.
> @@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>          goto free_and_fail;
>      }
>  
> +    bdrv_refresh_limits(bs);
> +
>  #ifndef _WIN32
>      if (bs->is_temporary) {
>          assert(bs->filename[0] != '\0');
> @@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
>      }
>      pstrcpy(bs->backing_file, sizeof(bs->backing_file),
>              bs->backing_hd->file->filename);
> +
> +    /* Recalculate the BlockLimits with the backing file */
> +    bdrv_refresh_limits(bs);
> +
>      return 0;
>  }
>  
> diff --git a/block/iscsi.c b/block/iscsi.c
> index c0ea0c4..3202dc5 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -1265,23 +1265,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>                 sizeof(struct scsi_inquiry_block_limits));
>          scsi_free_scsi_task(task);
>          task = NULL;
> -
> -        if (iscsilun->bl.max_unmap < 0xffffffff) {
> -            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> -                                                 iscsilun);
> -        }
> -        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> -                                                   iscsilun);
> -
> -        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> -            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> -                                                      iscsilun);
> -        }
> -        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> -                                                        iscsilun);
> -
> -        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> -                                                     iscsilun);
>      }
>  
>  #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
> @@ -1326,6 +1309,34 @@ static void iscsi_close(BlockDriverState *bs)
>      memset(iscsilun, 0, sizeof(IscsiLun));
>  }
>  
> +static int iscsi_refresh_limits(BlockDriverState *bs)
> +{
> +    IscsiLun *iscsilun = bs->opaque;
> +
> +    /* We don't actually refresh here, but just return data queried in
> +     * iscsi_open(): iscsi targets don't change their limits. */
> +    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
> +        if (iscsilun->bl.max_unmap < 0xffffffff) {
> +            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
> +                                                 iscsilun);
> +        }
> +        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> +                                                   iscsilun);
> +
> +        if (iscsilun->bl.max_ws_len < 0xffffffff) {
> +            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
> +                                                      iscsilun);
> +        }
> +        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
> +                                                        iscsilun);
> +
> +        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
> +                                                     iscsilun);
> +    }
> +
> +    return 0;
> +}
> +
>  static int iscsi_truncate(BlockDriverState *bs, int64_t offset)
>  {
>      IscsiLun *iscsilun = bs->opaque;
> @@ -1438,6 +1449,7 @@ static BlockDriver bdrv_iscsi = {
>      .bdrv_getlength  = iscsi_getlength,
>      .bdrv_get_info   = iscsi_get_info,
>      .bdrv_truncate   = iscsi_truncate,
> +    .bdrv_refresh_limits = iscsi_refresh_limits,
>  
>  #if defined(LIBISCSI_FEATURE_IOVECTOR)
>      .bdrv_co_get_block_status = iscsi_co_get_block_status,
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8ec9db1..6562994 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -718,7 +718,6 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      qemu_opts_del(opts);
> -    bs->bl.write_zeroes_alignment = s->cluster_sectors;
>  
>      if (s->use_lazy_refcounts && s->qcow_version < 3) {
>          error_setg(errp, "Lazy refcounts require a qcow2 image with at least "
> @@ -751,6 +750,15 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
>      return ret;
>  }
>  
> +static int qcow2_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +
> +    bs->bl.write_zeroes_alignment = s->cluster_sectors;
> +
> +    return 0;
> +}
> +
>  static int qcow2_set_key(BlockDriverState *bs, const char *key)
>  {
>      BDRVQcowState *s = bs->opaque;
> @@ -2268,6 +2276,7 @@ static BlockDriver bdrv_qcow2 = {
>  
>      .bdrv_change_backing_file   = qcow2_change_backing_file,
>  
> +    .bdrv_refresh_limits        = qcow2_refresh_limits,
>      .bdrv_invalidate_cache      = qcow2_invalidate_cache,
>  
>      .create_options = qcow2_create_options,
> diff --git a/block/qed.c b/block/qed.c
> index 450a1fa..d35bd1c 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -495,7 +495,6 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
>          }
>      }
>  
> -    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
>      s->need_check_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>                                              qed_need_check_timer_cb, s);
>  
> @@ -507,6 +506,15 @@ out:
>      return ret;
>  }
>  
> +static int bdrv_qed_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +
> +    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
> +
> +    return 0;
> +}
> +
>  /* We have nothing to do for QED reopen, stubs just return
>   * success */
>  static int bdrv_qed_reopen_prepare(BDRVReopenState *state,
> @@ -1616,6 +1624,7 @@ static BlockDriver bdrv_qed = {
>      .bdrv_truncate            = bdrv_qed_truncate,
>      .bdrv_getlength           = bdrv_qed_getlength,
>      .bdrv_get_info            = bdrv_qed_get_info,
> +    .bdrv_refresh_limits      = bdrv_qed_refresh_limits,
>      .bdrv_change_backing_file = bdrv_qed_change_backing_file,
>      .bdrv_invalidate_cache    = bdrv_qed_invalidate_cache,
>      .bdrv_check               = bdrv_qed_check,
> diff --git a/block/vmdk.c b/block/vmdk.c
> index c6b60b4..e26df97 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -428,10 +428,6 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>  
> -    if (!flat) {
> -        bs->bl.write_zeroes_alignment =
> -            MAX(bs->bl.write_zeroes_alignment, cluster_sectors);
> -    }
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
>      } else {
> @@ -891,6 +887,23 @@ fail:
>      return ret;
>  }
>  
> +
> +static int vmdk_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVVmdkState *s = bs->opaque;
> +    int i;
> +
> +    for (i = 0; i < s->num_extents; i++) {
> +        if (!s->extents[i].flat) {
> +            bs->bl.write_zeroes_alignment =
> +                MAX(bs->bl.write_zeroes_alignment,
> +                    s->extents[i].cluster_sectors);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  static int get_whole_cluster(BlockDriverState *bs,
>                  VmdkExtent *extent,
>                  uint64_t cluster_offset,
> @@ -2002,6 +2015,7 @@ static BlockDriver bdrv_vmdk = {
>      .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
>      .bdrv_has_zero_init           = vmdk_has_zero_init,
>      .bdrv_get_specific_info       = vmdk_get_specific_info,
> +    .bdrv_refresh_limits          = vmdk_refresh_limits,
>  
>      .create_options               = vmdk_create_options,
>  };
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 2772f2f..3dfe724 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -226,6 +226,8 @@ struct BlockDriver {
>      int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
>      bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
>  
> +    int (*bdrv_refresh_limits)(BlockDriverState *bs);
> +
>      /*
>       * Returns 1 if newly created images are guaranteed to contain only
>       * zeros, 0 otherwise.
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2014-01-21 12:54   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 12:54 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:57 (+0100), Kevin Wolf a écrit :
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> The alignment field is now set to the value that is promised to the
> guest, rather than required by the host.  The next patches will make
> QEMU aware of the host-provided values, so make this clear.
> 
> The alignment is also not about memory buffers, but about the sectors on
> the disk, change the documentation of the field.
> 
> At this point, the field is set by the device emulation, but completely
> ignored by the block layer.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 10 +++++-----
>  hw/block/virtio-blk.c     |  2 +-
>  hw/ide/core.c             |  2 +-
>  hw/scsi/scsi-disk.c       |  2 +-
>  hw/scsi/scsi-generic.c    |  2 +-
>  include/block/block.h     |  2 +-
>  include/block/block_int.h |  4 ++--
>  7 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 04c409a..b738abe 100644
> --- a/block.c
> +++ b/block.c
> @@ -812,7 +812,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>      }
>  
>      bs->open_flags = flags;
> -    bs->buffer_alignment = 512;
> +    bs->guest_block_size = 512;
>      bs->zero_beyond_eof = true;
>      open_flags = bdrv_open_flags(bs, flags);
>      bs->read_only = !(open_flags & BDRV_O_RDWR);
> @@ -1648,7 +1648,7 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
>      bs_dest->dev_ops            = bs_src->dev_ops;
>      bs_dest->dev_opaque         = bs_src->dev_opaque;
>      bs_dest->dev                = bs_src->dev;
> -    bs_dest->buffer_alignment   = bs_src->buffer_alignment;
> +    bs_dest->guest_block_size   = bs_src->guest_block_size;
>      bs_dest->copy_on_read       = bs_src->copy_on_read;
>  
>      bs_dest->enable_write_cache = bs_src->enable_write_cache;
> @@ -1800,7 +1800,7 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev)
>      bs->dev = NULL;
>      bs->dev_ops = NULL;
>      bs->dev_opaque = NULL;
> -    bs->buffer_alignment = 512;
> +    bs->guest_block_size = 512;
>  }
>  
>  /* TODO change to return DeviceState * when all users are qdevified */
> @@ -4561,9 +4561,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
>      return NULL;
>  }
>  
> -void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
> +void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>  {
> -    bs->buffer_alignment = align;
> +    bs->guest_block_size = align;
>  }
>  
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 19d0961..8a568e5 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -731,7 +731,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>      register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
>                      virtio_blk_save, virtio_blk_load, s);
>      bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
> -    bdrv_set_buffer_alignment(s->bs, s->conf->logical_block_size);
> +    bdrv_set_guest_block_size(s->bs, s->conf->logical_block_size);
>  
>      bdrv_iostatus_enable(s->bs);
>  
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index e1f4c33..036cd4a 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -2103,7 +2103,7 @@ int ide_init_drive(IDEState *s, BlockDriverState *bs, IDEDriveKind kind,
>      s->smart_selftest_count = 0;
>      if (kind == IDE_CD) {
>          bdrv_set_dev_ops(bs, &ide_cd_block_ops, s);
> -        bdrv_set_buffer_alignment(bs, 2048);
> +        bdrv_set_guest_block_size(bs, 2048);
>      } else {
>          if (!bdrv_is_inserted(s->bs)) {
>              error_report("Device needs media, but drive is empty");
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index bce617c..6491091 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -2254,7 +2254,7 @@ static int scsi_initfn(SCSIDevice *dev)
>      } else {
>          bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s);
>      }
> -    bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize);
> +    bdrv_set_guest_block_size(s->qdev.conf.bs, s->qdev.blocksize);
>  
>      bdrv_iostatus_enable(s->qdev.conf.bs);
>      add_boot_device_path(s->qdev.conf.bootindex, &dev->qdev, NULL);
> diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> index 8f195be..f08b64e 100644
> --- a/hw/scsi/scsi-generic.c
> +++ b/hw/scsi/scsi-generic.c
> @@ -210,7 +210,7 @@ static void scsi_read_complete(void * opaque, int ret)
>              s->blocksize = ldl_be_p(&r->buf[8]);
>              s->max_lba = ldq_be_p(&r->buf[0]);
>          }
> -        bdrv_set_buffer_alignment(s->conf.bs, s->blocksize);
> +        bdrv_set_guest_block_size(s->conf.bs, s->blocksize);
>  
>          scsi_req_data(&r->req, len);
>          if (!r->req.io_canceled) {
> diff --git a/include/block/block.h b/include/block/block.h
> index cf63ee2..05252d5 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -422,7 +422,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
>  /* Returns the alignment in bytes that is required so that no bounce buffer
>   * is required throughout the stack */
>  size_t bdrv_opt_mem_align(BlockDriverState *bs);
> -void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
> +void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
>  void *qemu_blockalign(BlockDriverState *bs, size_t size);
>  bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
>  
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a602e14..36c068f 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -319,8 +319,8 @@ struct BlockDriverState {
>      /* Whether produces zeros when read beyond eof */
>      bool zero_beyond_eof;
>  
> -    /* the memory alignment required for the buffers handled by this driver */
> -    int buffer_alignment;
> +    /* the block size for which the guest device expects atomicity */
> +    int guest_block_size;
>  
>      /* do we need to tell the quest if we have a volatile write cache? */
>      int enable_write_cache;
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment Kevin Wolf
@ 2014-01-21 13:03   ` Benoît Canet
  2014-01-21 13:29     ` Kevin Wolf
  0 siblings, 1 reply; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:58 (+0100), Kevin Wolf a écrit :
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> Add a bs->request_alignment field that contains the required
> offset/length alignment for I/O requests and fill it in the raw block
> drivers. Use ioctls if possible, else see what alignment it takes for
> O_DIRECT to succeed.
> 
> While at it, also expose the memory alignment requirements, which may be
> (and in practice are) different from the disk alignment requirements.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   |   3 ++
>  block/raw-posix.c         | 102 ++++++++++++++++++++++++++++++++++++++--------
>  block/raw-win32.c         |  41 +++++++++++++++++++
>  include/block/block_int.h |   3 ++
>  4 files changed, 132 insertions(+), 17 deletions(-)
> 
> diff --git a/block.c b/block.c
> index b738abe..25ae896 100644
> --- a/block.c
> +++ b/block.c
> @@ -813,6 +813,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>  
>      bs->open_flags = flags;
>      bs->guest_block_size = 512;
> +    bs->request_alignment = 512;
>      bs->zero_beyond_eof = true;
>      open_flags = bdrv_open_flags(bs, flags);
>      bs->read_only = !(open_flags & BDRV_O_RDWR);
> @@ -881,6 +882,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
>      }
>  
>      bdrv_refresh_limits(bs);
> +    assert(bdrv_opt_mem_align(bs) != 0);
> +    assert(bs->request_alignment != 0);
>  
>  #ifndef _WIN32
>      if (bs->is_temporary) {
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 0676037..126a634 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -127,6 +127,8 @@ typedef struct BDRVRawState {
>      int fd;
>      int type;
>      int open_flags;
> +    size_t buf_align;
> +
>  #if defined(__linux__)
>      /* linux floppy specific */
>      int64_t fd_open_time;
> @@ -213,6 +215,76 @@ static int raw_normalize_devicepath(const char **filename)
>  }
>  #endif
>  
> +static void raw_probe_alignment(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    char *buf;
> +    unsigned int sector_size;
> +
> +    /* For /dev/sg devices the alignment is not really used.
> +       With buffered I/O, we don't have any restrictions. */
> +    if (bs->sg || !(s->open_flags & O_DIRECT)) {
> +        bs->request_alignment = 1;
> +        s->buf_align = 1;
> +        return;
> +    }
> +
> +    /* Try a few ioctls to get the right size */
> +    bs->request_alignment = 0;
> +    s->buf_align = 0;
> +
> +#ifdef BLKSSZGET
> +    if (ioctl(s->fd, BLKSSZGET, &sector_size) >= 0) {
> +        bs->request_alignment = sector_size;
> +    }
> +#endif
> +#ifdef DKIOCGETBLOCKSIZE
> +    if (ioctl(s->fd, DKIOCGETBLOCKSIZE, &sector_size) >= 0) {
> +        bs->request_alignment = sector_size;
> +    }
> +#endif
> +#ifdef DIOCGSECTORSIZE
> +    if (ioctl(s->fd, DIOCGSECTORSIZE, &sector_size) >= 0) {
> +        bs->request_alignment = sector_size;
> +    }
> +#endif
> +#ifdef CONFIG_XFS
> +    if (s->is_xfs) {
> +        struct dioattr da;
> +        if (xfsctl(NULL, s->fd, XFS_IOC_DIOINFO, &da) >= 0) {
> +            bs->request_alignment = da.d_miniosz;
> +            /* The kernel returns wrong information for d_mem */
> +            /* s->buf_align = da.d_mem; */
> +        }
> +    }
> +#endif
> +
> +    /* If we could not get the sizes so far, we can only guess them */
> +    if (!s->buf_align) {
> +        size_t align;
> +        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
> +        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> +            if (pread(s->fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
> +                s->buf_align = align;
> +                break;
> +            }
> +        }
> +        qemu_vfree(buf);
> +    }
> +
> +    if (!bs->request_alignment) {
> +        size_t align;
> +        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
> +        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> +            if (pread(s->fd, buf, align, 0) >= 0) {
> +                bs->request_alignment = align;
> +                break;
> +            }
> +        }
> +        qemu_vfree(buf);
> +    }
> +}
> +
>  static void raw_parse_flags(int bdrv_flags, int *open_flags)
>  {
>      assert(open_flags != NULL);
> @@ -463,7 +535,6 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>      return ret;
>  }
>  
> -
>  static void raw_reopen_commit(BDRVReopenState *state)
>  {
>      BDRVRawReopenState *raw_s = state->opaque;
> @@ -499,23 +570,15 @@ static void raw_reopen_abort(BDRVReopenState *state)
>      state->opaque = NULL;
>  }
>  
> +static int raw_refresh_limits(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
>  
> -/* XXX: use host sector size if necessary with:
> -#ifdef DIOCGSECTORSIZE
> -        {
> -            unsigned int sectorsize = 512;
> -            if (!ioctl(fd, DIOCGSECTORSIZE, &sectorsize) &&
> -                sectorsize > bufsize)
> -                bufsize = sectorsize;
> -        }
> -#endif
> -#ifdef CONFIG_COCOA
> -        uint32_t blockSize = 512;
> -        if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
> -            bufsize = blockSize;
> -        }
> -#endif
> -*/
> +    raw_probe_alignment(bs);
> +    bs->bl.opt_mem_alignment = s->buf_align;
> +
> +    return 0;
> +}
>  
>  static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
>  {
> @@ -1363,6 +1426,7 @@ static BlockDriver bdrv_file = {
>      .bdrv_aio_writev = raw_aio_writev,
>      .bdrv_aio_flush = raw_aio_flush,
>      .bdrv_aio_discard = raw_aio_discard,
> +    .bdrv_refresh_limits = raw_refresh_limits,
>  
>      .bdrv_truncate = raw_truncate,
>      .bdrv_getlength = raw_getlength,
> @@ -1740,6 +1804,7 @@ static BlockDriver bdrv_host_device = {
>      .bdrv_aio_writev	= raw_aio_writev,
>      .bdrv_aio_flush	= raw_aio_flush,
>      .bdrv_aio_discard   = hdev_aio_discard,
> +    .bdrv_refresh_limits = raw_refresh_limits,
>  
>      .bdrv_truncate      = raw_truncate,
>      .bdrv_getlength	= raw_getlength,
> @@ -1871,6 +1936,7 @@ static BlockDriver bdrv_host_floppy = {
>      .bdrv_aio_readv     = raw_aio_readv,
>      .bdrv_aio_writev    = raw_aio_writev,
>      .bdrv_aio_flush	= raw_aio_flush,
> +    .bdrv_refresh_limits = raw_refresh_limits,
>  
>      .bdrv_truncate      = raw_truncate,
>      .bdrv_getlength      = raw_getlength,
> @@ -1981,6 +2047,7 @@ static BlockDriver bdrv_host_cdrom = {
>      .bdrv_aio_readv     = raw_aio_readv,
>      .bdrv_aio_writev    = raw_aio_writev,
>      .bdrv_aio_flush	= raw_aio_flush,
> +    .bdrv_refresh_limits = raw_refresh_limits,
>  
>      .bdrv_truncate      = raw_truncate,
>      .bdrv_getlength      = raw_getlength,
> @@ -2110,6 +2177,7 @@ static BlockDriver bdrv_host_cdrom = {
>      .bdrv_aio_readv     = raw_aio_readv,
>      .bdrv_aio_writev    = raw_aio_writev,
>      .bdrv_aio_flush	= raw_aio_flush,
> +    .bdrv_refresh_limits = raw_refresh_limits,
>  
>      .bdrv_truncate      = raw_truncate,
>      .bdrv_getlength      = raw_getlength,
> diff --git a/block/raw-win32.c b/block/raw-win32.c
> index ce314fd..beb7f23 100644
> --- a/block/raw-win32.c
> +++ b/block/raw-win32.c
> @@ -202,6 +202,35 @@ static int set_sparse(int fd)
>  				 NULL, 0, NULL, 0, &returned, NULL);
>  }
>  
> +static void raw_probe_alignment(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    DWORD sectorsPerCluster, freeClusters, totalClusters, count;
> +    DISK_GEOMETRY_EX dg;
> +    BOOL status;
> +
> +    if (s->type == FTYPE_CD) {
> +        bs->request_alignment = 2048;
> +        return;
> +    }
> +    if (s->type == FTYPE_HARDDISK) {
> +        status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
> +                                 NULL, 0, &dg, sizeof(dg), &count, NULL);
> +        if (status != 0) {
> +            bs->request_alignment = dg.Geometry.BytesPerSector;
> +            return;
> +        }
> +        /* try GetDiskFreeSpace too */
> +    }
> +
> +    if (s->drive_path[0]) {
> +        GetDiskFreeSpace(s->drive_path, &sectorsPerCluster,
> +                         &dg.Geometry.BytesPerSector,
> +                         &freeClusters, &totalClusters);
> +        bs->request_alignment = dg.Geometry.BytesPerSector;
> +    }
> +}
> +
>  static void raw_parse_flags(int flags, int *access_flags, DWORD *overlapped)
>  {
>      assert(access_flags != NULL);
> @@ -269,6 +298,17 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
>          }
>      }
>  
> +    if (filename[0] && filename[1] == ':') {
> +        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
> +    } else if (filename[0] == '\\' && filename[1] == '\\') {
> +        s->drive_path[0] = 0;
> +    } else {
> +        /* Relative path.  */
> +        char buf[MAX_PATH];
> +        GetCurrentDirectory(MAX_PATH, buf);
> +        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
> +    }
> +
This is not really alignment related.
Does this really belong to this patch ?

Best regards

Benoît
>      s->hfile = CreateFile(filename, access_flags,
>                            FILE_SHARE_READ, NULL,
>                            OPEN_EXISTING, overlapped, NULL);
> @@ -293,6 +333,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
>          s->aio = aio;
>      }
>  
> +    raw_probe_alignment(bs);
>      ret = 0;
>  fail:
>      qemu_opts_del(opts);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 36c068f..fb1d355 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -319,6 +319,9 @@ struct BlockDriverState {
>      /* Whether produces zeros when read beyond eof */
>      bool zero_beyond_eof;
>  
> +    /* Alignment requirement for offset/length of I/O requests */
> +    unsigned int request_alignment;
> +
>      /* the block size for which the guest device expects atomicity */
>      int guest_block_size;
>  
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv()
  2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv() Kevin Wolf
@ 2014-01-21 13:13   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:14:59 (+0100), Kevin Wolf a écrit :
> This separates the part of bdrv_co_do_readv() that needs to happen
> before the request is modified to match the backend alignment, and a
> part that needs to be executed afterwards and passes the request to the
> BlockDriver.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 61 +++++++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 25ae896..0c2baf5 100644
> --- a/block.c
> +++ b/block.c
> @@ -2705,26 +2705,24 @@ err:
>  }
>  
>  /*
> - * Handle a read request in coroutine context
> + * Forwards an already correctly aligned request to the BlockDriver. This
> + * handles copy on read and zeroing after EOF; any other features must be
> + * implemented by the caller.
>   */
> -static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> -    BdrvRequestFlags flags)
> +static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
unsigned int bytes <- Would it worth it to make this int64_t or size_t for the
not so far future where 4GB will be the new 4MB ?

Best regards

Benoît

>  {
>      BlockDriver *drv = bs->drv;
>      BdrvTrackedRequest req;
>      int ret;
>  
> -    if (!drv) {
> -        return -ENOMEDIUM;
> -    }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> -        return -EIO;
> -    }
> +    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> +    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
>  
> -    if (bs->copy_on_read) {
> -        flags |= BDRV_REQ_COPY_ON_READ;
> -    }
> +    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> +
> +    /* Handle Copy on Read and associated serialisation */
>      if (flags & BDRV_REQ_COPY_ON_READ) {
>          bs->copy_on_read_in_flight++;
>      }
> @@ -2733,11 +2731,6 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          wait_for_overlapping_requests(bs, sector_num, nb_sectors);
>      }
>  
> -    /* throttling disk I/O */
> -    if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, false);
> -    }
> -
>      tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>  
>      if (flags & BDRV_REQ_COPY_ON_READ) {
> @@ -2754,6 +2747,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          }
>      }
>  
> +    /* Forward the request to the BlockDriver */
>      if (!(bs->zero_beyond_eof && bs->growable)) {
>          ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
>      } else {
> @@ -2794,6 +2788,37 @@ out:
>      return ret;
>  }
>  
> +/*
> + * Handle a read request in coroutine context
> + */
> +static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
> +{
> +    BlockDriver *drv = bs->drv;
> +    int ret;
> +
> +    if (!drv) {
> +        return -ENOMEDIUM;
> +    }
> +    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +        return -EIO;
> +    }
> +
> +    if (bs->copy_on_read) {
> +        flags |= BDRV_REQ_COPY_ON_READ;
> +    }
> +
> +    /* throttling disk I/O */
> +    if (bs->io_limits_enabled) {
> +        bdrv_io_limits_intercept(bs, nb_sectors, false);
> +    }
> +
> +    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
> +                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +    return ret;
> +}
> +
>  int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
>      int nb_sectors, QEMUIOVector *qiov)
>  {
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment
  2014-01-21 13:03   ` Benoît Canet
@ 2014-01-21 13:29     ` Kevin Wolf
  0 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-21 13:29 UTC (permalink / raw)
  To: Benoît Canet; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Am 21.01.2014 um 14:03 hat Benoît Canet geschrieben:
> Le Friday 17 Jan 2014 à 15:14:58 (+0100), Kevin Wolf a écrit :
> > From: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > Add a bs->request_alignment field that contains the required
> > offset/length alignment for I/O requests and fill it in the raw block
> > drivers. Use ioctls if possible, else see what alignment it takes for
> > O_DIRECT to succeed.
> > 
> > While at it, also expose the memory alignment requirements, which may be
> > (and in practice are) different from the disk alignment requirements.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > Reviewed-by: Max Reitz <mreitz@redhat.com>
> > ---
> >  block.c                   |   3 ++
> >  block/raw-posix.c         | 102 ++++++++++++++++++++++++++++++++++++++--------
> >  block/raw-win32.c         |  41 +++++++++++++++++++
> >  include/block/block_int.h |   3 ++
> >  4 files changed, 132 insertions(+), 17 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index b738abe..25ae896 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -813,6 +813,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
> >  
> >      bs->open_flags = flags;
> >      bs->guest_block_size = 512;
> > +    bs->request_alignment = 512;
> >      bs->zero_beyond_eof = true;
> >      open_flags = bdrv_open_flags(bs, flags);
> >      bs->read_only = !(open_flags & BDRV_O_RDWR);
> > @@ -881,6 +882,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
> >      }
> >  
> >      bdrv_refresh_limits(bs);
> > +    assert(bdrv_opt_mem_align(bs) != 0);
> > +    assert(bs->request_alignment != 0);
> >  
> >  #ifndef _WIN32
> >      if (bs->is_temporary) {
> > diff --git a/block/raw-posix.c b/block/raw-posix.c
> > index 0676037..126a634 100644
> > --- a/block/raw-posix.c
> > +++ b/block/raw-posix.c
> > @@ -127,6 +127,8 @@ typedef struct BDRVRawState {
> >      int fd;
> >      int type;
> >      int open_flags;
> > +    size_t buf_align;
> > +
> >  #if defined(__linux__)
> >      /* linux floppy specific */
> >      int64_t fd_open_time;
> > @@ -213,6 +215,76 @@ static int raw_normalize_devicepath(const char **filename)
> >  }
> >  #endif
> >  
> > +static void raw_probe_alignment(BlockDriverState *bs)
> > +{
> > +    BDRVRawState *s = bs->opaque;
> > +    char *buf;
> > +    unsigned int sector_size;
> > +
> > +    /* For /dev/sg devices the alignment is not really used.
> > +       With buffered I/O, we don't have any restrictions. */
> > +    if (bs->sg || !(s->open_flags & O_DIRECT)) {
> > +        bs->request_alignment = 1;
> > +        s->buf_align = 1;
> > +        return;
> > +    }
> > +
> > +    /* Try a few ioctls to get the right size */
> > +    bs->request_alignment = 0;
> > +    s->buf_align = 0;
> > +
> > +#ifdef BLKSSZGET
> > +    if (ioctl(s->fd, BLKSSZGET, &sector_size) >= 0) {
> > +        bs->request_alignment = sector_size;
> > +    }
> > +#endif
> > +#ifdef DKIOCGETBLOCKSIZE
> > +    if (ioctl(s->fd, DKIOCGETBLOCKSIZE, &sector_size) >= 0) {
> > +        bs->request_alignment = sector_size;
> > +    }
> > +#endif
> > +#ifdef DIOCGSECTORSIZE
> > +    if (ioctl(s->fd, DIOCGSECTORSIZE, &sector_size) >= 0) {
> > +        bs->request_alignment = sector_size;
> > +    }
> > +#endif
> > +#ifdef CONFIG_XFS
> > +    if (s->is_xfs) {
> > +        struct dioattr da;
> > +        if (xfsctl(NULL, s->fd, XFS_IOC_DIOINFO, &da) >= 0) {
> > +            bs->request_alignment = da.d_miniosz;
> > +            /* The kernel returns wrong information for d_mem */
> > +            /* s->buf_align = da.d_mem; */
> > +        }
> > +    }
> > +#endif
> > +
> > +    /* If we could not get the sizes so far, we can only guess them */
> > +    if (!s->buf_align) {
> > +        size_t align;
> > +        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
> > +        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> > +            if (pread(s->fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
> > +                s->buf_align = align;
> > +                break;
> > +            }
> > +        }
> > +        qemu_vfree(buf);
> > +    }
> > +
> > +    if (!bs->request_alignment) {
> > +        size_t align;
> > +        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
> > +        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> > +            if (pread(s->fd, buf, align, 0) >= 0) {
> > +                bs->request_alignment = align;
> > +                break;
> > +            }
> > +        }
> > +        qemu_vfree(buf);
> > +    }
> > +}
> > +
> >  static void raw_parse_flags(int bdrv_flags, int *open_flags)
> >  {
> >      assert(open_flags != NULL);
> > @@ -463,7 +535,6 @@ static int raw_reopen_prepare(BDRVReopenState *state,
> >      return ret;
> >  }
> >  
> > -
> >  static void raw_reopen_commit(BDRVReopenState *state)
> >  {
> >      BDRVRawReopenState *raw_s = state->opaque;
> > @@ -499,23 +570,15 @@ static void raw_reopen_abort(BDRVReopenState *state)
> >      state->opaque = NULL;
> >  }
> >  
> > +static int raw_refresh_limits(BlockDriverState *bs)
> > +{
> > +    BDRVRawState *s = bs->opaque;
> >  
> > -/* XXX: use host sector size if necessary with:
> > -#ifdef DIOCGSECTORSIZE
> > -        {
> > -            unsigned int sectorsize = 512;
> > -            if (!ioctl(fd, DIOCGSECTORSIZE, &sectorsize) &&
> > -                sectorsize > bufsize)
> > -                bufsize = sectorsize;
> > -        }
> > -#endif
> > -#ifdef CONFIG_COCOA
> > -        uint32_t blockSize = 512;
> > -        if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
> > -            bufsize = blockSize;
> > -        }
> > -#endif
> > -*/
> > +    raw_probe_alignment(bs);
> > +    bs->bl.opt_mem_alignment = s->buf_align;
> > +
> > +    return 0;
> > +}
> >  
> >  static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
> >  {
> > @@ -1363,6 +1426,7 @@ static BlockDriver bdrv_file = {
> >      .bdrv_aio_writev = raw_aio_writev,
> >      .bdrv_aio_flush = raw_aio_flush,
> >      .bdrv_aio_discard = raw_aio_discard,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> >  
> >      .bdrv_truncate = raw_truncate,
> >      .bdrv_getlength = raw_getlength,
> > @@ -1740,6 +1804,7 @@ static BlockDriver bdrv_host_device = {
> >      .bdrv_aio_writev	= raw_aio_writev,
> >      .bdrv_aio_flush	= raw_aio_flush,
> >      .bdrv_aio_discard   = hdev_aio_discard,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> >  
> >      .bdrv_truncate      = raw_truncate,
> >      .bdrv_getlength	= raw_getlength,
> > @@ -1871,6 +1936,7 @@ static BlockDriver bdrv_host_floppy = {
> >      .bdrv_aio_readv     = raw_aio_readv,
> >      .bdrv_aio_writev    = raw_aio_writev,
> >      .bdrv_aio_flush	= raw_aio_flush,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> >  
> >      .bdrv_truncate      = raw_truncate,
> >      .bdrv_getlength      = raw_getlength,
> > @@ -1981,6 +2047,7 @@ static BlockDriver bdrv_host_cdrom = {
> >      .bdrv_aio_readv     = raw_aio_readv,
> >      .bdrv_aio_writev    = raw_aio_writev,
> >      .bdrv_aio_flush	= raw_aio_flush,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> >  
> >      .bdrv_truncate      = raw_truncate,
> >      .bdrv_getlength      = raw_getlength,
> > @@ -2110,6 +2177,7 @@ static BlockDriver bdrv_host_cdrom = {
> >      .bdrv_aio_readv     = raw_aio_readv,
> >      .bdrv_aio_writev    = raw_aio_writev,
> >      .bdrv_aio_flush	= raw_aio_flush,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> >  
> >      .bdrv_truncate      = raw_truncate,
> >      .bdrv_getlength      = raw_getlength,
> > diff --git a/block/raw-win32.c b/block/raw-win32.c
> > index ce314fd..beb7f23 100644
> > --- a/block/raw-win32.c
> > +++ b/block/raw-win32.c
> > @@ -202,6 +202,35 @@ static int set_sparse(int fd)
> >  				 NULL, 0, NULL, 0, &returned, NULL);
> >  }
> >  
> > +static void raw_probe_alignment(BlockDriverState *bs)
> > +{
> > +    BDRVRawState *s = bs->opaque;
> > +    DWORD sectorsPerCluster, freeClusters, totalClusters, count;
> > +    DISK_GEOMETRY_EX dg;
> > +    BOOL status;
> > +
> > +    if (s->type == FTYPE_CD) {
> > +        bs->request_alignment = 2048;
> > +        return;
> > +    }
> > +    if (s->type == FTYPE_HARDDISK) {
> > +        status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
> > +                                 NULL, 0, &dg, sizeof(dg), &count, NULL);
> > +        if (status != 0) {
> > +            bs->request_alignment = dg.Geometry.BytesPerSector;
> > +            return;
> > +        }
> > +        /* try GetDiskFreeSpace too */
> > +    }
> > +
> > +    if (s->drive_path[0]) {
> > +        GetDiskFreeSpace(s->drive_path, &sectorsPerCluster,
> > +                         &dg.Geometry.BytesPerSector,
> > +                         &freeClusters, &totalClusters);
> > +        bs->request_alignment = dg.Geometry.BytesPerSector;
> > +    }
> > +}
> > +
> >  static void raw_parse_flags(int flags, int *access_flags, DWORD *overlapped)
> >  {
> >      assert(access_flags != NULL);
> > @@ -269,6 +298,17 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
> >          }
> >      }
> >  
> > +    if (filename[0] && filename[1] == ':') {
> > +        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
> > +    } else if (filename[0] == '\\' && filename[1] == '\\') {
> > +        s->drive_path[0] = 0;
> > +    } else {
> > +        /* Relative path.  */
> > +        char buf[MAX_PATH];
> > +        GetCurrentDirectory(MAX_PATH, buf);
> > +        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
> > +    }
> > +
> This is not really alignment related.
> Does this really belong to this patch ?

raw_probe_alignment() needs the path, so s->drive_path must now be
initialised for all images. Without this change it is only set in
hdev_open(), i.e. for win32 host_device backends.

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv() Kevin Wolf
  2014-01-17 23:59   ` Max Reitz
@ 2014-01-21 13:29   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:00 (+0100), Kevin Wolf a écrit :
> Similar to bdrv_pread(), which aligns byte-aligned request to 512 byte
> sectors, bdrv_co_do_preadv() takes a byte-aligned request and aligns it
> to the alignment specified in bs->request_alignment.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 58 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 0c2baf5..f378a10 100644
> --- a/block.c
> +++ b/block.c
> @@ -2791,17 +2791,23 @@ out:
>  /*
>   * Handle a read request in coroutine context
>   */
> -static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
>      BdrvRequestFlags flags)
>  {
>      BlockDriver *drv = bs->drv;
> +    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
> +    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
> +    uint8_t *head_buf = NULL;
> +    uint8_t *tail_buf = NULL;
> +    QEMUIOVector local_qiov;
> +    bool use_local_qiov = false;
>      int ret;
>  
>      if (!drv) {
>          return -ENOMEDIUM;
>      }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +    if (bdrv_check_byte_request(bs, offset, bytes)) {
>          return -EIO;
>      }
>  
> @@ -2811,14 +2817,60 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>  
>      /* throttling disk I/O */
>      if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, false);
> +        /* TODO Switch to byte granularity */
> +        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
> +    }
> +
> +    /* Align read if necessary by padding qiov */
> +    if (offset & (align - 1)) {
> +        head_buf = qemu_blockalign(bs, align);
> +        qemu_iovec_init(&local_qiov, qiov->niov + 2);
> +        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
> +        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
> +        use_local_qiov = true;
> +
> +        bytes += offset & (align - 1);
> +        offset = offset & ~(align - 1);
> +    }
> +
> +    if ((offset + bytes) & (align - 1)) {
> +        if (!use_local_qiov) {
> +            qemu_iovec_init(&local_qiov, qiov->niov + 1);
> +            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
> +            use_local_qiov = true;
> +        }
> +        tail_buf = qemu_blockalign(bs, align);
> +        qemu_iovec_add(&local_qiov, tail_buf,
> +                       align - ((offset + bytes) & (align - 1)));
> +
> +        bytes = ROUND_UP(bytes, align);
> +    }
> +
> +    ret = bdrv_aligned_preadv(bs, offset, bytes,
> +                              use_local_qiov ? &local_qiov : qiov,
> +                              flags);
> +
> +    if (use_local_qiov) {
> +        qemu_iovec_destroy(&local_qiov);
> +        qemu_vfree(head_buf);
> +        qemu_vfree(tail_buf);
>      }
>  
> -    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
> -                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>      return ret;
>  }
>  
> +static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
> +{
> +    if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
> +        return -EINVAL;
> +    }
> +
> +    return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
> +                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +}
> +
>  int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
>      int nb_sectors, QEMUIOVector *qiov)
>  {
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
@ 2014-01-21 13:31   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:01 (+0100), Kevin Wolf a écrit :
> This separates the part of bdrv_co_do_writev() that needs to happen
> before the request is modified to match the backend alignment, and a
> part that needs to be executed afterwards and passes the request to the
> BlockDriver.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 41 insertions(+), 21 deletions(-)
> 
> diff --git a/block.c b/block.c
> index f378a10..70b72f0 100644
> --- a/block.c
> +++ b/block.c
> @@ -2964,34 +2964,20 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>  }
>  
>  /*
> - * Handle a write request in coroutine context
> + * Forwards an already correctly aligned write request to the BlockDriver.
>   */
> -static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> -    BdrvRequestFlags flags)
> +static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
>  {
>      BlockDriver *drv = bs->drv;
>      BdrvTrackedRequest req;
>      int ret;
>  
> -    if (!bs->drv) {
> -        return -ENOMEDIUM;
> -    }
> -    if (bs->read_only) {
> -        return -EACCES;
> -    }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> -        return -EIO;
> -    }
> -
> -    if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> -    }
> +    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> +    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
>  
> -    /* throttling disk I/O */
> -    if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, true);
> -    }
> +    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
>      tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
>  
> @@ -3023,6 +3009,40 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>      return ret;
>  }
>  
> +/*
> + * Handle a write request in coroutine context
> + */
> +static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
> +{
> +    int ret;
> +
> +    if (!bs->drv) {
> +        return -ENOMEDIUM;
> +    }
> +    if (bs->read_only) {
> +        return -EACCES;
> +    }
> +    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +        return -EIO;
> +    }
> +
> +    if (bs->copy_on_read_in_flight) {
> +        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +    }
> +
> +    /* throttling disk I/O */
> +    if (bs->io_limits_enabled) {
> +        bdrv_io_limits_intercept(bs, nb_sectors, true);
> +    }
> +
> +    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
> +                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +
> +    return ret;
> +}
> +
>  int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
>      int nb_sectors, QEMUIOVector *qiov)
>  {
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling Kevin Wolf
@ 2014-01-21 13:33   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:33 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:02 (+0100), Kevin Wolf a écrit :
> First waiting for all COR requests to complete and calling the
> throttling function afterwards means that the request could be delayed
> and we still need to wait for the COR request even if it was issued only
> after the throttled write request.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 70b72f0..fe55217 100644
> --- a/block.c
> +++ b/block.c
> @@ -2979,6 +2979,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
> +    if (bs->copy_on_read_in_flight) {
> +        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +    }
> +
>      tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
> @@ -3028,10 +3032,6 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>          return -EIO;
>      }
>  
> -    if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> -    }
> -
>      /* throttling disk I/O */
>      if (bs->io_limits_enabled) {
>          bdrv_io_limits_intercept(bs, nb_sectors, true);
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
  2014-01-18  0:00   ` Max Reitz
@ 2014-01-21 13:36   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:03 (+0100), Kevin Wolf a écrit :
> This is going to become the bdrv_co_do_preadv() equivalent for writes.
> In this patch, however, just a function taking byte offsets is created,
> it doesn't align anything yet.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fe55217..328f592 100644
> --- a/block.c
> +++ b/block.c
> @@ -3016,8 +3016,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>  /*
>   * Handle a write request in coroutine context
>   */
> -static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
>      BdrvRequestFlags flags)
>  {
>      int ret;
> @@ -3028,21 +3028,33 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>      if (bs->read_only) {
>          return -EACCES;
>      }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +    if (bdrv_check_byte_request(bs, offset, bytes)) {
>          return -EIO;
>      }
>  
>      /* throttling disk I/O */
>      if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, true);
> +        /* TODO Switch to byte granularity */
> +        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
>      }
>  
> -    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
> -                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
>  
>      return ret;
>  }
>  
> +static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
> +{
> +    if (nb_sectors < 0 || nb_sectors > (INT_MAX >> BDRV_SECTOR_BITS)) {
> +        return -EINVAL;
> +    }
> +
> +    return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
> +                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +}
> +
>  int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
>      int nb_sectors, QEMUIOVector *qiov)
>  {
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
  2014-01-17 23:19   ` Max Reitz
@ 2014-01-21 13:49   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-21 13:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:04 (+0100), Kevin Wolf a écrit :
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c                   | 52 +++++++++++++++++++++++++++++++----------------
>  block/backup.c            |  7 ++++++-
>  include/block/block_int.h |  4 ++--
>  3 files changed, 42 insertions(+), 21 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 328f592..85f28ab 100644
> --- a/block.c
> +++ b/block.c
> @@ -2037,13 +2037,13 @@ static void tracked_request_end(BdrvTrackedRequest *req)
>   */
>  static void tracked_request_begin(BdrvTrackedRequest *req,
>                                    BlockDriverState *bs,
> -                                  int64_t sector_num,
> -                                  int nb_sectors, bool is_write)
> +                                  int64_t offset,
> +                                  unsigned int bytes, bool is_write)
>  {
>      *req = (BdrvTrackedRequest){
>          .bs = bs,
> -        .sector_num = sector_num,
> -        .nb_sectors = nb_sectors,
> +        .offset = offset,
> +        .bytes = bytes,
>          .is_write = is_write,
>          .co = qemu_coroutine_self(),
>      };
> @@ -2074,25 +2074,43 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
>      }
>  }
>  
> +static void round_bytes_to_clusters(BlockDriverState *bs,
> +                                    int64_t offset, unsigned int bytes,
> +                                    int64_t *cluster_offset,
> +                                    unsigned int *cluster_bytes)
> +{
> +    BlockDriverInfo bdi;
> +
> +    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
> +        *cluster_offset = offset;
> +        *cluster_bytes = bytes;
> +    } else {
> +        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
> +        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
> +                                       bdi.cluster_size);
> +    }
> +}
> +
>  static bool tracked_request_overlaps(BdrvTrackedRequest *req,
> -                                     int64_t sector_num, int nb_sectors) {
> +                                     int64_t offset, unsigned int bytes)
> +{
>      /*        aaaa   bbbb */
> -    if (sector_num >= req->sector_num + req->nb_sectors) {
> +    if (offset >= req->offset + req->bytes) {
>          return false;
>      }
>      /* bbbb   aaaa        */
> -    if (req->sector_num >= sector_num + nb_sectors) {
> +    if (req->offset >= offset + bytes) {
>          return false;
>      }
>      return true;
>  }
>  
>  static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
> -        int64_t sector_num, int nb_sectors)
> +        int64_t offset, unsigned int bytes)
>  {
>      BdrvTrackedRequest *req;
> -    int64_t cluster_sector_num;
> -    int cluster_nb_sectors;
> +    int64_t cluster_offset;
> +    unsigned int cluster_bytes;
>      bool retry;
>  
>      /* If we touch the same cluster it counts as an overlap.  This guarantees
> @@ -2101,14 +2119,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>       * CoR read and write operations are atomic and guest writes cannot
>       * interleave between them.
>       */
> -    bdrv_round_to_clusters(bs, sector_num, nb_sectors,
> -                           &cluster_sector_num, &cluster_nb_sectors);
> +    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
>  
>      do {
>          retry = false;
>          QLIST_FOREACH(req, &bs->tracked_requests, list) {
> -            if (tracked_request_overlaps(req, cluster_sector_num,
> -                                         cluster_nb_sectors)) {
> +            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
>                  /* Hitting this means there was a reentrant request, for
>                   * example, a block driver issuing nested requests.  This must
>                   * never happen since it means deadlock.
> @@ -2728,10 +2744,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>      }
>  
>      if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +        wait_for_overlapping_requests(bs, offset, bytes);
>      }
>  
> -    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +    tracked_request_begin(&req, bs, offset, bytes, false);
>  
>      if (flags & BDRV_REQ_COPY_ON_READ) {
>          int pnum;
> @@ -2980,10 +2996,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
>      if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +        wait_for_overlapping_requests(bs, offset, bytes);
>      }
>  
> -    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
> +    tracked_request_begin(&req, bs, offset, bytes, true);
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
>  
> diff --git a/block/backup.c b/block/backup.c
> index 0198514..15a2e55 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -181,8 +181,13 @@ static int coroutine_fn backup_before_write_notify(
>          void *opaque)
>  {
>      BdrvTrackedRequest *req = opaque;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
>  
> -    return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL);
> +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> +
> +    return backup_do_cow(req->bs, sector_num, nb_sectors, NULL);
>  }
>  
>  static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index fb1d355..97a4d23 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -57,8 +57,8 @@
>  
>  typedef struct BdrvTrackedRequest {
>      BlockDriverState *bs;
> -    int64_t sector_num;
> -    int nb_sectors;
> +    int64_t offset;
> +    unsigned int bytes;
>      bool is_write;
>      QLIST_ENTRY(BdrvTrackedRequest) list;
>      Coroutine *co; /* owner, used for deadlock detection */
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
@ 2014-01-22 19:46   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 19:46 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:05 (+0100), Kevin Wolf a écrit :
> Previously, it was not possible to use wait_for_overlapping_requests()
> between tracked_request_begin()/end() because it would wait for itself.
> 
> Ignore the current request in the overlap check and run more of the
> bdrv_co_do_preadv/pwritev code with a BdrvTrackedRequest present.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 85f28ab..18dcd43 100644
> --- a/block.c
> +++ b/block.c
> @@ -2106,7 +2106,7 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
>  }
>  
>  static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
> -        int64_t offset, unsigned int bytes)
> +        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
>  {
>      BdrvTrackedRequest *req;
>      int64_t cluster_offset;
> @@ -2124,6 +2124,9 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>      do {
>          retry = false;
>          QLIST_FOREACH(req, &bs->tracked_requests, list) {
> +            if (req == self) {
> +                continue;
> +            }
>              if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
>                  /* Hitting this means there was a reentrant request, for
>                   * example, a block driver issuing nested requests.  This must
> @@ -2726,10 +2729,10 @@ err:
>   * implemented by the caller.
>   */
>  static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
> -    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
> +    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
> +    QEMUIOVector *qiov, int flags)
>  {
>      BlockDriver *drv = bs->drv;
> -    BdrvTrackedRequest req;
>      int ret;
>  
>      int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> @@ -2744,11 +2747,9 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>      }
>  
>      if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, offset, bytes);
> +        wait_for_overlapping_requests(bs, req, offset, bytes);
>      }
>  
> -    tracked_request_begin(&req, bs, offset, bytes, false);
> -
>      if (flags & BDRV_REQ_COPY_ON_READ) {
>          int pnum;
>  
> @@ -2795,8 +2796,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>      }
>  
>  out:
> -    tracked_request_end(&req);
> -
>      if (flags & BDRV_REQ_COPY_ON_READ) {
>          bs->copy_on_read_in_flight--;
>      }
> @@ -2812,6 +2811,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>      BdrvRequestFlags flags)
>  {
>      BlockDriver *drv = bs->drv;
> +    BdrvTrackedRequest req;
> +
>      /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
>      uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
>      uint8_t *head_buf = NULL;
> @@ -2862,9 +2863,11 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>          bytes = ROUND_UP(bytes, align);
>      }
>  
> -    ret = bdrv_aligned_preadv(bs, offset, bytes,
> +    tracked_request_begin(&req, bs, offset, bytes, false);
> +    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
>                                use_local_qiov ? &local_qiov : qiov,
>                                flags);
> +    tracked_request_end(&req);
>  
>      if (use_local_qiov) {
>          qemu_iovec_destroy(&local_qiov);
> @@ -2983,10 +2986,10 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>   * Forwards an already correctly aligned write request to the BlockDriver.
>   */
>  static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> -    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
> +    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
> +    QEMUIOVector *qiov, int flags)
>  {
>      BlockDriver *drv = bs->drv;
> -    BdrvTrackedRequest req;
>      int ret;
>  
>      int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> @@ -2996,12 +2999,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
>      if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, offset, bytes);
> +        wait_for_overlapping_requests(bs, req, offset, bytes);
>      }
>  
> -    tracked_request_begin(&req, bs, offset, bytes, true);
> -
> -    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
> +    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
>      if (ret < 0) {
>          /* Do nothing, write notifier decided to fail this request */
> @@ -3024,8 +3025,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>          bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
>      }
>  
> -    tracked_request_end(&req);
> -
>      return ret;
>  }
>  
> @@ -3036,6 +3035,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>      int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
>      BdrvRequestFlags flags)
>  {
> +    BdrvTrackedRequest req;
>      int ret;
>  
>      if (!bs->drv) {
> @@ -3054,7 +3054,9 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>          bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
>      }
>  
> -    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
> +    tracked_request_begin(&req, bs, offset, bytes, true);
> +    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
> +    tracked_request_end(&req);
>  
>      return ret;
>  }
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment Kevin Wolf
  2014-01-17 23:21   ` Max Reitz
@ 2014-01-22 19:50   ` Benoît Canet
  1 sibling, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 19:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:06 (+0100), Kevin Wolf a écrit :
> Odd file sizes could make bdrv_aligned_preadv() shorten the request in
> non-aligned ways. Fix it by rounding to the required alignment instead
> of 512 bytes.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 18dcd43..d7156ce 100644
> --- a/block.c
> +++ b/block.c
> @@ -2730,7 +2730,7 @@ err:
>   */
>  static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>      BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
> -    QEMUIOVector *qiov, int flags)
> +    int64_t align, QEMUIOVector *qiov, int flags)
>  {
>      BlockDriver *drv = bs->drv;
>      int ret;
> @@ -2778,7 +2778,8 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>          }
>  
>          total_sectors = DIV_ROUND_UP(len, BDRV_SECTOR_SIZE);
> -        max_nb_sectors = MAX(0, total_sectors - sector_num);
> +        max_nb_sectors = MAX(0, ROUND_UP(total_sectors - sector_num,
> +                                         align >> BDRV_SECTOR_BITS));
>          if (max_nb_sectors > 0) {
>              ret = drv->bdrv_co_readv(bs, sector_num,
>                                       MIN(nb_sectors, max_nb_sectors), qiov);
> @@ -2864,7 +2865,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>      }
>  
>      tracked_request_begin(&req, bs, offset, bytes, false);
> -    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
> +    ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,
>                                use_local_qiov ? &local_qiov : qiov,
>                                flags);
>      tracked_request_end(&req);
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation Kevin Wolf
@ 2014-01-22 20:00   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 20:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:07 (+0100), Kevin Wolf a écrit :
> Change the API so that specific requests can be marked serialising. Only

I find the spelling "serialising" instead of "serializing" really odd since QEMU
code is full of z like the word virtualization.

Reviewed-by: Benoit Canet <benoit@irqsave.net>

> these requests are checked for overlaps then.
> 
> This means that during a Copy on Read operation, not all requests
> overlapping other requests are serialised any more, but only those that
> actually overlap with the specific COR request.
> 
> Also remove COR from function and variable names because this
> functionality can be useful in other contexts.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 48 ++++++++++++++++++++++++++++-------------------
>  include/block/block_int.h |  5 +++--
>  2 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/block.c b/block.c
> index d7156ce..efa8979 100644
> --- a/block.c
> +++ b/block.c
> @@ -2028,6 +2028,10 @@ int bdrv_commit_all(void)
>   */
>  static void tracked_request_end(BdrvTrackedRequest *req)
>  {
> +    if (req->serialising) {
> +        req->bs->serialising_in_flight--;
> +    }
> +
>      QLIST_REMOVE(req, list);
>      qemu_co_queue_restart_all(&req->wait_queue);
>  }
> @@ -2042,10 +2046,11 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>  {
>      *req = (BdrvTrackedRequest){
>          .bs = bs,
> -        .offset = offset,
> -        .bytes = bytes,
> -        .is_write = is_write,
> -        .co = qemu_coroutine_self(),
> +        .offset         = offset,
> +        .bytes          = bytes,
> +        .is_write       = is_write,
> +        .co             = qemu_coroutine_self(),
> +        .serialising    = false,
>      };
>  
>      qemu_co_queue_init(&req->wait_queue);
> @@ -2053,6 +2058,14 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>      QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
>  }
>  
> +static void mark_request_serialising(BdrvTrackedRequest *req)
> +{
> +    if (!req->serialising) {
> +        req->bs->serialising_in_flight++;
> +        req->serialising = true;
> +    }
> +}
> +
>  /**
>   * Round a region to cluster boundaries
>   */
> @@ -2105,26 +2118,31 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
>      return true;
>  }
>  
> -static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
> -        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
> +static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>  {
> +    BlockDriverState *bs = self->bs;
>      BdrvTrackedRequest *req;
>      int64_t cluster_offset;
>      unsigned int cluster_bytes;
>      bool retry;
>  
> +    if (!bs->serialising_in_flight) {
> +        return;
> +    }
> +
>      /* If we touch the same cluster it counts as an overlap.  This guarantees
>       * that allocating writes will be serialized and not race with each other
>       * for the same cluster.  For example, in copy-on-read it ensures that the
>       * CoR read and write operations are atomic and guest writes cannot
>       * interleave between them.
>       */
> -    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
> +    round_bytes_to_clusters(bs, self->offset, self->bytes,
> +                            &cluster_offset, &cluster_bytes);
>  
>      do {
>          retry = false;
>          QLIST_FOREACH(req, &bs->tracked_requests, list) {
> -            if (req == self) {
> +            if (req == self || (!req->serialising && !self->serialising)) {
>                  continue;
>              }
>              if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
> @@ -2743,12 +2761,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>  
>      /* Handle Copy on Read and associated serialisation */
>      if (flags & BDRV_REQ_COPY_ON_READ) {
> -        bs->copy_on_read_in_flight++;
> +        mark_request_serialising(req);
>      }
>  
> -    if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, req, offset, bytes);
> -    }
> +    wait_serialising_requests(req);
>  
>      if (flags & BDRV_REQ_COPY_ON_READ) {
>          int pnum;
> @@ -2797,10 +2813,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>      }
>  
>  out:
> -    if (flags & BDRV_REQ_COPY_ON_READ) {
> -        bs->copy_on_read_in_flight--;
> -    }
> -
>      return ret;
>  }
>  
> @@ -2999,9 +3011,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
> -    if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, req, offset, bytes);
> -    }
> +    wait_serialising_requests(req);
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 97a4d23..d8443df 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -60,6 +60,7 @@ typedef struct BdrvTrackedRequest {
>      int64_t offset;
>      unsigned int bytes;
>      bool is_write;
> +    bool serialising;
>      QLIST_ENTRY(BdrvTrackedRequest) list;
>      Coroutine *co; /* owner, used for deadlock detection */
>      CoQueue wait_queue; /* coroutines blocked on this request */
> @@ -296,8 +297,8 @@ struct BlockDriverState {
>      /* Callback before write request is processed */
>      NotifierWithReturnList before_write_notifiers;
>  
> -    /* number of in-flight copy-on-read requests */
> -    unsigned int copy_on_read_in_flight;
> +    /* number of in-flight serialising requests */
> +    unsigned int serialising_in_flight;
>  
>      /* I/O throttling */
>      ThrottleState throttle_state;
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic Kevin Wolf
@ 2014-01-22 20:15   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 20:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:08 (+0100), Kevin Wolf a écrit :
> Copy on Read wants to serialise with all requests touching the same
> cluster, so wait_serialising_requests() rounded to cluster boundaries.
> Other users like alignment RMW will have different requirements, though
> (requests touching the same sector), so make it dynamic.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 53 ++++++++++++++++++++++++-----------------------
>  include/block/block_int.h |  4 ++++
>  2 files changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/block.c b/block.c
> index efa8979..e72966a 100644
> --- a/block.c
> +++ b/block.c
> @@ -2051,6 +2051,8 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>          .is_write       = is_write,
>          .co             = qemu_coroutine_self(),
>          .serialising    = false,
> +        .overlap_offset = offset,
> +        .overlap_bytes  = bytes,
>      };
>  
>      qemu_co_queue_init(&req->wait_queue);
> @@ -2058,12 +2060,19 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>      QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
>  }
>  
> -static void mark_request_serialising(BdrvTrackedRequest *req)
> +static void mark_request_serialising(BdrvTrackedRequest *req, size_t align)
>  {
> +    int64_t overlap_offset = req->offset & ~(align - 1);
> +    int overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
> +                      - overlap_offset;
> +
>      if (!req->serialising) {
>          req->bs->serialising_in_flight++;
>          req->serialising = true;
>      }
> +
> +    req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
> +    req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
>  }
>  
>  /**
> @@ -2087,20 +2096,16 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
>      }
>  }
>  
> -static void round_bytes_to_clusters(BlockDriverState *bs,
> -                                    int64_t offset, unsigned int bytes,
> -                                    int64_t *cluster_offset,
> -                                    unsigned int *cluster_bytes)
> +static int bdrv_get_cluster_size(BlockDriverState *bs)
>  {
>      BlockDriverInfo bdi;
> +    int ret;
>  
> -    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
> -        *cluster_offset = offset;
> -        *cluster_bytes = bytes;
> +    ret = bdrv_get_info(bs, &bdi);
> +    if (ret < 0 || bdi.cluster_size == 0) {
> +        return bs->request_alignment;
>      } else {
> -        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
> -        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
> -                                       bdi.cluster_size);
> +        return bdi.cluster_size;
>      }
>  }
>  
> @@ -2108,11 +2113,11 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
>                                       int64_t offset, unsigned int bytes)
>  {
>      /*        aaaa   bbbb */
> -    if (offset >= req->offset + req->bytes) {
> +    if (offset >= req->overlap_offset + req->overlap_bytes) {
>          return false;
>      }
>      /* bbbb   aaaa        */
> -    if (req->offset >= offset + bytes) {
> +    if (req->overlap_offset >= offset + bytes) {
>          return false;
>      }
>      return true;
> @@ -2122,30 +2127,21 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>  {
>      BlockDriverState *bs = self->bs;
>      BdrvTrackedRequest *req;
> -    int64_t cluster_offset;
> -    unsigned int cluster_bytes;
>      bool retry;
>  
>      if (!bs->serialising_in_flight) {
>          return;
>      }
>  
> -    /* If we touch the same cluster it counts as an overlap.  This guarantees
> -     * that allocating writes will be serialized and not race with each other
> -     * for the same cluster.  For example, in copy-on-read it ensures that the
> -     * CoR read and write operations are atomic and guest writes cannot
> -     * interleave between them.
> -     */
> -    round_bytes_to_clusters(bs, self->offset, self->bytes,
> -                            &cluster_offset, &cluster_bytes);
> -
>      do {
>          retry = false;
>          QLIST_FOREACH(req, &bs->tracked_requests, list) {
>              if (req == self || (!req->serialising && !self->serialising)) {
>                  continue;
>              }
> -            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
> +            if (tracked_request_overlaps(req, self->overlap_offset,
> +                                         self->overlap_bytes))
> +            {
>                  /* Hitting this means there was a reentrant request, for
>                   * example, a block driver issuing nested requests.  This must
>                   * never happen since it means deadlock.
> @@ -2761,7 +2757,12 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>  
>      /* Handle Copy on Read and associated serialisation */
>      if (flags & BDRV_REQ_COPY_ON_READ) {
> -        mark_request_serialising(req);
> +        /* If we touch the same cluster it counts as an overlap.  This
> +         * guarantees that allocating writes will be serialized and not race
> +         * with each other for the same cluster.  For example, in copy-on-read
> +         * it ensures that the CoR read and write operations are atomic and
> +         * guest writes cannot interleave between them. */
> +        mark_request_serialising(req, bdrv_get_cluster_size(bs));
>      }
>  
>      wait_serialising_requests(req);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index d8443df..ccd2c68 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -60,7 +60,11 @@ typedef struct BdrvTrackedRequest {
>      int64_t offset;
>      unsigned int bytes;
>      bool is_write;
> +
>      bool serialising;
> +    int64_t overlap_offset;
> +    unsigned int overlap_bytes;
> +
>      QLIST_ENTRY(BdrvTrackedRequest) list;
>      Coroutine *co; /* owner, used for deadlock detection */
>      CoQueue wait_queue; /* coroutines blocked on this request */
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point Kevin Wolf
@ 2014-01-22 20:21   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 20:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:09 (+0100), Kevin Wolf a écrit :
> We can only have a single wait_serialising_requests() call per request
> because otherwise we can run into deadlocks where requests are waiting
> for each other. The same is true when wait_serialising_requests() is not
> at the very beginning of a request, so that other requests can be issued
> between the start of the tracking and wait_serialising_requests().
> 
> Fix this by changing wait_serialising_requests() to ignore requests that
> are already (directly or indirectly) waiting for the calling request.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 13 ++++++++++---
>  include/block/block_int.h |  2 ++
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/block.c b/block.c
> index e72966a..55e8c69 100644
> --- a/block.c
> +++ b/block.c
> @@ -2148,9 +2148,16 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>                   */
>                  assert(qemu_coroutine_self() != req->co);
>  
> -                qemu_co_queue_wait(&req->wait_queue);
> -                retry = true;
> -                break;
> +                /* If the request is already (indirectly) waiting for us, or
> +                 * will wait for us as soon as it wakes up, then just go on
> +                 * (instead of producing a deadlock in the former case). */
> +                if (!req->waiting_for) {
> +                    self->waiting_for = req;
> +                    qemu_co_queue_wait(&req->wait_queue);
> +                    self->waiting_for = NULL;
> +                    retry = true;
> +                    break;
> +                }
>              }
>          }
>      } while (retry);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index ccd2c68..fdf0e0b 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -68,6 +68,8 @@ typedef struct BdrvTrackedRequest {
>      QLIST_ENTRY(BdrvTrackedRequest) list;
>      Coroutine *co; /* owner, used for deadlock detection */
>      CoQueue wait_queue; /* coroutines blocked on this request */
> +
> +    struct BdrvTrackedRequest *waiting_for;
>  } BdrvTrackedRequest;
>  
>  struct BlockDriver {
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev()
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-22 20:29   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-22 20:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:10 (+0100), Kevin Wolf a écrit :
> This patch changes bdrv_co_do_pwritev() to actually be what its name
> promises. If requests aren't properly aligned, it performs a RMW.
> 
> Requests touching the same block are serialised against the RMW request.
> Further optimisation of this is possible by differentiating types of
> requests (concurrent reads should actually be okay here).
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 55e8c69..859e1aa 100644
> --- a/block.c
> +++ b/block.c
> @@ -3055,6 +3055,12 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>      BdrvRequestFlags flags)
>  {
>      BdrvTrackedRequest req;
> +    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
> +    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
> +    uint8_t *head_buf = NULL;
> +    uint8_t *tail_buf = NULL;
> +    QEMUIOVector local_qiov;
> +    bool use_local_qiov = false;
>      int ret;
>  
>      if (!bs->drv) {
> @@ -3073,10 +3079,88 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>          bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, true);
>      }
>  
> +    /*
> +     * Align write if necessary by performing a read-modify-write cycle.
> +     * Pad qiov with the read parts and be sure to have a tracked request not
> +     * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
> +     */
>      tracked_request_begin(&req, bs, offset, bytes, true);
> -    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
> +
> +    if (offset & (align - 1)) {
> +        QEMUIOVector head_qiov;
> +        struct iovec head_iov;
> +
> +        mark_request_serialising(&req, align);
> +        wait_serialising_requests(&req);
> +
> +        head_buf = qemu_blockalign(bs, align);
> +        head_iov = (struct iovec) {
> +            .iov_base   = head_buf,
> +            .iov_len    = align,
> +        };
> +        qemu_iovec_init_external(&head_qiov, &head_iov, 1);
> +
> +        ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
> +                                  align, &head_qiov, 0);
> +        if (ret < 0) {
> +            goto fail;
> +        }
> +
> +        qemu_iovec_init(&local_qiov, qiov->niov + 2);
> +        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
> +        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
> +        use_local_qiov = true;
> +
> +        bytes += offset & (align - 1);
> +        offset = offset & ~(align - 1);
> +    }
> +
> +    if ((offset + bytes) & (align - 1)) {
> +        QEMUIOVector tail_qiov;
> +        struct iovec tail_iov;
> +        size_t tail_bytes;
> +
> +        mark_request_serialising(&req, align);
> +        wait_serialising_requests(&req);
> +
> +        tail_buf = qemu_blockalign(bs, align);
> +        tail_iov = (struct iovec) {
> +            .iov_base   = tail_buf,
> +            .iov_len    = align,
> +        };
> +        qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
> +
> +        ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
> +                                  align, &tail_qiov, 0);
> +        if (ret < 0) {
> +            goto fail;
> +        }
> +
> +        if (!use_local_qiov) {
> +            qemu_iovec_init(&local_qiov, qiov->niov + 1);
> +            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
> +            use_local_qiov = true;
> +        }
> +
> +        tail_bytes = (offset + bytes) & (align - 1);
> +        qemu_iovec_add(&local_qiov, tail_buf + tail_bytes, align - tail_bytes);
> +
> +        bytes = ROUND_UP(bytes, align);
> +    }
> +
> +    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes,
> +                               use_local_qiov ? &local_qiov : qiov,
> +                               flags);
> +
> +fail:
>      tracked_request_end(&req);
>  
> +    if (use_local_qiov) {
> +        qemu_iovec_destroy(&local_qiov);
> +        qemu_vfree(head_buf);
> +        qemu_vfree(tail_buf);
> +    }
> +
>      return ret;
>  }
>  
> -- 
> 1.8.1.4
> 
> 
Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation
  2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
                   ` (28 preceding siblings ...)
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity Kevin Wolf
@ 2014-01-22 20:30 ` Christian Borntraeger
  2014-01-23 10:29   ` Kevin Wolf
  29 siblings, 1 reply; 70+ messages in thread
From: Christian Borntraeger @ 2014-01-22 20:30 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, xiawenc, stefanha, mreitz

On 17/01/14 15:14, Kevin Wolf wrote:
> This patch series adds code to the block layer that allows performing
> I/O requests in smaller granularities than required by the host backend
> (most importantly, O_DIRECT restrictions). It achieves this for reads
> by rounding the request to host-side block boundary, and for writes by
> performing a read-modify-write cycle (and serialising requests
> touching the same block so that the RMW doesn't write back stale data).

Nice, this might really help on s390 (also for KVM) since dasd disks usually
have a 4k sector size. We also have flash systems with 4k block size. Both
disk systems cause lots of trouble with cache=none and friends.

Do you have a tree with these patches, so that I can test those on s390?



> 
> Originally I intended to reuse a lot of code from Paolo's previous
> patch series, however as I tried to integrate pread/pwrite, which
> already do a very similar thing (except for considering concurrency),
> and because I wanted to implement zero-copy, most of this series ended
> up being new code.
> 
> Zero-copy is possible in a common case because while XFS defauls to a
> 4k sector size and therefore 4k on-disk O_DIRECT alignment for 512E
> disks, it still only has a 512 byte memory alignment requirement.
> (Unfortunately the XFS_IOC_DIOINFO ioctl claims 4k even for memory, but
> we know that the value is wrong and can probe it.)
> 
> 
> Changes in v2 -> v3:
> - Fixed I/O throttling bypass by converting to byte granularity [Wenchao]
> - Made 'bytes' argument to tracked_request_overlaps() unsigned [Max]
> - Fixed a corruption bug that came from using outdated RMW buffers after
>   waiting for another request and added some assertions to check the
>   assumptions [Peter]
> - Fixed bytes vs. sectors error in zero-after-EOF code of
>   bdrv_co_do_preadv [Max]
> - Removed orphaned protoype in block.h [Max]
> - A qemu-iotests case and some infrastructure to support it
> 
> Changes in v1 -> v2:
> - Fixed overlap_bytes calculation in mark_request_serialising()
> - Fixed wait_serialising_requests() deadlock
> - iscsi: Set bs->request_alignment [Peter]
> - iscsi: Query block limits only in iscsi_open() when no other request
>   are in flight, and in iscsi_refresh_limits() copy the stored values
>   into bs->bl [Peter]
> 
> Changes in RFC -> v1:
> - Moved opt_mem_alignment into BlockLimits [Paolo]
> - Changed BlockLimits in turn to work a bit more like the
>   .bdrv_opt_mem_align() callback of the RFC; allows updating the
>   BlockLimits later when the chain changes or bdrv_reopen() toggles
>   O_DIRECT
> - Fixed a typo in a commit message [Eric]
> 
> 
> Kevin Wolf (26):
>   block: Move initialisation of BlockLimits to bdrv_refresh_limits()
>   block: Inherit opt_transfer_length
>   block: Update BlockLimits when they might have changed
>   qemu_memalign: Allow small alignments
>   block: Detect unaligned length in bdrv_qiov_is_aligned()
>   block: Don't use guest sector size for qemu_blockalign()
>   block: Introduce bdrv_aligned_preadv()
>   block: Introduce bdrv_co_do_preadv()
>   block: Introduce bdrv_aligned_pwritev()
>   block: write: Handle COR dependency after I/O throttling
>   block: Introduce bdrv_co_do_pwritev()
>   block: Switch BdrvTrackedRequest to byte granularity
>   block: Allow waiting for overlapping requests between begin/end
>   block: Make zero-after-EOF work with larger alignment
>   block: Generalise and optimise COR serialisation
>   block: Make overlap range for serialisation dynamic
>   block: Allow wait_serialising_requests() at any point
>   block: Align requests in bdrv_co_do_pwritev()
>   block: Assert serialisation assumptions in pwritev
>   block: Change coroutine wrapper to byte granularity
>   block: Make bdrv_pread() a bdrv_prwv_co() wrapper
>   block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
>   blkdebug: Make required alignment configurable
>   qemu-io: New command 'sleep'
>   qemu-iotests: Test pwritev RMW logic
>   block: Switch bdrv_io_limits_intercept() to byte granularity
> 
> Paolo Bonzini (3):
>   block: rename buffer_alignment to guest_block_size
>   raw: Probe required direct I/O alignment
>   iscsi: Set bs->request_alignment
> 
>  block.c                    | 644 +++++++++++++++++++++++++++++++--------------
>  block/backup.c             |   7 +-
>  block/blkdebug.c           |  24 ++
>  block/iscsi.c              |  47 ++--
>  block/qcow2.c              |  11 +-
>  block/qed.c                |  11 +-
>  block/raw-posix.c          | 102 +++++--
>  block/raw-win32.c          |  41 +++
>  block/stream.c             |   2 +
>  block/vmdk.c               |  22 +-
>  hw/block/virtio-blk.c      |   2 +-
>  hw/ide/core.c              |   2 +-
>  hw/scsi/scsi-disk.c        |   2 +-
>  hw/scsi/scsi-generic.c     |   2 +-
>  include/block/block.h      |  15 +-
>  include/block/block_int.h  |  27 +-
>  qemu-io-cmds.c             |  42 +++
>  tests/qemu-iotests/077     | 278 +++++++++++++++++++
>  tests/qemu-iotests/077.out | 202 ++++++++++++++
>  tests/qemu-iotests/group   |   1 +
>  util/oslib-posix.c         |   5 +
>  21 files changed, 1234 insertions(+), 255 deletions(-)
>  create mode 100755 tests/qemu-iotests/077
>  create mode 100644 tests/qemu-iotests/077.out
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation
  2014-01-22 20:30 ` [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Christian Borntraeger
@ 2014-01-23 10:29   ` Kevin Wolf
  2014-01-23 11:12     ` Christian Borntraeger
  0 siblings, 1 reply; 70+ messages in thread
From: Kevin Wolf @ 2014-01-23 10:29 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Am 22.01.2014 um 21:30 hat Christian Borntraeger geschrieben:
> On 17/01/14 15:14, Kevin Wolf wrote:
> > This patch series adds code to the block layer that allows performing
> > I/O requests in smaller granularities than required by the host backend
> > (most importantly, O_DIRECT restrictions). It achieves this for reads
> > by rounding the request to host-side block boundary, and for writes by
> > performing a read-modify-write cycle (and serialising requests
> > touching the same block so that the RMW doesn't write back stale data).
> 
> Nice, this might really help on s390 (also for KVM) since dasd disks usually
> have a 4k sector size. We also have flash systems with 4k block size. Both
> disk systems cause lots of trouble with cache=none and friends.

Can't you configure guests to use 4k sector size as well? With these
patches, it should work, but there is an obvious performance penalty, so
you want the guest to make as little use of it as possible.

For PCs, it's essentially just the boot process that needs 512 byte
accesses because that's the BIOS interface, and the OS generally won't
send misaligned requests later.

What's the situation on s390?

> Do you have a tree with these patches, so that I can test those on s390?

Sure, and any testing is appreciated:

    git://repo.or.cz/qemu/kevin.git align

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation
  2014-01-23 10:29   ` Kevin Wolf
@ 2014-01-23 11:12     ` Christian Borntraeger
  0 siblings, 0 replies; 70+ messages in thread
From: Christian Borntraeger @ 2014-01-23 11:12 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

On 23/01/14 11:29, Kevin Wolf wrote:
> Am 22.01.2014 um 21:30 hat Christian Borntraeger geschrieben:
>> On 17/01/14 15:14, Kevin Wolf wrote:
>>> This patch series adds code to the block layer that allows performing
>>> I/O requests in smaller granularities than required by the host backend
>>> (most importantly, O_DIRECT restrictions). It achieves this for reads
>>> by rounding the request to host-side block boundary, and for writes by
>>> performing a read-modify-write cycle (and serialising requests
>>> touching the same block so that the RMW doesn't write back stale data).
>>
>> Nice, this might really help on s390 (also for KVM) since dasd disks usually
>> have a 4k sector size. We also have flash systems with 4k block size. Both
>> disk systems cause lots of trouble with cache=none and friends.
> 
> Can't you configure guests to use 4k sector size as well? With these
> patches, it should work, but there is an obvious performance penalty, so
> you want the guest to make as little use of it as possible.

Yes, thats what we do now, but it requires a command line parameter. 
A year ago or so we posted patches for block size detection (as well as geometry
pass through) but the necessary rework after feedback is still pending
(this got lost in a bunch of other todos, but we will restart that)

> For PCs, it's essentially just the boot process that needs 512 byte
> accesses because that's the BIOS interface, and the OS generally won't
> send misaligned requests later.
> 
> What's the situation on s390?

Yes, its mostly the boot process. But there is also a problem if you have
an image file on file system that is on a 4k disk. Then the guest might send
non-aligned request IIRC.
> 
>> Do you have a tree with these patches, so that I can test those on s390?
> 
> Sure, and any testing is appreciated:
> 
>     git://repo.or.cz/qemu/kevin.git align
> 
> Kevin
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev Kevin Wolf
  2014-01-17 23:42   ` Max Reitz
@ 2014-01-24 16:09   ` Benoît Canet
  2014-01-24 16:18     ` Kevin Wolf
  1 sibling, 1 reply; 70+ messages in thread
From: Benoît Canet @ 2014-01-24 16:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:11 (+0100), Kevin Wolf a écrit :
> If a request calls wait_serialising_requests() and actually has to wait
> in this function (i.e. a coroutine yield), other requests can run and
> previously read data (like the head or tail buffer) could become
> outdated. In this case, we would have to restart from the beginning to
> read in the updated data.
> 
> However, we're lucky and don't actually need to do that: A request can
> only wait in the first call of wait_serialising_requests() because we
> mark it as serialising before that call, so any later requests would
> wait. So as we don't wait in practice, we don't have to reload the data.
> 
> This is an important assumption that may not be broken or data
> corruption will happen. Document it with some assertions.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 859e1aa..53d9bd5 100644
> --- a/block.c
> +++ b/block.c
> @@ -2123,14 +2123,15 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
>      return true;
>  }
>  
> -static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> +static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>  {
>      BlockDriverState *bs = self->bs;
>      BdrvTrackedRequest *req;
>      bool retry;
> +    bool waited = false;
>  
>      if (!bs->serialising_in_flight) {
> -        return;
> +        return false;
>      }
>  
>      do {
> @@ -2156,11 +2157,14 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>                      qemu_co_queue_wait(&req->wait_queue);
>                      self->waiting_for = NULL;
>                      retry = true;
> +                    waited = true;
>                      break;
>                  }
>              }
>          }
>      } while (retry);
> +
> +    return waited;
>  }
>  
>  /*
> @@ -3011,6 +3015,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      QEMUIOVector *qiov, int flags)
>  {
>      BlockDriver *drv = bs->drv;
> +    bool waited;
>      int ret;
>  
>      int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> @@ -3019,7 +3024,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>      assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
>      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>  
> -    wait_serialising_requests(req);
> +    waited = wait_serialising_requests(req);
> +    assert(!waited || !req->serialising);

I we apply De Morgan's law to the expression we have:

    assert(!(waited && req->serialising));

Is it really the condition we want ?

Best regards

Benoît

>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
> @@ -3119,9 +3125,11 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
>          QEMUIOVector tail_qiov;
>          struct iovec tail_iov;
>          size_t tail_bytes;
> +        bool waited;
>  
>          mark_request_serialising(&req, align);
> -        wait_serialising_requests(&req);
> +        waited = wait_serialising_requests(&req);
> +        assert(!waited || !use_local_qiov);
>  
>          tail_buf = qemu_blockalign(bs, align);
>          tail_iov = (struct iovec) {
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev
  2014-01-24 16:09   ` Benoît Canet
@ 2014-01-24 16:18     ` Kevin Wolf
  0 siblings, 0 replies; 70+ messages in thread
From: Kevin Wolf @ 2014-01-24 16:18 UTC (permalink / raw)
  To: Benoît Canet; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Am 24.01.2014 um 17:09 hat Benoît Canet geschrieben:
> Le Friday 17 Jan 2014 à 15:15:11 (+0100), Kevin Wolf a écrit :
> > If a request calls wait_serialising_requests() and actually has to wait
> > in this function (i.e. a coroutine yield), other requests can run and
> > previously read data (like the head or tail buffer) could become
> > outdated. In this case, we would have to restart from the beginning to
> > read in the updated data.
> > 
> > However, we're lucky and don't actually need to do that: A request can
> > only wait in the first call of wait_serialising_requests() because we
> > mark it as serialising before that call, so any later requests would
> > wait. So as we don't wait in practice, we don't have to reload the data.
> > 
> > This is an important assumption that may not be broken or data
> > corruption will happen. Document it with some assertions.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block.c | 16 ++++++++++++----
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 859e1aa..53d9bd5 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2123,14 +2123,15 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
> >      return true;
> >  }
> >  
> > -static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> > +static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> >  {
> >      BlockDriverState *bs = self->bs;
> >      BdrvTrackedRequest *req;
> >      bool retry;
> > +    bool waited = false;
> >  
> >      if (!bs->serialising_in_flight) {
> > -        return;
> > +        return false;
> >      }
> >  
> >      do {
> > @@ -2156,11 +2157,14 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
> >                      qemu_co_queue_wait(&req->wait_queue);
> >                      self->waiting_for = NULL;
> >                      retry = true;
> > +                    waited = true;
> >                      break;
> >                  }
> >              }
> >          }
> >      } while (retry);
> > +
> > +    return waited;
> >  }
> >  
> >  /*
> > @@ -3011,6 +3015,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> >      QEMUIOVector *qiov, int flags)
> >  {
> >      BlockDriver *drv = bs->drv;
> > +    bool waited;
> >      int ret;
> >  
> >      int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> > @@ -3019,7 +3024,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> >      assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> >      assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> >  
> > -    wait_serialising_requests(req);
> > +    waited = wait_serialising_requests(req);
> > +    assert(!waited || !req->serialising);
> 
> I we apply De Morgan's law to the expression we have:
> 
>     assert(!(waited && req->serialising));
> 
> Is it really the condition we want ?

Yes. If req->serialising is true here, it's an RMW case and we already
had a mark_request_serialising() + wait_serialising_requests() pair. So
we have already waited for earlier requests and newer requests must be
waiting for us. Therefore, it can't happen that a serialising request
has to wait here.

Kevin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment
  2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment Kevin Wolf
@ 2014-01-24 16:29   ` Benoît Canet
  0 siblings, 0 replies; 70+ messages in thread
From: Benoît Canet @ 2014-01-24 16:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pl, qemu-devel, mreitz, stefanha, pbonzini, xiawenc

Le Friday 17 Jan 2014 à 15:15:15 (+0100), Kevin Wolf a écrit :
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> The iSCSI backend already gets the block size from the READ CAPACITY
> command it sends.  Save it so that the generic block layer gets it
> too.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/iscsi.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 3202dc5..23c3fc4 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -1217,6 +1217,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>          goto out;
>      }
>      bs->total_sectors = sector_lun2qemu(iscsilun->num_blocks, iscsilun);
> +    bs->request_alignment = iscsilun->block_size;
>  
>      /* Medium changer or tape. We dont have any emulation for this so this must
>       * be sg ioctl compatible. We force it to be sg, otherwise qemu will try
> -- 
> 1.8.1.4
> 
> 

Reviewed-by: Benoit Canet <benoit@irqsave.net>

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2014-01-24 16:29 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-17 14:14 [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Kevin Wolf
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 01/29] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
2014-01-17 22:39   ` Benoît Canet
2014-01-20  9:31     ` Kevin Wolf
2014-01-20  9:49       ` Peter Lieven
2014-01-21 12:49   ` Benoît Canet
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 02/29] block: Inherit opt_transfer_length Kevin Wolf
2014-01-17 22:42   ` Benoît Canet
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 03/29] block: Update BlockLimits when they might have changed Kevin Wolf
2014-01-17 22:47   ` Benoît Canet
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 04/29] qemu_memalign: Allow small alignments Kevin Wolf
2014-01-17 22:49   ` Benoît Canet
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 05/29] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 06/29] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 07/29] block: rename buffer_alignment to guest_block_size Kevin Wolf
2014-01-21 12:54   ` Benoît Canet
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 08/29] raw: Probe required direct I/O alignment Kevin Wolf
2014-01-21 13:03   ` Benoît Canet
2014-01-21 13:29     ` Kevin Wolf
2014-01-17 14:14 ` [Qemu-devel] [PATCH v3 09/29] block: Introduce bdrv_aligned_preadv() Kevin Wolf
2014-01-21 13:13   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 10/29] block: Introduce bdrv_co_do_preadv() Kevin Wolf
2014-01-17 23:59   ` Max Reitz
2014-01-21 13:29   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 11/29] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
2014-01-21 13:31   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 12/29] block: write: Handle COR dependency after I/O throttling Kevin Wolf
2014-01-21 13:33   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 13/29] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
2014-01-18  0:00   ` Max Reitz
2014-01-21 13:36   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 14/29] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
2014-01-17 23:19   ` Max Reitz
2014-01-21 13:49   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 15/29] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
2014-01-22 19:46   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 16/29] block: Make zero-after-EOF work with larger alignment Kevin Wolf
2014-01-17 23:21   ` Max Reitz
2014-01-22 19:50   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 17/29] block: Generalise and optimise COR serialisation Kevin Wolf
2014-01-22 20:00   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 18/29] block: Make overlap range for serialisation dynamic Kevin Wolf
2014-01-22 20:15   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 19/29] block: Allow wait_serialising_requests() at any point Kevin Wolf
2014-01-22 20:21   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 20/29] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
2014-01-22 20:29   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 21/29] block: Assert serialisation assumptions in pwritev Kevin Wolf
2014-01-17 23:42   ` Max Reitz
2014-01-24 16:09   ` Benoît Canet
2014-01-24 16:18     ` Kevin Wolf
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 22/29] block: Change coroutine wrapper to byte granularity Kevin Wolf
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 23/29] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 24/29] block: Make bdrv_pwrite() " Kevin Wolf
2014-01-17 23:43   ` Max Reitz
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 25/29] iscsi: Set bs->request_alignment Kevin Wolf
2014-01-24 16:29   ` Benoît Canet
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 26/29] blkdebug: Make required alignment configurable Kevin Wolf
2014-01-17 23:50   ` Max Reitz
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 27/29] qemu-io: New command 'sleep' Kevin Wolf
2014-01-17 23:55   ` Max Reitz
2014-01-20  9:58     ` Kevin Wolf
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 28/29] qemu-iotests: Test pwritev RMW logic Kevin Wolf
2014-01-18 16:01   ` Max Reitz
2014-01-20  9:44     ` Kevin Wolf
2014-01-17 14:15 ` [Qemu-devel] [PATCH v3 29/29] block: Switch bdrv_io_limits_intercept() to byte granularity Kevin Wolf
2014-01-17 23:59   ` Max Reitz
2014-01-22 20:30 ` [Qemu-devel] [PATCH v3 00/29] block: Support for 512b-on-4k emulation Christian Borntraeger
2014-01-23 10:29   ` Kevin Wolf
2014-01-23 11:12     ` Christian Borntraeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.