All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio
@ 2013-12-13 13:22 Kevin Wolf
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
                   ` (24 more replies)
  0 siblings, 25 replies; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This patch series adds code to the block layer that allows performing
I/O requests in smaller granularities than required by the host backend
(most importantly, O_DIRECT restrictions). It achieves this for reads
by rounding the request to host-side block boundary, and for writes by
performing a read-modify-write cycle (and serialising requests
touching the same block so that the RMW doesn't write back stale data).

Originally I intended to reuse a lot of code from Paolo's previous
patch series, however as I tried to integrate pread/pwrite, which
already do a very similar thing (except for considering concurrency),
and because I wanted to implement zero-copy, most of this series ended
up being new code.

Zero-copy is possible in a common case because while XFS defauls to a
4k sector size and therefore 4k on-disk O_DIRECT alignment for 512E
disks, it still only has a 512 byte memory alignment requirement.
(Unfortunately the XFS_IOC_DIOINFO ioctl claims 4k even for memory, but
we know that the value is wrong and can probe it.)


Changes in v1 -> v2:
- Fixed overlap_bytes calculation in mark_request_serialising()
- Fixed wait_serialising_requests() deadlock
- iscsi: Set bs->request_alignment [Peter]
- iscsi: Query block limits only in iscsi_open() when no other request
  are in flight, and in iscsi_refresh_limits() copy the stored values
  into bs->bl [Peter]

Changes in RFC -> v1:
- Moved opt_mem_alignment into BlockLimits [Paolo]
- Changed BlockLimits in turn to work a bit more like the
  .bdrv_opt_mem_align() callback of the RFC; allows updating the
  BlockLimits later when the chain changes or bdrv_reopen() toggles
  O_DIRECT
- Fixed a typo in a commit message [Eric]

Kevin Wolf (21):
  block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  block: Inherit opt_transfer_length
  block: Update BlockLimits when they might have changed
  qemu_memalign: Allow small alignments
  block: Detect unaligned length in bdrv_qiov_is_aligned()
  block: Don't use guest sector size for qemu_blockalign()
  block: Introduce bdrv_aligned_preadv()
  block: Introduce bdrv_co_do_preadv()
  block: Introduce bdrv_aligned_pwritev()
  block: write: Handle COR dependency after I/O throttling
  block: Introduce bdrv_co_do_pwritev()
  block: Switch BdrvTrackedRequest to byte granularity
  block: Allow waiting for overlapping requests between begin/end
  block: Make zero-after-EOF work with larger alignment
  block: Generalise and optimise COR serialisation
  block: Make overlap range for serialisation dynamic
  block: Allow wait_serialising_requests() at any point
  block: Align requests in bdrv_co_do_pwritev()
  block: Change coroutine wrapper to byte granularity
  block: Make bdrv_pread() a bdrv_prwv_co() wrapper
  block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper

Paolo Bonzini (3):
  block: rename buffer_alignment to guest_block_size
  raw: Probe required direct I/O alignment
  iscsi: Set bs->request_alignment

 block.c                   | 634 +++++++++++++++++++++++++++++++---------------
 block/backup.c            |   7 +-
 block/iscsi.c             |  47 ++--
 block/qcow2.c             |  11 +-
 block/qed.c               |  11 +-
 block/raw-posix.c         | 102 ++++++--
 block/raw-win32.c         |  41 +++
 block/stream.c            |   2 +
 block/vmdk.c              |  22 +-
 hw/block/virtio-blk.c     |   2 +-
 hw/ide/core.c             |   2 +-
 hw/scsi/scsi-disk.c       |   2 +-
 hw/scsi/scsi-generic.c    |   2 +-
 include/block/block.h     |   6 +-
 include/block/block_int.h |  27 +-
 util/oslib-posix.c        |   5 +
 16 files changed, 673 insertions(+), 250 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:24   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length Kevin Wolf
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This function separates filling the BlockLimits from bdrv_open(), which
allows it to call it from other operations which may change the limits
(e.g. modifications to the backing file chain or bdrv_reopen)

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 19 +++++++++++++++++++
 block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
 block/qcow2.c             | 11 ++++++++++-
 block/qed.c               | 11 ++++++++++-
 block/vmdk.c              | 22 ++++++++++++++++++----
 include/block/block_int.h |  2 ++
 6 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/block.c b/block.c
index 13f001a..3bbcad4 100644
--- a/block.c
+++ b/block.c
@@ -479,6 +479,19 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
     return ret;
 }
 
+static int bdrv_refresh_limits(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+
+    memset(&bs->bl, 0, sizeof(bs->bl));
+
+    if (drv && drv->bdrv_refresh_limits) {
+        return drv->bdrv_refresh_limits(bs);
+    }
+
+    return 0;
+}
+
 /*
  * Create a uniquely-named empty temporary file.
  * Return 0 upon success, otherwise a negative errno value.
@@ -833,6 +846,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
         goto free_and_fail;
     }
 
+    bdrv_refresh_limits(bs);
+
 #ifndef _WIN32
     if (bs->is_temporary) {
         assert(bs->filename[0] != '\0');
@@ -1018,6 +1033,10 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp)
     }
     pstrcpy(bs->backing_file, sizeof(bs->backing_file),
             bs->backing_hd->file->filename);
+
+    /* Recalculate the BlockLimits with the backing file */
+    bdrv_refresh_limits(bs);
+
     return 0;
 }
 
diff --git a/block/iscsi.c b/block/iscsi.c
index 829d444..18bc093 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1443,23 +1443,6 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
                sizeof(struct scsi_inquiry_block_limits));
         scsi_free_scsi_task(task);
         task = NULL;
-
-        if (iscsilun->bl.max_unmap < 0xffffffff) {
-            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
-                                                 iscsilun);
-        }
-        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
-                                                   iscsilun);
-
-        if (iscsilun->bl.max_ws_len < 0xffffffff) {
-            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
-                                                      iscsilun);
-        }
-        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
-                                                        iscsilun);
-
-        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
-                                                     iscsilun);
     }
 
 #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
@@ -1504,6 +1487,34 @@ static void iscsi_close(BlockDriverState *bs)
     memset(iscsilun, 0, sizeof(IscsiLun));
 }
 
+static int iscsi_refresh_limits(BlockDriverState *bs)
+{
+    IscsiLun *iscsilun = bs->opaque;
+
+    /* We don't actually refresh here, but just return data queried in
+     * iscsi_open(): iscsi targets don't change their limits. */
+    if (iscsilun->lbp.lbpu || iscsilun->lbp.lbpws) {
+        if (iscsilun->bl.max_unmap < 0xffffffff) {
+            bs->bl.max_discard = sector_lun2qemu(iscsilun->bl.max_unmap,
+                                                 iscsilun);
+        }
+        bs->bl.discard_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
+                                                   iscsilun);
+
+        if (iscsilun->bl.max_ws_len < 0xffffffff) {
+            bs->bl.max_write_zeroes = sector_lun2qemu(iscsilun->bl.max_ws_len,
+                                                      iscsilun);
+        }
+        bs->bl.write_zeroes_alignment = sector_lun2qemu(iscsilun->bl.opt_unmap_gran,
+                                                        iscsilun);
+
+        bs->bl.opt_transfer_length = sector_lun2qemu(iscsilun->bl.opt_xfer_len,
+                                                     iscsilun);
+    }
+
+    return 0;
+}
+
 static int iscsi_truncate(BlockDriverState *bs, int64_t offset)
 {
     IscsiLun *iscsilun = bs->opaque;
@@ -1616,6 +1627,7 @@ static BlockDriver bdrv_iscsi = {
     .bdrv_getlength  = iscsi_getlength,
     .bdrv_get_info   = iscsi_get_info,
     .bdrv_truncate   = iscsi_truncate,
+    .bdrv_refresh_limits = iscsi_refresh_limits,
 
 #if defined(LIBISCSI_FEATURE_IOVECTOR)
     .bdrv_co_get_block_status = iscsi_co_get_block_status,
diff --git a/block/qcow2.c b/block/qcow2.c
index f29aa88..f40355c 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -718,7 +718,6 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     qemu_opts_del(opts);
-    bs->bl.write_zeroes_alignment = s->cluster_sectors;
 
     if (s->use_lazy_refcounts && s->qcow_version < 3) {
         error_setg(errp, "Lazy refcounts require a qcow2 image with at least "
@@ -751,6 +750,15 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     return ret;
 }
 
+static int qcow2_refresh_limits(BlockDriverState *bs)
+{
+    BDRVQcowState *s = bs->opaque;
+
+    bs->bl.write_zeroes_alignment = s->cluster_sectors;
+
+    return 0;
+}
+
 static int qcow2_set_key(BlockDriverState *bs, const char *key)
 {
     BDRVQcowState *s = bs->opaque;
@@ -2268,6 +2276,7 @@ static BlockDriver bdrv_qcow2 = {
 
     .bdrv_change_backing_file   = qcow2_change_backing_file,
 
+    .bdrv_refresh_limits        = qcow2_refresh_limits,
     .bdrv_invalidate_cache      = qcow2_invalidate_cache,
 
     .create_options = qcow2_create_options,
diff --git a/block/qed.c b/block/qed.c
index 450a1fa..d35bd1c 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -495,7 +495,6 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
-    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
     s->need_check_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                             qed_need_check_timer_cb, s);
 
@@ -507,6 +506,15 @@ out:
     return ret;
 }
 
+static int bdrv_qed_refresh_limits(BlockDriverState *bs)
+{
+    BDRVQEDState *s = bs->opaque;
+
+    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
+
+    return 0;
+}
+
 /* We have nothing to do for QED reopen, stubs just return
  * success */
 static int bdrv_qed_reopen_prepare(BDRVReopenState *state,
@@ -1616,6 +1624,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
+    .bdrv_refresh_limits      = bdrv_qed_refresh_limits,
     .bdrv_change_backing_file = bdrv_qed_change_backing_file,
     .bdrv_invalidate_cache    = bdrv_qed_invalidate_cache,
     .bdrv_check               = bdrv_qed_check,
diff --git a/block/vmdk.c b/block/vmdk.c
index 0734bc2..f8a387f 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -428,10 +428,6 @@ static int vmdk_add_extent(BlockDriverState *bs,
     extent->l2_size = l2_size;
     extent->cluster_sectors = flat ? sectors : cluster_sectors;
 
-    if (!flat) {
-        bs->bl.write_zeroes_alignment =
-            MAX(bs->bl.write_zeroes_alignment, cluster_sectors);
-    }
     if (s->num_extents > 1) {
         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
     } else {
@@ -886,6 +882,23 @@ fail:
     return ret;
 }
 
+
+static int vmdk_refresh_limits(BlockDriverState *bs)
+{
+    BDRVVmdkState *s = bs->opaque;
+    int i;
+
+    for (i = 0; i < s->num_extents; i++) {
+        if (!s->extents[i].flat) {
+            bs->bl.write_zeroes_alignment =
+                MAX(bs->bl.write_zeroes_alignment,
+                    s->extents[i].cluster_sectors);
+        }
+    }
+
+    return 0;
+}
+
 static int get_whole_cluster(BlockDriverState *bs,
                 VmdkExtent *extent,
                 uint64_t cluster_offset,
@@ -1971,6 +1984,7 @@ static BlockDriver bdrv_vmdk = {
     .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
     .bdrv_has_zero_init           = vmdk_has_zero_init,
     .bdrv_get_specific_info       = vmdk_get_specific_info,
+    .bdrv_refresh_limits          = vmdk_refresh_limits,
 
     .create_options               = vmdk_create_options,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 8b132d7..c49fa6b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -226,6 +226,8 @@ struct BlockDriver {
     int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
     bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
 
+    int (*bdrv_refresh_limits)(BlockDriverState *bs);
+
     /*
      * Returns 1 if newly created images are guaranteed to contain only
      * zeros, 0 otherwise.
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:24   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed Kevin Wolf
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

When there is a format driver between the backend, it's not guaranteed
that exposing the opt_transfer_length for the format driver results in
the optimal requests (because of fragmentation etc.), but it can't make
things worse, so let's just do it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 block.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 3bbcad4..79a325f 100644
--- a/block.c
+++ b/block.c
@@ -485,7 +485,25 @@ static int bdrv_refresh_limits(BlockDriverState *bs)
 
     memset(&bs->bl, 0, sizeof(bs->bl));
 
-    if (drv && drv->bdrv_refresh_limits) {
+    if (!drv) {
+        return 0;
+    }
+
+    /* Take some limits from the children as a default */
+    if (bs->file) {
+        bdrv_refresh_limits(bs->file);
+        bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
+    }
+
+    if (bs->backing_hd) {
+        bdrv_refresh_limits(bs->backing_hd);
+        bs->bl.opt_transfer_length =
+            MAX(bs->bl.opt_transfer_length,
+                bs->backing_hd->bl.opt_transfer_length);
+    }
+
+    /* Then let the driver override it */
+    if (drv->bdrv_refresh_limits) {
         return drv->bdrv_refresh_limits(bs);
     }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:24   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments Kevin Wolf
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

When reopening with different flags, or when backing files disappear
from the chain, the limits may change. Make sure they get updated in
these cases.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 block.c               | 5 ++++-
 block/stream.c        | 2 ++
 include/block/block.h | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 79a325f..8eae359 100644
--- a/block.c
+++ b/block.c
@@ -479,7 +479,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options,
     return ret;
 }
 
-static int bdrv_refresh_limits(BlockDriverState *bs)
+int bdrv_refresh_limits(BlockDriverState *bs)
 {
     BlockDriver *drv = bs->drv;
 
@@ -1464,6 +1464,8 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
     reopen_state->bs->enable_write_cache = !!(reopen_state->flags &
                                               BDRV_O_CACHE_WB);
     reopen_state->bs->read_only = !(reopen_state->flags & BDRV_O_RDWR);
+
+    bdrv_refresh_limits(reopen_state->bs);
 }
 
 /*
@@ -2261,6 +2263,7 @@ int bdrv_drop_intermediate(BlockDriverState *active, BlockDriverState *top,
     }
     new_top_bs->backing_hd = base_bs;
 
+    bdrv_refresh_limits(new_top_bs);
 
     QSIMPLEQ_FOREACH_SAFE(intermediate_state, &states_to_delete, entry, next) {
         /* so that bdrv_close() does not recursively close the chain */
diff --git a/block/stream.c b/block/stream.c
index 46bec7d..dd0b4ac 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -75,6 +75,8 @@ static void close_unused_images(BlockDriverState *top, BlockDriverState *base,
         unused->backing_hd = NULL;
         bdrv_unref(unused);
     }
+
+    bdrv_refresh_limits(top);
 }
 
 static void coroutine_fn stream_run(void *opaque)
diff --git a/include/block/block.h b/include/block/block.h
index 36efaea..64fb319 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -249,6 +249,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset);
 int64_t bdrv_getlength(BlockDriverState *bs);
 int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
 void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
+int bdrv_refresh_limits(BlockDriverState *bs);
 int bdrv_commit(BlockDriverState *bs);
 int bdrv_commit_all(void);
 int bdrv_change_backing_file(BlockDriverState *bs,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (2 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:25   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

The functions used by qemu_memalign() require an alignment that is at
least sizeof(void*). Adjust it if it is too small.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 util/oslib-posix.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e00a44c..54f8932 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -85,6 +85,11 @@ void *qemu_oom_check(void *ptr)
 void *qemu_memalign(size_t alignment, size_t size)
 {
     void *ptr;
+
+    if (alignment < sizeof(void*)) {
+        alignment = sizeof(void*);
+    }
+
 #if defined(_POSIX_C_SOURCE) && !defined(__sun__)
     int ret;
     ret = posix_memalign(&ptr, alignment, size);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (3 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:25   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

For an O_DIRECT request to succeed, it's not only necessary that all
base addresses in the qiov are aligned, but also that each length in it
is aligned.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 block.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block.c b/block.c
index 8eae359..62c5a75 100644
--- a/block.c
+++ b/block.c
@@ -4561,6 +4561,9 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
         if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
             return false;
         }
+        if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+            return false;
+        }
     }
 
     return true;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (4 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:25   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size Kevin Wolf
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

bs->buffer_alignment is set by the device emulation and contains the
logical block size of the guest device. This isn't something that the
block layer should know, and even less something to use for determining
the right alignment of buffers to be used for the host.

The new BlockLimits field opt_mem_alignment tells the qemu block layer
the optimal alignment to be used so that no bounce buffer must be used
in the driver.

This patch may change the buffer alignment from 4k to 512 for all
callers that used qemu_blockalign() with the top-level image format
BlockDriverState. The value was never propagated to other levels in the
tree, so in particular raw-posix never required anything else than 512.

While on disks with 4k sectors direct I/O requires a 4k alignment,
memory may still be okay when aligned to 512 byte boundaries. This is
what must have happened in practice, because otherwise this would
already have failed earlier. Therefore I don't expect regressions even
with this intermediate state. Later, raw-posix can implement the hook
and expose a different memory alignment requirement.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 block.c                   | 23 ++++++++++++++++++++---
 include/block/block.h     |  3 +++
 include/block/block_int.h |  3 +++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 62c5a75..d50a4e4 100644
--- a/block.c
+++ b/block.c
@@ -214,6 +214,16 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
     qemu_co_queue_next(&bs->throttled_reqs[is_write]);
 }
 
+size_t bdrv_opt_mem_align(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        /* 4k should be on the safe side */
+        return 4096;
+    }
+
+    return bs->bl.opt_mem_alignment;
+}
+
 /* check if the path starts with "<protocol>:" */
 static int path_has_protocol(const char *path)
 {
@@ -493,6 +503,9 @@ int bdrv_refresh_limits(BlockDriverState *bs)
     if (bs->file) {
         bdrv_refresh_limits(bs->file);
         bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
+        bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
+    } else {
+        bs->bl.opt_mem_alignment = 512;
     }
 
     if (bs->backing_hd) {
@@ -500,6 +513,9 @@ int bdrv_refresh_limits(BlockDriverState *bs)
         bs->bl.opt_transfer_length =
             MAX(bs->bl.opt_transfer_length,
                 bs->backing_hd->bl.opt_transfer_length);
+        bs->bl.opt_mem_alignment =
+            MAX(bs->bl.opt_mem_alignment,
+                bs->backing_hd->bl.opt_mem_alignment);
     }
 
     /* Then let the driver override it */
@@ -4547,7 +4563,7 @@ void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
-    return qemu_memalign((bs && bs->buffer_alignment) ? bs->buffer_alignment : 512, size);
+    return qemu_memalign(bdrv_opt_mem_align(bs), size);
 }
 
 /*
@@ -4556,12 +4572,13 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size)
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
 {
     int i;
+    size_t alignment = bdrv_opt_mem_align(bs);
 
     for (i = 0; i < qiov->niov; i++) {
-        if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
+        if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
             return false;
         }
-        if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+        if (qiov->iov[i].iov_len % alignment) {
             return false;
         }
     }
diff --git a/include/block/block.h b/include/block/block.h
index 64fb319..cf63ee2 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -419,6 +419,9 @@ void bdrv_img_create(const char *filename, const char *fmt,
                      char *options, uint64_t img_size, int flags,
                      Error **errp, bool quiet);
 
+/* Returns the alignment in bytes that is required so that no bounce buffer
+ * is required throughout the stack */
+size_t bdrv_opt_mem_align(BlockDriverState *bs);
 void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index c49fa6b..6ffae7c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -252,6 +252,9 @@ typedef struct BlockLimits {
 
     /* optimal transfer length in sectors */
     int opt_transfer_length;
+
+    /* memory alignment so that no bounce buffer is needed */
+    size_t opt_mem_alignment;
 } BlockLimits;
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (5 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:26   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment Kevin Wolf
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

From: Paolo Bonzini <pbonzini@redhat.com>

The alignment field is now set to the value that is promised to the
guest, rather than required by the host.  The next patches will make
QEMU aware of the host-provided values, so make this clear.

The alignment is also not about memory buffers, but about the sectors on
the disk, change the documentation of the field.

At this point, the field is set by the device emulation, but completely
ignored by the block layer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
---
 block.c                   | 10 +++++-----
 hw/block/virtio-blk.c     |  2 +-
 hw/ide/core.c             |  2 +-
 hw/scsi/scsi-disk.c       |  2 +-
 hw/scsi/scsi-generic.c    |  2 +-
 include/block/block.h     |  2 +-
 include/block/block_int.h |  4 ++--
 7 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index d50a4e4..3504d17 100644
--- a/block.c
+++ b/block.c
@@ -812,7 +812,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
     }
 
     bs->open_flags = flags;
-    bs->buffer_alignment = 512;
+    bs->guest_block_size = 512;
     bs->zero_beyond_eof = true;
     open_flags = bdrv_open_flags(bs, flags);
     bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -1648,7 +1648,7 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->dev_ops            = bs_src->dev_ops;
     bs_dest->dev_opaque         = bs_src->dev_opaque;
     bs_dest->dev                = bs_src->dev;
-    bs_dest->buffer_alignment   = bs_src->buffer_alignment;
+    bs_dest->guest_block_size   = bs_src->guest_block_size;
     bs_dest->copy_on_read       = bs_src->copy_on_read;
 
     bs_dest->enable_write_cache = bs_src->enable_write_cache;
@@ -1800,7 +1800,7 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev)
     bs->dev = NULL;
     bs->dev_ops = NULL;
     bs->dev_opaque = NULL;
-    bs->buffer_alignment = 512;
+    bs->guest_block_size = 512;
 }
 
 /* TODO change to return DeviceState * when all users are qdevified */
@@ -4556,9 +4556,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
     return NULL;
 }
 
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
 {
-    bs->buffer_alignment = align;
+    bs->guest_block_size = align;
 }
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 13f6d82..323e9ec 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -720,7 +720,7 @@ static int virtio_blk_device_init(VirtIODevice *vdev)
     register_savevm(qdev, "virtio-blk", virtio_blk_id++, 2,
                     virtio_blk_save, virtio_blk_load, s);
     bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
-    bdrv_set_buffer_alignment(s->bs, s->conf->logical_block_size);
+    bdrv_set_guest_block_size(s->bs, s->conf->logical_block_size);
 
     bdrv_iostatus_enable(s->bs);
 
diff --git a/hw/ide/core.c b/hw/ide/core.c
index e1f4c33..036cd4a 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2103,7 +2103,7 @@ int ide_init_drive(IDEState *s, BlockDriverState *bs, IDEDriveKind kind,
     s->smart_selftest_count = 0;
     if (kind == IDE_CD) {
         bdrv_set_dev_ops(bs, &ide_cd_block_ops, s);
-        bdrv_set_buffer_alignment(bs, 2048);
+        bdrv_set_guest_block_size(bs, 2048);
     } else {
         if (!bdrv_is_inserted(s->bs)) {
             error_report("Device needs media, but drive is empty");
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index efadfc0..6cf6040 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2225,7 +2225,7 @@ static int scsi_initfn(SCSIDevice *dev)
     } else {
         bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s);
     }
-    bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize);
+    bdrv_set_guest_block_size(s->qdev.conf.bs, s->qdev.blocksize);
 
     bdrv_iostatus_enable(s->qdev.conf.bs);
     add_boot_device_path(s->qdev.conf.bootindex, &dev->qdev, NULL);
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 8f195be..f08b64e 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -210,7 +210,7 @@ static void scsi_read_complete(void * opaque, int ret)
             s->blocksize = ldl_be_p(&r->buf[8]);
             s->max_lba = ldq_be_p(&r->buf[0]);
         }
-        bdrv_set_buffer_alignment(s->conf.bs, s->blocksize);
+        bdrv_set_guest_block_size(s->conf.bs, s->blocksize);
 
         scsi_req_data(&r->req, len);
         if (!r->req.io_canceled) {
diff --git a/include/block/block.h b/include/block/block.h
index cf63ee2..05252d5 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -422,7 +422,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
 /* Returns the alignment in bytes that is required so that no bounce buffer
  * is required throughout the stack */
 size_t bdrv_opt_mem_align(BlockDriverState *bs);
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 6ffae7c..9921cd3 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,8 +319,8 @@ struct BlockDriverState {
     /* Whether produces zeros when read beyond eof */
     bool zero_beyond_eof;
 
-    /* the memory alignment required for the buffers handled by this driver */
-    int buffer_alignment;
+    /* the block size for which the guest device expects atomicity */
+    int guest_block_size;
 
     /* do we need to tell the quest if we have a volatile write cache? */
     int enable_write_cache;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (6 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2013-12-24  3:39   ` Wenchao Xia
  2014-01-11 22:26   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv() Kevin Wolf
                   ` (16 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

From: Paolo Bonzini <pbonzini@redhat.com>

Add a bs->request_alignment field that contains the required
offset/length alignment for I/O requests and fill it in the raw block
drivers. Use ioctls if possible, else see what alignment it takes for
O_DIRECT to succeed.

While at it, also expose the memory alignment requirements, which may be
(and in practice are) different from the disk alignment requirements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   |   3 ++
 block/raw-posix.c         | 102 ++++++++++++++++++++++++++++++++++++++--------
 block/raw-win32.c         |  41 +++++++++++++++++++
 include/block/block_int.h |   3 ++
 4 files changed, 132 insertions(+), 17 deletions(-)

diff --git a/block.c b/block.c
index 3504d17..b86e754 100644
--- a/block.c
+++ b/block.c
@@ -813,6 +813,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
 
     bs->open_flags = flags;
     bs->guest_block_size = 512;
+    bs->request_alignment = 512;
     bs->zero_beyond_eof = true;
     open_flags = bdrv_open_flags(bs, flags);
     bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -881,6 +882,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
     }
 
     bdrv_refresh_limits(bs);
+    assert(bdrv_opt_mem_align(bs) != 0);
+    assert(bs->request_alignment != 0);
 
 #ifndef _WIN32
     if (bs->is_temporary) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 10c6b34..e8e75a7 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -127,6 +127,8 @@ typedef struct BDRVRawState {
     int fd;
     int type;
     int open_flags;
+    size_t buf_align;
+
 #if defined(__linux__)
     /* linux floppy specific */
     int64_t fd_open_time;
@@ -213,6 +215,76 @@ static int raw_normalize_devicepath(const char **filename)
 }
 #endif
 
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    char *buf;
+    unsigned int sector_size;
+
+    /* For /dev/sg devices the alignment is not really used.
+       With buffered I/O, we don't have any restrictions. */
+    if (bs->sg || !(s->open_flags & O_DIRECT)) {
+        bs->request_alignment = 1;
+        s->buf_align = 1;
+        return;
+    }
+
+    /* Try a few ioctls to get the right size */
+    bs->request_alignment = 0;
+    s->buf_align = 0;
+
+#ifdef BLKSSZGET
+    if (ioctl(s->fd, BLKSSZGET, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef DKIOCGETBLOCKSIZE
+    if (ioctl(s->fd, DKIOCGETBLOCKSIZE, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef DIOCGSECTORSIZE
+    if (ioctl(s->fd, DIOCGSECTORSIZE, &sector_size) >= 0) {
+        bs->request_alignment = sector_size;
+    }
+#endif
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        struct dioattr da;
+        if (xfsctl(NULL, s->fd, XFS_IOC_DIOINFO, &da) >= 0) {
+            bs->request_alignment = da.d_miniosz;
+            /* The kernel returns wrong information for d_mem */
+            /* s->buf_align = da.d_mem; */
+        }
+    }
+#endif
+
+    /* If we could not get the sizes so far, we can only guess them */
+    if (!s->buf_align) {
+        size_t align;
+        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
+        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+            if (pread(s->fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
+                s->buf_align = align;
+                break;
+            }
+        }
+        qemu_vfree(buf);
+    }
+
+    if (!bs->request_alignment) {
+        size_t align;
+        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
+        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+            if (pread(s->fd, buf, align, 0) >= 0) {
+                bs->request_alignment = align;
+                break;
+            }
+        }
+        qemu_vfree(buf);
+    }
+}
+
 static void raw_parse_flags(int bdrv_flags, int *open_flags)
 {
     assert(open_flags != NULL);
@@ -463,7 +535,6 @@ static int raw_reopen_prepare(BDRVReopenState *state,
     return ret;
 }
 
-
 static void raw_reopen_commit(BDRVReopenState *state)
 {
     BDRVRawReopenState *raw_s = state->opaque;
@@ -499,23 +570,15 @@ static void raw_reopen_abort(BDRVReopenState *state)
     state->opaque = NULL;
 }
 
+static int raw_refresh_limits(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
 
-/* XXX: use host sector size if necessary with:
-#ifdef DIOCGSECTORSIZE
-        {
-            unsigned int sectorsize = 512;
-            if (!ioctl(fd, DIOCGSECTORSIZE, &sectorsize) &&
-                sectorsize > bufsize)
-                bufsize = sectorsize;
-        }
-#endif
-#ifdef CONFIG_COCOA
-        uint32_t blockSize = 512;
-        if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
-            bufsize = blockSize;
-        }
-#endif
-*/
+    raw_probe_alignment(bs);
+    bs->bl.opt_mem_alignment = s->buf_align;
+
+    return 0;
+}
 
 static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
 {
@@ -1363,6 +1426,7 @@ static BlockDriver bdrv_file = {
     .bdrv_aio_writev = raw_aio_writev,
     .bdrv_aio_flush = raw_aio_flush,
     .bdrv_aio_discard = raw_aio_discard,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
@@ -1740,6 +1804,7 @@ static BlockDriver bdrv_host_device = {
     .bdrv_aio_writev	= raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_aio_discard   = hdev_aio_discard,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength	= raw_getlength,
@@ -1871,6 +1936,7 @@ static BlockDriver bdrv_host_floppy = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
@@ -1981,6 +2047,7 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
@@ -2110,6 +2177,7 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush	= raw_aio_flush,
+    .bdrv_refresh_limits = raw_refresh_limits,
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength      = raw_getlength,
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 2bad5a3..6fe64d2 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -202,6 +202,35 @@ static int set_sparse(int fd)
 				 NULL, 0, NULL, 0, &returned, NULL);
 }
 
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    DWORD sectorsPerCluster, freeClusters, totalClusters, count;
+    DISK_GEOMETRY_EX dg;
+    BOOL status;
+
+    if (s->type == FTYPE_CD) {
+        bs->request_alignment = 2048;
+        return;
+    }
+    if (s->type == FTYPE_HARDDISK) {
+        status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
+                                 NULL, 0, &dg, sizeof(dg), &count, NULL);
+        if (status != 0) {
+            bs->request_alignment = dg.Geometry.BytesPerSector;
+            return;
+        }
+        /* try GetDiskFreeSpace too */
+    }
+
+    if (s->drive_path[0]) {
+        GetDiskFreeSpace(s->drive_path, &sectorsPerCluster,
+                         &dg.Geometry.BytesPerSector,
+                         &freeClusters, &totalClusters);
+        bs->request_alignment = dg.Geometry.BytesPerSector;
+    }
+}
+
 static void raw_parse_flags(int flags, int *access_flags, DWORD *overlapped)
 {
     assert(access_flags != NULL);
@@ -269,6 +298,17 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    if (filename[0] && filename[1] == ':') {
+        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
+    } else if (filename[0] == '\\' && filename[1] == '\\') {
+        s->drive_path[0] = 0;
+    } else {
+        /* Relative path.  */
+        char buf[MAX_PATH];
+        GetCurrentDirectory(MAX_PATH, buf);
+        snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
+    }
+
     s->hfile = CreateFile(filename, access_flags,
                           FILE_SHARE_READ, NULL,
                           OPEN_EXISTING, overlapped, NULL);
@@ -293,6 +333,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
         s->aio = aio;
     }
 
+    raw_probe_alignment(bs);
     ret = 0;
 fail:
     qemu_opts_del(opts);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9921cd3..0a01b69 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,6 +319,9 @@ struct BlockDriverState {
     /* Whether produces zeros when read beyond eof */
     bool zero_beyond_eof;
 
+    /* Alignment requirement for offset/length of I/O requests */
+    unsigned int request_alignment;
+
     /* the block size for which the guest device expects atomicity */
     int guest_block_size;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (7 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2013-12-24  7:23   ` Wenchao Xia
  2014-01-11 22:27   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv() Kevin Wolf
                   ` (15 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This separates the part of bdrv_co_do_readv() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 61 +++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index b86e754..cbc9f6d 100644
--- a/block.c
+++ b/block.c
@@ -2700,26 +2700,24 @@ err:
 }
 
 /*
- * Handle a read request in coroutine context
+ * Forwards an already correctly aligned request to the BlockDriver. This
+ * handles copy on read and zeroing after EOF; any other features must be
+ * implemented by the caller.
  */
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
-    BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
     int ret;
 
-    if (!drv) {
-        return -ENOMEDIUM;
-    }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
-        return -EIO;
-    }
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
 
-    if (bs->copy_on_read) {
-        flags |= BDRV_REQ_COPY_ON_READ;
-    }
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
         bs->copy_on_read_in_flight++;
     }
@@ -2728,11 +2726,6 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
     }
 
-    /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, false);
-    }
-
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -2749,6 +2742,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         }
     }
 
+    /* Forward the request to the BlockDriver */
     if (!(bs->zero_beyond_eof && bs->growable)) {
         ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
     } else {
@@ -2789,6 +2783,37 @@ out:
     return ret;
 }
 
+/*
+ * Handle a read request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    BlockDriver *drv = bs->drv;
+    int ret;
+
+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+        return -EIO;
+    }
+
+    if (bs->copy_on_read) {
+        flags |= BDRV_REQ_COPY_ON_READ;
+    }
+
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_intercept(bs, nb_sectors, false);
+    }
+
+    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+    return ret;
+}
+
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (8 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2013-12-26  4:08   ` Wenchao Xia
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Similar to bdrv_pread(), which aligns byte-aligned request to 512 byte
sectors, bdrv_co_do_preadv() takes a byte-aligned request and aligns it
to the alignment specified in bs->request_alignment.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index cbc9f6d..7812d8f 100644
--- a/block.c
+++ b/block.c
@@ -2786,17 +2786,23 @@ out:
 /*
  * Handle a read request in coroutine context
  */
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
+    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+    uint8_t *head_buf = NULL;
+    uint8_t *tail_buf = NULL;
+    QEMUIOVector local_qiov;
+    bool use_local_qiov = false;
     int ret;
 
     if (!drv) {
         return -ENOMEDIUM;
     }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+    if (bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
@@ -2806,14 +2812,59 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, false);
+        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
+    }
+
+    /* Align read if necessary by padding qiov */
+    if (offset & (align - 1)) {
+        head_buf = qemu_blockalign(bs, align);
+        qemu_iovec_init(&local_qiov, qiov->niov + 2);
+        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+        use_local_qiov = true;
+
+        bytes += offset & (align - 1);
+        offset = offset & ~(align - 1);
+    }
+
+    if ((offset + bytes) & (align - 1)) {
+        if (!use_local_qiov) {
+            qemu_iovec_init(&local_qiov, qiov->niov + 1);
+            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+            use_local_qiov = true;
+        }
+        tail_buf = qemu_blockalign(bs, align);
+        qemu_iovec_add(&local_qiov, tail_buf,
+                       align - ((offset + bytes) & (align - 1)));
+
+        bytes = ROUND_UP(bytes, align);
+    }
+
+    ret = bdrv_aligned_preadv(bs, offset, bytes,
+                              use_local_qiov ? &local_qiov : qiov,
+                              flags);
+
+    if (use_local_qiov) {
+        qemu_iovec_destroy(&local_qiov);
+        qemu_vfree(head_buf);
+        qemu_vfree(tail_buf);
     }
 
-    ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
-                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
     return ret;
 }
 
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
+        return -EINVAL;
+    }
+
+    return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+                             nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (9 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:27   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling Kevin Wolf
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This separates the part of bdrv_co_do_writev() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index 7812d8f..e26e31c 100644
--- a/block.c
+++ b/block.c
@@ -2958,34 +2958,20 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 }
 
 /*
- * Handle a write request in coroutine context
+ * Forwards an already correctly aligned write request to the BlockDriver.
  */
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
-    BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
     int ret;
 
-    if (!bs->drv) {
-        return -ENOMEDIUM;
-    }
-    if (bs->read_only) {
-        return -EACCES;
-    }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
-        return -EIO;
-    }
-
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
-    }
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
 
-    /* throttling disk I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, true);
-    }
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
@@ -3017,6 +3003,40 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     return ret;
 }
 
+/*
+ * Handle a write request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    int ret;
+
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+    if (bs->read_only) {
+        return -EACCES;
+    }
+    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+        return -EIO;
+    }
+
+    if (bs->copy_on_read_in_flight) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_intercept(bs, nb_sectors, true);
+    }
+
+    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+
+    return ret;
+}
+
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (10 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:27   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

First waiting for all COR requests to complete and calling the
throttling function afterwards means that the request could be delayed
and we still need to wait for the COR request even if it was issued only
after the throttled write request.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index e26e31c..385fb8a 100644
--- a/block.c
+++ b/block.c
@@ -2973,6 +2973,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
+    if (bs->copy_on_read_in_flight) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
@@ -3022,10 +3026,6 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         return -EIO;
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
-    }
-
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
         bdrv_io_limits_intercept(bs, nb_sectors, true);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (11 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-10 18:11   ` Max Reitz
  2014-01-16 12:25   ` Kevin Wolf
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
                   ` (11 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This is going to become the bdrv_co_do_preadv() equivalent for writes.
In this patch, however, just a function taking byte offsets is created,
it doesn't align anything yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 385fb8a..a80db2e 100644
--- a/block.c
+++ b/block.c
@@ -3010,8 +3010,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
 /*
  * Handle a write request in coroutine context
  */
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
     int ret;
@@ -3022,21 +3022,32 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     if (bs->read_only) {
         return -EACCES;
     }
-    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+    if (bdrv_check_byte_request(bs, offset, bytes)) {
         return -EIO;
     }
 
     /* throttling disk I/O */
     if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, nb_sectors, true);
+        bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
     }
 
-    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
-                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
 
     return ret;
 }
 
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    BdrvRequestFlags flags)
+{
+    if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
+        return -EINVAL;
+    }
+
+    return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov)
 {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (12 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-10 18:34   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 52 +++++++++++++++++++++++++++++++----------------
 block/backup.c            |  7 ++++++-
 include/block/block_int.h |  4 ++--
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index a80db2e..fa888d9 100644
--- a/block.c
+++ b/block.c
@@ -2037,13 +2037,13 @@ static void tracked_request_end(BdrvTrackedRequest *req)
  */
 static void tracked_request_begin(BdrvTrackedRequest *req,
                                   BlockDriverState *bs,
-                                  int64_t sector_num,
-                                  int nb_sectors, bool is_write)
+                                  int64_t offset,
+                                  unsigned int bytes, bool is_write)
 {
     *req = (BdrvTrackedRequest){
         .bs = bs,
-        .sector_num = sector_num,
-        .nb_sectors = nb_sectors,
+        .offset = offset,
+        .bytes = bytes,
         .is_write = is_write,
         .co = qemu_coroutine_self(),
     };
@@ -2074,25 +2074,43 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
     }
 }
 
+static void round_bytes_to_clusters(BlockDriverState *bs,
+                                    int64_t offset, unsigned int bytes,
+                                    int64_t *cluster_offset,
+                                    unsigned int *cluster_bytes)
+{
+    BlockDriverInfo bdi;
+
+    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+        *cluster_offset = offset;
+        *cluster_bytes = bytes;
+    } else {
+        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
+        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
+                                       bdi.cluster_size);
+    }
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
-                                     int64_t sector_num, int nb_sectors) {
+                                     int64_t offset, int bytes)
+{
     /*        aaaa   bbbb */
-    if (sector_num >= req->sector_num + req->nb_sectors) {
+    if (offset >= req->offset + req->bytes) {
         return false;
     }
     /* bbbb   aaaa        */
-    if (req->sector_num >= sector_num + nb_sectors) {
+    if (req->offset >= offset + bytes) {
         return false;
     }
     return true;
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors)
+        int64_t offset, unsigned int bytes)
 {
     BdrvTrackedRequest *req;
-    int64_t cluster_sector_num;
-    int cluster_nb_sectors;
+    int64_t cluster_offset;
+    unsigned int cluster_bytes;
     bool retry;
 
     /* If we touch the same cluster it counts as an overlap.  This guarantees
@@ -2101,14 +2119,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
      * CoR read and write operations are atomic and guest writes cannot
      * interleave between them.
      */
-    bdrv_round_to_clusters(bs, sector_num, nb_sectors,
-                           &cluster_sector_num, &cluster_nb_sectors);
+    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
 
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (tracked_request_overlaps(req, cluster_sector_num,
-                                         cluster_nb_sectors)) {
+            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
                  * never happen since it means deadlock.
@@ -2723,10 +2739,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+        wait_for_overlapping_requests(bs, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+    tracked_request_begin(&req, bs, offset, bytes, false);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
@@ -2974,10 +2990,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+        wait_for_overlapping_requests(bs, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+    tracked_request_begin(&req, bs, offset, bytes, true);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
 
diff --git a/block/backup.c b/block/backup.c
index 0198514..15a2e55 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -181,8 +181,13 @@ static int coroutine_fn backup_before_write_notify(
         void *opaque)
 {
     BdrvTrackedRequest *req = opaque;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
 
-    return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL);
+    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    return backup_do_cow(req->bs, sector_num, nb_sectors, NULL);
 }
 
 static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0a01b69..a11e5c9 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -57,8 +57,8 @@
 
 typedef struct BdrvTrackedRequest {
     BlockDriverState *bs;
-    int64_t sector_num;
-    int nb_sectors;
+    int64_t offset;
+    unsigned int bytes;
     bool is_write;
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (13 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:28   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment Kevin Wolf
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Previously, it was not possible to use wait_for_overlapping_requests()
between tracked_request_begin()/end() because it would wait for itself.

Ignore the current request in the overlap check and run more of the
bdrv_co_do_preadv/pwritev code with a BdrvTrackedRequest present.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index fa888d9..b4f6ead 100644
--- a/block.c
+++ b/block.c
@@ -2106,7 +2106,7 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        int64_t offset, unsigned int bytes)
+        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
 {
     BdrvTrackedRequest *req;
     int64_t cluster_offset;
@@ -2124,6 +2124,9 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
+            if (req == self) {
+                continue;
+            }
             if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
@@ -2721,10 +2724,10 @@ err:
  * implemented by the caller.
  */
 static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
-    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+    QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
-    BdrvTrackedRequest req;
     int ret;
 
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2739,11 +2742,9 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, offset, bytes);
+        wait_for_overlapping_requests(bs, req, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, offset, bytes, false);
-
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
 
@@ -2790,8 +2791,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
 out:
-    tracked_request_end(&req);
-
     if (flags & BDRV_REQ_COPY_ON_READ) {
         bs->copy_on_read_in_flight--;
     }
@@ -2807,6 +2806,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
+
     /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
     uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
     uint8_t *head_buf = NULL;
@@ -2856,9 +2857,11 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
         bytes = ROUND_UP(bytes, align);
     }
 
-    ret = bdrv_aligned_preadv(bs, offset, bytes,
+    tracked_request_begin(&req, bs, offset, bytes, false);
+    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
+    tracked_request_end(&req);
 
     if (use_local_qiov) {
         qemu_iovec_destroy(&local_qiov);
@@ -2977,10 +2980,10 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
  * Forwards an already correctly aligned write request to the BlockDriver.
  */
 static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
-    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+    BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+    QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
-    BdrvTrackedRequest req;
     int ret;
 
     int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2990,12 +2993,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, offset, bytes);
+        wait_for_overlapping_requests(bs, req, offset, bytes);
     }
 
-    tracked_request_begin(&req, bs, offset, bytes, true);
-
-    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
+    ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
     if (ret < 0) {
         /* Do nothing, write notifier decided to fail this request */
@@ -3018,8 +3019,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
         bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
     }
 
-    tracked_request_end(&req);
-
     return ret;
 }
 
@@ -3030,6 +3029,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
 {
+    BdrvTrackedRequest req;
     int ret;
 
     if (!bs->drv) {
@@ -3047,7 +3047,9 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
     }
 
-    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
+    tracked_request_begin(&req, bs, offset, bytes, true);
+    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+    tracked_request_end(&req);
 
     return ret;
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (14 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-10 19:07   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation Kevin Wolf
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Odd file sizes could make bdrv_aligned_preadv() shorten the request in
non-aligned ways. Fix it by rounding to the required alignment instead
of 512 bytes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index b4f6ead..6dddb7c 100644
--- a/block.c
+++ b/block.c
@@ -2725,7 +2725,7 @@ err:
  */
 static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
-    QEMUIOVector *qiov, int flags)
+    int64_t align, QEMUIOVector *qiov, int flags)
 {
     BlockDriver *drv = bs->drv;
     int ret;
@@ -2773,7 +2773,7 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
         }
 
         total_sectors = DIV_ROUND_UP(len, BDRV_SECTOR_SIZE);
-        max_nb_sectors = MAX(0, total_sectors - sector_num);
+        max_nb_sectors = MAX(0, ROUND_UP(total_sectors - sector_num, align));
         if (max_nb_sectors > 0) {
             ret = drv->bdrv_co_readv(bs, sector_num,
                                      MIN(nb_sectors, max_nb_sectors), qiov);
@@ -2858,7 +2858,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
     }
 
     tracked_request_begin(&req, bs, offset, bytes, false);
-    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
+    ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,
                               use_local_qiov ? &local_qiov : qiov,
                               flags);
     tracked_request_end(&req);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (15 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:28   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic Kevin Wolf
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Change the API so that specific requests can be marked serialising. Only
these requests are checked for overlaps then.

This means that during a Copy on Read operation, not all requests
overlapping other requests are serialised any more, but only those that
actually overlap with the specific COR request.

Also remove COR from function and variable names because this
functionality can be useful in other contexts.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 48 ++++++++++++++++++++++++++++-------------------
 include/block/block_int.h |  5 +++--
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index 6dddb7c..bf8b46e 100644
--- a/block.c
+++ b/block.c
@@ -2028,6 +2028,10 @@ int bdrv_commit_all(void)
  */
 static void tracked_request_end(BdrvTrackedRequest *req)
 {
+    if (req->serialising) {
+        req->bs->serialising_in_flight--;
+    }
+
     QLIST_REMOVE(req, list);
     qemu_co_queue_restart_all(&req->wait_queue);
 }
@@ -2042,10 +2046,11 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
 {
     *req = (BdrvTrackedRequest){
         .bs = bs,
-        .offset = offset,
-        .bytes = bytes,
-        .is_write = is_write,
-        .co = qemu_coroutine_self(),
+        .offset         = offset,
+        .bytes          = bytes,
+        .is_write       = is_write,
+        .co             = qemu_coroutine_self(),
+        .serialising    = false,
     };
 
     qemu_co_queue_init(&req->wait_queue);
@@ -2053,6 +2058,14 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+static void mark_request_serialising(BdrvTrackedRequest *req)
+{
+    if (!req->serialising) {
+        req->bs->serialising_in_flight++;
+        req->serialising = true;
+    }
+}
+
 /**
  * Round a region to cluster boundaries
  */
@@ -2105,26 +2118,31 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
-static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
+static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
+    BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     int64_t cluster_offset;
     unsigned int cluster_bytes;
     bool retry;
 
+    if (!bs->serialising_in_flight) {
+        return;
+    }
+
     /* If we touch the same cluster it counts as an overlap.  This guarantees
      * that allocating writes will be serialized and not race with each other
      * for the same cluster.  For example, in copy-on-read it ensures that the
      * CoR read and write operations are atomic and guest writes cannot
      * interleave between them.
      */
-    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+    round_bytes_to_clusters(bs, self->offset, self->bytes,
+                            &cluster_offset, &cluster_bytes);
 
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (req == self) {
+            if (req == self || (!req->serialising && !self->serialising)) {
                 continue;
             }
             if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
@@ -2738,12 +2756,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
-        bs->copy_on_read_in_flight++;
+        mark_request_serialising(req);
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, req, offset, bytes);
-    }
+    wait_serialising_requests(req);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
@@ -2791,10 +2807,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
     }
 
 out:
-    if (flags & BDRV_REQ_COPY_ON_READ) {
-        bs->copy_on_read_in_flight--;
-    }
-
     return ret;
 }
 
@@ -2992,9 +3004,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
     assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
     assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, req, offset, bytes);
-    }
+    wait_serialising_requests(req);
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a11e5c9..b00402b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -60,6 +60,7 @@ typedef struct BdrvTrackedRequest {
     int64_t offset;
     unsigned int bytes;
     bool is_write;
+    bool serialising;
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
@@ -296,8 +297,8 @@ struct BlockDriverState {
     /* Callback before write request is processed */
     NotifierWithReturnList before_write_notifiers;
 
-    /* number of in-flight copy-on-read requests */
-    unsigned int copy_on_read_in_flight;
+    /* number of in-flight serialising requests */
+    unsigned int serialising_in_flight;
 
     /* I/O throttling */
     ThrottleState throttle_state;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (16 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:28   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point Kevin Wolf
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Copy on Read wants to serialise with all requests touching the same
cluster, so wait_serialising_requests() rounded to cluster boundaries.
Other users like alignment RMW will have different requirements, though
(requests touching the same sector), so make it dynamic.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 53 ++++++++++++++++++++++++-----------------------
 include/block/block_int.h |  4 ++++
 2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index bf8b46e..73bba47 100644
--- a/block.c
+++ b/block.c
@@ -2051,6 +2051,8 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
         .is_write       = is_write,
         .co             = qemu_coroutine_self(),
         .serialising    = false,
+        .overlap_offset = offset,
+        .overlap_bytes  = bytes,
     };
 
     qemu_co_queue_init(&req->wait_queue);
@@ -2058,12 +2060,19 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
-static void mark_request_serialising(BdrvTrackedRequest *req)
+static void mark_request_serialising(BdrvTrackedRequest *req, size_t align)
 {
+    int64_t overlap_offset = req->offset & ~(align - 1);
+    int overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
+                      - overlap_offset;
+
     if (!req->serialising) {
         req->bs->serialising_in_flight++;
         req->serialising = true;
     }
+
+    req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
+    req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
 }
 
 /**
@@ -2087,20 +2096,16 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
     }
 }
 
-static void round_bytes_to_clusters(BlockDriverState *bs,
-                                    int64_t offset, unsigned int bytes,
-                                    int64_t *cluster_offset,
-                                    unsigned int *cluster_bytes)
+static int bdrv_get_cluster_size(BlockDriverState *bs)
 {
     BlockDriverInfo bdi;
+    int ret;
 
-    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
-        *cluster_offset = offset;
-        *cluster_bytes = bytes;
+    ret = bdrv_get_info(bs, &bdi);
+    if (ret < 0 || bdi.cluster_size == 0) {
+        return bs->request_alignment;
     } else {
-        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
-        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
-                                       bdi.cluster_size);
+        return bdi.cluster_size;
     }
 }
 
@@ -2108,11 +2113,11 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t offset, int bytes)
 {
     /*        aaaa   bbbb */
-    if (offset >= req->offset + req->bytes) {
+    if (offset >= req->overlap_offset + req->overlap_bytes) {
         return false;
     }
     /* bbbb   aaaa        */
-    if (req->offset >= offset + bytes) {
+    if (req->overlap_offset >= offset + bytes) {
         return false;
     }
     return true;
@@ -2122,30 +2127,21 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
-    int64_t cluster_offset;
-    unsigned int cluster_bytes;
     bool retry;
 
     if (!bs->serialising_in_flight) {
         return;
     }
 
-    /* If we touch the same cluster it counts as an overlap.  This guarantees
-     * that allocating writes will be serialized and not race with each other
-     * for the same cluster.  For example, in copy-on-read it ensures that the
-     * CoR read and write operations are atomic and guest writes cannot
-     * interleave between them.
-     */
-    round_bytes_to_clusters(bs, self->offset, self->bytes,
-                            &cluster_offset, &cluster_bytes);
-
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
             if (req == self || (!req->serialising && !self->serialising)) {
                 continue;
             }
-            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
+            if (tracked_request_overlaps(req, self->overlap_offset,
+                                         self->overlap_bytes))
+            {
                 /* Hitting this means there was a reentrant request, for
                  * example, a block driver issuing nested requests.  This must
                  * never happen since it means deadlock.
@@ -2756,7 +2752,12 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
-        mark_request_serialising(req);
+        /* If we touch the same cluster it counts as an overlap.  This
+         * guarantees that allocating writes will be serialized and not race
+         * with each other for the same cluster.  For example, in copy-on-read
+         * it ensures that the CoR read and write operations are atomic and
+         * guest writes cannot interleave between them. */
+        mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
     wait_serialising_requests(req);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index b00402b..1aee02b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -60,7 +60,11 @@ typedef struct BdrvTrackedRequest {
     int64_t offset;
     unsigned int bytes;
     bool is_write;
+
     bool serialising;
+    int64_t overlap_offset;
+    unsigned int overlap_bytes;
+
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (17 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2013-12-27  4:17   ` Wenchao Xia
  2014-01-11 22:29   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
                   ` (5 subsequent siblings)
  24 siblings, 2 replies; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

We can only have a single wait_serialising_requests() call per request
because otherwise we can run into deadlocks where requests are waiting
for each other. The same is true when wait_serialising_requests() is not
at the very beginning of a request, so that other requests can be issued
between the start of the tracking and wait_serialising_requests().

Fix this by changing wait_serialising_requests() to ignore requests that
are already (directly or indirectly) waiting for the calling request.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 26 +++++++++++++++++++++++---
 include/block/block_int.h |  2 ++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 73bba47..8a0225d 100644
--- a/block.c
+++ b/block.c
@@ -2123,6 +2123,20 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
     return true;
 }
 
+static bool coroutine_fn is_waiting_for(BdrvTrackedRequest *waiting,
+                                        BdrvTrackedRequest *waited_for)
+{
+    BdrvTrackedRequest *req;
+
+    for (req = waiting; req != NULL; req = req->waiting_for) {
+        if (req == waited_for) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
 static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
@@ -2148,9 +2162,15 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  */
                 assert(qemu_coroutine_self() != req->co);
 
-                qemu_co_queue_wait(&req->wait_queue);
-                retry = true;
-                break;
+                /* If the request is already (indircetly) waiting for us, then
+                 * just go on instead of producing a deadlock. */
+                if (!is_waiting_for(req, self)) {
+                    self->waiting_for = req;
+                    qemu_co_queue_wait(&req->wait_queue);
+                    self->waiting_for = NULL;
+                    retry = true;
+                    break;
+                }
             }
         }
     } while (retry);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1aee02b..0f7fcef 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -68,6 +68,8 @@ typedef struct BdrvTrackedRequest {
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
     CoQueue wait_queue; /* coroutines blocked on this request */
+
+    struct BdrvTrackedRequest *waiting_for;
 } BdrvTrackedRequest;
 
 struct BlockDriver {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev()
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (18 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:29   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity Kevin Wolf
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

This patch changes bdrv_co_do_pwritev() to actually be what its name
promises. If requests aren't properly aligned, it performs a RMW.

Requests touching the same block are serialised against the RMW request.
Further optimisation of this is possible by differentiating types of
requests (concurrent reads should actually be okay here).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 8a0225d..bfa2233 100644
--- a/block.c
+++ b/block.c
@@ -3061,6 +3061,12 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     BdrvRequestFlags flags)
 {
     BdrvTrackedRequest req;
+    /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+    uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+    uint8_t *head_buf = NULL;
+    uint8_t *tail_buf = NULL;
+    QEMUIOVector local_qiov;
+    bool use_local_qiov = false;
     int ret;
 
     if (!bs->drv) {
@@ -3078,10 +3084,88 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
     }
 
+    /*
+     * Align write if necessary by performing a read-modify-write cycle.
+     * Pad qiov with the read parts and be sure to have a tracked request not
+     * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
+     */
     tracked_request_begin(&req, bs, offset, bytes, true);
-    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+
+    if (offset & (align - 1)) {
+        QEMUIOVector head_qiov;
+        struct iovec head_iov;
+
+        mark_request_serialising(&req, align);
+        wait_serialising_requests(&req);
+
+        head_buf = qemu_blockalign(bs, align);
+        head_iov = (struct iovec) {
+            .iov_base   = head_buf,
+            .iov_len    = align,
+        };
+        qemu_iovec_init_external(&head_qiov, &head_iov, 1);
+
+        ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
+                                  align, &head_qiov, 0);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        qemu_iovec_init(&local_qiov, qiov->niov + 2);
+        qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+        qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+        use_local_qiov = true;
+
+        bytes += offset & (align - 1);
+        offset = offset & ~(align - 1);
+    }
+
+    if ((offset + bytes) & (align - 1)) {
+        QEMUIOVector tail_qiov;
+        struct iovec tail_iov;
+        size_t tail_bytes;
+
+        mark_request_serialising(&req, align);
+        wait_serialising_requests(&req);
+
+        tail_buf = qemu_blockalign(bs, align);
+        tail_iov = (struct iovec) {
+            .iov_base   = tail_buf,
+            .iov_len    = align,
+        };
+        qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
+
+        ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
+                                  align, &tail_qiov, 0);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        if (!use_local_qiov) {
+            qemu_iovec_init(&local_qiov, qiov->niov + 1);
+            qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+            use_local_qiov = true;
+        }
+
+        tail_bytes = (offset + bytes) & (align - 1);
+        qemu_iovec_add(&local_qiov, tail_buf + tail_bytes, align - tail_bytes);
+
+        bytes = ROUND_UP(bytes, align);
+    }
+
+    ret = bdrv_aligned_pwritev(bs, &req, offset, bytes,
+                               use_local_qiov ? &local_qiov : qiov,
+                               flags);
+
+fail:
     tracked_request_end(&req);
 
+    if (use_local_qiov) {
+        qemu_iovec_destroy(&local_qiov);
+        qemu_vfree(head_buf);
+        qemu_vfree(tail_buf);
+    }
+
     return ret;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (19 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:29   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 48 ++++++++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/block.c b/block.c
index bfa2233..5b00d23 100644
--- a/block.c
+++ b/block.c
@@ -69,11 +69,11 @@ static int coroutine_fn bdrv_co_readv_em(BlockDriverState *bs,
 static int coroutine_fn bdrv_co_writev_em(BlockDriverState *bs,
                                          int64_t sector_num, int nb_sectors,
                                          QEMUIOVector *iov);
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
@@ -2383,8 +2383,7 @@ static int bdrv_check_request(BlockDriverState *bs, int64_t sector_num,
 
 typedef struct RwCo {
     BlockDriverState *bs;
-    int64_t sector_num;
-    int nb_sectors;
+    int64_t offset;
     QEMUIOVector *qiov;
     bool is_write;
     int ret;
@@ -2396,34 +2395,32 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
     RwCo *rwco = opaque;
 
     if (!rwco->is_write) {
-        rwco->ret = bdrv_co_do_readv(rwco->bs, rwco->sector_num,
-                                     rwco->nb_sectors, rwco->qiov,
-                                     rwco->flags);
-    } else {
-        rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
-                                      rwco->nb_sectors, rwco->qiov,
+        rwco->ret = bdrv_co_do_preadv(rwco->bs, rwco->offset,
+                                      rwco->qiov->size, rwco->qiov,
                                       rwco->flags);
+    } else {
+        rwco->ret = bdrv_co_do_pwritev(rwco->bs, rwco->offset,
+                                       rwco->qiov->size, rwco->qiov,
+                                       rwco->flags);
     }
 }
 
 /*
  * Process a vectored synchronous request using coroutines
  */
-static int bdrv_rwv_co(BlockDriverState *bs, int64_t sector_num,
-                       QEMUIOVector *qiov, bool is_write,
-                       BdrvRequestFlags flags)
+static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
+                        QEMUIOVector *qiov, bool is_write,
+                        BdrvRequestFlags flags)
 {
     Coroutine *co;
     RwCo rwco = {
         .bs = bs,
-        .sector_num = sector_num,
-        .nb_sectors = qiov->size >> BDRV_SECTOR_BITS,
+        .offset = offset,
         .qiov = qiov,
         .is_write = is_write,
         .ret = NOT_DONE,
         .flags = flags,
     };
-    assert((qiov->size & (BDRV_SECTOR_SIZE - 1)) == 0);
 
     /**
      * In sync call context, when the vcpu is blocked, this throttling timer
@@ -2462,7 +2459,8 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
     };
 
     qemu_iovec_init_external(&qiov, &iov, 1);
-    return bdrv_rwv_co(bs, sector_num, &qiov, is_write, flags);
+    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS,
+                        &qiov, is_write, flags);
 }
 
 /* return < 0 if error. See bdrv_write() for the return codes */
@@ -2500,7 +2498,7 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
 
 int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
 {
-    return bdrv_rwv_co(bs, sector_num, qiov, true, 0);
+    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
 }
 
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
@@ -4608,9 +4606,15 @@ int bdrv_flush(BlockDriverState *bs)
     return rwco.ret;
 }
 
+typedef struct DiscardCo {
+    BlockDriverState *bs;
+    int64_t sector_num;
+    int nb_sectors;
+    int ret;
+} DiscardCo;
 static void coroutine_fn bdrv_discard_co_entry(void *opaque)
 {
-    RwCo *rwco = opaque;
+    DiscardCo *rwco = opaque;
 
     rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
 }
@@ -4694,7 +4698,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
 {
     Coroutine *co;
-    RwCo rwco = {
+    DiscardCo rwco = {
         .bs = bs,
         .sector_num = sector_num,
         .nb_sectors = nb_sectors,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (20 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:29   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() " Kevin Wolf
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_preadv().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 49 +++++++++++++------------------------------------
 1 file changed, 13 insertions(+), 36 deletions(-)

diff --git a/block.c b/block.c
index 5b00d23..ab165c5 100644
--- a/block.c
+++ b/block.c
@@ -2545,49 +2545,26 @@ int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags)
     }
 }
 
-int bdrv_pread(BlockDriverState *bs, int64_t offset,
-               void *buf, int count1)
+int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
 {
-    uint8_t tmp_buf[BDRV_SECTOR_SIZE];
-    int len, nb_sectors, count;
-    int64_t sector_num;
+    QEMUIOVector qiov;
+    struct iovec iov = {
+        .iov_base = (void *)buf,
+        .iov_len = bytes,
+    };
     int ret;
 
-    count = count1;
-    /* first read to align to sector start */
-    len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
-    if (len > count)
-        len = count;
-    sector_num = offset >> BDRV_SECTOR_BITS;
-    if (len > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        memcpy(buf, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)), len);
-        count -= len;
-        if (count == 0)
-            return count1;
-        sector_num++;
-        buf += len;
+    if (bytes < 0) {
+        return -EINVAL;
     }
 
-    /* read the sectors "in place" */
-    nb_sectors = count >> BDRV_SECTOR_BITS;
-    if (nb_sectors > 0) {
-        if ((ret = bdrv_read(bs, sector_num, buf, nb_sectors)) < 0)
-            return ret;
-        sector_num += nb_sectors;
-        len = nb_sectors << BDRV_SECTOR_BITS;
-        buf += len;
-        count -= len;
+    qemu_iovec_init_external(&qiov, &iov, 1);
+    ret = bdrv_prwv_co(bs, offset, &qiov, false, 0);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* add data from the last sector */
-    if (count > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        memcpy(buf, tmp_buf, count);
-    }
-    return count1;
+    return bytes;
 }
 
 int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (21 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:22   ` Max Reitz
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment Kevin Wolf
  2013-12-27  4:27 ` [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Wenchao Xia
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_pwritev().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 64 +++++++++-------------------------------------------------------
 1 file changed, 9 insertions(+), 55 deletions(-)

diff --git a/block.c b/block.c
index ab165c5..350cb81 100644
--- a/block.c
+++ b/block.c
@@ -2496,11 +2496,6 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
     return bdrv_rw_co(bs, sector_num, (uint8_t *)buf, nb_sectors, true, 0);
 }
 
-int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
-{
-    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
-}
-
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
                       int nb_sectors, BdrvRequestFlags flags)
 {
@@ -2569,70 +2564,29 @@ int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
 
 int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
 {
-    uint8_t tmp_buf[BDRV_SECTOR_SIZE];
-    int len, nb_sectors, count;
-    int64_t sector_num;
     int ret;
 
-    count = qiov->size;
-
-    /* first write to align to sector start */
-    len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
-    if (len > count)
-        len = count;
-    sector_num = offset >> BDRV_SECTOR_BITS;
-    if (len > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        qemu_iovec_to_buf(qiov, 0, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)),
-                          len);
-        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        count -= len;
-        if (count == 0)
-            return qiov->size;
-        sector_num++;
-    }
-
-    /* write the sectors "in place" */
-    nb_sectors = count >> BDRV_SECTOR_BITS;
-    if (nb_sectors > 0) {
-        QEMUIOVector qiov_inplace;
-
-        qemu_iovec_init(&qiov_inplace, qiov->niov);
-        qemu_iovec_concat(&qiov_inplace, qiov, len,
-                          nb_sectors << BDRV_SECTOR_BITS);
-        ret = bdrv_writev(bs, sector_num, &qiov_inplace);
-        qemu_iovec_destroy(&qiov_inplace);
-        if (ret < 0) {
-            return ret;
-        }
-
-        sector_num += nb_sectors;
-        len = nb_sectors << BDRV_SECTOR_BITS;
-        count -= len;
+    ret = bdrv_prwv_co(bs, offset, qiov, true, 0);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* add data from the last sector */
-    if (count > 0) {
-        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-        qemu_iovec_to_buf(qiov, qiov->size - count, tmp_buf, count);
-        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
-            return ret;
-    }
     return qiov->size;
 }
 
 int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
-                const void *buf, int count1)
+                const void *buf, int bytes)
 {
     QEMUIOVector qiov;
     struct iovec iov = {
         .iov_base   = (void *) buf,
-        .iov_len    = count1,
+        .iov_len    = bytes,
     };
 
+    if (bytes < 0) {
+        return -EINVAL;
+    }
+
     qemu_iovec_init_external(&qiov, &iov, 1);
     return bdrv_pwritev(bs, offset, &qiov);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (22 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() " Kevin Wolf
@ 2013-12-13 13:22 ` Kevin Wolf
  2014-01-11 22:30   ` Max Reitz
  2013-12-27  4:27 ` [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Wenchao Xia
  24 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2013-12-13 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, pl, stefanha

From: Paolo Bonzini <pbonzini@redhat.com>

The iSCSI backend already gets the block size from the READ CAPACITY
command it sends.  Save it so that the generic block layer gets it
too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/iscsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index 18bc093..039b8ad 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1395,6 +1395,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
         goto out;
     }
     bs->total_sectors = sector_lun2qemu(iscsilun->num_blocks, iscsilun);
+    bs->request_alignment = iscsilun->block_size;
 
     /* Medium changer or tape. We dont have any emulation for this so this must
      * be sg ioctl compatible. We force it to be sg, otherwise qemu will try
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment Kevin Wolf
@ 2013-12-24  3:39   ` Wenchao Xia
  2014-01-11 22:26   ` Max Reitz
  1 sibling, 0 replies; 57+ messages in thread
From: Wenchao Xia @ 2013-12-24  3:39 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

I am not sure the win32 API usage, except that part, patch looks OK.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv() Kevin Wolf
@ 2013-12-24  7:23   ` Wenchao Xia
  2014-01-11 22:27   ` Max Reitz
  1 sibling, 0 replies; 57+ messages in thread
From: Wenchao Xia @ 2013-12-24  7:23 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv() Kevin Wolf
@ 2013-12-26  4:08   ` Wenchao Xia
  0 siblings, 0 replies; 57+ messages in thread
From: Wenchao Xia @ 2013-12-26  4:08 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

>       /* throttling disk I/O */
>       if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, false);
> +        bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
> +    }

 Is it possible (bytes >> BDRV_SECTOR_BITS) == 0?

> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point Kevin Wolf
@ 2013-12-27  4:17   ` Wenchao Xia
  2014-01-13 11:29     ` Kevin Wolf
  2014-01-11 22:29   ` Max Reitz
  1 sibling, 1 reply; 57+ messages in thread
From: Wenchao Xia @ 2013-12-27  4:17 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

于 2013/12/13 21:22, Kevin Wolf 写道:
> We can only have a single wait_serialising_requests() call per request
> because otherwise we can run into deadlocks where requests are waiting
> for each other.
  do you mean:
mark_request_serialising(req)
...
wait_serialising_requests(req);
...
wait_serialising_requests(req);

 will have deadlock? I thought it is already resolved by patch 15?
Maybe here is another deadlock reason?

 The same is true when wait_serialising_requests() is not
> at the very beginning of a request, so that other requests can be issued
> between the start of the tracking and wait_serialising_requests().

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio
  2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
                   ` (23 preceding siblings ...)
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment Kevin Wolf
@ 2013-12-27  4:27 ` Wenchao Xia
  24 siblings, 0 replies; 57+ messages in thread
From: Wenchao Xia @ 2013-12-27  4:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

For patch 9, 11-18, 20-24
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>

For patch 10, I am not sure whether there is a risk to bypass I/O limit
when bytes is too small:
bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-10 18:11   ` Max Reitz
  2014-01-16 12:25   ` Kevin Wolf
  1 sibling, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-10 18:11 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> This is going to become the bdrv_co_do_preadv() equivalent for writes.
> In this patch, however, just a function taking byte offsets is created,
> it doesn't align anything yet.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 23 +++++++++++++++++------
>   1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/block.c b/block.c
> index 385fb8a..a80db2e 100644
> --- a/block.c
> +++ b/block.c
> @@ -3010,8 +3010,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>   /*
>    * Handle a write request in coroutine context
>    */
> -static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
>       BdrvRequestFlags flags)
>   {
>       int ret;
> @@ -3022,21 +3022,32 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>       if (bs->read_only) {
>           return -EACCES;
>       }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +    if (bdrv_check_byte_request(bs, offset, bytes)) {
>           return -EIO;
>       }
>   
>       /* throttling disk I/O */
>       if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, true);
> +        bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
>       }
>   
> -    ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
> -                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +    ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
>   
>       return ret;
>   }
>   
> +static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +    BdrvRequestFlags flags)
> +{
> +    if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {

This should probably be INT_MAX, since nb_sectors is an integer. If 
nb_sectors is between INT_MAX >> BDRV_SECTOR_BITS and UINT_MAX >> 
BDRV_SECTOR_BITS, the result of nb_sectors << BDRV_SECTOR_BITS (which 
will be a signed integer) is undefined. It is obviously then implicitly 
casted to an unsigned integer (which is the type of the parameter 
"bytes") which will probably solve the problem as intended, but it is 
still technically undefined.

Thus, I'd either change this to INT_MAX or cast nb_sectors to an 
unsigned int before shifting it below.

Max

> +        return -EINVAL;
> +    }
> +
> +    return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
> +                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> +}
> +
>   int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
>       int nb_sectors, QEMUIOVector *qiov)
>   {

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
@ 2014-01-10 18:34   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-10 18:34 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 52 +++++++++++++++++++++++++++++++----------------
>   block/backup.c            |  7 ++++++-
>   include/block/block_int.h |  4 ++--
>   3 files changed, 42 insertions(+), 21 deletions(-)
>
> diff --git a/block.c b/block.c
> index a80db2e..fa888d9 100644
> --- a/block.c
> +++ b/block.c
> @@ -2037,13 +2037,13 @@ static void tracked_request_end(BdrvTrackedRequest *req)
>    */
>   static void tracked_request_begin(BdrvTrackedRequest *req,
>                                     BlockDriverState *bs,
> -                                  int64_t sector_num,
> -                                  int nb_sectors, bool is_write)
> +                                  int64_t offset,
> +                                  unsigned int bytes, bool is_write)
>   {
>       *req = (BdrvTrackedRequest){
>           .bs = bs,
> -        .sector_num = sector_num,
> -        .nb_sectors = nb_sectors,
> +        .offset = offset,
> +        .bytes = bytes,
>           .is_write = is_write,
>           .co = qemu_coroutine_self(),
>       };
> @@ -2074,25 +2074,43 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
>       }
>   }
>   
> +static void round_bytes_to_clusters(BlockDriverState *bs,
> +                                    int64_t offset, unsigned int bytes,
> +                                    int64_t *cluster_offset,
> +                                    unsigned int *cluster_bytes)
> +{
> +    BlockDriverInfo bdi;
> +
> +    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
> +        *cluster_offset = offset;
> +        *cluster_bytes = bytes;
> +    } else {
> +        *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
> +        *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
> +                                       bdi.cluster_size);
> +    }
> +}
> +
>   static bool tracked_request_overlaps(BdrvTrackedRequest *req,
> -                                     int64_t sector_num, int nb_sectors) {
> +                                     int64_t offset, int bytes)

Shouldn't this be "unsigned int bytes"?

Max

> +{
>       /*        aaaa   bbbb */
> -    if (sector_num >= req->sector_num + req->nb_sectors) {
> +    if (offset >= req->offset + req->bytes) {
>           return false;
>       }
>       /* bbbb   aaaa        */
> -    if (req->sector_num >= sector_num + nb_sectors) {
> +    if (req->offset >= offset + bytes) {
>           return false;
>       }
>       return true;
>   }
>   
>   static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
> -        int64_t sector_num, int nb_sectors)
> +        int64_t offset, unsigned int bytes)
>   {
>       BdrvTrackedRequest *req;
> -    int64_t cluster_sector_num;
> -    int cluster_nb_sectors;
> +    int64_t cluster_offset;
> +    unsigned int cluster_bytes;
>       bool retry;
>   
>       /* If we touch the same cluster it counts as an overlap.  This guarantees
> @@ -2101,14 +2119,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>        * CoR read and write operations are atomic and guest writes cannot
>        * interleave between them.
>        */
> -    bdrv_round_to_clusters(bs, sector_num, nb_sectors,
> -                           &cluster_sector_num, &cluster_nb_sectors);
> +    round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
>   
>       do {
>           retry = false;
>           QLIST_FOREACH(req, &bs->tracked_requests, list) {
> -            if (tracked_request_overlaps(req, cluster_sector_num,
> -                                         cluster_nb_sectors)) {
> +            if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
>                   /* Hitting this means there was a reentrant request, for
>                    * example, a block driver issuing nested requests.  This must
>                    * never happen since it means deadlock.
> @@ -2723,10 +2739,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>       }
>   
>       if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +        wait_for_overlapping_requests(bs, offset, bytes);
>       }
>   
> -    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +    tracked_request_begin(&req, bs, offset, bytes, false);
>   
>       if (flags & BDRV_REQ_COPY_ON_READ) {
>           int pnum;
> @@ -2974,10 +2990,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>       assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
>   
>       if (bs->copy_on_read_in_flight) {
> -        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
> +        wait_for_overlapping_requests(bs, offset, bytes);
>       }
>   
> -    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
> +    tracked_request_begin(&req, bs, offset, bytes, true);
>   
>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
>   
> diff --git a/block/backup.c b/block/backup.c
> index 0198514..15a2e55 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -181,8 +181,13 @@ static int coroutine_fn backup_before_write_notify(
>           void *opaque)
>   {
>       BdrvTrackedRequest *req = opaque;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
>   
> -    return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL);
> +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> +
> +    return backup_do_cow(req->bs, sector_num, nb_sectors, NULL);
>   }
>   
>   static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 0a01b69..a11e5c9 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -57,8 +57,8 @@
>   
>   typedef struct BdrvTrackedRequest {
>       BlockDriverState *bs;
> -    int64_t sector_num;
> -    int nb_sectors;
> +    int64_t offset;
> +    unsigned int bytes;
>       bool is_write;
>       QLIST_ENTRY(BdrvTrackedRequest) list;
>       Coroutine *co; /* owner, used for deadlock detection */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment Kevin Wolf
@ 2014-01-10 19:07   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-10 19:07 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Odd file sizes could make bdrv_aligned_preadv() shorten the request in
> non-aligned ways. Fix it by rounding to the required alignment instead
> of 512 bytes.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/block.c b/block.c
> index b4f6ead..6dddb7c 100644
> --- a/block.c
> +++ b/block.c
> @@ -2725,7 +2725,7 @@ err:
>    */
>   static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>       BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
> -    QEMUIOVector *qiov, int flags)
> +    int64_t align, QEMUIOVector *qiov, int flags)
>   {
>       BlockDriver *drv = bs->drv;
>       int ret;
> @@ -2773,7 +2773,7 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
>           }
>   
>           total_sectors = DIV_ROUND_UP(len, BDRV_SECTOR_SIZE);
> -        max_nb_sectors = MAX(0, total_sectors - sector_num);
> +        max_nb_sectors = MAX(0, ROUND_UP(total_sectors - sector_num, align));

It appears this should be an alignment given in sectors…

>           if (max_nb_sectors > 0) {
>               ret = drv->bdrv_co_readv(bs, sector_num,
>                                        MIN(nb_sectors, max_nb_sectors), qiov);
> @@ -2858,7 +2858,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
>       }
>   
>       tracked_request_begin(&req, bs, offset, bytes, false);
> -    ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
> +    ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,

Whereas this one is an alignment given in bytes.

Max

>                                 use_local_qiov ? &local_qiov : qiov,
>                                 flags);
>       tracked_request_end(&req);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() " Kevin Wolf
@ 2014-01-11 22:22   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:22 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Instead of implementing the alignment adjustment here, use the now
> existing functionality of bdrv_co_do_pwritev().
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 64 +++++++++-------------------------------------------------------
>   1 file changed, 9 insertions(+), 55 deletions(-)
>
> diff --git a/block.c b/block.c
> index ab165c5..350cb81 100644
> --- a/block.c
> +++ b/block.c
> @@ -2496,11 +2496,6 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
>       return bdrv_rw_co(bs, sector_num, (uint8_t *)buf, nb_sectors, true, 0);
>   }
>   
> -int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
> -{
> -    return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
> -}

This function seems unused, so removing it should be okay. But I think 
it should be removed from block.h as well.

Max

> -
>   int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>                         int nb_sectors, BdrvRequestFlags flags)
>   {
> @@ -2569,70 +2564,29 @@ int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
>   
>   int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
>   {
> -    uint8_t tmp_buf[BDRV_SECTOR_SIZE];
> -    int len, nb_sectors, count;
> -    int64_t sector_num;
>       int ret;
>   
> -    count = qiov->size;
> -
> -    /* first write to align to sector start */
> -    len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
> -    if (len > count)
> -        len = count;
> -    sector_num = offset >> BDRV_SECTOR_BITS;
> -    if (len > 0) {
> -        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
> -            return ret;
> -        qemu_iovec_to_buf(qiov, 0, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)),
> -                          len);
> -        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
> -            return ret;
> -        count -= len;
> -        if (count == 0)
> -            return qiov->size;
> -        sector_num++;
> -    }
> -
> -    /* write the sectors "in place" */
> -    nb_sectors = count >> BDRV_SECTOR_BITS;
> -    if (nb_sectors > 0) {
> -        QEMUIOVector qiov_inplace;
> -
> -        qemu_iovec_init(&qiov_inplace, qiov->niov);
> -        qemu_iovec_concat(&qiov_inplace, qiov, len,
> -                          nb_sectors << BDRV_SECTOR_BITS);
> -        ret = bdrv_writev(bs, sector_num, &qiov_inplace);
> -        qemu_iovec_destroy(&qiov_inplace);
> -        if (ret < 0) {
> -            return ret;
> -        }
> -
> -        sector_num += nb_sectors;
> -        len = nb_sectors << BDRV_SECTOR_BITS;
> -        count -= len;
> +    ret = bdrv_prwv_co(bs, offset, qiov, true, 0);
> +    if (ret < 0) {
> +        return ret;
>       }
>   
> -    /* add data from the last sector */
> -    if (count > 0) {
> -        if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
> -            return ret;
> -        qemu_iovec_to_buf(qiov, qiov->size - count, tmp_buf, count);
> -        if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
> -            return ret;
> -    }
>       return qiov->size;
>   }
>   
>   int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
> -                const void *buf, int count1)
> +                const void *buf, int bytes)
>   {
>       QEMUIOVector qiov;
>       struct iovec iov = {
>           .iov_base   = (void *) buf,
> -        .iov_len    = count1,
> +        .iov_len    = bytes,
>       };
>   
> +    if (bytes < 0) {
> +        return -EINVAL;
> +    }
> +
>       qemu_iovec_init_external(&qiov, &iov, 1);
>       return bdrv_pwritev(bs, offset, &qiov);
>   }

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
@ 2014-01-11 22:24   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:24 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> This function separates filling the BlockLimits from bdrv_open(), which
> allows it to call it from other operations which may change the limits
> (e.g. modifications to the backing file chain or bdrv_reopen)
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 19 +++++++++++++++++++
>   block/iscsi.c             | 46 +++++++++++++++++++++++++++++-----------------
>   block/qcow2.c             | 11 ++++++++++-
>   block/qed.c               | 11 ++++++++++-
>   block/vmdk.c              | 22 ++++++++++++++++++----
>   include/block/block_int.h |  2 ++
>   6 files changed, 88 insertions(+), 23 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length Kevin Wolf
@ 2014-01-11 22:24   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:24 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> When there is a format driver between the backend, it's not guaranteed
> that exposing the opt_transfer_length for the format driver results in
> the optimal requests (because of fragmentation etc.), but it can't make
> things worse, so let's just do it.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   block.c | 20 +++++++++++++++++++-
>   1 file changed, 19 insertions(+), 1 deletion(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed Kevin Wolf
@ 2014-01-11 22:24   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:24 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> When reopening with different flags, or when backing files disappear
> from the chain, the limits may change. Make sure they get updated in
> these cases.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   block.c               | 5 ++++-
>   block/stream.c        | 2 ++
>   include/block/block.h | 1 +
>   3 files changed, 7 insertions(+), 1 deletion(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments Kevin Wolf
@ 2014-01-11 22:25   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:25 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> The functions used by qemu_memalign() require an alignment that is at
> least sizeof(void*). Adjust it if it is too small.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   util/oslib-posix.c | 5 +++++
>   1 file changed, 5 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
@ 2014-01-11 22:25   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:25 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> For an O_DIRECT request to succeed, it's not only necessary that all
> base addresses in the qiov are aligned, but also that each length in it
> is aligned.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   block.c | 3 +++
>   1 file changed, 3 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
@ 2014-01-11 22:25   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:25 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> bs->buffer_alignment is set by the device emulation and contains the
> logical block size of the guest device. This isn't something that the
> block layer should know, and even less something to use for determining
> the right alignment of buffers to be used for the host.
>
> The new BlockLimits field opt_mem_alignment tells the qemu block layer
> the optimal alignment to be used so that no bounce buffer must be used
> in the driver.
>
> This patch may change the buffer alignment from 4k to 512 for all
> callers that used qemu_blockalign() with the top-level image format
> BlockDriverState. The value was never propagated to other levels in the
> tree, so in particular raw-posix never required anything else than 512.
>
> While on disks with 4k sectors direct I/O requires a 4k alignment,
> memory may still be okay when aligned to 512 byte boundaries. This is
> what must have happened in practice, because otherwise this would
> already have failed earlier. Therefore I don't expect regressions even
> with this intermediate state. Later, raw-posix can implement the hook
> and expose a different memory alignment requirement.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   block.c                   | 23 ++++++++++++++++++++---
>   include/block/block.h     |  3 +++
>   include/block/block_int.h |  3 +++
>   3 files changed, 26 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2014-01-11 22:26   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:26 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> The alignment field is now set to the value that is promised to the
> guest, rather than required by the host.  The next patches will make
> QEMU aware of the host-provided values, so make this clear.
>
> The alignment is also not about memory buffers, but about the sectors on
> the disk, change the documentation of the field.
>
> At this point, the field is set by the device emulation, but completely
> ignored by the block layer.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
> ---
>   block.c                   | 10 +++++-----
>   hw/block/virtio-blk.c     |  2 +-
>   hw/ide/core.c             |  2 +-
>   hw/scsi/scsi-disk.c       |  2 +-
>   hw/scsi/scsi-generic.c    |  2 +-
>   include/block/block.h     |  2 +-
>   include/block/block_int.h |  4 ++--
>   7 files changed, 12 insertions(+), 12 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment Kevin Wolf
  2013-12-24  3:39   ` Wenchao Xia
@ 2014-01-11 22:26   ` Max Reitz
  1 sibling, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:26 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> Add a bs->request_alignment field that contains the required
> offset/length alignment for I/O requests and fill it in the raw block
> drivers. Use ioctls if possible, else see what alignment it takes for
> O_DIRECT to succeed.
>
> While at it, also expose the memory alignment requirements, which may be
> (and in practice are) different from the disk alignment requirements.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   |   3 ++
>   block/raw-posix.c         | 102 ++++++++++++++++++++++++++++++++++++++--------
>   block/raw-win32.c         |  41 +++++++++++++++++++
>   include/block/block_int.h |   3 ++
>   4 files changed, 132 insertions(+), 17 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv() Kevin Wolf
  2013-12-24  7:23   ` Wenchao Xia
@ 2014-01-11 22:27   ` Max Reitz
  1 sibling, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> This separates the part of bdrv_co_do_readv() that needs to happen
> before the request is modified to match the backend alignment, and a
> part that needs to be executed afterwards and passes the request to the
> BlockDriver.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 61 +++++++++++++++++++++++++++++++++++++++++++------------------
>   1 file changed, 43 insertions(+), 18 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
@ 2014-01-11 22:27   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> This separates the part of bdrv_co_do_writev() that needs to happen
> before the request is modified to match the backend alignment, and a
> part that needs to be executed afterwards and passes the request to the
> BlockDriver.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------------
>   1 file changed, 41 insertions(+), 21 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling Kevin Wolf
@ 2014-01-11 22:27   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> First waiting for all COR requests to complete and calling the
> throttling function afterwards means that the request could be delayed
> and we still need to wait for the COR request even if it was issued only
> after the throttled write request.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
@ 2014-01-11 22:28   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:28 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Previously, it was not possible to use wait_for_overlapping_requests()
> between tracked_request_begin()/end() because it would wait for itself.
>
> Ignore the current request in the overlap check and run more of the
> bdrv_co_do_preadv/pwritev code with a BdrvTrackedRequest present.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 38 ++++++++++++++++++++------------------
>   1 file changed, 20 insertions(+), 18 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation Kevin Wolf
@ 2014-01-11 22:28   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:28 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Change the API so that specific requests can be marked serialising. Only
> these requests are checked for overlaps then.
>
> This means that during a Copy on Read operation, not all requests
> overlapping other requests are serialised any more, but only those that
> actually overlap with the specific COR request.
>
> Also remove COR from function and variable names because this
> functionality can be useful in other contexts.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 48 ++++++++++++++++++++++++++++-------------------
>   include/block/block_int.h |  5 +++--
>   2 files changed, 32 insertions(+), 21 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic Kevin Wolf
@ 2014-01-11 22:28   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:28 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Copy on Read wants to serialise with all requests touching the same
> cluster, so wait_serialising_requests() rounded to cluster boundaries.
> Other users like alignment RMW will have different requirements, though
> (requests touching the same sector), so make it dynamic.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 53 ++++++++++++++++++++++++-----------------------
>   include/block/block_int.h |  4 ++++
>   2 files changed, 31 insertions(+), 26 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point Kevin Wolf
  2013-12-27  4:17   ` Wenchao Xia
@ 2014-01-11 22:29   ` Max Reitz
  1 sibling, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:29 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> We can only have a single wait_serialising_requests() call per request
> because otherwise we can run into deadlocks where requests are waiting
> for each other. The same is true when wait_serialising_requests() is not
> at the very beginning of a request, so that other requests can be issued
> between the start of the tracking and wait_serialising_requests().
>
> Fix this by changing wait_serialising_requests() to ignore requests that
> are already (directly or indirectly) waiting for the calling request.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c                   | 26 +++++++++++++++++++++++---
>   include/block/block_int.h |  2 ++
>   2 files changed, 25 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
@ 2014-01-11 22:29   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:29 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> This patch changes bdrv_co_do_pwritev() to actually be what its name
> promises. If requests aren't properly aligned, it performs a RMW.
>
> Requests touching the same block are serialised against the RMW request.
> Further optimisation of this is possible by differentiating types of
> requests (concurrent reads should actually be okay here).
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 85 insertions(+), 1 deletion(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity Kevin Wolf
@ 2014-01-11 22:29   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:29 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 48 ++++++++++++++++++++++++++----------------------
>   1 file changed, 26 insertions(+), 22 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
@ 2014-01-11 22:29   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:29 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> Instead of implementing the alignment adjustment here, use the now
> existing functionality of bdrv_co_do_preadv().
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 49 +++++++++++++------------------------------------
>   1 file changed, 13 insertions(+), 36 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment Kevin Wolf
@ 2014-01-11 22:30   ` Max Reitz
  0 siblings, 0 replies; 57+ messages in thread
From: Max Reitz @ 2014-01-11 22:30 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, pl, stefanha

On 13.12.2013 14:22, Kevin Wolf wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> The iSCSI backend already gets the block size from the READ CAPACITY
> command it sends.  Save it so that the generic block layer gets it
> too.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point
  2013-12-27  4:17   ` Wenchao Xia
@ 2014-01-13 11:29     ` Kevin Wolf
  2014-01-14  3:13       ` Wenchao Xia
  0 siblings, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2014-01-13 11:29 UTC (permalink / raw)
  To: Wenchao Xia; +Cc: pbonzini, pl, qemu-devel, stefanha

Am 27.12.2013 um 05:17 hat Wenchao Xia geschrieben:
> 于 2013/12/13 21:22, Kevin Wolf 写道:
> > We can only have a single wait_serialising_requests() call per request
> > because otherwise we can run into deadlocks where requests are waiting
> > for each other.
>   do you mean:
> mark_request_serialising(req)
> ...
> wait_serialising_requests(req);
> ...
> wait_serialising_requests(req);
> 
>  will have deadlock?

Yes, it can deadlock (it doesn't have to, it depends on whether another
overlapping request is started concurrently). More precisely, the
problematic pattern is:

    mark_request_serialising(req);
    ...
    qemu_coroutine_yield(); /* Other requests may be issued now */
    ...
    wait_serialising_requests(req);

What you mentioned above is a special case of this.

>  I thought it is already resolved by patch 15?
> Maybe here is another deadlock reason?

The problematic pattern in patch 15 was:

    mark_request_serialising(req);
    ... /* no yield here */
    wait_serialising_requests(req);

As opposed to the originally used:

    wait_serialising_requests(req);
    ... /* no yield here */
    mark_request_serialising(req);

Kevin

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point
  2014-01-13 11:29     ` Kevin Wolf
@ 2014-01-14  3:13       ` Wenchao Xia
  0 siblings, 0 replies; 57+ messages in thread
From: Wenchao Xia @ 2014-01-14  3:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, pl, qemu-devel, stefanha

于 2014/1/13 19:29, Kevin Wolf 写道:
> Am 27.12.2013 um 05:17 hat Wenchao Xia geschrieben:
>> 于 2013/12/13 21:22, Kevin Wolf 写道:
>>> We can only have a single wait_serialising_requests() call per request
>>> because otherwise we can run into deadlocks where requests are waiting
>>> for each other.
>>    do you mean:
>> mark_request_serialising(req)
>> ...
>> wait_serialising_requests(req);
>> ...
>> wait_serialising_requests(req);
>>
>>   will have deadlock?
>
> Yes, it can deadlock (it doesn't have to, it depends on whether another
> overlapping request is started concurrently). More precisely, the
> problematic pattern is:
>
>      mark_request_serialising(req);
>      ...
>      qemu_coroutine_yield(); /* Other requests may be issued now */
>      ...
>      wait_serialising_requests(req);
>
> What you mentioned above is a special case of this.
>

   It seems when two overlapping request exist in the qeueue, and
both are waiting, problem happens. Thanks for explanation, I am fine
with this patch.


>>   I thought it is already resolved by patch 15?
>> Maybe here is another deadlock reason?
>
> The problematic pattern in patch 15 was:
>
>      mark_request_serialising(req);
>      ... /* no yield here */
>      wait_serialising_requests(req);
>
> As opposed to the originally used:
>
>      wait_serialising_requests(req);
>      ... /* no yield here */
>      mark_request_serialising(req);
>
> Kevin
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev()
  2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
  2014-01-10 18:11   ` Max Reitz
@ 2014-01-16 12:25   ` Kevin Wolf
  2014-01-17  1:38     ` Fam Zheng
  1 sibling, 1 reply; 57+ messages in thread
From: Kevin Wolf @ 2014-01-16 12:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, pl, stefanha

Am 13.12.2013 um 14:22 hat Kevin Wolf geschrieben:
> This is going to become the bdrv_co_do_preadv() equivalent for writes.
> In this patch, however, just a function taking byte offsets is created,
> it doesn't align anything yet.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 385fb8a..a80db2e 100644
> --- a/block.c
> +++ b/block.c
> @@ -3010,8 +3010,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>  /*
>   * Handle a write request in coroutine context
>   */
> -static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> +static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
>      BdrvRequestFlags flags)
>  {
>      int ret;
> @@ -3022,21 +3022,32 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>      if (bs->read_only) {
>          return -EACCES;
>      }
> -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> +    if (bdrv_check_byte_request(bs, offset, bytes)) {
>          return -EIO;
>      }
>  
>      /* throttling disk I/O */
>      if (bs->io_limits_enabled) {
> -        bdrv_io_limits_intercept(bs, nb_sectors, true);
> +        bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
>      }

Oh nice, this shifts in the wrong direction.

If somebody feels like writing a test case, something testing that I/O
throttling actually throttles would be useful...

Kevin

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev()
  2014-01-16 12:25   ` Kevin Wolf
@ 2014-01-17  1:38     ` Fam Zheng
  0 siblings, 0 replies; 57+ messages in thread
From: Fam Zheng @ 2014-01-17  1:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, pl, qemu-devel, stefanha

On Thu, 01/16 13:25, Kevin Wolf wrote:
> Am 13.12.2013 um 14:22 hat Kevin Wolf geschrieben:
> > This is going to become the bdrv_co_do_preadv() equivalent for writes.
> > In this patch, however, just a function taking byte offsets is created,
> > it doesn't align anything yet.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block.c | 23 +++++++++++++++++------
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 385fb8a..a80db2e 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -3010,8 +3010,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> >  /*
> >   * Handle a write request in coroutine context
> >   */
> > -static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> > -    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> > +static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
> > +    int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
> >      BdrvRequestFlags flags)
> >  {
> >      int ret;
> > @@ -3022,21 +3022,32 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
> >      if (bs->read_only) {
> >          return -EACCES;
> >      }
> > -    if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> > +    if (bdrv_check_byte_request(bs, offset, bytes)) {
> >          return -EIO;
> >      }
> >  
> >      /* throttling disk I/O */
> >      if (bs->io_limits_enabled) {
> > -        bdrv_io_limits_intercept(bs, nb_sectors, true);
> > +        bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
> >      }
> 
> Oh nice, this shifts in the wrong direction.
> 
> If somebody feels like writing a test case, something testing that I/O
> throttling actually throttles would be useful...
> 

Good idea. Should be possible for a basic test in qemu-iotests. I'll give it a
try.

Thanks,
Fam

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2014-01-17  1:38 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 13:22 [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Kevin Wolf
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 01/24] block: Move initialisation of BlockLimits to bdrv_refresh_limits() Kevin Wolf
2014-01-11 22:24   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 02/24] block: Inherit opt_transfer_length Kevin Wolf
2014-01-11 22:24   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 03/24] block: Update BlockLimits when they might have changed Kevin Wolf
2014-01-11 22:24   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 04/24] qemu_memalign: Allow small alignments Kevin Wolf
2014-01-11 22:25   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 05/24] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
2014-01-11 22:25   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 06/24] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
2014-01-11 22:25   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 07/24] block: rename buffer_alignment to guest_block_size Kevin Wolf
2014-01-11 22:26   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 08/24] raw: Probe required direct I/O alignment Kevin Wolf
2013-12-24  3:39   ` Wenchao Xia
2014-01-11 22:26   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 09/24] block: Introduce bdrv_aligned_preadv() Kevin Wolf
2013-12-24  7:23   ` Wenchao Xia
2014-01-11 22:27   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 10/24] block: Introduce bdrv_co_do_preadv() Kevin Wolf
2013-12-26  4:08   ` Wenchao Xia
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 11/24] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
2014-01-11 22:27   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 12/24] block: write: Handle COR dependency after I/O throttling Kevin Wolf
2014-01-11 22:27   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 13/24] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
2014-01-10 18:11   ` Max Reitz
2014-01-16 12:25   ` Kevin Wolf
2014-01-17  1:38     ` Fam Zheng
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 14/24] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
2014-01-10 18:34   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 15/24] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
2014-01-11 22:28   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 16/24] block: Make zero-after-EOF work with larger alignment Kevin Wolf
2014-01-10 19:07   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 17/24] block: Generalise and optimise COR serialisation Kevin Wolf
2014-01-11 22:28   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 18/24] block: Make overlap range for serialisation dynamic Kevin Wolf
2014-01-11 22:28   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 19/24] block: Allow wait_serialising_requests() at any point Kevin Wolf
2013-12-27  4:17   ` Wenchao Xia
2014-01-13 11:29     ` Kevin Wolf
2014-01-14  3:13       ` Wenchao Xia
2014-01-11 22:29   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 20/24] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
2014-01-11 22:29   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 21/24] block: Change coroutine wrapper to byte granularity Kevin Wolf
2014-01-11 22:29   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 22/24] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
2014-01-11 22:29   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 23/24] block: Make bdrv_pwrite() " Kevin Wolf
2014-01-11 22:22   ` Max Reitz
2013-12-13 13:22 ` [Qemu-devel] [PATCH v2 24/24] iscsi: Set bs->request_alignment Kevin Wolf
2014-01-11 22:30   ` Max Reitz
2013-12-27  4:27 ` [Qemu-devel] [PATCH v2 00/24] block: Support for 512b-on-4k emulatio Wenchao Xia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.