All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read
@ 2011-11-17 13:40 Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The new -drive copy-on-read=on|off feature populates the image file with data
from the backing file on read.  This is useful when accessing images backed
over a slow medium (e.g. http over internet).  All read data will be stored in
the local image file so it does not need to be fetched again in the future.

This series is a prerequisite for the image streaming feature, which uses
copy-on-read to populate the image file in the background while the VM is
running.  However, the copy-on-read feature is useful on its own.

Copy-on-read is implemented by checking whether or not data is allocated in the
image file before reading it.  If data is not allocated then it needs to be
read and written back to the image file.

The tricky bit is avoiding races with other I/O requests.  These patches add
request tracking to BlockDriverState so that the list of pending requests is
available.  Copy-on-read prevents races by serializing overlapping requests.

Finally, there is a performance impact when enabling this feature since an
additional write is performed.  Serializing overlapping requests also means
that I/O patterns where multiple requests access the same cluster will see a
loss in parallelism.  Perhaps we can be smarter about preventing corruption in
the future and win back some performance.

v2:
 * Based on bdrv_co_is_allocated patch series - now safe in coroutine context
 * Use QEMU_ALIGN_DOWN/UP() macros for copy-on-read cluster calculations [Zhi Yong]
 * Reset bs->copy_on_read on bdrv_close() [Kevin]
 * Refcount bs->copy_on_read so it doesn't get clobbered by multiple users [Marcelo]
 * Use bool instead of int where appropriate [Kevin]
 * Use compound literal assignment to ensure BdrvTrackedRequest fields always get zeroed [Kevin]
 * Comment rationale for copy-on-read bounce buffer [Kevin]

Stefan Hajnoczi (8):
  qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros
  coroutine: add qemu_co_queue_restart_all()
  block: add request tracking
  block: add bdrv_set_copy_on_read()
  block: wait for overlapping requests
  block: request overlap detection
  block: core copy-on-read logic
  block: add -drive copy-on-read=on|off

 block.c               |  213 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block.h               |    3 +
 block/qcow2.c         |    2 +-
 block_int.h           |    6 ++
 blockdev.c            |    6 ++
 hmp-commands.hx       |    5 +-
 qemu-common.h         |    6 ++
 qemu-config.c         |    4 +
 qemu-coroutine-lock.c |   15 ++--
 qemu-coroutine.h      |    5 +
 qemu-options.hx       |    9 ++-
 trace-events          |    1 +
 12 files changed, 263 insertions(+), 12 deletions(-)

-- 
1.7.7.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Add macros for aligning a number to a multiple, for example:

QEMU_ALIGN_DOWN(500, 2000) = 0
QEMU_ALIGN_UP(500, 2000) = 2000

Since ALIGN_UP() is a common macro name use the QEMU_* namespace prefix.
Hopefully this will protect us from included headers that leak something
with a similar name.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-common.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/qemu-common.h b/qemu-common.h
index 2ce47aa..44870fe 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -341,6 +341,12 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
     return res.ll;
 }
 
+/* Round number down to multiple */
+#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
+
+/* Round number up to multiple */
+#define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
+
 #include "module.h"
 
 #endif
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 2/8] coroutine: add qemu_co_queue_restart_all()
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 3/8] block: add request tracking Stefan Hajnoczi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

It's common to wake up all waiting coroutines.  Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qcow2.c         |    2 +-
 qemu-coroutine-lock.c |   15 ++++++++-------
 qemu-coroutine.h      |    5 +++++
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index eab35d1..195e1b1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -514,7 +514,7 @@ static void run_dependent_requests(BDRVQcowState *s, QCowL2Meta *m)
     /* Restart all dependent requests */
     if (!qemu_co_queue_empty(&m->dependent_requests)) {
         qemu_co_mutex_unlock(&s->lock);
-        while(qemu_co_queue_next(&m->dependent_requests));
+        qemu_co_queue_restart_all(&m->dependent_requests);
         qemu_co_mutex_lock(&s->lock);
     }
 }
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index 9549c07..26ad76b 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -84,6 +84,13 @@ bool qemu_co_queue_next(CoQueue *queue)
     return (next != NULL);
 }
 
+void qemu_co_queue_restart_all(CoQueue *queue)
+{
+    while (qemu_co_queue_next(queue)) {
+        /* Do nothing */
+    }
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
     return (QTAILQ_FIRST(&queue->entries) == NULL);
@@ -144,13 +151,7 @@ void qemu_co_rwlock_unlock(CoRwlock *lock)
     assert(qemu_in_coroutine());
     if (lock->writer) {
         lock->writer = false;
-        while (!qemu_co_queue_empty(&lock->queue)) {
-            /*
-             * Wakeup every body. This will include some
-             * writers too.
-             */
-            qemu_co_queue_next(&lock->queue);
-        }
+        qemu_co_queue_restart_all(&lock->queue);
     } else {
         lock->reader--;
         assert(lock->reader >= 0);
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
index 8a2e5d2..8a55fe1 100644
--- a/qemu-coroutine.h
+++ b/qemu-coroutine.h
@@ -131,6 +131,11 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue);
 bool qemu_co_queue_next(CoQueue *queue);
 
 /**
+ * Restarts all coroutines in the CoQueue and leaves the queue empty.
+ */
+void qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
  * Checks if the CoQueue is empty.
  */
 bool qemu_co_queue_empty(CoQueue *queue);
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 3/8] block: add request tracking
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The block layer does not know about pending requests.  This information
is necessary for copy-on-read since overlapping requests must be
serialized to prevent races that corrupt the image.

The BlockDriverState gets a new tracked_request list field which
contains all pending requests.  Each request is a BdrvTrackedRequest
record with sector_num, nb_sectors, and is_write fields.

Note that request tracking is always enabled but hopefully this extra
work is so small that it doesn't justify adding an enable/disable flag.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c     |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 block_int.h |    4 ++++
 2 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/block.c b/block.c
index 0df7eb9..27c4e84 100644
--- a/block.c
+++ b/block.c
@@ -1071,6 +1071,42 @@ void bdrv_commit_all(void)
     }
 }
 
+struct BdrvTrackedRequest {
+    BlockDriverState *bs;
+    int64_t sector_num;
+    int nb_sectors;
+    bool is_write;
+    QLIST_ENTRY(BdrvTrackedRequest) list;
+};
+
+/**
+ * Remove an active request from the tracked requests list
+ *
+ * This function should be called when a tracked request is completing.
+ */
+static void tracked_request_end(BdrvTrackedRequest *req)
+{
+    QLIST_REMOVE(req, list);
+}
+
+/**
+ * Add an active request to the tracked requests list
+ */
+static void tracked_request_begin(BdrvTrackedRequest *req,
+                                  BlockDriverState *bs,
+                                  int64_t sector_num,
+                                  int nb_sectors, bool is_write)
+{
+    *req = (BdrvTrackedRequest){
+        .bs = bs,
+        .sector_num = sector_num,
+        .nb_sectors = nb_sectors,
+        .is_write = is_write,
+    };
+
+    QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
+}
+
 /*
  * Return values:
  * 0        - success
@@ -1350,6 +1386,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
+    int ret;
 
     if (!drv) {
         return -ENOMEDIUM;
@@ -1363,7 +1401,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, false, nb_sectors);
     }
 
-    return drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    tracked_request_end(&req);
+    return ret;
 }
 
 int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
@@ -1381,6 +1422,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BdrvTrackedRequest req;
     int ret;
 
     if (!bs->drv) {
@@ -1398,6 +1440,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
+    tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+
     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
 
     if (bs->dirty_bitmap) {
@@ -1408,6 +1452,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bs->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 
+    tracked_request_end(&req);
+
     return ret;
 }
 
diff --git a/block_int.h b/block_int.h
index f9e2c9a..788fde9 100644
--- a/block_int.h
+++ b/block_int.h
@@ -51,6 +51,8 @@
 #define BLOCK_OPT_PREALLOC      "preallocation"
 #define BLOCK_OPT_SUBFMT        "subformat"
 
+typedef struct BdrvTrackedRequest BdrvTrackedRequest;
+
 typedef struct AIOPool {
     void (*cancel)(BlockDriverAIOCB *acb);
     int aiocb_size;
@@ -250,6 +252,8 @@ struct BlockDriverState {
     int in_use; /* users other than guest access, eg. block migration */
     QTAILQ_ENTRY(BlockDriverState) list;
     void *private;
+
+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
 };
 
 struct BlockDriverAIOCB {
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 4/8] block: add bdrv_set_copy_on_read()
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 3/8] block: add request tracking Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests Stefan Hajnoczi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

The bdrv_set_copy_on_read() function can be used to programmatically
enable or disable copy-on-read for a block device.  Later patches add
the actual copy-on-read logic.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c     |   22 ++++++++++++++++++++++
 block.h     |    3 +++
 block_int.h |    2 ++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 27c4e84..c90880b 100644
--- a/block.c
+++ b/block.c
@@ -538,6 +538,22 @@ int bdrv_parse_cache_flags(const char *mode, int *flags)
     return 0;
 }
 
+/**
+ * Enable/disable copy-on-read
+ *
+ * This is based on a reference count so multiple users may call this function
+ * without worrying about clobbering the previous state.  Copy-on-read stays
+ * enabled until all users have called to disable it.
+ */
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable)
+{
+    if (enable) {
+        bs->copy_on_read++;
+    } else {
+        bs->copy_on_read--;
+    }
+}
+
 /*
  * Common part for opening disk images and files
  */
@@ -559,6 +575,11 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
     bs->growable = 0;
     bs->buffer_alignment = 512;
 
+    assert(bs->copy_on_read == 0); /* bdrv_new() and bdrv_close() make it so */
+    if (flags & BDRV_O_RDWR) {
+        bdrv_set_copy_on_read(bs, !!(flags & BDRV_O_COPY_ON_READ));
+    }
+
     pstrcpy(bs->filename, sizeof(bs->filename), filename);
     bs->backing_file[0] = '\0';
 
@@ -801,6 +822,7 @@ void bdrv_close(BlockDriverState *bs)
 #endif
         bs->opaque = NULL;
         bs->drv = NULL;
+        bs->copy_on_read = 0;
 
         if (bs->file != NULL) {
             bdrv_close(bs->file);
diff --git a/block.h b/block.h
index ad8dd48..68b4b14 100644
--- a/block.h
+++ b/block.h
@@ -70,6 +70,7 @@ typedef struct BlockDevOps {
 #define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
 #define BDRV_O_NO_BACKING  0x0100 /* don't open the backing file */
 #define BDRV_O_NO_FLUSH    0x0200 /* disable flushing on this disk */
+#define BDRV_O_COPY_ON_READ 0x0400 /* copy read backing sectors into image */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
 
@@ -308,6 +309,8 @@ void bdrv_reset_dirty(BlockDriverState *bs, int64_t cur_sector,
                       int nr_sectors);
 int64_t bdrv_get_dirty_count(BlockDriverState *bs);
 
+void bdrv_set_copy_on_read(BlockDriverState *bs, bool enable);
+
 void bdrv_set_in_use(BlockDriverState *bs, int in_use);
 int bdrv_in_use(BlockDriverState *bs);
 
diff --git a/block_int.h b/block_int.h
index 788fde9..3c5bacb 100644
--- a/block_int.h
+++ b/block_int.h
@@ -193,6 +193,8 @@ struct BlockDriverState {
     int encrypted; /* if true, the media is encrypted */
     int valid_key; /* if true, a valid encryption key has been set */
     int sg;        /* if true, the device is a /dev/sg* */
+    int copy_on_read; /* if true, copy read backing sectors into image
+                         note this is a reference count */
 
     BlockDriver *drv; /* NULL means no media */
     void *opaque;
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-17 13:43   ` Paolo Bonzini
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 6/8] block: request overlap detection Stefan Hajnoczi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

When copy-on-read is enabled it is necessary to wait for overlapping
requests before issuing new requests.  This prevents races between the
copy-on-read and a write request.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index c90880b..da7aaa2 100644
--- a/block.c
+++ b/block.c
@@ -1099,6 +1099,7 @@ struct BdrvTrackedRequest {
     int nb_sectors;
     bool is_write;
     QLIST_ENTRY(BdrvTrackedRequest) list;
+    CoQueue wait_queue; /* coroutines blocked on this request */
 };
 
 /**
@@ -1109,6 +1110,7 @@ struct BdrvTrackedRequest {
 static void tracked_request_end(BdrvTrackedRequest *req)
 {
     QLIST_REMOVE(req, list);
+    qemu_co_queue_restart_all(&req->wait_queue);
 }
 
 /**
@@ -1126,9 +1128,34 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
         .is_write = is_write,
     };
 
+    qemu_co_queue_init(&req->wait_queue);
+
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+static bool tracked_request_overlaps(BdrvTrackedRequest *req,
+                                     int64_t sector_num, int nb_sectors) {
+    return false; /* not yet implemented */
+}
+
+static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors)
+{
+    BdrvTrackedRequest *req;
+    bool retry;
+
+    do {
+        retry = false;
+        QLIST_FOREACH(req, &bs->tracked_requests, list) {
+            if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+                qemu_co_queue_wait(&req->wait_queue);
+                retry = true;
+                break;
+            }
+        }
+    } while (retry);
+}
+
 /*
  * Return values:
  * 0        - success
@@ -1423,6 +1450,10 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, false, nb_sectors);
     }
 
+    if (bs->copy_on_read) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
     tracked_request_end(&req);
@@ -1462,6 +1493,10 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
+    if (bs->copy_on_read) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    }
+
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 6/8] block: request overlap detection
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-22 15:06   ` Kevin Wolf
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic Stefan Hajnoczi
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
  7 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Detect overlapping requests and remember to align to cluster boundaries
if the image format uses them.  This assumes that allocating I/O is
performed in cluster granularity - which is true for qcow2, qed, etc.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   40 ++++++++++++++++++++++++++++++++++++++--
 1 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index da7aaa2..0eef122 100644
--- a/block.c
+++ b/block.c
@@ -1133,21 +1133,57 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
 }
 
+/**
+ * Round a region to cluster boundaries
+ */
+static void round_to_clusters(BlockDriverState *bs,
+                              int64_t sector_num, int nb_sectors,
+                              int64_t *cluster_sector_num,
+                              int *cluster_nb_sectors)
+{
+    BlockDriverInfo bdi;
+
+    if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+        *cluster_sector_num = sector_num;
+        *cluster_nb_sectors = nb_sectors;
+    } else {
+        int64_t c = bdi.cluster_size / BDRV_SECTOR_SIZE;
+        *cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
+        *cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
+                                            nb_sectors, c);
+    }
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t sector_num, int nb_sectors) {
-    return false; /* not yet implemented */
+    /*        aaaa   bbbb */
+    if (sector_num >= req->sector_num + req->nb_sectors) {
+        return false;
+    }
+    /* bbbb   aaaa        */
+    if (req->sector_num >= sector_num + nb_sectors) {
+        return false;
+    }
+    return true;
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors)
 {
     BdrvTrackedRequest *req;
+    int64_t cluster_sector_num;
+    int cluster_nb_sectors;
     bool retry;
 
+    /* If we touch the same cluster it counts as an overlap */
+    round_to_clusters(bs, sector_num, nb_sectors,
+                      &cluster_sector_num, &cluster_nb_sectors);
+
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (tracked_request_overlaps(req, sector_num, nb_sectors)) {
+            if (tracked_request_overlaps(req, cluster_sector_num,
+                                         cluster_nb_sectors)) {
                 qemu_co_queue_wait(&req->wait_queue);
                 retry = true;
                 break;
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 6/8] block: request overlap detection Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  2011-11-23  3:42   ` Zhi Yong Wu
  2011-11-23  4:42   ` Zhi Yong Wu
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi
  7 siblings, 2 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 trace-events |    1 +
 2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 0eef122..d5faa6c 100644
--- a/block.c
+++ b/block.c
@@ -1464,6 +1464,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
+static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
+{
+    /* Perform I/O through a temporary buffer so that users who scribble over
+     * their read buffer while the operation is in progress do not end up
+     * modifying the image file.  This is critical for zero-copy guest I/O
+     * where anything might happen inside guest memory.
+     */
+    void *bounce_buffer;
+
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    int64_t cluster_sector_num;
+    int cluster_nb_sectors;
+    size_t skip_bytes;
+    int ret;
+
+    /* Cover entire cluster so no additional backing file I/O is required when
+     * allocating cluster in the image file.
+     */
+    round_to_clusters(bs, sector_num, nb_sectors,
+                      &cluster_sector_num, &cluster_nb_sectors);
+
+    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
+                                cluster_sector_num, cluster_nb_sectors);
+
+    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
+    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
+    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+
+    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
+                                 &bounce_qiov);
+    if (ret < 0) {
+        goto err;
+    }
+
+    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
+                                  &bounce_qiov);
+    if (ret < 0) {
+        /* It might be okay to ignore write errors for guest requests.  If this
+         * is a deliberate copy-on-read then we don't want to ignore the error.
+         * Simply report it in all cases.
+         */
+        goto err;
+    }
+
+    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
+    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
+                           nb_sectors * BDRV_SECTOR_SIZE);
+
+err:
+    qemu_vfree(bounce_buffer);
+    return ret;
+}
+
 /*
  * Handle a read request in coroutine context
  */
@@ -1491,7 +1546,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+
+    if (bs->copy_on_read) {
+        int pnum;
+
+        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
+        if (ret < 0) {
+            goto out;
+        }
+
+        if (!ret || pnum != nb_sectors) {
+            ret = bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, qiov);
+            goto out;
+        }
+    }
+
     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+
+out:
     tracked_request_end(&req);
     return ret;
 }
diff --git a/trace-events b/trace-events
index 962caca..518b76b 100644
--- a/trace-events
+++ b/trace-events
@@ -69,6 +69,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
+bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
 
 # hw/virtio-blk.c
 virtio_blk_req_complete(void *req, int status) "req %p status %d"
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 8/8] block: add -drive copy-on-read=on|off
  2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-11-17 13:40 ` Stefan Hajnoczi
  7 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi

This patch adds the -drive copy-on-read=on|off command-line option:

  copy-on-read=on|off
  copy-on-read is "on" or "off" and enables whether to copy read backing
  file sectors into the image file.  Copy-on-read avoids accessing the
  same backing file sectors repeatedly and is useful when the backing
  file is over a slow network.  By default copy-on-read is off.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 blockdev.c      |    6 ++++++
 hmp-commands.hx |    5 +++--
 qemu-config.c   |    4 ++++
 qemu-options.hx |    9 ++++++++-
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 9068c5b..af4e239 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -257,6 +257,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
     DriveInfo *dinfo;
     BlockIOLimit io_limits;
     int snapshot = 0;
+    bool copy_on_read;
     int ret;
 
     translation = BIOS_ATA_TRANSLATION_AUTO;
@@ -273,6 +274,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 
     snapshot = qemu_opt_get_bool(opts, "snapshot", 0);
     ro = qemu_opt_get_bool(opts, "readonly", 0);
+    copy_on_read = qemu_opt_get_bool(opts, "copy-on-read", false);
 
     file = qemu_opt_get(opts, "file");
     serial = qemu_opt_get(opts, "serial");
@@ -546,6 +548,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
         bdrv_flags |= (BDRV_O_SNAPSHOT|BDRV_O_CACHE_WB|BDRV_O_NO_FLUSH);
     }
 
+    if (copy_on_read) {
+        bdrv_flags |= BDRV_O_COPY_ON_READ;
+    }
+
     if (media == MEDIA_CDROM) {
         /* CDROM is fine for any interface, don't check.  */
         ro = 1;
diff --git a/hmp-commands.hx b/hmp-commands.hx
index f8d855e..79a9195 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -860,9 +860,10 @@ ETEXI
         .args_type  = "pci_addr:s,opts:s",
         .params     = "[[<domain>:]<bus>:]<slot>\n"
                       "[file=file][,if=type][,bus=n]\n"
-                      "[,unit=m][,media=d][index=i]\n"
+                      "[,unit=m][,media=d][,index=i]\n"
                       "[,cyls=c,heads=h,secs=s[,trans=t]]\n"
-                      "[snapshot=on|off][,cache=on|off]",
+                      "[,snapshot=on|off][,cache=on|off]\n"
+                      "[,readonly=on|off][,copy-on-read=on|off]",
         .help       = "add drive to PCI storage controller",
         .mhandler.cmd = drive_hot_add,
     },
diff --git a/qemu-config.c b/qemu-config.c
index 1aa080f..18f3020 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -109,6 +109,10 @@ static QemuOptsList qemu_drive_opts = {
             .name = "bps_wr",
             .type = QEMU_OPT_NUMBER,
             .help = "limit write bytes per second",
+        },{
+            .name = "copy-on-read",
+            .type = QEMU_OPT_BOOL,
+            .help = "copy read data from backing file into image file",
         },
         { /* end of list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index 25a7be7..b3db10c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -135,7 +135,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
     "       [,serial=s][,addr=A][,id=name][,aio=threads|native]\n"
-    "       [,readonly=on|off]\n"
+    "       [,readonly=on|off][,copy-on-read=on|off]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]][[,iops=i]|[[,iops_rd=r][,iops_wr=w]]\n"
     "                use 'file' as a drive image\n", QEMU_ARCH_ALL)
 STEXI
@@ -187,6 +187,9 @@ host disk is full; report the error to the guest otherwise).
 The default setting is @option{werror=enospc} and @option{rerror=report}.
 @item readonly
 Open drive @option{file} as read-only. Guest write attempts will fail.
+@item copy-on-read=@var{copy-on-read}
+@var{copy-on-read} is "on" or "off" and enables whether to copy read backing
+file sectors into the image file.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that
@@ -218,6 +221,10 @@ like your host losing power, the disk storage getting disconnected accidently,
 etc. you're image will most probably be rendered unusable.   When using
 the @option{-snapshot} option, unsafe caching is always used.
 
+Copy-on-read avoids accessing the same backing file sectors repeatedly and is
+useful when the backing file is over a slow network.  By default copy-on-read
+is off.
+
 Instead of @option{-cdrom} you can use:
 @example
 qemu -drive file=file,index=2,media=cdrom
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests Stefan Hajnoczi
@ 2011-11-17 13:43   ` Paolo Bonzini
  2011-11-17 13:51     ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2011-11-17 13:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

On 11/17/2011 02:40 PM, Stefan Hajnoczi wrote:
> When copy-on-read is enabled it is necessary to wait for overlapping
> requests before issuing new requests.  This prevents races between the
> copy-on-read and a write request.

What about discards?

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests
  2011-11-17 13:43   ` Paolo Bonzini
@ 2011-11-17 13:51     ` Stefan Hajnoczi
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-17 13:51 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Thu, Nov 17, 2011 at 1:43 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 11/17/2011 02:40 PM, Stefan Hajnoczi wrote:
>>
>> When copy-on-read is enabled it is necessary to wait for overlapping
>> requests before issuing new requests.  This prevents races between the
>> copy-on-read and a write request.
>
> What about discards?

To get into an interesting scenario the guest would need to issue
overlapping read and discard requests.  QEMU with copy-on-read turns
this into either:

discard, read-from-backing-file, write-to-image-file
read-from-backing-file, discard, write-to-image-file
read-from-backing-file, write-to-image-file, discard

There is no issue with any of these orderings.  In the worst case we
end up with allocated image space where the guest issued a discard.
But since discard is a hint anyway it doesn't matter.

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/8] block: request overlap detection
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 6/8] block: request overlap detection Stefan Hajnoczi
@ 2011-11-22 15:06   ` Kevin Wolf
  2011-11-22 15:10     ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Kevin Wolf @ 2011-11-22 15:06 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel

Am 17.11.2011 14:40, schrieb Stefan Hajnoczi:
> Detect overlapping requests and remember to align to cluster boundaries
> if the image format uses them.  This assumes that allocating I/O is
> performed in cluster granularity - which is true for qcow2, qed, etc.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

>  static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>          int64_t sector_num, int nb_sectors)
>  {
>      BdrvTrackedRequest *req;
> +    int64_t cluster_sector_num;
> +    int cluster_nb_sectors;
>      bool retry;
>  
> +    /* If we touch the same cluster it counts as an overlap */
> +    round_to_clusters(bs, sector_num, nb_sectors,
> +                      &cluster_sector_num, &cluster_nb_sectors);

Is this really required? Image formats must be able to deal with two
concurrent write requests to the same cluster, and I don't think it
makes a difference whether it's a guest write request or a COR one.

Or does the queuing protect more than just that a COR never takes
precedence over a guest write?

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/8] block: request overlap detection
  2011-11-22 15:06   ` Kevin Wolf
@ 2011-11-22 15:10     ` Stefan Hajnoczi
  2011-11-22 16:36       ` Kevin Wolf
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-22 15:10 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Tue, Nov 22, 2011 at 3:06 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 17.11.2011 14:40, schrieb Stefan Hajnoczi:
>> Detect overlapping requests and remember to align to cluster boundaries
>> if the image format uses them.  This assumes that allocating I/O is
>> performed in cluster granularity - which is true for qcow2, qed, etc.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>
>>  static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>>          int64_t sector_num, int nb_sectors)
>>  {
>>      BdrvTrackedRequest *req;
>> +    int64_t cluster_sector_num;
>> +    int cluster_nb_sectors;
>>      bool retry;
>>
>> +    /* If we touch the same cluster it counts as an overlap */
>> +    round_to_clusters(bs, sector_num, nb_sectors,
>> +                      &cluster_sector_num, &cluster_nb_sectors);
>
> Is this really required? Image formats must be able to deal with two
> concurrent write requests to the same cluster, and I don't think it
> makes a difference whether it's a guest write request or a COR one.
>
> Or does the queuing protect more than just that a COR never takes
> precedence over a guest write?

It guarantees that a write request and a copy-on-read request racing
for the same cluster will be serialized.  Either:
1. CoR, then write.
2. Allocating write, then normal read.

If we don't do this we risk:
3. CoR (part 1: read), allocating write, CoR (part 2: write)

The result is that we dropped the write request!

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 6/8] block: request overlap detection
  2011-11-22 15:10     ` Stefan Hajnoczi
@ 2011-11-22 16:36       ` Kevin Wolf
  0 siblings, 0 replies; 20+ messages in thread
From: Kevin Wolf @ 2011-11-22 16:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

Am 22.11.2011 16:10, schrieb Stefan Hajnoczi:
> On Tue, Nov 22, 2011 at 3:06 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 17.11.2011 14:40, schrieb Stefan Hajnoczi:
>>> Detect overlapping requests and remember to align to cluster boundaries
>>> if the image format uses them.  This assumes that allocating I/O is
>>> performed in cluster granularity - which is true for qcow2, qed, etc.
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>
>>>  static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
>>>          int64_t sector_num, int nb_sectors)
>>>  {
>>>      BdrvTrackedRequest *req;
>>> +    int64_t cluster_sector_num;
>>> +    int cluster_nb_sectors;
>>>      bool retry;
>>>
>>> +    /* If we touch the same cluster it counts as an overlap */
>>> +    round_to_clusters(bs, sector_num, nb_sectors,
>>> +                      &cluster_sector_num, &cluster_nb_sectors);
>>
>> Is this really required? Image formats must be able to deal with two
>> concurrent write requests to the same cluster, and I don't think it
>> makes a difference whether it's a guest write request or a COR one.
>>
>> Or does the queuing protect more than just that a COR never takes
>> precedence over a guest write?
> 
> It guarantees that a write request and a copy-on-read request racing
> for the same cluster will be serialized.  Either:
> 1. CoR, then write.
> 2. Allocating write, then normal read.
> 
> If we don't do this we risk:
> 3. CoR (part 1: read), allocating write, CoR (part 2: write)
> 
> The result is that we dropped the write request!

Right, this requirement comes in with the next patch that makes COR
round clusters for optimisation. I think this relationship deserves a
comment here.

Anyway, I merged the series into the block branch (Only to notice that
it would have been better to merge bdrv_co_is_allocated first... Will
review that tomorrow.) If you send another version for this patch to add
a comment, I'll replace it.

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic Stefan Hajnoczi
@ 2011-11-23  3:42   ` Zhi Yong Wu
  2011-11-23  9:05     ` Stefan Hajnoczi
  2011-11-23  4:42   ` Zhi Yong Wu
  1 sibling, 1 reply; 20+ messages in thread
From: Zhi Yong Wu @ 2011-11-23  3:42 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, qemu-devel

On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  trace-events |    1 +
>  2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/block.c b/block.c
> index 0eef122..d5faa6c 100644
> --- a/block.c
> +++ b/block.c
> @@ -1464,6 +1464,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
>     return 0;
>  }
>
> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
> +        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> +{
> +    /* Perform I/O through a temporary buffer so that users who scribble over
> +     * their read buffer while the operation is in progress do not end up
> +     * modifying the image file.  This is critical for zero-copy guest I/O
> +     * where anything might happen inside guest memory.
> +     */
> +    void *bounce_buffer;
> +
> +    struct iovec iov;
> +    QEMUIOVector bounce_qiov;
> +    int64_t cluster_sector_num;
> +    int cluster_nb_sectors;
> +    size_t skip_bytes;
> +    int ret;
> +
> +    /* Cover entire cluster so no additional backing file I/O is required when
> +     * allocating cluster in the image file.
> +     */
> +    round_to_clusters(bs, sector_num, nb_sectors,
> +                      &cluster_sector_num, &cluster_nb_sectors);
> +
> +    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
> +                                cluster_sector_num, cluster_nb_sectors);
> +
> +    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
> +    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
> +    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
> +
> +    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
> +                                 &bounce_qiov);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +
> +    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
> +                                  &bounce_qiov);
> +    if (ret < 0) {
> +        /* It might be okay to ignore write errors for guest requests.  If this
> +         * is a deliberate copy-on-read then we don't want to ignore the error.
> +         * Simply report it in all cases.
> +         */
> +        goto err;
> +    }
> +
> +    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
> +    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
> +                           nb_sectors * BDRV_SECTOR_SIZE);
> +
> +err:
> +    qemu_vfree(bounce_buffer);
> +    return ret;
> +}
> +
>  /*
>  * Handle a read request in coroutine context
>  */
> @@ -1491,7 +1546,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>     }
>
>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +
> +    if (bs->copy_on_read) {
Why is  tracked_request_begin/end() not put inside the brace around
bs->copy_on_read?
If this COR function is not enabled, i guess that request tracing
function should not be need.
> +        int pnum;
> +
> +        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
> +        if (ret < 0) {
> +            goto out;
> +        }
> +
> +        if (!ret || pnum != nb_sectors) {
> +            ret = bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, qiov);
> +            goto out;
> +        }
> +    }
> +
>     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +
> +out:
>     tracked_request_end(&req);
>     return ret;
>  }
> diff --git a/trace-events b/trace-events
> index 962caca..518b76b 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -69,6 +69,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
>  bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
> +bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
>
>  # hw/virtio-blk.c
>  virtio_blk_req_complete(void *req, int status) "req %p status %d"
> --
> 1.7.7.1
>
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic Stefan Hajnoczi
  2011-11-23  3:42   ` Zhi Yong Wu
@ 2011-11-23  4:42   ` Zhi Yong Wu
  2011-11-23  8:58     ` Stefan Hajnoczi
  1 sibling, 1 reply; 20+ messages in thread
From: Zhi Yong Wu @ 2011-11-23  4:42 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, qemu-devel

On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  block.c      |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  trace-events |    1 +
>  2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/block.c b/block.c
> index 0eef122..d5faa6c 100644
> --- a/block.c
> +++ b/block.c
> @@ -1464,6 +1464,61 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
>     return 0;
>  }
>
> +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
> +        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
> +{
> +    /* Perform I/O through a temporary buffer so that users who scribble over
> +     * their read buffer while the operation is in progress do not end up
> +     * modifying the image file.  This is critical for zero-copy guest I/O
> +     * where anything might happen inside guest memory.
> +     */
> +    void *bounce_buffer;
> +
> +    struct iovec iov;
> +    QEMUIOVector bounce_qiov;
> +    int64_t cluster_sector_num;
> +    int cluster_nb_sectors;
> +    size_t skip_bytes;
> +    int ret;
> +
> +    /* Cover entire cluster so no additional backing file I/O is required when
> +     * allocating cluster in the image file.
> +     */
> +    round_to_clusters(bs, sector_num, nb_sectors,
> +                      &cluster_sector_num, &cluster_nb_sectors);
Why need to round down/up sector_num/nb_sectors in this function? The
detection of race condition for write request has been done before
this function.

> +
> +    trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
> +                                cluster_sector_num, cluster_nb_sectors);
> +
> +    iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE;
> +    iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
> +    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
> +
> +    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
> +                                 &bounce_qiov);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +
> +    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
> +                                  &bounce_qiov);
> +    if (ret < 0) {
> +        /* It might be okay to ignore write errors for guest requests.  If this
> +         * is a deliberate copy-on-read then we don't want to ignore the error.
> +         * Simply report it in all cases.
> +         */
> +        goto err;
> +    }
> +
> +    skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE;
> +    qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes,
> +                           nb_sectors * BDRV_SECTOR_SIZE);
> +
> +err:
> +    qemu_vfree(bounce_buffer);
> +    return ret;
> +}
> +
>  /*
>  * Handle a read request in coroutine context
>  */
> @@ -1491,7 +1546,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>     }
>
>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> +
> +    if (bs->copy_on_read) {
> +        int pnum;
> +
> +        ret = bdrv_co_is_allocated(bs, sector_num, nb_sectors, &pnum);
> +        if (ret < 0) {
> +            goto out;
> +        }
> +
> +        if (!ret || pnum != nb_sectors) {
> +            ret = bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, qiov);
> +            goto out;
> +        }
> +    }
> +
>     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +
> +out:
>     tracked_request_end(&req);
>     return ret;
>  }
> diff --git a/trace-events b/trace-events
> index 962caca..518b76b 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -69,6 +69,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
>  bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
> +bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
>
>  # hw/virtio-blk.c
>  virtio_blk_req_complete(void *req, int status) "req %p status %d"
> --
> 1.7.7.1
>
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-23  4:42   ` Zhi Yong Wu
@ 2011-11-23  8:58     ` Stefan Hajnoczi
  2011-11-23  9:00       ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23  8:58 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Wed, Nov 23, 2011 at 4:42 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
>> +    /* Cover entire cluster so no additional backing file I/O is required when
>> +     * allocating cluster in the image file.
>> +     */
>> +    round_to_clusters(bs, sector_num, nb_sectors,
>> +                      &cluster_sector_num, &cluster_nb_sectors);
> Why need to round down/up sector_num/nb_sectors in this function? The
> detection of race condition for write request has been done before
> this function.

If we write less than a cluster then the image format will have to
perform additional to populate the regions of the cluster that we did
not touch.  So we touch entire clusters and therefore no extra I/O is
generated.

The comment right above the line you are asking about explains this.

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-23  8:58     ` Stefan Hajnoczi
@ 2011-11-23  9:00       ` Stefan Hajnoczi
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23  9:00 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Wed, Nov 23, 2011 at 8:58 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Nov 23, 2011 at 4:42 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
>> <stefanha@linux.vnet.ibm.com> wrote:
>>> +    /* Cover entire cluster so no additional backing file I/O is required when
>>> +     * allocating cluster in the image file.
>>> +     */
>>> +    round_to_clusters(bs, sector_num, nb_sectors,
>>> +                      &cluster_sector_num, &cluster_nb_sectors);
>> Why need to round down/up sector_num/nb_sectors in this function? The
>> detection of race condition for write request has been done before
>> this function.
>
> If we write less than a cluster then the image format will have to
> perform additional to populate the regions of the cluster that we did

"perform additional I/O"

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-23  3:42   ` Zhi Yong Wu
@ 2011-11-23  9:05     ` Stefan Hajnoczi
  2011-11-23  9:51       ` Zhi Yong Wu
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2011-11-23  9:05 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Wed, Nov 23, 2011 at 3:42 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
>>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>> +
>> +    if (bs->copy_on_read) {
> Why is  tracked_request_begin/end() not put inside the brace around
> bs->copy_on_read?
> If this COR function is not enabled, i guess that request tracing
> function should not be need.

It's not safe to put the calls inside the "if (bs->copy_on_read) {"
body because turning off copy_on_read while a request is pending would
leave the request in the tracked list forever!

In a previous version of the series there was a flag to turn request
tracking on/off.  Pending requests would still remove themselves from
the list even after request tracking was disabled.

But request tracking is cheap - it involves filling in fields on the
stack and adding them to a linked list.  So to keep things simple we
always maintain this list.

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic
  2011-11-23  9:05     ` Stefan Hajnoczi
@ 2011-11-23  9:51       ` Zhi Yong Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Zhi Yong Wu @ 2011-11-23  9:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Marcelo Tosatti, Stefan Hajnoczi, qemu-devel

On Wed, Nov 23, 2011 at 5:05 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Nov 23, 2011 at 3:42 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Thu, Nov 17, 2011 at 9:40 PM, Stefan Hajnoczi
>> <stefanha@linux.vnet.ibm.com> wrote:
>>>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>>> +
>>> +    if (bs->copy_on_read) {
>> Why is  tracked_request_begin/end() not put inside the brace around
>> bs->copy_on_read?
>> If this COR function is not enabled, i guess that request tracing
>> function should not be need.
>
> It's not safe to put the calls inside the "if (bs->copy_on_read) {"
> body because turning off copy_on_read while a request is pending would
> leave the request in the tracked list forever!
GOT IT, thanks.
>
> In a previous version of the series there was a flag to turn request
> tracking on/off.  Pending requests would still remove themselves from
> the list even after request tracking was disabled.
>
> But request tracking is cheap - it involves filling in fields on the
> stack and adding them to a linked list.  So to keep things simple we
> always maintain this list.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-11-23  9:51 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-17 13:40 [Qemu-devel] [PATCH v2 0/8] block: generic copy-on-read Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 1/8] qemu-common: add QEMU_ALIGN_DOWN() and QEMU_ALIGN_UP() macros Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 2/8] coroutine: add qemu_co_queue_restart_all() Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 3/8] block: add request tracking Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 4/8] block: add bdrv_set_copy_on_read() Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 5/8] block: wait for overlapping requests Stefan Hajnoczi
2011-11-17 13:43   ` Paolo Bonzini
2011-11-17 13:51     ` Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 6/8] block: request overlap detection Stefan Hajnoczi
2011-11-22 15:06   ` Kevin Wolf
2011-11-22 15:10     ` Stefan Hajnoczi
2011-11-22 16:36       ` Kevin Wolf
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 7/8] block: core copy-on-read logic Stefan Hajnoczi
2011-11-23  3:42   ` Zhi Yong Wu
2011-11-23  9:05     ` Stefan Hajnoczi
2011-11-23  9:51       ` Zhi Yong Wu
2011-11-23  4:42   ` Zhi Yong Wu
2011-11-23  8:58     ` Stefan Hajnoczi
2011-11-23  9:00       ` Stefan Hajnoczi
2011-11-17 13:40 ` [Qemu-devel] [PATCH v2 8/8] block: add -drive copy-on-read=on|off Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.