All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
@ 2011-01-21 22:19 Chunqiang Tang
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-21 22:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chunqiang Tang

This patch is part of the Fast Virtual Disk (FVD) proposal. See the related
discussions at
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg00426.html.

This patch adds the 'update' command to qemu-img. FVD stores various
image-specific configurable parameters in the image header. A user can use
'qemu-img update' to modify those parameters and accordingly controls FVD's
runtime behavior. This command may also be leveraged by other block device
drivers, e.g., to set the size of the in-memory metadata cache. Currently
those parameters are hard-coded in a one-size-fit-all manner.

Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
---
 block_int.h      |    1 +
 qemu-img-cmds.hx |    6 ++++++
 qemu-img.c       |   43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/block_int.h b/block_int.h
index 12663e8..e98872a 100644
--- a/block_int.h
+++ b/block_int.h
@@ -98,6 +98,7 @@ struct BlockDriver {
     int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
                                   const char *snapshot_name);
     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
+    int (*bdrv_update)(BlockDriverState *bs, int argc, char **argv);
 
     int (*bdrv_save_vmstate)(BlockDriverState *bs, const uint8_t *buf,
                              int64_t pos, int size);
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 6c7176f..1ad378b 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -39,6 +39,12 @@ STEXI
 @item info [-f @var{fmt}] @var{filename}
 ETEXI
 
+DEF("update", img_update,
+    "update [-f fmt] filename [attr1=val1 attr2=val2 ...]")
+STEXI
+@item update [-f @var{fmt}] @var{filename} [@var{attr1=val1 attr2=val2 ...}]")
+ETEXI
+
 DEF("snapshot", img_snapshot,
     "snapshot [-l | -a snapshot | -c snapshot | -d snapshot] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index afd9ed2..5f35c4d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1054,6 +1054,49 @@ static int img_info(int argc, char **argv)
     return 0;
 }
 
+static int img_update(int argc, char **argv)
+{
+    int c;
+    const char *filename, *fmt;
+    BlockDriverState *bs;
+
+    fmt = NULL;
+    for(;;) {
+        c = getopt(argc, argv, "f:h");
+        if (c == -1)
+            break;
+        switch(c) {
+        case 'h':
+            help();
+            break;
+        case 'f':
+            fmt = optarg;
+            break;
+        }
+    }
+    if (optind >= argc)
+        help();
+    filename = argv[optind++];
+    bs = bdrv_new_open(filename, fmt, BDRV_O_FLAGS | BDRV_O_NO_BACKING
+                       | BDRV_O_RDWR);
+    if (!bs) {
+        return 1;
+    }
+
+    if (bs->drv->bdrv_update) {
+        bs->drv->bdrv_update(bs, argc-optind, &argv[optind]);
+    }
+    else {
+        char fmt_name[128];
+        bdrv_get_format(bs, fmt_name, sizeof(fmt_name));
+        error_report ("The 'update' command is not supported for "
+                      "the '%s' image format.", fmt_name);
+    }
+
+    bdrv_delete(bs);
+    return 0;
+}
+
 #define SNAPSHOT_LIST   1
 #define SNAPSHOT_CREATE 2
 #define SNAPSHOT_APPLY  3
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-21 22:19 [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Chunqiang Tang
@ 2011-01-21 22:19 ` Chunqiang Tang
  2011-01-21 22:49   ` Anthony Liguori
  2011-01-23 15:26   ` Andreas Färber
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 3/3] FVD: Made qemu-io working with simulation (blksim) Chunqiang Tang
  2011-01-28  9:57 ` [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Stefan Hajnoczi
  2 siblings, 2 replies; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-21 22:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chunqiang Tang

This patch is part of the Fast Virtual Disk (FVD) proposal. See the related
discussions at
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg00426.html.

This patch adds the 'blksim' block device driver, which is a tool to
facilitate testing and debugging. blksim operates on a RAW image, but it uses
neither AIO nor posix threads to perform I/Os. Instead, blksim functions like
an event-driven disk simulator, and allows a block device driver developer to
fully control the order of disk I/Os, the order of callbacks, and the return
code of every I/O operation. The purpose is to comprehensively test a block
device driver under failures and race conditions. Bugs found by blksim under
rare race conditions are guaranteed to be precisely reproducible, with no
dependency on thread timing etc., which makes debugging much easier.

Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
---
 Makefile.objs  |    1 +
 block/blksim.c |  752 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blksim.h |   35 +++
 3 files changed, 788 insertions(+), 0 deletions(-)
 create mode 100644 block/blksim.c
 create mode 100644 block/blksim.h

diff --git a/Makefile.objs b/Makefile.objs
index c3e52c5..ce5cc8d 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -23,6 +23,7 @@ block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
+block-nested-y += blksim.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/blksim.c b/block/blksim.c
new file mode 100644
index 0000000..a92ba11
--- /dev/null
+++ b/block/blksim.c
@@ -0,0 +1,752 @@
+/*
+ * Copyright (c) 2010-2011 IBM
+ *
+ * Authors:
+ *         Chunqiang Tang <ctang@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*=============================================================================
+ *  A short description: this module implements a simulated block device
+ *  driver "blksim". It works with qemu-io and qemu-test to perform testing,
+ *  allowing changing the  order of disk I/O and callback activities to test
+ *  rare race conditions. See qemu-test.c, qemu-io.c, and qemu-io-sim.c.
+ *============================================================================*/
+
+#include <sys/vfs.h>
+#include <sys/mman.h>
+#include <pthread.h>
+#include <execinfo.h>
+#include <stdlib.h>
+#include <sys/ioctl.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <inttypes.h>
+#include "block_int.h"
+#include "osdep.h"
+#include "qemu-option.h"
+#include "qemu-timer.h"
+#include "block.h"
+#include "qemu-queue.h"
+#include "qemu-common.h"
+#include "block/blksim.h"
+
+#ifndef TRUE
+# define TRUE 1
+#endif
+
+#ifndef FALSE
+# define FALSE 0
+#endif
+
+#if 0
+# define QDEBUG printf
+#else
+# define QDEBUG(format,...) do {} while (0)
+#endif
+
+typedef enum {
+    SIM_NULL,
+    SIM_READ,
+    SIM_WRITE,
+    SIM_FLUSH,
+    SIM_READ_CALLBACK,
+    SIM_WRITE_CALLBACK,
+    SIM_FLUSH_CALLBACK,
+    SIM_TIMER
+} sim_op_t;
+
+static void sim_aio_cancel (BlockDriverAIOCB * acb);
+static int64_t sim_uuid = 0;
+static int64_t current_time = 0;
+static int64_t rand_time = 0;
+static int interactive_print = TRUE;
+static int blksim_invoked = FALSE;
+static int instant_qemubh = TRUE;
+struct SimAIOCB;
+
+/*
+ * Note: disk_io_return_code, set_disk_io_return_code(), and insert_task() work
+ * together to ensure that multiple subrequests triggered by the same
+ * outtermost request either succeed together or fail together. This behavior
+ * is required by qemu-test.  Here is one example of problems caused by
+ * departuring from this behavior.  Consider a write request that generates
+ * two subrequests, w1 and w2. If w1 succeeds but w2 fails, the data will not
+ * be written into qemu-test's "truth image" but the part of the data handled
+ * by w1 will be written into qemu-test's "test image". As a result, their
+ * contents diverge can automated testing cannot continue.
+ */
+static int disk_io_return_code = 0;
+
+typedef struct BDRVSimState {
+    int fd;
+} BDRVSimState;
+
+typedef struct SimAIOCB {
+    BlockDriverAIOCB common;
+    int64_t uuid;
+    sim_op_t op;
+    int64_t sector_num;
+    QEMUIOVector *qiov;
+    int nb_sectors;
+    int ret;
+    int64_t time;
+    struct SimAIOCB *next;
+    struct SimAIOCB *prev;
+
+} SimAIOCB;
+
+static AIOPool sim_aio_pool = {
+    .aiocb_size = sizeof (SimAIOCB),
+    .cancel = sim_aio_cancel,
+};
+
+static SimAIOCB head = {
+    .uuid = -1,
+    .time = (int64_t) (9223372036854775807ULL),
+    .op = SIM_NULL,
+    .next = &head,
+    .prev = &head,
+};
+
+/* Debug a specific task.*/
+#if 1
+# define CHECK_TASK(acb) do { } while (0)
+#else
+static inline void CHECK_TASK (int64_t uuid)
+{
+    if (uuid == 19LL) {
+        printf ("CHECK_TASK pause for task %" PRId64 "\n", uuid);
+    }
+}
+#endif
+
+/* do_io() should never fail. A failure indicates a bug in the upper layer
+ * block device driver, or failure in the real hardware. */
+static int do_io (BlockDriverState * bs, int64_t sector_num, uint8_t * buf,
+                  int nb_sectors, int do_read)
+{
+    BDRVSimState *s = bs->opaque;
+    size_t size = nb_sectors * 512;
+    int ret;
+
+    if (lseek (s->fd, sector_num * 512, SEEK_SET) < 0) {
+        fprintf (stderr, "Error: lseek %s sector_num=%" PRId64 ". "
+                 "Pause process %d for debugging...\n",
+                 bs->filename, sector_num, getpid ());
+        fgetc (stdin);
+    }
+
+    while (size > 0) {
+
+        if (do_read) {
+            ret = read (s->fd, buf, size);
+            if (ret == 0) {
+                fprintf (stderr,
+                         "Error: read beyond the size of %s sector_num=%" PRId64
+                         " nb_sectors=%d. Pause process %d for debugging...\n",
+                         bs->filename, sector_num, nb_sectors, getpid ());
+                fgetc (stdin);
+            }
+        } else {
+            ret = write (s->fd, buf, size);
+        }
+
+        if (ret >= 0) {
+            size -= ret;
+            buf += ret;
+        } else if (errno != EINTR) {
+            fprintf (stderr, "Error: %s %s sector_num=%" PRId64
+                     " nb_sectors=%d. Pause process %d for debugging...\n",
+                     do_read ? "READ" : "WRITE", bs->filename, sector_num,
+                     nb_sectors, getpid ());
+            fgetc (stdin);
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
+static int blksim_read (BlockDriverState * bs, int64_t sector_num, 
+                        uint8_t * buf, int nb_sectors)
+{
+    return do_io (bs, sector_num, buf, nb_sectors, TRUE);
+}
+
+static int blksim_write (BlockDriverState * bs, int64_t sector_num,
+                      const uint8_t * buf, int nb_sectors)
+{
+    return do_io (bs, sector_num, (uint8_t *) buf, nb_sectors, FALSE);
+}
+
+static void insert_in_list (SimAIOCB * acb)
+{
+    int64_t new_id = sim_uuid++;
+    CHECK_TASK (new_id);
+    acb->uuid = new_id;
+
+    if (rand_time <= 0) {
+        /* Working with qemu-io.c and not doing delay randomization.
+         * Insert it to the tail. */
+        acb->time = 0;
+        acb->prev = head.prev;
+        acb->next = &head;
+        head.prev->next = acb;
+        head.prev = acb;
+        return;
+    }
+
+    SimAIOCB *p = head.next;
+
+    if (acb->time >= 0) {
+        /* Introduce a random delay to better trigger rare race conditions. */
+        acb->time += random () % rand_time;
+
+        /* Find the position to insert. The list is sorted in ascending time. */
+        while (1) {
+            if (p->time > acb->time) {
+                break;
+            }
+            if (p->time == acb->time && (random () % 2 == 0)) {
+                break;
+            }
+            p = p->next;
+        }
+    }
+
+    /* Insert acb before p. */
+    acb->next = p;
+    acb->prev = p->prev;
+    p->prev->next = acb;
+    p->prev = acb;
+}
+
+/* Debug problems related to reusing task objects. Problem already solved.*/
+#if 1
+# define my_qemu_aio_get qemu_aio_get
+# define my_qemu_aio_release qemu_aio_release
+
+#else
+static SimAIOCB *search_task_list (SimAIOCB * acb)
+{
+    SimAIOCB *p;
+    for (p = head.next; p != &head; p = p->next) {
+        if (p == acb) {
+            return p;
+        }
+    }
+
+    return NULL;
+}
+
+static inline void *my_qemu_aio_get (AIOPool * pool, BlockDriverState * bs,
+                                     BlockDriverCompletionFunc * cb,
+                                     void *opaque)
+{
+    SimAIOCB *acb = (SimAIOCB *) qemu_aio_get (&sim_aio_pool, bs, cb, opaque);
+    QDEBUG ("SIM: qemu_aio_get reuse old task%" PRId64 "\n", acb->uuid);
+    ASSERT (!search_task_list (acb));
+    return acb;
+}
+
+static inline void my_qemu_aio_release (SimAIOCB * acb)
+{
+    QDEBUG ("SIM: qemu_aio_release task%" PRId64 "\n", acb->uuid);
+    qemu_aio_release (acb);
+}
+#endif
+
+static BlockDriverAIOCB *insert_task (int op, BlockDriverState * bs,
+                                      int64_t sector_num, QEMUIOVector * qiov,
+                                      int nb_sectors,
+                                      BlockDriverCompletionFunc * cb,
+                                      void *opaque)
+{
+    SimAIOCB *acb = my_qemu_aio_get (&sim_aio_pool, bs, cb, opaque);
+    if (!acb) {
+        return NULL;
+    }
+
+    acb->op = op;
+    acb->sector_num = sector_num;
+    acb->qiov = qiov;
+    acb->nb_sectors = nb_sectors;
+    acb->ret = disk_io_return_code;
+    acb->time = current_time;
+    insert_in_list (acb);
+
+    if (interactive_print) {
+        if (op == SIM_READ) {
+            printf ("Added READ uuid=%" PRId64 "  filename=%s  sector_num=%"
+                    PRId64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else if (op == SIM_WRITE) {
+            printf ("Added WRITE uuid=%" PRId64 "  filename=%s  sector_num=%"
+                    PRId64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else {
+            fprintf (stderr, "Unknown op %d\n", op);
+            exit (1);
+        }
+    }
+
+    return &acb->common;
+}
+
+static void insert_aio_callback (SimAIOCB * acb)
+{
+    acb->time = current_time;
+    insert_in_list (acb);
+
+    if (acb->op == SIM_FLUSH) {
+        acb->op = SIM_FLUSH_CALLBACK;
+        if (interactive_print) {
+            printf ("Added FLUSH_CALLBACK uuid=%" PRId64 "  filename=%s\n",
+                    acb->uuid, acb->common.bs->filename);
+        }
+    } else if (acb->op == SIM_READ) {
+        acb->op = SIM_READ_CALLBACK;
+        if (interactive_print) {
+            printf ("Added READ_CALLBACK uuid=%" PRId64
+                    "  filename=%s  sector_num=%" PRId64 "  nb_sectors=%d\n",
+                    acb->uuid, acb->common.bs->filename, acb->sector_num,
+                    acb->nb_sectors);
+        }
+    } else if (acb->op == SIM_WRITE) {
+        acb->op = SIM_WRITE_CALLBACK;
+        if (interactive_print) {
+            printf ("Added WRITE_CALLBACK uuid=%" PRId64
+                    "  filename=%s  sector_num=%" PRId64 "  nb_sectors=%d\n",
+                    acb->uuid, acb->common.bs->filename, acb->sector_num,
+                    acb->nb_sectors);
+        }
+    } else {
+        fprintf (stderr, "Wrong op %d\n", acb->op);
+        exit (1);
+    }
+}
+
+void blksim_list_tasks (void)
+{
+    SimAIOCB *acb;
+
+    for (acb = head.next; acb != &head; acb = acb->next) {
+        if (acb->op == SIM_READ) {
+            printf ("uuid=%" PRId64 "  READ           file=%s  sector_num=%"
+                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else if (acb->op == SIM_WRITE) {
+            printf ("uuid=%" PRId64 "  WRITE          file=%s  sector_num=%"
+                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else if (acb->op == SIM_READ_CALLBACK) {
+            printf ("uuid=%" PRId64 "  CALLBACK READ  file=%s  sector_num=%"
+                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else if (acb->op == SIM_WRITE_CALLBACK) {
+            printf ("uuid=%" PRId64 "  CALLBACK WRITE file=%s  sector_num=%"
+                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
+                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
+        } else {
+            fprintf (stderr, "Wrong OP %d\n", acb->op);
+            exit (1);
+        }
+    }
+}
+
+static inline void sim_callback (SimAIOCB * acb)
+{
+    acb->common.cb (acb->common.opaque, acb->ret);
+}
+
+int64_t blksim_get_time (void)
+{
+    return current_time;
+}
+
+void *blksim_new_timer (void *cb, void *opaque)
+{
+    SimAIOCB *acb = my_qemu_aio_get (&sim_aio_pool, NULL, cb, opaque);
+    acb->op = SIM_TIMER;
+    acb->prev = NULL;
+    return acb;
+}
+
+void blksim_mod_timer (void *ts, int64_t expire_time)
+{
+    SimAIOCB *acb = ts;
+
+    if (acb->prev) {
+        /* Remove it first. */
+        acb->next->prev = acb->prev;
+        acb->prev->next = acb->next;
+    }
+    acb->time = expire_time;
+    insert_in_list (acb);
+
+    if (interactive_print) {
+        printf ("Added TIMER uuid=%" PRId64 "  expire_time=%"PRId64
+                " current_time=%"PRId64"\n", 
+                acb->uuid, expire_time, current_time);
+    }
+}
+
+void blksim_free_timer (void *ts)
+{
+    SimAIOCB *acb = ts;
+    CHECK_TASK (acb->uuid);
+    my_qemu_aio_release (acb);
+}
+
+void blksim_del_timer (void *ts)
+{
+    SimAIOCB *acb = ts;
+
+    CHECK_TASK (acb->uuid);
+    if (acb->prev) {
+        /* Remove it from the list. */
+        acb->next->prev = acb->prev;
+        acb->prev->next = acb->next;
+
+        /* Mark it as not in list. */
+        acb->prev = NULL;
+    }
+}
+
+void blksim_bh_schedule (void *bh)
+{
+    if (instant_qemubh) {
+        blksim_mod_timer (bh, -1);
+    } else {
+        blksim_mod_timer (bh, current_time);
+    }
+}
+
+void blksim_set_instant_qemubh (int instant)
+{
+    instant_qemubh = instant;
+}
+
+void blksim_set_disk_io_return_code (int ret)
+{
+    disk_io_return_code = ret;
+}
+
+static void run_task_by_acb (SimAIOCB * acb)
+{
+    CHECK_TASK (acb->uuid);
+
+    /* Remove it from the list. */
+    acb->next->prev = acb->prev;
+    acb->prev->next = acb->next;
+    acb->prev = NULL;        /* Indicate that it is no longer in the list. */
+
+    if (acb->time > current_time) {
+        current_time = acb->time;
+    }
+
+    if (acb->op == SIM_TIMER) {
+        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " TIMER \n",
+                acb->uuid, acb->time);
+
+        ((QEMUTimerCB *) acb->common.cb) (acb->common.opaque);
+        return;
+    }
+
+    BlockDriverState *bs = acb->common.bs;
+
+    if (acb->op == SIM_READ) {
+        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64
+                " READ %s sector_num=%" PRId64 " nb_sectors=%d\n",
+                acb->uuid, acb->time, bs->filename, acb->sector_num, 
+                acb->nb_sectors);
+
+        if (acb->ret == 0) {
+            if (acb->qiov->niov == 1) {
+                if (blksim_read
+                    (bs, acb->sector_num, acb->qiov->iov->iov_base,
+                     acb->nb_sectors) != 0) {
+                    fprintf (stderr, "Error in reading %s sector_num=%lld "
+                             "nb_sectors=%d\n", acb->common.bs->filename,
+                             acb->sector_num, acb->nb_sectors);
+                    exit (1);
+                }
+            } else {
+                uint8_t *buf=qemu_blockalign (acb->common.bs, acb->qiov->size);
+                if (blksim_read (bs, acb->sector_num, buf, 
+                                 acb->nb_sectors) != 0) {
+                    fprintf (stderr, "Error in reading %s sector_num=%lld "
+                             "nb_sectors=%d\n", acb->common.bs->filename,
+                             acb->sector_num, acb->nb_sectors);
+                    exit (1);
+                }
+                qemu_iovec_from_buffer (acb->qiov, buf, acb->qiov->size);
+                qemu_vfree (buf);
+            }
+        }
+
+        insert_aio_callback (acb);
+    } else if (acb->op == SIM_WRITE) {
+        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64
+                " WRITE %s sector_num=%" PRId64 " nb_sectors=%d\n",
+                acb->uuid, acb->time, bs->filename, 
+                acb->sector_num, acb->nb_sectors);
+
+        if (acb->ret == 0) {
+            if (acb->qiov->niov == 1) {
+                if (blksim_write (bs, acb->sector_num, acb->qiov->iov->iov_base,
+                                  acb->nb_sectors) != 0) {
+                    fprintf (stderr, "Error in writing %s sector_num=%lld "
+                             "nb_sectors=%d\n", acb->common.bs->filename,
+                             acb->sector_num, acb->nb_sectors);
+                    exit (1);
+                }
+            } else {
+                uint8_t *buf = qemu_blockalign (acb->common.bs,
+                                                acb->qiov->size);
+                qemu_iovec_to_buffer (acb->qiov, buf);
+                if (blksim_write (bs, acb->sector_num, buf, 
+                                  acb->nb_sectors)!= 0) {
+                    fprintf (stderr, "Error in writing %s sector_num=%lld "
+                             "nb_sectors=%d\n", acb->common.bs->filename,
+                             acb->sector_num, acb->nb_sectors);
+                    exit (1);
+                }
+                qemu_vfree (buf);
+            }
+        }
+
+        insert_aio_callback (acb);
+    } else if (acb->op == SIM_FLUSH) {
+        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " FLUSH %s\n",
+                acb->uuid, acb->time, bs->filename);
+        /* Skip real flushing to speed up simulation:
+         *         if (ret == 0) { * fdatasync (s->fd); } */
+        insert_aio_callback (acb);
+    } else if (acb->op == SIM_WRITE_CALLBACK || acb->op == SIM_READ_CALLBACK
+               || acb->op == SIM_FLUSH_CALLBACK) {
+        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " CALLBACK\n",
+                acb->uuid, acb->time);
+        sim_callback (acb);
+        CHECK_TASK (acb->uuid);
+        my_qemu_aio_release (acb);
+    } else {
+        fprintf (stderr, "Unknown op %d\n", acb->op);
+        exit (1);
+    }
+}
+
+int blksim_run_task_by_uuid (int64_t uuid)
+{
+    SimAIOCB *acb;
+
+    for (acb = head.next; acb != &head; acb = acb->next) {
+        if (acb->uuid == uuid) {
+            run_task_by_acb (acb);
+            return 0;
+        }
+    }
+
+    return -1;
+}
+
+int blksim_run_all_tasks (void)
+{
+    int n = 0;
+
+    while (1) {
+        SimAIOCB *acb = head.next;
+        if (acb == &head) {
+            return n; /* No more tasks.*/
+        }
+
+        run_task_by_acb (acb);
+        n++;
+    }
+}
+
+static BlockDriverAIOCB *blksim_aio_readv (BlockDriverState * bs,
+                                        int64_t sector_num,
+                                        QEMUIOVector * qiov,
+                                        int nb_sectors,
+                                        BlockDriverCompletionFunc * cb,
+                                        void *opaque)
+{
+    return insert_task (SIM_READ, bs, sector_num, qiov, nb_sectors, cb, opaque);
+}
+
+static BlockDriverAIOCB *blksim_aio_writev (BlockDriverState * bs,
+                                         int64_t sector_num,
+                                         QEMUIOVector * qiov,
+                                         int nb_sectors,
+                                         BlockDriverCompletionFunc * cb,
+                                         void *opaque)
+{
+    return insert_task (SIM_WRITE, bs, sector_num, qiov, nb_sectors, cb,
+                        opaque);
+}
+
+static BlockDriverAIOCB *blksim_aio_flush (BlockDriverState * bs,
+                                        BlockDriverCompletionFunc * cb,
+                                        void *opaque)
+{
+    return insert_task (SIM_FLUSH, bs, 0, NULL, 0, cb, opaque);
+}
+
+static void sim_aio_cancel (BlockDriverAIOCB * blockacb)
+{
+    SimAIOCB *acb = container_of (blockacb, SimAIOCB, common);
+
+    CHECK_TASK (acb->uuid);
+    QDEBUG ("SIM: cancel task%" PRId64 "\n", acb->uuid);
+
+    if (acb->prev) {
+        acb->next->prev = acb->prev;
+        acb->prev->next = acb->next;
+        acb->prev = NULL;
+        my_qemu_aio_release (acb);
+    } else {
+        fprintf (stderr, "Error: cancel a blksim task that does not exist: "
+                 "uuid=%"PRId64". Halt process %d for debugging...\n",
+                 acb->uuid, getpid());
+        fgetc (stdin); 
+        exit (1);
+    }
+}
+
+static int blksim_open (BlockDriverState * bs, const char *filename,
+                     int bdrv_flags)
+{
+    BDRVSimState *s = bs->opaque;
+    int open_flags = O_BINARY | O_LARGEFILE;
+
+    blksim_invoked  = TRUE;
+
+    if ((bdrv_flags & BDRV_O_RDWR)) {
+        open_flags |= O_RDWR;
+    } else {
+        open_flags |= O_RDONLY;
+    }
+
+    if ((bdrv_flags & BDRV_O_NOCACHE)) {
+        open_flags |= O_DIRECT;
+    } else if (!(bdrv_flags & BDRV_O_CACHE_WB)) {
+        open_flags |= O_DSYNC;
+    }
+
+    /* Parse the "blksim:" prefix */
+    if (!strncmp(filename, "blksim:", strlen("blksim:"))) {
+        filename += strlen("blksim:");
+    }
+
+    s->fd = open (filename, open_flags);
+    if (s->fd < 0)
+        return -1;
+
+    int64_t len = lseek (s->fd, 0, SEEK_END);
+    if (len >= 0) {
+        bs->total_sectors = len / 512;
+    } else {
+        bs->total_sectors = 0;
+    }
+
+    bs->growable = 1;
+    return 0;
+}
+
+static void blksim_close (BlockDriverState * bs)
+{
+    BDRVSimState *s = bs->opaque;
+    close (s->fd);
+}
+
+static int blksim_flush (BlockDriverState * bs)
+{
+    /*
+     * Skip real flushing to speed up simulation.
+         * BDRVSimState *s = bs->opaque;
+         * fdatasync (s->fd);
+     */
+    return 0;
+}
+
+static int blksim_has_zero_init (BlockDriverState * bs)
+{
+    struct stat buf;
+
+    if (stat (bs->filename, &buf) != 0) {
+        fprintf (stderr, "Failed to stat() %s\n", bs->filename);
+        exit (1);
+    }
+
+    if (S_ISBLK (buf.st_mode) || S_ISCHR (buf.st_mode)) {
+        return 0;
+    }
+
+    return 1;
+}
+
+static int blksim_truncate (BlockDriverState * bs, int64_t offset)
+{
+    BDRVSimState *s = bs->opaque;
+    return ftruncate (s->fd, offset);
+}
+
+static BlockDriver bdrv_blksim = {
+    .format_name = "blksim",
+    .protocol_name = "blksim",
+    .instance_size = sizeof (BDRVSimState),
+    .bdrv_file_open = blksim_open,
+    .bdrv_close = blksim_close,
+    .bdrv_flush = blksim_flush,
+    .bdrv_read = blksim_read,
+    .bdrv_write = blksim_write,
+    .bdrv_aio_readv = blksim_aio_readv,
+    .bdrv_aio_writev = blksim_aio_writev,
+    .bdrv_aio_flush = blksim_aio_flush,
+    .bdrv_has_zero_init = blksim_has_zero_init,
+    .bdrv_truncate = blksim_truncate,
+};
+
+static void bdrv_blksim_init(void)
+{
+    bdrv_register(&bdrv_blksim);
+}
+block_init(bdrv_blksim_init);
+
+void init_blksim (int print, int64_t _rand_time)
+{
+    interactive_print = print;
+    rand_time = _rand_time;
+}
+
+/* 
+ * To work properly in the simulation mode, block device drivers that
+ * explicitly invoke qemu_aio_wait() should invoke blksim_qemu_aio_wait() if 
+ * the block device is openned using blksim. Most block device drivers do not
+ * invoke qemu_aio_wait() and hence should not be concerned about this.
+ */
+int blksim_qemu_aio_wait (void)
+{
+    SimAIOCB *acb = head.next;
+    if (acb == &head) {
+        return 0;
+    }
+    else {
+        run_task_by_acb (acb);
+        return 1;
+    }
+}
+
+int blksim_has_task (void)
+{
+    return head.next != &head;
+}
+
+int using_blksim (void)
+{
+    return blksim_invoked;
+}
diff --git a/block/blksim.h b/block/blksim.h
new file mode 100644
index 0000000..fa1e20d
--- /dev/null
+++ b/block/blksim.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2010-2011 IBM
+ *
+ * Authors:
+ *         Chunqiang Tang <ctang@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*=============================================================================
+ *  A short description: this is the header of the simulated block device
+ *  driver "blksim".
+ *============================================================================*/
+
+#ifndef __block_sim_h__
+#define __block_sim_h__
+
+void init_blksim (int print, int64_t _rand_time);
+int using_blksim (void);
+int blksim_has_task (void);
+void blksim_list_tasks (void);
+int blksim_run_task_by_uuid (int64_t uuid);
+int blksim_run_all_tasks (void);
+int64_t blksim_get_time (void);
+void *blksim_new_timer (void *cb, void *opaque);
+void blksim_mod_timer (void *ts, int64_t expire_time);
+void blksim_free_timer (void *ts);
+void blksim_del_timer (void *ts);
+void blksim_bh_schedule (void *bh);
+void blksim_set_disk_io_return_code (int ret);
+int blksim_qemu_aio_wait(void);
+void blksim_set_instant_qemubh (int instant /* TRUE or FALSE */);
+
+#endif
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH 3/3] FVD: Made qemu-io working with simulation (blksim)
  2011-01-21 22:19 [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Chunqiang Tang
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
@ 2011-01-21 22:19 ` Chunqiang Tang
  2011-01-28  9:57 ` [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Stefan Hajnoczi
  2 siblings, 0 replies; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-21 22:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chunqiang Tang

This patch is part of the Fast Virtual Disk (FVD) proposal. See the related
discussions at
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg00426.html.

This patch adds the 'sim' command to qemu-io, which works  with the blksim
driver. With this extension, now a developer can use qemu-io to precisely
control the order of disk I/Os, the order of callbacks, and the return code of
every I/O operation. The purpose is to comprehensively test a block device
driver under failures and race conditions. Bugs found by blksim under rare
race conditions are guaranteed to be precisely reproducible, with no
dependency on thread timing etc., which makes debugging much easier.

Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
---
 qemu-io-sim.c |  109 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-io.c     |   38 +++++++++++++++-----
 2 files changed, 138 insertions(+), 9 deletions(-)
 create mode 100644 qemu-io-sim.c

diff --git a/qemu-io-sim.c b/qemu-io-sim.c
new file mode 100644
index 0000000..816f8ae
--- /dev/null
+++ b/qemu-io-sim.c
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2010-2011 IBM
+ *
+ * Authors:
+ *         Chunqiang Tang <ctang@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*=============================================================================
+ * qemu-io-sim works with qemu-io to perform simulated testing. The 'sim'
+ * command allows the user to control the order of disk I/O and callback
+ * activities in order to test rare race conditions. See block/blksim.c
+ * Note that in the manual mode, qemu-io's 'sim' command can only work with
+ * qemu-io's 'aio_read' and 'aio_write' commands. The automated testing mode,
+ * 'qemu-io --auto', performs a much more comprehensive fully automated test
+ * (see qemu-io-auto.c). Below is one example of using qemu-io to perform
+ * manual testing in the simulation mode.
+ 
+$ qemu-img create -f qcow2 -obacking_fmt=blksim -b base.raw test.qcow2
+$ qemu-io -f qcow2 blksim:test.qcow2
+qemu-io> aio_read 0 512
+Added READ uuid=0  filename=base.raw  sector_num=0  nb_sectors=1
+qemu-io> aio_read 0 1024
+Added READ uuid=1  filename=base.raw  sector_num=0  nb_sectors=2
+qemu-io> sim list
+uuid=0  READ           file=base.raw  sector_num=0  nb_sectors=1
+uuid=1  READ           file=base.raw  sector_num=0  nb_sectors=2
+qemu-io> aio_write 512 1024
+Added WRITE uuid=2  filename=test.qcow2  sector_num=33297  nb_sectors=2
+qemu-io> sim list
+uuid=0  READ           file=base.raw  sector_num=0  nb_sectors=1
+uuid=1  READ           file=base.raw  sector_num=0  nb_sectors=2
+uuid=2  WRITE          file=test.qcow2  sector_num=33297  nb_sectors=2
+qemu-io> sim 2
+Added WRITE_CALLBACK uuid=3  filename=test.qcow2  sector_num=33297  nb_sectors=2
+qemu-io> sim list
+uuid=0  READ           file=base.raw  sector_num=0  nb_sectors=1
+uuid=1  READ           file=base.raw  sector_num=0  nb_sectors=2
+uuid=3  CALLBACK WRITE file=test.qcow2  sector_num=33297  nb_sectors=2
+qemu-io> sim 1
+Added READ_CALLBACK uuid=4  filename=base.raw  sector_num=0  nb_sectors=2
+qemu-io> sim list
+uuid=0  READ           file=base.raw  sector_num=0  nb_sectors=1
+uuid=3  CALLBACK WRITE file=test.qcow2  sector_num=33297  nb_sectors=2
+uuid=4  CALLBACK READ  file=base.raw  sector_num=0  nb_sectors=2
+qemu-io> sim 3
+Added WRITE uuid=5  filename=test.qcow2  sector_num=528  nb_sectors=1
+qemu-io> sim list
+uuid=0  READ           file=base.raw  sector_num=0  nb_sectors=1
+uuid=4  CALLBACK READ  file=base.raw  sector_num=0  nb_sectors=2
+uuid=5  WRITE          file=test.qcow2  sector_num=528  nb_sectors=1
+
+*=============================================================================*/
+
+#include "block/blksim.h"
+
+static void sim_help (void)
+{
+    printf ("\n"
+            " sim list\t\tlist all simulation tasks\n"
+            " sim <#task> [#ret]\trun a simulation task, optionally "
+                    "using #ret as the return value of a read/write operation\n"
+            " sim all [#ret]\t\trun all tasks, optionally using #ret as "
+                    "the return value of read/write tasks\n"
+            " sim prefetch\t\tstart prefetching\n");
+}
+
+static int sim_f (int argc, char **argv)
+{
+    int ret = 0;
+
+    if (argc == 3) {
+        ret = atoi (argv[2]);
+    }
+    else if (argc != 2) {
+        sim_help ();
+        return 0;
+    }
+
+    if (strcmp (argv[1], "list") == 0) {
+        blksim_list_tasks ();
+    }
+    else if (strcmp (argv[1], "all") == 0) {
+        blksim_set_disk_io_return_code (ret);
+        int n = blksim_run_all_tasks ();
+        blksim_set_disk_io_return_code (0);
+        printf ("Executed %d tasks.\n", n);
+    }
+    else {
+        blksim_set_disk_io_return_code (ret);
+        blksim_run_task_by_uuid (atoll (argv[1]));
+        blksim_set_disk_io_return_code (0);
+    }
+
+    return 0;
+}
+
+static const cmdinfo_t sim_cmd = {
+    .name = "sim",
+    .altname = "s",
+    .cfunc = sim_f,
+    .argmin = 1,
+    .argmax = 2,
+    .args = "",
+    .oneline = "use simulation to control the order of disk I/Os and callbacks",
+    .help = sim_help,
+};
diff --git a/qemu-io.c b/qemu-io.c
index 5b24c5e..9144a6a 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1584,7 +1584,7 @@ static const cmdinfo_t close_cmd = {
 	.oneline	= "close the current open file",
 };
 
-static int openfile(char *name, int flags, int growable)
+static int openfile(char *name, const char *fmt, int flags, int growable)
 {
 	if (bs) {
 		fprintf(stderr, "file open already, try 'help close'\n");
@@ -1597,9 +1597,17 @@ static int openfile(char *name, int flags, int growable)
 			return 1;
 		}
 	} else {
+                BlockDriver *drv = NULL;
+                if (fmt && !(drv = bdrv_find_format (fmt))) {
+                        fprintf(stderr, "%s: can't find driver for format "
+                                "%s \n", progname, fmt);
+                        bs = NULL;
+                        return 1;
+                }
+
 		bs = bdrv_new("hda");
 
-		if (bdrv_open(bs, name, flags, NULL) < 0) {
+		if (bdrv_open(bs, name, flags, drv) < 0) {
 			fprintf(stderr, "%s: can't open device %s\n", progname, name);
 			bs = NULL;
 			return 1;
@@ -1636,7 +1644,7 @@ static const cmdinfo_t open_cmd = {
 	.argmin		= 1,
 	.argmax		= -1,
 	.flags		= CMD_NOFILE_OK,
-	.args		= "[-Crsn] [path]",
+	.args		= "[-Crsn] [-f <format>] [path]",
 	.oneline	= "open the file specified by path",
 	.help		= open_help,
 };
@@ -1648,8 +1656,9 @@ open_f(int argc, char **argv)
 	int readonly = 0;
 	int growable = 0;
 	int c;
+        const char *fmt = NULL;
 
-	while ((c = getopt(argc, argv, "snrg")) != EOF) {
+	while ((c = getopt(argc, argv, "snrgf:")) != EOF) {
 		switch (c) {
 		case 's':
 			flags |= BDRV_O_SNAPSHOT;
@@ -1663,6 +1672,9 @@ open_f(int argc, char **argv)
 		case 'g':
 			growable = 1;
 			break;
+		case 'f':
+                        fmt = optarg;
+			break;
 		default:
 			return command_usage(&open_cmd);
 		}
@@ -1675,7 +1687,7 @@ open_f(int argc, char **argv)
 	if (optind != argc - 1)
 		return command_usage(&open_cmd);
 
-	return openfile(argv[optind], flags, growable);
+	return openfile(argv[optind], fmt, flags, growable);
 }
 
 static int
@@ -1701,10 +1713,12 @@ init_check_command(
 	return 1;
 }
 
+#include "qemu-io-sim.c"
+
 static void usage(const char *name)
 {
 	printf(
-"Usage: %s [-h] [-V] [-rsnm] [-c cmd] ... [file]\n"
+"Usage: %s [-h] [-a] [-V] [-rsnm] [-c cmd] ... [file]\n"
 "QEMU Disk exerciser\n"
 "\n"
 "  -c, --cmd            command to execute\n"
@@ -1714,18 +1728,18 @@ static void usage(const char *name)
 "  -g, --growable       allow file to grow (only applies to protocols)\n"
 "  -m, --misalign       misalign allocations for O_DIRECT\n"
 "  -k, --native-aio     use kernel AIO implementation (on Linux only)\n"
+"  -f, --format         image format of the file\n"
 "  -h, --help           display this help and exit\n"
 "  -V, --version        output version information and exit\n"
 "\n",
 	name);
 }
 
-
 int main(int argc, char **argv)
 {
 	int readonly = 0;
 	int growable = 0;
-	const char *sopt = "hVc:rsnmgk";
+	const char *sopt = "hVc:rsnmgkaf:";
         const struct option lopt[] = {
 		{ "help", 0, NULL, 'h' },
 		{ "version", 0, NULL, 'V' },
@@ -1737,11 +1751,13 @@ int main(int argc, char **argv)
 		{ "misalign", 0, NULL, 'm' },
 		{ "growable", 0, NULL, 'g' },
 		{ "native-aio", 0, NULL, 'k' },
+		{ "format", 1, NULL, 'f' },
 		{ NULL, 0, NULL, 0 }
 	};
 	int c;
 	int opt_index = 0;
 	int flags = 0;
+        const char *fmt = NULL;
 
 	progname = basename(argv[0]);
 
@@ -1768,6 +1784,9 @@ int main(int argc, char **argv)
 		case 'k':
 			flags |= BDRV_O_NATIVE_AIO;
 			break;
+		case 'f':
+                        fmt = optarg;
+                        break;
 		case 'V':
 			printf("%s version %s\n", progname, VERSION);
 			exit(0);
@@ -1807,6 +1826,7 @@ int main(int argc, char **argv)
 	add_command(&discard_cmd);
 	add_command(&alloc_cmd);
 	add_command(&map_cmd);
+        add_command(&sim_cmd);
 
 	add_args_command(init_args_command);
 	add_check_command(init_check_command);
@@ -1817,7 +1837,7 @@ int main(int argc, char **argv)
         }
 
 	if ((argc - optind) == 1)
-		openfile(argv[optind], flags, growable);
+		openfile(argv[optind], fmt, flags, growable);
 	command_loop();
 
 	/*
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
@ 2011-01-21 22:49   ` Anthony Liguori
  2011-01-22  3:09     ` Chunqiang Tang
  2011-01-23 15:26   ` Andreas Färber
  1 sibling, 1 reply; 16+ messages in thread
From: Anthony Liguori @ 2011-01-21 22:49 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On 01/21/2011 04:19 PM, Chunqiang Tang wrote:
> This patch is part of the Fast Virtual Disk (FVD) proposal. See the related
> discussions at
> http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg00426.html.
>
> This patch adds the 'blksim' block device driver, which is a tool to
> facilitate testing and debugging. blksim operates on a RAW image, but it uses
> neither AIO nor posix threads to perform I/Os. Instead, blksim functions like
> an event-driven disk simulator, and allows a block device driver developer to
> fully control the order of disk I/Os, the order of callbacks, and the return
> code of every I/O operation. The purpose is to comprehensively test a block
> device driver under failures and race conditions. Bugs found by blksim under
> rare race conditions are guaranteed to be precisely reproducible, with no
> dependency on thread timing etc., which makes debugging much easier.
>
> Signed-off-by: Chunqiang Tang<ctang@us.ibm.com>
> ---
>   Makefile.objs  |    1 +
>   block/blksim.c |  752 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   block/blksim.h |   35 +++
>   3 files changed, 788 insertions(+), 0 deletions(-)
>   create mode 100644 block/blksim.c
>   create mode 100644 block/blksim.h
>
> diff --git a/Makefile.objs b/Makefile.objs
> index c3e52c5..ce5cc8d 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -23,6 +23,7 @@ block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
>   block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>   block-nested-y += qed-check.o
>   block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
> +block-nested-y += blksim.o
>   block-nested-$(CONFIG_WIN32) += raw-win32.o
>   block-nested-$(CONFIG_POSIX) += raw-posix.o
>   block-nested-$(CONFIG_CURL) += curl.o
> diff --git a/block/blksim.c b/block/blksim.c
> new file mode 100644
> index 0000000..a92ba11
> --- /dev/null
> +++ b/block/blksim.c
> @@ -0,0 +1,752 @@
> +/*
> + * Copyright (c) 2010-2011 IBM
> + *
> + * Authors:
> + *         Chunqiang Tang<ctang@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +/*=============================================================================
> + *  A short description: this module implements a simulated block device
> + *  driver "blksim". It works with qemu-io and qemu-test to perform testing,
> + *  allowing changing the  order of disk I/O and callback activities to test
> + *  rare race conditions. See qemu-test.c, qemu-io.c, and qemu-io-sim.c.
> + *============================================================================*/
> +
> +#include<sys/vfs.h>
> +#include<sys/mman.h>
> +#include<pthread.h>
> +#include<execinfo.h>
> +#include<stdlib.h>
> +#include<sys/ioctl.h>
> +#include<stdint.h>
> +#include<stdio.h>
> +#include<inttypes.h>
>    

QEMU code shouldn't include headers like this.  It almost guarantees 
breaking portability.

> +#include "block_int.h"
> +#include "osdep.h"
> +#include "qemu-option.h"
> +#include "qemu-timer.h"
> +#include "block.h"
> +#include "qemu-queue.h"
> +#include "qemu-common.h"
> +#include "block/blksim.h"
> +
> +#ifndef TRUE
> +# define TRUE 1
> +#endif
> +
> +#ifndef FALSE
> +# define FALSE 0
> +#endif
>    

C99 introduces stdbool.h.  That's the appropriate defines to use for 
booleans.

> +
> +#if 0
> +# define QDEBUG printf
> +#else
> +# define QDEBUG(format,...) do {} while (0)
> +#endif
> +
> +typedef enum {
> +    SIM_NULL,
> +    SIM_READ,
> +    SIM_WRITE,
> +    SIM_FLUSH,
> +    SIM_READ_CALLBACK,
> +    SIM_WRITE_CALLBACK,
> +    SIM_FLUSH_CALLBACK,
> +    SIM_TIMER
> +} sim_op_t;
>    

Breaks coding style (and the C standard).

> +static void sim_aio_cancel (BlockDriverAIOCB * acb);
> +static int64_t sim_uuid = 0;
> +static int64_t current_time = 0;
> +static int64_t rand_time = 0;
> +static int interactive_print = TRUE;
> +static int blksim_invoked = FALSE;
> +static int instant_qemubh = TRUE;
> +struct SimAIOCB;
> +
> +/*
> + * Note: disk_io_return_code, set_disk_io_return_code(), and insert_task() work
> + * together to ensure that multiple subrequests triggered by the same
> + * outtermost request either succeed together or fail together. This behavior
> + * is required by qemu-test.  Here is one example of problems caused by
> + * departuring from this behavior.  Consider a write request that generates
> + * two subrequests, w1 and w2. If w1 succeeds but w2 fails, the data will not
> + * be written into qemu-test's "truth image" but the part of the data handled
> + * by w1 will be written into qemu-test's "test image". As a result, their
> + * contents diverge can automated testing cannot continue.
> + */
> +static int disk_io_return_code = 0;
> +
> +typedef struct BDRVSimState {
> +    int fd;
> +} BDRVSimState;
> +
> +typedef struct SimAIOCB {
> +    BlockDriverAIOCB common;
> +    int64_t uuid;
> +    sim_op_t op;
> +    int64_t sector_num;
> +    QEMUIOVector *qiov;
> +    int nb_sectors;
> +    int ret;
> +    int64_t time;
> +    struct SimAIOCB *next;
> +    struct SimAIOCB *prev;
>    

Use qemu-queue instead of open coding data structures.

> +} SimAIOCB;
> +
> +static AIOPool sim_aio_pool = {
> +    .aiocb_size = sizeof (SimAIOCB),
> +    .cancel = sim_aio_cancel,
> +};
> +
> +static SimAIOCB head = {
> +    .uuid = -1,
> +    .time = (int64_t) (9223372036854775807ULL),
>    

This number has to mean something but I'll be damned if I know what it 
means.

> +    .op = SIM_NULL,
> +    .next =&head,
> +    .prev =&head,
> +};
> +
> +/* Debug a specific task.*/
> +#if 1
> +# define CHECK_TASK(acb) do { } while (0)
> +#else
> +static inline void CHECK_TASK (int64_t uuid)
> +{
> +    if (uuid == 19LL) {
> +        printf ("CHECK_TASK pause for task %" PRId64 "\n", uuid);
> +    }
> +}
> +#endif
>    

19LL?  Why is 19 significant?

> +/* do_io() should never fail. A failure indicates a bug in the upper layer
> + * block device driver, or failure in the real hardware. */
> +static int do_io (BlockDriverState * bs, int64_t sector_num, uint8_t * buf,
> +                  int nb_sectors, int do_read)
> +{
> +    BDRVSimState *s = bs->opaque;
> +    size_t size = nb_sectors * 512;
> +    int ret;
> +
> +    if (lseek (s->fd, sector_num * 512, SEEK_SET)<  0) {
> +        fprintf (stderr, "Error: lseek %s sector_num=%" PRId64 ". "
> +                 "Pause process %d for debugging...\n",
> +                 bs->filename, sector_num, getpid ());
> +        fgetc (stdin);
> +    }
> +
> +    while (size>  0) {
> +
> +        if (do_read) {
> +            ret = read (s->fd, buf, size);
> +            if (ret == 0) {
> +                fprintf (stderr,
> +                         "Error: read beyond the size of %s sector_num=%" PRId64
> +                         " nb_sectors=%d. Pause process %d for debugging...\n",
> +                         bs->filename, sector_num, nb_sectors, getpid ());
> +                fgetc (stdin);
> +            }
> +        } else {
> +            ret = write (s->fd, buf, size);
> +        }
> +
> +        if (ret>= 0) {
> +            size -= ret;
> +            buf += ret;
>    

You'll hit an infinite loop on EOF.

> +        } else if (errno != EINTR) {
> +            fprintf (stderr, "Error: %s %s sector_num=%" PRId64
> +                     " nb_sectors=%d. Pause process %d for debugging...\n",
> +                     do_read ? "READ" : "WRITE", bs->filename, sector_num,
> +                     nb_sectors, getpid ());
> +            fgetc (stdin);
> +            return -errno;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int blksim_read (BlockDriverState * bs, int64_t sector_num,
> +                        uint8_t * buf, int nb_sectors)
> +{
> +    return do_io (bs, sector_num, buf, nb_sectors, TRUE);
> +}
> +
> +static int blksim_write (BlockDriverState * bs, int64_t sector_num,
> +                      const uint8_t * buf, int nb_sectors)
> +{
> +    return do_io (bs, sector_num, (uint8_t *) buf, nb_sectors, FALSE);
> +}
> +
> +static void insert_in_list (SimAIOCB * acb)
> +{
> +    int64_t new_id = sim_uuid++;
> +    CHECK_TASK (new_id);
> +    acb->uuid = new_id;
> +
> +    if (rand_time<= 0) {
> +        /* Working with qemu-io.c and not doing delay randomization.
> +         * Insert it to the tail. */
> +        acb->time = 0;
> +        acb->prev = head.prev;
> +        acb->next =&head;
> +        head.prev->next = acb;
> +        head.prev = acb;
> +        return;
> +    }
> +
> +    SimAIOCB *p = head.next;
> +
> +    if (acb->time>= 0) {
> +        /* Introduce a random delay to better trigger rare race conditions. */
> +        acb->time += random () % rand_time;
> +
> +        /* Find the position to insert. The list is sorted in ascending time. */
> +        while (1) {
> +            if (p->time>  acb->time) {
> +                break;
> +            }
> +            if (p->time == acb->time&&  (random () % 2 == 0)) {
> +                break;
> +            }
> +            p = p->next;
> +        }
> +    }
> +
> +    /* Insert acb before p. */
> +    acb->next = p;
> +    acb->prev = p->prev;
> +    p->prev->next = acb;
> +    p->prev = acb;
> +}
> +
> +/* Debug problems related to reusing task objects. Problem already solved.*/
> +#if 1
> +# define my_qemu_aio_get qemu_aio_get
> +# define my_qemu_aio_release qemu_aio_release
> +
> +#else
> +static SimAIOCB *search_task_list (SimAIOCB * acb)
> +{
> +    SimAIOCB *p;
> +    for (p = head.next; p !=&head; p = p->next) {
> +        if (p == acb) {
> +            return p;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static inline void *my_qemu_aio_get (AIOPool * pool, BlockDriverState * bs,
> +                                     BlockDriverCompletionFunc * cb,
> +                                     void *opaque)
> +{
> +    SimAIOCB *acb = (SimAIOCB *) qemu_aio_get (&sim_aio_pool, bs, cb, opaque);
> +    QDEBUG ("SIM: qemu_aio_get reuse old task%" PRId64 "\n", acb->uuid);
> +    ASSERT (!search_task_list (acb));
> +    return acb;
> +}
> +
> +static inline void my_qemu_aio_release (SimAIOCB * acb)
> +{
> +    QDEBUG ("SIM: qemu_aio_release task%" PRId64 "\n", acb->uuid);
> +    qemu_aio_release (acb);
> +}
> +#endif
> +
> +static BlockDriverAIOCB *insert_task (int op, BlockDriverState * bs,
> +                                      int64_t sector_num, QEMUIOVector * qiov,
> +                                      int nb_sectors,
> +                                      BlockDriverCompletionFunc * cb,
> +                                      void *opaque)
> +{
> +    SimAIOCB *acb = my_qemu_aio_get (&sim_aio_pool, bs, cb, opaque);
> +    if (!acb) {
> +        return NULL;
> +    }
> +
> +    acb->op = op;
> +    acb->sector_num = sector_num;
> +    acb->qiov = qiov;
> +    acb->nb_sectors = nb_sectors;
> +    acb->ret = disk_io_return_code;
> +    acb->time = current_time;
> +    insert_in_list (acb);
> +
> +    if (interactive_print) {
> +        if (op == SIM_READ) {
> +            printf ("Added READ uuid=%" PRId64 "  filename=%s  sector_num=%"
> +                    PRId64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else if (op == SIM_WRITE) {
> +            printf ("Added WRITE uuid=%" PRId64 "  filename=%s  sector_num=%"
> +                    PRId64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else {
> +            fprintf (stderr, "Unknown op %d\n", op);
> +            exit (1);
> +        }
> +    }
> +
> +    return&acb->common;
> +}
> +
> +static void insert_aio_callback (SimAIOCB * acb)
> +{
> +    acb->time = current_time;
> +    insert_in_list (acb);
> +
> +    if (acb->op == SIM_FLUSH) {
> +        acb->op = SIM_FLUSH_CALLBACK;
> +        if (interactive_print) {
> +            printf ("Added FLUSH_CALLBACK uuid=%" PRId64 "  filename=%s\n",
> +                    acb->uuid, acb->common.bs->filename);
> +        }
> +    } else if (acb->op == SIM_READ) {
> +        acb->op = SIM_READ_CALLBACK;
> +        if (interactive_print) {
> +            printf ("Added READ_CALLBACK uuid=%" PRId64
> +                    "  filename=%s  sector_num=%" PRId64 "  nb_sectors=%d\n",
> +                    acb->uuid, acb->common.bs->filename, acb->sector_num,
> +                    acb->nb_sectors);
> +        }
> +    } else if (acb->op == SIM_WRITE) {
> +        acb->op = SIM_WRITE_CALLBACK;
> +        if (interactive_print) {
> +            printf ("Added WRITE_CALLBACK uuid=%" PRId64
> +                    "  filename=%s  sector_num=%" PRId64 "  nb_sectors=%d\n",
> +                    acb->uuid, acb->common.bs->filename, acb->sector_num,
> +                    acb->nb_sectors);
> +        }
> +    } else {
> +        fprintf (stderr, "Wrong op %d\n", acb->op);
> +        exit (1);
> +    }
> +}
> +
> +void blksim_list_tasks (void)
> +{
> +    SimAIOCB *acb;
> +
> +    for (acb = head.next; acb !=&head; acb = acb->next) {
> +        if (acb->op == SIM_READ) {
> +            printf ("uuid=%" PRId64 "  READ           file=%s  sector_num=%"
> +                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else if (acb->op == SIM_WRITE) {
> +            printf ("uuid=%" PRId64 "  WRITE          file=%s  sector_num=%"
> +                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else if (acb->op == SIM_READ_CALLBACK) {
> +            printf ("uuid=%" PRId64 "  CALLBACK READ  file=%s  sector_num=%"
> +                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else if (acb->op == SIM_WRITE_CALLBACK) {
> +            printf ("uuid=%" PRId64 "  CALLBACK WRITE file=%s  sector_num=%"
> +                    PRIu64 "  nb_sectors=%d\n", acb->uuid,
> +                    acb->common.bs->filename, acb->sector_num, acb->nb_sectors);
> +        } else {
> +            fprintf (stderr, "Wrong OP %d\n", acb->op);
> +            exit (1);
> +        }
> +    }
> +}
> +
> +static inline void sim_callback (SimAIOCB * acb)
> +{
> +    acb->common.cb (acb->common.opaque, acb->ret);
> +}
> +
> +int64_t blksim_get_time (void)
> +{
> +    return current_time;
> +}
> +
> +void *blksim_new_timer (void *cb, void *opaque)
> +{
> +    SimAIOCB *acb = my_qemu_aio_get (&sim_aio_pool, NULL, cb, opaque);
> +    acb->op = SIM_TIMER;
> +    acb->prev = NULL;
> +    return acb;
> +}
> +
> +void blksim_mod_timer (void *ts, int64_t expire_time)
> +{
> +    SimAIOCB *acb = ts;
> +
> +    if (acb->prev) {
> +        /* Remove it first. */
> +        acb->next->prev = acb->prev;
> +        acb->prev->next = acb->next;
> +    }
> +    acb->time = expire_time;
> +    insert_in_list (acb);
> +
> +    if (interactive_print) {
> +        printf ("Added TIMER uuid=%" PRId64 "  expire_time=%"PRId64
> +                " current_time=%"PRId64"\n",
> +                acb->uuid, expire_time, current_time);
> +    }
> +}
> +
> +void blksim_free_timer (void *ts)
> +{
> +    SimAIOCB *acb = ts;
> +    CHECK_TASK (acb->uuid);
> +    my_qemu_aio_release (acb);
> +}
> +
> +void blksim_del_timer (void *ts)
> +{
> +    SimAIOCB *acb = ts;
> +
> +    CHECK_TASK (acb->uuid);
> +    if (acb->prev) {
> +        /* Remove it from the list. */
> +        acb->next->prev = acb->prev;
> +        acb->prev->next = acb->next;
> +
> +        /* Mark it as not in list. */
> +        acb->prev = NULL;
> +    }
> +}
> +
> +void blksim_bh_schedule (void *bh)
> +{
> +    if (instant_qemubh) {
> +        blksim_mod_timer (bh, -1);
> +    } else {
> +        blksim_mod_timer (bh, current_time);
> +    }
> +}
> +
> +void blksim_set_instant_qemubh (int instant)
> +{
> +    instant_qemubh = instant;
> +}
> +
> +void blksim_set_disk_io_return_code (int ret)
> +{
> +    disk_io_return_code = ret;
> +}
> +
> +static void run_task_by_acb (SimAIOCB * acb)
> +{
> +    CHECK_TASK (acb->uuid);
> +
> +    /* Remove it from the list. */
> +    acb->next->prev = acb->prev;
> +    acb->prev->next = acb->next;
> +    acb->prev = NULL;        /* Indicate that it is no longer in the list. */
> +
> +    if (acb->time>  current_time) {
> +        current_time = acb->time;
> +    }
> +
> +    if (acb->op == SIM_TIMER) {
> +        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " TIMER \n",
> +                acb->uuid, acb->time);
> +
> +        ((QEMUTimerCB *) acb->common.cb) (acb->common.opaque);
> +        return;
> +    }
> +
> +    BlockDriverState *bs = acb->common.bs;
> +
> +    if (acb->op == SIM_READ) {
> +        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64
> +                " READ %s sector_num=%" PRId64 " nb_sectors=%d\n",
> +                acb->uuid, acb->time, bs->filename, acb->sector_num,
> +                acb->nb_sectors);
> +
> +        if (acb->ret == 0) {
> +            if (acb->qiov->niov == 1) {
> +                if (blksim_read
> +                    (bs, acb->sector_num, acb->qiov->iov->iov_base,
> +                     acb->nb_sectors) != 0) {
> +                    fprintf (stderr, "Error in reading %s sector_num=%lld "
> +                             "nb_sectors=%d\n", acb->common.bs->filename,
> +                             acb->sector_num, acb->nb_sectors);
> +                    exit (1);
> +                }
> +            } else {
> +                uint8_t *buf=qemu_blockalign (acb->common.bs, acb->qiov->size);
> +                if (blksim_read (bs, acb->sector_num, buf,
> +                                 acb->nb_sectors) != 0) {
> +                    fprintf (stderr, "Error in reading %s sector_num=%lld "
> +                             "nb_sectors=%d\n", acb->common.bs->filename,
> +                             acb->sector_num, acb->nb_sectors);
> +                    exit (1);
> +                }
> +                qemu_iovec_from_buffer (acb->qiov, buf, acb->qiov->size);
> +                qemu_vfree (buf);
> +            }
> +        }
> +
> +        insert_aio_callback (acb);
> +    } else if (acb->op == SIM_WRITE) {
> +        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64
> +                " WRITE %s sector_num=%" PRId64 " nb_sectors=%d\n",
> +                acb->uuid, acb->time, bs->filename,
> +                acb->sector_num, acb->nb_sectors);
> +
> +        if (acb->ret == 0) {
> +            if (acb->qiov->niov == 1) {
> +                if (blksim_write (bs, acb->sector_num, acb->qiov->iov->iov_base,
> +                                  acb->nb_sectors) != 0) {
> +                    fprintf (stderr, "Error in writing %s sector_num=%lld "
> +                             "nb_sectors=%d\n", acb->common.bs->filename,
> +                             acb->sector_num, acb->nb_sectors);
> +                    exit (1);
> +                }
> +            } else {
> +                uint8_t *buf = qemu_blockalign (acb->common.bs,
> +                                                acb->qiov->size);
> +                qemu_iovec_to_buffer (acb->qiov, buf);
> +                if (blksim_write (bs, acb->sector_num, buf,
> +                                  acb->nb_sectors)!= 0) {
> +                    fprintf (stderr, "Error in writing %s sector_num=%lld "
> +                             "nb_sectors=%d\n", acb->common.bs->filename,
> +                             acb->sector_num, acb->nb_sectors);
> +                    exit (1);
> +                }
> +                qemu_vfree (buf);
> +            }
> +        }
> +
> +        insert_aio_callback (acb);
> +    } else if (acb->op == SIM_FLUSH) {
> +        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " FLUSH %s\n",
> +                acb->uuid, acb->time, bs->filename);
> +        /* Skip real flushing to speed up simulation:
> +         *         if (ret == 0) { * fdatasync (s->fd); } */
> +        insert_aio_callback (acb);
> +    } else if (acb->op == SIM_WRITE_CALLBACK || acb->op == SIM_READ_CALLBACK
> +               || acb->op == SIM_FLUSH_CALLBACK) {
> +        QDEBUG ("SIM: execute task%" PRId64 " time=%" PRId64 " CALLBACK\n",
> +                acb->uuid, acb->time);
> +        sim_callback (acb);
> +        CHECK_TASK (acb->uuid);
> +        my_qemu_aio_release (acb);
> +    } else {
> +        fprintf (stderr, "Unknown op %d\n", acb->op);
> +        exit (1);
> +    }
> +}
> +
> +int blksim_run_task_by_uuid (int64_t uuid)
> +{
> +    SimAIOCB *acb;
> +
> +    for (acb = head.next; acb !=&head; acb = acb->next) {
> +        if (acb->uuid == uuid) {
> +            run_task_by_acb (acb);
> +            return 0;
> +        }
> +    }
> +
> +    return -1;
> +}
> +
> +int blksim_run_all_tasks (void)
> +{
> +    int n = 0;
> +
> +    while (1) {
> +        SimAIOCB *acb = head.next;
> +        if (acb ==&head) {
> +            return n; /* No more tasks.*/
> +        }
> +
> +        run_task_by_acb (acb);
> +        n++;
> +    }
> +}
> +
> +static BlockDriverAIOCB *blksim_aio_readv (BlockDriverState * bs,
> +                                        int64_t sector_num,
> +                                        QEMUIOVector * qiov,
> +                                        int nb_sectors,
> +                                        BlockDriverCompletionFunc * cb,
> +                                        void *opaque)
> +{
> +    return insert_task (SIM_READ, bs, sector_num, qiov, nb_sectors, cb, opaque);
> +}
> +
> +static BlockDriverAIOCB *blksim_aio_writev (BlockDriverState * bs,
> +                                         int64_t sector_num,
> +                                         QEMUIOVector * qiov,
> +                                         int nb_sectors,
> +                                         BlockDriverCompletionFunc * cb,
> +                                         void *opaque)
> +{
> +    return insert_task (SIM_WRITE, bs, sector_num, qiov, nb_sectors, cb,
> +                        opaque);
> +}
> +
> +static BlockDriverAIOCB *blksim_aio_flush (BlockDriverState * bs,
> +                                        BlockDriverCompletionFunc * cb,
> +                                        void *opaque)
> +{
> +    return insert_task (SIM_FLUSH, bs, 0, NULL, 0, cb, opaque);
> +}
> +
> +static void sim_aio_cancel (BlockDriverAIOCB * blockacb)
> +{
> +    SimAIOCB *acb = container_of (blockacb, SimAIOCB, common);
> +
> +    CHECK_TASK (acb->uuid);
> +    QDEBUG ("SIM: cancel task%" PRId64 "\n", acb->uuid);
> +
> +    if (acb->prev) {
> +        acb->next->prev = acb->prev;
> +        acb->prev->next = acb->next;
> +        acb->prev = NULL;
> +        my_qemu_aio_release (acb);
> +    } else {
> +        fprintf (stderr, "Error: cancel a blksim task that does not exist: "
> +                 "uuid=%"PRId64". Halt process %d for debugging...\n",
> +                 acb->uuid, getpid());
> +        fgetc (stdin);
> +        exit (1);
> +    }
> +}
> +
> +static int blksim_open (BlockDriverState * bs, const char *filename,
> +                     int bdrv_flags)
> +{
> +    BDRVSimState *s = bs->opaque;
> +    int open_flags = O_BINARY | O_LARGEFILE;
> +
> +    blksim_invoked  = TRUE;
> +
> +    if ((bdrv_flags&  BDRV_O_RDWR)) {
> +        open_flags |= O_RDWR;
> +    } else {
> +        open_flags |= O_RDONLY;
> +    }
> +
> +    if ((bdrv_flags&  BDRV_O_NOCACHE)) {
> +        open_flags |= O_DIRECT;
> +    } else if (!(bdrv_flags&  BDRV_O_CACHE_WB)) {
> +        open_flags |= O_DSYNC;
> +    }
> +
> +    /* Parse the "blksim:" prefix */
> +    if (!strncmp(filename, "blksim:", strlen("blksim:"))) {
> +        filename += strlen("blksim:");
> +    }
> +
> +    s->fd = open (filename, open_flags);
> +    if (s->fd<  0)
> +        return -1;
> +
> +    int64_t len = lseek (s->fd, 0, SEEK_END);
> +    if (len>= 0) {
> +        bs->total_sectors = len / 512;
> +    } else {
> +        bs->total_sectors = 0;
> +    }
> +
> +    bs->growable = 1;
> +    return 0;
> +}
> +
> +static void blksim_close (BlockDriverState * bs)
> +{
> +    BDRVSimState *s = bs->opaque;
> +    close (s->fd);
> +}
> +
> +static int blksim_flush (BlockDriverState * bs)
> +{
> +    /*
> +     * Skip real flushing to speed up simulation.
> +         * BDRVSimState *s = bs->opaque;
> +         * fdatasync (s->fd);
> +     */
> +    return 0;
> +}
> +
> +static int blksim_has_zero_init (BlockDriverState * bs)
> +{
> +    struct stat buf;
> +
> +    if (stat (bs->filename,&buf) != 0) {
> +        fprintf (stderr, "Failed to stat() %s\n", bs->filename);
> +        exit (1);
> +    }
> +
> +    if (S_ISBLK (buf.st_mode) || S_ISCHR (buf.st_mode)) {
> +        return 0;
> +    }
> +
> +    return 1;
> +}
> +
> +static int blksim_truncate (BlockDriverState * bs, int64_t offset)
> +{
> +    BDRVSimState *s = bs->opaque;
> +    return ftruncate (s->fd, offset);
> +}
> +
> +static BlockDriver bdrv_blksim = {
> +    .format_name = "blksim",
> +    .protocol_name = "blksim",
> +    .instance_size = sizeof (BDRVSimState),
> +    .bdrv_file_open = blksim_open,
> +    .bdrv_close = blksim_close,
> +    .bdrv_flush = blksim_flush,
> +    .bdrv_read = blksim_read,
> +    .bdrv_write = blksim_write,
> +    .bdrv_aio_readv = blksim_aio_readv,
> +    .bdrv_aio_writev = blksim_aio_writev,
> +    .bdrv_aio_flush = blksim_aio_flush,
> +    .bdrv_has_zero_init = blksim_has_zero_init,
> +    .bdrv_truncate = blksim_truncate,
> +};
> +
> +static void bdrv_blksim_init(void)
> +{
> +    bdrv_register(&bdrv_blksim);
> +}
> +block_init(bdrv_blksim_init);
> +
> +void init_blksim (int print, int64_t _rand_time)
> +{
> +    interactive_print = print;
> +    rand_time = _rand_time;
> +}
> +
> +/*
> + * To work properly in the simulation mode, block device drivers that
> + * explicitly invoke qemu_aio_wait() should invoke blksim_qemu_aio_wait() if
> + * the block device is openned using blksim. Most block device drivers do not
> + * invoke qemu_aio_wait() and hence should not be concerned about this.
> + */
> +int blksim_qemu_aio_wait (void)
> +{
> +    SimAIOCB *acb = head.next;
> +    if (acb ==&head) {
> +        return 0;
> +    }
> +    else {
> +        run_task_by_acb (acb);
> +        return 1;
> +    }
> +}
> +
> +int blksim_has_task (void)
> +{
> +    return head.next !=&head;
> +}
> +
> +int using_blksim (void)
> +{
> +    return blksim_invoked;
> +}
> diff --git a/block/blksim.h b/block/blksim.h
> new file mode 100644
> index 0000000..fa1e20d
> --- /dev/null
> +++ b/block/blksim.h
> @@ -0,0 +1,35 @@
> +/*
> + * Copyright (c) 2010-2011 IBM
> + *
> + * Authors:
> + *         Chunqiang Tang<ctang@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +/*=============================================================================
> + *  A short description: this is the header of the simulated block device
> + *  driver "blksim".
> + *============================================================================*/
> +
> +#ifndef __block_sim_h__
> +#define __block_sim_h__
>    

Coding style.

In general, I like the idea of the simulator but the coding style is off 
quite a bit.

Regards,

Anthony Liguori

> +void init_blksim (int print, int64_t _rand_time);
> +int using_blksim (void);
> +int blksim_has_task (void);
> +void blksim_list_tasks (void);
> +int blksim_run_task_by_uuid (int64_t uuid);
> +int blksim_run_all_tasks (void);
> +int64_t blksim_get_time (void);
> +void *blksim_new_timer (void *cb, void *opaque);
> +void blksim_mod_timer (void *ts, int64_t expire_time);
> +void blksim_free_timer (void *ts);
> +void blksim_del_timer (void *ts);
> +void blksim_bh_schedule (void *bh);
> +void blksim_set_disk_io_return_code (int ret);
> +int blksim_qemu_aio_wait(void);
> +void blksim_set_instant_qemubh (int instant /* TRUE or FALSE */);
> +
> +#endif
>    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-21 22:49   ` Anthony Liguori
@ 2011-01-22  3:09     ` Chunqiang Tang
  2011-01-23 23:26       ` Anthony Liguori
  0 siblings, 1 reply; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-22  3:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

> Coding style.
> 
> In general, I like the idea of the simulator but the coding style is off 

> quite a bit.

Please be specific and I would be happy to take suggestions. The header 
issue should be easy to fix. 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
  2011-01-21 22:49   ` Anthony Liguori
@ 2011-01-23 15:26   ` Andreas Färber
  2011-01-25 16:54     ` Chunqiang Tang
  1 sibling, 1 reply; 16+ messages in thread
From: Andreas Färber @ 2011-01-23 15:26 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, QEMU Developers

Am 21.01.2011 um 23:19 schrieb Chunqiang Tang:

> diff --git a/block/blksim.c b/block/blksim.c
> new file mode 100644
> index 0000000..a92ba11
> --- /dev/null
> +++ b/block/blksim.c

Some formal comments, since you're introducing a new file:

> @@ -0,0 +1,752 @@
> +/*

Headers usually start with a one-line summary, "QEMU simulated block  
driver" maybe?

> + * Copyright (c) 2010-2011 IBM
> + *
> + * Authors:
> + *         Chunqiang Tang <ctang@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.

Can you make this GPLv2-or-later to avoid future hassles?

> +#ifndef TRUE
> +# define TRUE 1
> +#endif
> +
> +#ifndef FALSE
> +# define FALSE 0
> +#endif

I don't think these two belong here.

stdbool.h defines true and false with identical values.
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/stdbool.h.html

Not sure about TRUE and FALSE.
If we want them as local definitions, they should rather go to qemu- 
common.h than to individual source files.

Regards,
Andreas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-22  3:09     ` Chunqiang Tang
@ 2011-01-23 23:26       ` Anthony Liguori
  2011-01-24 15:07         ` Chunqiang Tang
  0 siblings, 1 reply; 16+ messages in thread
From: Anthony Liguori @ 2011-01-23 23:26 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On 01/21/2011 09:09 PM, Chunqiang Tang wrote:
>> Coding style.
>>
>> In general, I like the idea of the simulator but the coding style is off
>>      
>    
>> quite a bit.
>>      
> Please be specific and I would be happy to take suggestions. The header
> issue should be easy to fix.
>    

Read CODING_STYLE and go through your code.

The code pretty consistently doesn't follow it.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-23 23:26       ` Anthony Liguori
@ 2011-01-24 15:07         ` Chunqiang Tang
  0 siblings, 0 replies; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-24 15:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

> Read CODING_STYLE and go through your code.

Went through CODING_STYLE. The white space issue in FVD was already been 
fixed previously. FVD’s variable and type names are fine, and line width 
is fine. The only remaining issue in FVD is '}' before 'else', which will 
be fixed. CODING_STYLE does not require, but I noticed through example 
that, function calls are 'do_something()', while FVD uses 'do_something 
()' (with a white space before '()'). Is this a hard requirement and need 
be fixed? Is there anything else that are not specified in CODING_STYLE 
but is adopted in QEMU by convention? I would like to take all suggestions 
and fix code style in one pass, rather than doing it again and again. 
Thanks.

CODING_STYLE:
    if (a == 5) {
    } else if (a == 6) {
    }

FVD: 
    if (a == 5) {
    } 
    else if (a == 6) {
    }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver
  2011-01-23 15:26   ` Andreas Färber
@ 2011-01-25 16:54     ` Chunqiang Tang
  0 siblings, 0 replies; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-25 16:54 UTC (permalink / raw)
  To: Andreas Färber; +Cc: Kevin Wolf, QEMU Developers

> Headers usually start with a one-line summary, "QEMU simulated block 
> driver" maybe?
> > + * Copyright (c) 2010-2011 IBM
> > + *
> > + * Authors:
> > + *         Chunqiang Tang <ctang@us.ibm.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + * See the COPYING file in the top-level directory.
> 
> Can you make this GPLv2-or-later to avoid future hassles?

Will do.

> > +#ifndef TRUE
> > +# define TRUE 1
> > +#endif
> > +
> > +#ifndef FALSE
> > +# define FALSE 0
> > +#endif
> 
> I don't think these two belong here.
> 
> stdbool.h defines true and false with identical values.
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/stdbool.h.html
> 
> Not sure about TRUE and FALSE.
> If we want them as local definitions, they should rather go to qemu- 
> common.h than to individual source files.

You are right. "true" and "false" should be used instead, and 
qemu-common.h already includes stdbool.h.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-21 22:19 [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Chunqiang Tang
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
  2011-01-21 22:19 ` [Qemu-devel] [PATCH 3/3] FVD: Made qemu-io working with simulation (blksim) Chunqiang Tang
@ 2011-01-28  9:57 ` Stefan Hajnoczi
  2011-01-28 14:51   ` Chunqiang Tang
  2 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-01-28  9:57 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
> This patch adds the 'update' command to qemu-img. FVD stores various
> image-specific configurable parameters in the image header. A user can use
> 'qemu-img update' to modify those parameters and accordingly controls FVD's
> runtime behavior. This command may also be leveraged by other block device
> drivers, e.g., to set the size of the in-memory metadata cache. Currently
> those parameters are hard-coded in a one-size-fit-all manner.

There's a high risk that users will try this command while the VM is
running.  A safe-guard is needed here in order to avoid corrupting the
image.

Please use qemu-option.h instead of int argc, char **argv just like
qemu-img create -o does.

Finally, is this interface really necessary?  As a developer it can be
useful to tweak image values (in QED I actually have a free-standing
tool that can query and manipulate image internals).  But should users
need to micromanage every image file in order to achieve desired
functionality/performance?  What's the real need here?

> diff --git a/qemu-img.c b/qemu-img.c
> index afd9ed2..5f35c4d 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1054,6 +1054,49 @@ static int img_info(int argc, char **argv)
>      return 0;
>  }
>  
> +static int img_update(int argc, char **argv)
> +{
> +    int c;
> +    const char *filename, *fmt;
> +    BlockDriverState *bs;
> +
> +    fmt = NULL;
> +    for(;;) {
> +        c = getopt(argc, argv, "f:h");
> +        if (c == -1)
> +            break;

{}, see CODING_STYLE and HACKING.

> +        switch(c) {
> +        case 'h':
> +            help();
> +            break;
> +        case 'f':
> +            fmt = optarg;
> +            break;
> +        }
> +    }
> +    if (optind >= argc)
> +        help();

{}

> +    filename = argv[optind++];
> +    bs = bdrv_new_open(filename, fmt, BDRV_O_FLAGS | BDRV_O_NO_BACKING
> +                       | BDRV_O_RDWR);
> +    if (!bs) {
> +        return 1;
> +    }
> +
> +    if (bs->drv->bdrv_update) {
> +        bs->drv->bdrv_update(bs, argc-optind, &argv[optind]);
> +    }
> +    else {

} else {, see CODING_STYLE

> +        char fmt_name[128];
> +        bdrv_get_format(bs, fmt_name, sizeof(fmt_name));
> +        error_report ("The 'update' command is not supported for "
> +                      "the '%s' image format.", fmt_name);

Return value should be 1?

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-28  9:57 ` [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Stefan Hajnoczi
@ 2011-01-28 14:51   ` Chunqiang Tang
  2011-01-28 16:16     ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-28 14:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
> > This patch adds the 'update' command to qemu-img. FVD stores various
> > image-specific configurable parameters in the image header. A user can 
use
> > 'qemu-img update' to modify those parameters and accordingly controls 
FVD's
> > runtime behavior. This command may also be leveraged by other block 
device
> > drivers, e.g., to set the size of the in-memory metadata cache. 
Currently
> > those parameters are hard-coded in a one-size-fit-all manner.
> 
> There's a high risk that users will try this command while the VM is
> running.  A safe-guard is needed here in order to avoid corrupting the
> image.

Good observation and this should be added. In FVD, once an image is open, 
a dirty bit is set in the image header. qemu-img update can refuse to work 
or ask for confirmation if it sees the dirty bit.
 
> Please use qemu-option.h instead of int argc, char **argv just like
> qemu-img create -o does.

Will do.
 
> Finally, is this interface really necessary?  As a developer it can be
> useful to tweak image values (in QED I actually have a free-standing
> tool that can query and manipulate image internals).  But should users
> need to micromanage every image file in order to achieve desired
> functionality/performance?  What's the real need here?

Default values of the image parameters may be suitable for most users, but 
there are cases where more control is needed, not only for performance 
tuning but also for functional control. This is especially true for 
copy-on-read and prefetching, and when many VMs are created based on the 
same image template (like that in a Cloud or the publicly shared VMware 
appliance http://www.vmware.com/appliances/). For example, the image 
template may turn on copy-on-read and prefetching, but a user wants to set 
it off. Even for prefetching, there are parameters that control when to 
start prefetching and how much bandwidth prefetching can use (as 
prefetching has to be conservative). This can be different in different 
environments, e.g., 100MB vs 10GB Ethernet, SAN disk array  vs. a local 
disk. Without control, users share the same image template may often find 
it hard to use.

Performance tuning may also be important. Because QEMU supports so many 
diverse platforms, we cannot expect one size fit all. Specifically, when 
doing benchmarking, I found QCOW2's metadata cache was way too small even 
for my 32-bit host, but I simply couldn't control that.


Regards,
ChunQiang (CQ) Tang, 
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-28 14:51   ` Chunqiang Tang
@ 2011-01-28 16:16     ` Stefan Hajnoczi
  2011-01-28 21:26       ` Chunqiang Tang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-01-28 16:16 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Fri, Jan 28, 2011 at 2:51 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
>> > This patch adds the 'update' command to qemu-img. FVD stores various
>> > image-specific configurable parameters in the image header. A user can
> use
>> > 'qemu-img update' to modify those parameters and accordingly controls
> FVD's
>> > runtime behavior. This command may also be leveraged by other block
> device
>> > drivers, e.g., to set the size of the in-memory metadata cache.
> Currently
>> > those parameters are hard-coded in a one-size-fit-all manner.
>>
>> There's a high risk that users will try this command while the VM is
>> running.  A safe-guard is needed here in order to avoid corrupting the
>> image.
>
> Good observation and this should be added. In FVD, once an image is open,
> a dirty bit is set in the image header. qemu-img update can refuse to work
> or ask for confirmation if it sees the dirty bit.
>
>> Please use qemu-option.h instead of int argc, char **argv just like
>> qemu-img create -o does.
>
> Will do.
>
>> Finally, is this interface really necessary?  As a developer it can be
>> useful to tweak image values (in QED I actually have a free-standing
>> tool that can query and manipulate image internals).  But should users
>> need to micromanage every image file in order to achieve desired
>> functionality/performance?  What's the real need here?
>
> Default values of the image parameters may be suitable for most users, but
> there are cases where more control is needed, not only for performance
> tuning but also for functional control. This is especially true for
> copy-on-read and prefetching, and when many VMs are created based on the
> same image template (like that in a Cloud or the publicly shared VMware
> appliance http://www.vmware.com/appliances/). For example, the image
> template may turn on copy-on-read and prefetching, but a user wants to set
> it off. Even for prefetching, there are parameters that control when to
> start prefetching and how much bandwidth prefetching can use (as
> prefetching has to be conservative). This can be different in different
> environments, e.g., 100MB vs 10GB Ethernet, SAN disk array  vs. a local
> disk. Without control, users share the same image template may often find
> it hard to use.

It should be possible to change prefetching and copy-on-read while the
VM is running.  For example, having to shut down a VM in order to
pause the prefetching is not workable.  In the QED image streaming
tree there are monitor commands for this:

http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-28 16:16     ` Stefan Hajnoczi
@ 2011-01-28 21:26       ` Chunqiang Tang
  2011-01-29 10:05         ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-28 21:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> It should be possible to change prefetching and copy-on-read while the
> VM is running.  For example, having to shut down a VM in order to
> pause the prefetching is not workable.  In the QED image streaming
> tree there are monitor commands for this:
> 
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command

I took a quick look at the code. Using a monitor command to dynamically 
control copy-on-read and prefetching is a good idea. This should be 
adopted in FVD as well. 

On another note, I saw that the code does not support (copy_on_read=off && 
stream=on). My previous benchmarking shows that copy_on_read does slow 
down other normal reads and writes, because it needs to save data to disk. 
For example, numbers in my papers show that, on ext3, FVD with 
copy_on_read=on actually boots a VM slower than QCOW2 does, even if FVD's 
copy-on-read is already heavily optimized and is not on the  critical path 
of read (i.e., callback is invoked and data is returned to the VM first, 
and then save copy-on-read data asynchronously in the background). 
Therefore, it might be possible that a user does not want to enable 
copy-on-read and only wants to do prefetching when resources are idle.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-28 21:26       ` Chunqiang Tang
@ 2011-01-29 10:05         ` Stefan Hajnoczi
  2011-01-31 14:49           ` Chunqiang Tang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-01-29 10:05 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Fri, Jan 28, 2011 at 9:26 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> It should be possible to change prefetching and copy-on-read while the
>> VM is running.  For example, having to shut down a VM in order to
>> pause the prefetching is not workable.  In the QED image streaming
>> tree there are monitor commands for this:
>>
>> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command
>
> I took a quick look at the code. Using a monitor command to dynamically
> control copy-on-read and prefetching is a good idea. This should be
> adopted in FVD as well.

After thinking about it more, qemu-img update does also serve a
purpose.  Sometimes it is necessary to set options on many images in
bulk or from provisioning scripts instead of at runtime.

I guess my main fear of qemu-img update is that it adds a new
interface that only FVD exploits so far.  If it never catches on with
other formats then we have this special feature that must be
maintained but is rarely used.  I'd hold off this patch until code
that can make use of it has been merged into qemu.git.

> On another note, I saw that the code does not support (copy_on_read=off &&
> stream=on). My previous benchmarking shows that copy_on_read does slow
> down other normal reads and writes, because it needs to save data to disk.
> For example, numbers in my papers show that, on ext3, FVD with
> copy_on_read=on actually boots a VM slower than QCOW2 does, even if FVD's
> copy-on-read is already heavily optimized and is not on the  critical path
> of read (i.e., callback is invoked and data is returned to the VM first,
> and then save copy-on-read data asynchronously in the background).
> Therefore, it might be possible that a user does not want to enable
> copy-on-read and only wants to do prefetching when resources are idle.

The current implementation basically takes advantage of copy-on-read
in order to populate the image.

There's a lot of room for studying the behavior and making
improvements.  Coming up with throttling strategies that make the
prefetch I/O an "idle task" only when there's bandwidth available is
difficult because the problem is more complex than just one greedy
QEMU process.  In a cloud environment there will be any physical
hosts, each with multiple VMs, on a shared network and no single QEMU
process has global knowledge.  It's more like TCP where you need to
try seeing how much data the connection can carry, fall back on packet
loss, and then gradually try again.  But I'm not sure we have a
feedback mechanism to say "you're doing too much prefetching".

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-29 10:05         ` Stefan Hajnoczi
@ 2011-01-31 14:49           ` Chunqiang Tang
  2011-02-01 13:53             ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: Chunqiang Tang @ 2011-01-31 14:49 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> After thinking about it more, qemu-img update does also serve a
> purpose.  Sometimes it is necessary to set options on many images in
> bulk or from provisioning scripts instead of at runtime.
> 
> I guess my main fear of qemu-img update is that it adds a new
> interface that only FVD exploits so far.  If it never catches on with
> other formats then we have this special feature that must be
> maintained but is rarely used.  I'd hold off this patch until code
> that can make use of it has been merged into qemu.git.

I am fine with holding it off. Actually, 'qemu-img rebase' is already a 
special type of 'qemu-img update'. Without a general update interface, 
every parameter change would have to use a special interface like 
'rebase'.

> There's a lot of room for studying the behavior and making
> improvements.  Coming up with throttling strategies that make the
> prefetch I/O an "idle task" only when there's bandwidth available is
> difficult because the problem is more complex than just one greedy
> QEMU process.  In a cloud environment there will be any physical
> hosts, each with multiple VMs, on a shared network and no single QEMU
> process has global knowledge.  It's more like TCP where you need to
> try seeing how much data the connection can carry, fall back on packet
> loss, and then gradually try again.  But I'm not sure we have a
> feedback mechanism to say "you're doing too much prefetching".

Your observation on the similarity to TCP is incisive. To my knowledge, 
the two papers below are the most relevant work on congestion control for 
VM storage. I evaluated [1] below, and found that it is not most reliable. 
I copied some text from my paper below. Basically, since a storage system 
does not have packet loss, there are two ways of detecting congestion in a 
storage system: 1) making decisions based on increased latency, or 2) 
making decisions based on reduced throughput. The paper [1] below uses 
latency but I found it not always reliable. The current implementation in 
FVD is based on throughput, which tends to be more reliable. But as you 
said,this is still a problem that has a lot of room for future study.

===========
"[1] also performs adaptive prefetching. It halves the prefetch rate
if a certain “percentage” of recent requests experiencing
a high latency. Our experiments show that it is hard to set
a proper “percentage” to reliably detect contentions. Because
storage servers and disk controllers perform readahead
in large chunks for sequential reads, a very large
percentage (e.g., 90%) of a VM’s prefetching reads hit in
read-ahead caches and experience a low latency. When a
storage server becomes busy, the “percentage” of requests
that hit in read-ahead caches may change little, but the response
time of those cache-miss requests may increase
dramatically. In other words, this “percentage” does not
correlate well with the achieved disk I/O throughput."
============

The paper [2] below from VMware is informative but cannot be adopted by us 
directly, as the problem domain is different. I previously had a paper on 
general congestion control, 
https://sites.google.com/site/tangchq/papers/NCI-USENIX09.pdf?attredirects=0 
, and attended a conference together with an author of paper [2] , and had 
some discussions. It is an interesting paper.

[1] http://suif.stanford.edu/papers/nsdi05.pdf . 
[2] http://www.usenix.org/events/fast09/tech/full_papers/gulati/gulati.pdf 

 


Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update'
  2011-01-31 14:49           ` Chunqiang Tang
@ 2011-02-01 13:53             ` Stefan Hajnoczi
  0 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2011-02-01 13:53 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Mon, Jan 31, 2011 at 2:49 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> The paper [2] below from VMware is informative but cannot be adopted by us
> directly, as the problem domain is different. I previously had a paper on
> general congestion control,
> https://sites.google.com/site/tangchq/papers/NCI-USENIX09.pdf?attredirects=0
> , and attended a conference together with an author of paper [2] , and had
> some discussions. It is an interesting paper.
>
> [1] http://suif.stanford.edu/papers/nsdi05.pdf .
> [2] http://www.usenix.org/events/fast09/tech/full_papers/gulati/gulati.pdf

Thanks for the links, I will take a look.  I can see both latency and
throughput being problematic.  With throughput the problem is that the
I/O pattern may be sporadic or bursty.  It's hard for a heuristic to
follow large changes in the guest's I/O pattern.

Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-02-01 13:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 22:19 [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Chunqiang Tang
2011-01-21 22:19 ` [Qemu-devel] [PATCH 2/3] FVD: Added the simulated 'blksim' driver Chunqiang Tang
2011-01-21 22:49   ` Anthony Liguori
2011-01-22  3:09     ` Chunqiang Tang
2011-01-23 23:26       ` Anthony Liguori
2011-01-24 15:07         ` Chunqiang Tang
2011-01-23 15:26   ` Andreas Färber
2011-01-25 16:54     ` Chunqiang Tang
2011-01-21 22:19 ` [Qemu-devel] [PATCH 3/3] FVD: Made qemu-io working with simulation (blksim) Chunqiang Tang
2011-01-28  9:57 ` [Qemu-devel] [PATCH 1/3] FVD: Added support for 'qemu-img update' Stefan Hajnoczi
2011-01-28 14:51   ` Chunqiang Tang
2011-01-28 16:16     ` Stefan Hajnoczi
2011-01-28 21:26       ` Chunqiang Tang
2011-01-29 10:05         ` Stefan Hajnoczi
2011-01-31 14:49           ` Chunqiang Tang
2011-02-01 13:53             ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.