All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 15:07 ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

This series implements a dedicated thread for virtio-blk processing using Linux
AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
old but I wanted to share it on the list since it has been mentioned on mailing
lists and IRC recently.

These patches can be used for benchmarking and discussion about how to improve
block performance.  Paolo Bonzini has also worked in this area and might want
to share his patches.

The basic approach is:
1. Each virtio-blk device has a thread dedicated to handling ioeventfd
   signalling when the guest kicks the virtqueue.
2. Requests are processed without going through the QEMU block layer using
   Linux AIO directly.
3. Completion interrupts are injected via ioctl from the dedicated thread.

The series also contains request merging as a bdrv_aio_multiwrite() equivalent.
This was only to get a comparison against the QEMU block layer and I would drop
it for other types of analysis.

The effect of this series is that O_DIRECT Linux AIO on raw files can bypass
the QEMU global mutex and block layer.  This means higher performance.

A cleaned up version of this approach could be added to QEMU as a raw O_DIRECT
Linux AIO fast path.  Image file formats, protocols, and other block layer
features are not supported by virtio-blk-data-plane.

Git repo:
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane

Stefan Hajnoczi (27):
  virtio-blk: Remove virtqueue request handling code
  virtio-blk: Set up host notifier for data plane
  virtio-blk: Data plane thread event loop
  virtio-blk: Map vring
  virtio-blk: Do cheapest possible memory mapping
  virtio-blk: Take PCI memory range into account
  virtio-blk: Put dataplane code into its own directory
  virtio-blk: Read requests from the vring
  virtio-blk: Add Linux AIO queue
  virtio-blk: Stop data plane thread cleanly
  virtio-blk: Indirect vring and flush support
  virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
  virtio-blk: Increase max requests for indirect vring
  virtio-blk: Use pthreads instead of qemu-thread
  notifier: Add a function to set the notifier
  virtio-blk: Kick data plane thread using event notifier set
  virtio-blk: Use guest notifier to raise interrupts
  virtio-blk: Call ioctl() directly instead of irqfd
  virtio-blk: Disable guest->host notifies while processing vring
  virtio-blk: Add ioscheduler to detect mergable requests
  virtio-blk: Add basic request merging
  virtio-blk: Fix request merging
  virtio-blk: Stub out SCSI commands
  virtio-blk: fix incorrect length
  msix: fix irqchip breakage in msix_try_notify_from_thread()
  msix: use upstream kvm_irqchip_set_irq()
  virtio-blk: add EVENT_IDX support to dataplane

 event_notifier.c          |    7 +
 event_notifier.h          |    1 +
 hw/dataplane/event-poll.h |  116 +++++++
 hw/dataplane/ioq.h        |  128 ++++++++
 hw/dataplane/iosched.h    |   97 ++++++
 hw/dataplane/vring.h      |  334 ++++++++++++++++++++
 hw/msix.c                 |   15 +
 hw/msix.h                 |    1 +
 hw/virtio-blk.c           |  753 +++++++++++++++++++++------------------------
 hw/virtio-pci.c           |    8 +
 hw/virtio.c               |    9 +
 hw/virtio.h               |    3 +
 12 files changed, 1074 insertions(+), 398 deletions(-)
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/ioq.h
 create mode 100644 hw/dataplane/iosched.h
 create mode 100644 hw/dataplane/vring.h

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 15:07 ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

This series implements a dedicated thread for virtio-blk processing using Linux
AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
old but I wanted to share it on the list since it has been mentioned on mailing
lists and IRC recently.

These patches can be used for benchmarking and discussion about how to improve
block performance.  Paolo Bonzini has also worked in this area and might want
to share his patches.

The basic approach is:
1. Each virtio-blk device has a thread dedicated to handling ioeventfd
   signalling when the guest kicks the virtqueue.
2. Requests are processed without going through the QEMU block layer using
   Linux AIO directly.
3. Completion interrupts are injected via ioctl from the dedicated thread.

The series also contains request merging as a bdrv_aio_multiwrite() equivalent.
This was only to get a comparison against the QEMU block layer and I would drop
it for other types of analysis.

The effect of this series is that O_DIRECT Linux AIO on raw files can bypass
the QEMU global mutex and block layer.  This means higher performance.

A cleaned up version of this approach could be added to QEMU as a raw O_DIRECT
Linux AIO fast path.  Image file formats, protocols, and other block layer
features are not supported by virtio-blk-data-plane.

Git repo:
http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane

Stefan Hajnoczi (27):
  virtio-blk: Remove virtqueue request handling code
  virtio-blk: Set up host notifier for data plane
  virtio-blk: Data plane thread event loop
  virtio-blk: Map vring
  virtio-blk: Do cheapest possible memory mapping
  virtio-blk: Take PCI memory range into account
  virtio-blk: Put dataplane code into its own directory
  virtio-blk: Read requests from the vring
  virtio-blk: Add Linux AIO queue
  virtio-blk: Stop data plane thread cleanly
  virtio-blk: Indirect vring and flush support
  virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
  virtio-blk: Increase max requests for indirect vring
  virtio-blk: Use pthreads instead of qemu-thread
  notifier: Add a function to set the notifier
  virtio-blk: Kick data plane thread using event notifier set
  virtio-blk: Use guest notifier to raise interrupts
  virtio-blk: Call ioctl() directly instead of irqfd
  virtio-blk: Disable guest->host notifies while processing vring
  virtio-blk: Add ioscheduler to detect mergable requests
  virtio-blk: Add basic request merging
  virtio-blk: Fix request merging
  virtio-blk: Stub out SCSI commands
  virtio-blk: fix incorrect length
  msix: fix irqchip breakage in msix_try_notify_from_thread()
  msix: use upstream kvm_irqchip_set_irq()
  virtio-blk: add EVENT_IDX support to dataplane

 event_notifier.c          |    7 +
 event_notifier.h          |    1 +
 hw/dataplane/event-poll.h |  116 +++++++
 hw/dataplane/ioq.h        |  128 ++++++++
 hw/dataplane/iosched.h    |   97 ++++++
 hw/dataplane/vring.h      |  334 ++++++++++++++++++++
 hw/msix.c                 |   15 +
 hw/msix.h                 |    1 +
 hw/virtio-blk.c           |  753 +++++++++++++++++++++------------------------
 hw/virtio-pci.c           |    8 +
 hw/virtio.c               |    9 +
 hw/virtio.h               |    3 +
 12 files changed, 1074 insertions(+), 398 deletions(-)
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/ioq.h
 create mode 100644 hw/dataplane/iosched.h
 create mode 100644 hw/dataplane/vring.h

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC v9 01/27] virtio-blk: Remove virtqueue request handling code
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Start with a clean slate, a virtio-blk device that supports virtio
lifecycle operations and configuration but doesn't do any actual I/O.
The I/O is going to happen in a separate optimized data plane thread.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |  496 +------------------------------------------------------
 1 file changed, 3 insertions(+), 493 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 49990f8..a627427 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -16,18 +16,12 @@
 #include "trace.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
-#include "scsi-defs.h"
-#ifdef __linux__
-# include <scsi/sg.h>
-#endif
 
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
     BlockDriverState *bs;
     VirtQueue *vq;
-    void *rq;
-    QEMUBH *bh;
     BlockConf *conf;
     char *serial;
     unsigned short sector_mask;
@@ -39,439 +33,11 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-typedef struct VirtIOBlockReq
-{
-    VirtIOBlock *dev;
-    VirtQueueElement elem;
-    struct virtio_blk_inhdr *in;
-    struct virtio_blk_outhdr *out;
-    struct virtio_scsi_inhdr *scsi;
-    QEMUIOVector qiov;
-    struct VirtIOBlockReq *next;
-    BlockAcctCookie acct;
-} VirtIOBlockReq;
-
-static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
-{
-    VirtIOBlock *s = req->dev;
-
-    trace_virtio_blk_req_complete(req, status);
-
-    stb_p(&req->in->status, status);
-    virtqueue_push(s->vq, &req->elem, req->qiov.size + sizeof(*req->in));
-    virtio_notify(&s->vdev, s->vq);
-}
-
-static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
-    int is_read)
-{
-    BlockErrorAction action = bdrv_get_on_error(req->dev->bs, is_read);
-    VirtIOBlock *s = req->dev;
-
-    if (action == BLOCK_ERR_IGNORE) {
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_IGNORE, is_read);
-        return 0;
-    }
-
-    if ((error == ENOSPC && action == BLOCK_ERR_STOP_ENOSPC)
-            || action == BLOCK_ERR_STOP_ANY) {
-        req->next = s->rq;
-        s->rq = req;
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_STOP, is_read);
-        vm_stop(RUN_STATE_IO_ERROR);
-        bdrv_iostatus_set_err(s->bs, error);
-    } else {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        bdrv_acct_done(s->bs, &req->acct);
-        g_free(req);
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_REPORT, is_read);
-    }
-
-    return 1;
-}
-
-static void virtio_blk_rw_complete(void *opaque, int ret)
-{
-    VirtIOBlockReq *req = opaque;
-
-    trace_virtio_blk_rw_complete(req, ret);
-
-    if (ret) {
-        int is_read = !(ldl_p(&req->out->type) & VIRTIO_BLK_T_OUT);
-        if (virtio_blk_handle_rw_error(req, -ret, is_read))
-            return;
-    }
-
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-    bdrv_acct_done(req->dev->bs, &req->acct);
-    g_free(req);
-}
-
-static void virtio_blk_flush_complete(void *opaque, int ret)
-{
-    VirtIOBlockReq *req = opaque;
-
-    if (ret) {
-        if (virtio_blk_handle_rw_error(req, -ret, 0)) {
-            return;
-        }
-    }
-
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-    bdrv_acct_done(req->dev->bs, &req->acct);
-    g_free(req);
-}
-
-static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
-{
-    VirtIOBlockReq *req = g_malloc(sizeof(*req));
-    req->dev = s;
-    req->qiov.size = 0;
-    req->next = NULL;
-    return req;
-}
-
-static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
-{
-    VirtIOBlockReq *req = virtio_blk_alloc_request(s);
-
-    if (req != NULL) {
-        if (!virtqueue_pop(s->vq, &req->elem)) {
-            g_free(req);
-            return NULL;
-        }
-    }
-
-    return req;
-}
-
-#ifdef __linux__
-static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
-{
-    struct sg_io_hdr hdr;
-    int ret;
-    int status;
-    int i;
-
-    if ((req->dev->vdev.guest_features & (1 << VIRTIO_BLK_F_SCSI)) == 0) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * We require at least one output segment each for the virtio_blk_outhdr
-     * and the SCSI command block.
-     *
-     * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr
-     * and the sense buffer pointer in the input segments.
-     */
-    if (req->elem.out_num < 2 || req->elem.in_num < 3) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * No support for bidirection commands yet.
-     */
-    if (req->elem.out_num > 2 && req->elem.in_num > 3) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * The scsi inhdr is placed in the second-to-last input segment, just
-     * before the regular inhdr.
-     */
-    req->scsi = (void *)req->elem.in_sg[req->elem.in_num - 2].iov_base;
-
-    memset(&hdr, 0, sizeof(struct sg_io_hdr));
-    hdr.interface_id = 'S';
-    hdr.cmd_len = req->elem.out_sg[1].iov_len;
-    hdr.cmdp = req->elem.out_sg[1].iov_base;
-    hdr.dxfer_len = 0;
-
-    if (req->elem.out_num > 2) {
-        /*
-         * If there are more than the minimally required 2 output segments
-         * there is write payload starting from the third iovec.
-         */
-        hdr.dxfer_direction = SG_DXFER_TO_DEV;
-        hdr.iovec_count = req->elem.out_num - 2;
-
-        for (i = 0; i < hdr.iovec_count; i++)
-            hdr.dxfer_len += req->elem.out_sg[i + 2].iov_len;
-
-        hdr.dxferp = req->elem.out_sg + 2;
-
-    } else if (req->elem.in_num > 3) {
-        /*
-         * If we have more than 3 input segments the guest wants to actually
-         * read data.
-         */
-        hdr.dxfer_direction = SG_DXFER_FROM_DEV;
-        hdr.iovec_count = req->elem.in_num - 3;
-        for (i = 0; i < hdr.iovec_count; i++)
-            hdr.dxfer_len += req->elem.in_sg[i].iov_len;
-
-        hdr.dxferp = req->elem.in_sg;
-    } else {
-        /*
-         * Some SCSI commands don't actually transfer any data.
-         */
-        hdr.dxfer_direction = SG_DXFER_NONE;
-    }
-
-    hdr.sbp = req->elem.in_sg[req->elem.in_num - 3].iov_base;
-    hdr.mx_sb_len = req->elem.in_sg[req->elem.in_num - 3].iov_len;
-
-    ret = bdrv_ioctl(req->dev->bs, SG_IO, &hdr);
-    if (ret) {
-        status = VIRTIO_BLK_S_UNSUPP;
-        hdr.status = ret;
-        hdr.resid = hdr.dxfer_len;
-    } else if (hdr.status) {
-        status = VIRTIO_BLK_S_IOERR;
-    } else {
-        status = VIRTIO_BLK_S_OK;
-    }
-
-    /*
-     * From SCSI-Generic-HOWTO: "Some lower level drivers (e.g. ide-scsi)
-     * clear the masked_status field [hence status gets cleared too, see
-     * block/scsi_ioctl.c] even when a CHECK_CONDITION or COMMAND_TERMINATED
-     * status has occurred.  However they do set DRIVER_SENSE in driver_status
-     * field. Also a (sb_len_wr > 0) indicates there is a sense buffer.
-     */
-    if (hdr.status == 0 && hdr.sb_len_wr > 0) {
-        hdr.status = CHECK_CONDITION;
-    }
-
-    stl_p(&req->scsi->errors,
-          hdr.status | (hdr.msg_status << 8) |
-          (hdr.host_status << 16) | (hdr.driver_status << 24));
-    stl_p(&req->scsi->residual, hdr.resid);
-    stl_p(&req->scsi->sense_len, hdr.sb_len_wr);
-    stl_p(&req->scsi->data_len, hdr.dxfer_len);
-
-    virtio_blk_req_complete(req, status);
-    g_free(req);
-}
-#else
-static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
-{
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-    g_free(req);
-}
-#endif /* __linux__ */
-
-typedef struct MultiReqBuffer {
-    BlockRequest        blkreq[32];
-    unsigned int        num_writes;
-} MultiReqBuffer;
-
-static void virtio_submit_multiwrite(BlockDriverState *bs, MultiReqBuffer *mrb)
-{
-    int i, ret;
-
-    if (!mrb->num_writes) {
-        return;
-    }
-
-    ret = bdrv_aio_multiwrite(bs, mrb->blkreq, mrb->num_writes);
-    if (ret != 0) {
-        for (i = 0; i < mrb->num_writes; i++) {
-            if (mrb->blkreq[i].error) {
-                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
-            }
-        }
-    }
-
-    mrb->num_writes = 0;
-}
-
-static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
-{
-    bdrv_acct_start(req->dev->bs, &req->acct, 0, BDRV_ACCT_FLUSH);
-
-    /*
-     * Make sure all outstanding writes are posted to the backing device.
-     */
-    virtio_submit_multiwrite(req->dev->bs, mrb);
-    bdrv_aio_flush(req->dev->bs, virtio_blk_flush_complete, req);
-}
-
-static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
-{
-    BlockRequest *blkreq;
-    uint64_t sector;
-
-    sector = ldq_p(&req->out->sector);
-
-    bdrv_acct_start(req->dev->bs, &req->acct, req->qiov.size, BDRV_ACCT_WRITE);
-
-    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
-
-    if (sector & req->dev->sector_mask) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    if (req->qiov.size % req->dev->conf->logical_block_size) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-
-    if (mrb->num_writes == 32) {
-        virtio_submit_multiwrite(req->dev->bs, mrb);
-    }
-
-    blkreq = &mrb->blkreq[mrb->num_writes];
-    blkreq->sector = sector;
-    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
-    blkreq->qiov = &req->qiov;
-    blkreq->cb = virtio_blk_rw_complete;
-    blkreq->opaque = req;
-    blkreq->error = 0;
-
-    mrb->num_writes++;
-}
-
-static void virtio_blk_handle_read(VirtIOBlockReq *req)
-{
-    uint64_t sector;
-
-    sector = ldq_p(&req->out->sector);
-
-    bdrv_acct_start(req->dev->bs, &req->acct, req->qiov.size, BDRV_ACCT_READ);
-
-    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
-
-    if (sector & req->dev->sector_mask) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    if (req->qiov.size % req->dev->conf->logical_block_size) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    bdrv_aio_readv(req->dev->bs, sector, &req->qiov,
-                   req->qiov.size / BDRV_SECTOR_SIZE,
-                   virtio_blk_rw_complete, req);
-}
-
-static void virtio_blk_handle_request(VirtIOBlockReq *req,
-    MultiReqBuffer *mrb)
-{
-    uint32_t type;
-
-    if (req->elem.out_num < 1 || req->elem.in_num < 1) {
-        error_report("virtio-blk missing headers");
-        exit(1);
-    }
-
-    if (req->elem.out_sg[0].iov_len < sizeof(*req->out) ||
-        req->elem.in_sg[req->elem.in_num - 1].iov_len < sizeof(*req->in)) {
-        error_report("virtio-blk header not in correct element");
-        exit(1);
-    }
-
-    req->out = (void *)req->elem.out_sg[0].iov_base;
-    req->in = (void *)req->elem.in_sg[req->elem.in_num - 1].iov_base;
-
-    type = ldl_p(&req->out->type);
-
-    if (type & VIRTIO_BLK_T_FLUSH) {
-        virtio_blk_handle_flush(req, mrb);
-    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
-        virtio_blk_handle_scsi(req);
-    } else if (type & VIRTIO_BLK_T_GET_ID) {
-        VirtIOBlock *s = req->dev;
-
-        /*
-         * NB: per existing s/n string convention the string is
-         * terminated by '\0' only when shorter than buffer.
-         */
-        strncpy(req->elem.in_sg[0].iov_base,
-                s->serial ? s->serial : "",
-                MIN(req->elem.in_sg[0].iov_len, VIRTIO_BLK_ID_BYTES));
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-        g_free(req);
-    } else if (type & VIRTIO_BLK_T_OUT) {
-        qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1],
-                                 req->elem.out_num - 1);
-        virtio_blk_handle_write(req, mrb);
-    } else {
-        qemu_iovec_init_external(&req->qiov, &req->elem.in_sg[0],
-                                 req->elem.in_num - 1);
-        virtio_blk_handle_read(req);
-    }
-}
-
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    VirtIOBlock *s = to_virtio_blk(vdev);
-    VirtIOBlockReq *req;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
-
-    while ((req = virtio_blk_get_request(s))) {
-        virtio_blk_handle_request(req, &mrb);
-    }
-
-    virtio_submit_multiwrite(s->bs, &mrb);
-
-    /*
-     * FIXME: Want to check for completions before returning to guest mode,
-     * so cached reads and writes are reported as quickly as possible. But
-     * that should be done in the generic block layer.
-     */
-}
-
-static void virtio_blk_dma_restart_bh(void *opaque)
-{
-    VirtIOBlock *s = opaque;
-    VirtIOBlockReq *req = s->rq;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
-
-    qemu_bh_delete(s->bh);
-    s->bh = NULL;
-
-    s->rq = NULL;
-
-    while (req) {
-        virtio_blk_handle_request(req, &mrb);
-        req = req->next;
-    }
-
-    virtio_submit_multiwrite(s->bs, &mrb);
-}
-
-static void virtio_blk_dma_restart_cb(void *opaque, int running,
-                                      RunState state)
-{
-    VirtIOBlock *s = opaque;
-
-    if (!running)
-        return;
-
-    if (!s->bh) {
-        s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
-        qemu_bh_schedule(s->bh);
-    }
-}
-
-static void virtio_blk_reset(VirtIODevice *vdev)
-{
-    /*
-     * This should cancel pending requests, but can't do nicely until there
-     * are per-device request lists.
-     */
-    bdrv_drain_all();
+    fprintf(stderr, "virtio_blk_handle_output: should never get here,"
+                    "data plane thread should process requests\n");
+    exit(1);
 }
 
 /* coalesce internal state, copy to pci i/o region 0
@@ -519,61 +85,11 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
     return features;
 }
 
-static void virtio_blk_save(QEMUFile *f, void *opaque)
-{
-    VirtIOBlock *s = opaque;
-    VirtIOBlockReq *req = s->rq;
-
-    virtio_save(&s->vdev, f);
-    
-    while (req) {
-        qemu_put_sbyte(f, 1);
-        qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
-        req = req->next;
-    }
-    qemu_put_sbyte(f, 0);
-}
-
-static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id)
-{
-    VirtIOBlock *s = opaque;
-
-    if (version_id != 2)
-        return -EINVAL;
-
-    virtio_load(&s->vdev, f);
-    while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = virtio_blk_alloc_request(s);
-        qemu_get_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
-        req->next = s->rq;
-        s->rq = req;
-
-        virtqueue_map_sg(req->elem.in_sg, req->elem.in_addr,
-            req->elem.in_num, 1);
-        virtqueue_map_sg(req->elem.out_sg, req->elem.out_addr,
-            req->elem.out_num, 0);
-    }
-
-    return 0;
-}
-
-static void virtio_blk_resize(void *opaque)
-{
-    VirtIOBlock *s = opaque;
-
-    virtio_notify_config(&s->vdev);
-}
-
-static const BlockDevOps virtio_block_ops = {
-    .resize_cb = virtio_blk_resize,
-};
-
 VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
                               char **serial)
 {
     VirtIOBlock *s;
     int cylinders, heads, secs;
-    static int virtio_blk_id;
     DriveInfo *dinfo;
 
     if (!conf->bs) {
@@ -599,21 +115,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
-    s->vdev.reset = virtio_blk_reset;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
-    s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
 
-    qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
     s->qdev = dev;
-    register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
-                    virtio_blk_save, virtio_blk_load, s);
-    bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
     bdrv_set_buffer_alignment(s->bs, conf->logical_block_size);
 
     bdrv_iostatus_enable(s->bs);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 01/27] virtio-blk: Remove virtqueue request handling code
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Start with a clean slate, a virtio-blk device that supports virtio
lifecycle operations and configuration but doesn't do any actual I/O.
The I/O is going to happen in a separate optimized data plane thread.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |  496 +------------------------------------------------------
 1 file changed, 3 insertions(+), 493 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 49990f8..a627427 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -16,18 +16,12 @@
 #include "trace.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
-#include "scsi-defs.h"
-#ifdef __linux__
-# include <scsi/sg.h>
-#endif
 
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
     BlockDriverState *bs;
     VirtQueue *vq;
-    void *rq;
-    QEMUBH *bh;
     BlockConf *conf;
     char *serial;
     unsigned short sector_mask;
@@ -39,439 +33,11 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-typedef struct VirtIOBlockReq
-{
-    VirtIOBlock *dev;
-    VirtQueueElement elem;
-    struct virtio_blk_inhdr *in;
-    struct virtio_blk_outhdr *out;
-    struct virtio_scsi_inhdr *scsi;
-    QEMUIOVector qiov;
-    struct VirtIOBlockReq *next;
-    BlockAcctCookie acct;
-} VirtIOBlockReq;
-
-static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
-{
-    VirtIOBlock *s = req->dev;
-
-    trace_virtio_blk_req_complete(req, status);
-
-    stb_p(&req->in->status, status);
-    virtqueue_push(s->vq, &req->elem, req->qiov.size + sizeof(*req->in));
-    virtio_notify(&s->vdev, s->vq);
-}
-
-static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
-    int is_read)
-{
-    BlockErrorAction action = bdrv_get_on_error(req->dev->bs, is_read);
-    VirtIOBlock *s = req->dev;
-
-    if (action == BLOCK_ERR_IGNORE) {
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_IGNORE, is_read);
-        return 0;
-    }
-
-    if ((error == ENOSPC && action == BLOCK_ERR_STOP_ENOSPC)
-            || action == BLOCK_ERR_STOP_ANY) {
-        req->next = s->rq;
-        s->rq = req;
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_STOP, is_read);
-        vm_stop(RUN_STATE_IO_ERROR);
-        bdrv_iostatus_set_err(s->bs, error);
-    } else {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        bdrv_acct_done(s->bs, &req->acct);
-        g_free(req);
-        bdrv_emit_qmp_error_event(s->bs, BDRV_ACTION_REPORT, is_read);
-    }
-
-    return 1;
-}
-
-static void virtio_blk_rw_complete(void *opaque, int ret)
-{
-    VirtIOBlockReq *req = opaque;
-
-    trace_virtio_blk_rw_complete(req, ret);
-
-    if (ret) {
-        int is_read = !(ldl_p(&req->out->type) & VIRTIO_BLK_T_OUT);
-        if (virtio_blk_handle_rw_error(req, -ret, is_read))
-            return;
-    }
-
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-    bdrv_acct_done(req->dev->bs, &req->acct);
-    g_free(req);
-}
-
-static void virtio_blk_flush_complete(void *opaque, int ret)
-{
-    VirtIOBlockReq *req = opaque;
-
-    if (ret) {
-        if (virtio_blk_handle_rw_error(req, -ret, 0)) {
-            return;
-        }
-    }
-
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-    bdrv_acct_done(req->dev->bs, &req->acct);
-    g_free(req);
-}
-
-static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
-{
-    VirtIOBlockReq *req = g_malloc(sizeof(*req));
-    req->dev = s;
-    req->qiov.size = 0;
-    req->next = NULL;
-    return req;
-}
-
-static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
-{
-    VirtIOBlockReq *req = virtio_blk_alloc_request(s);
-
-    if (req != NULL) {
-        if (!virtqueue_pop(s->vq, &req->elem)) {
-            g_free(req);
-            return NULL;
-        }
-    }
-
-    return req;
-}
-
-#ifdef __linux__
-static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
-{
-    struct sg_io_hdr hdr;
-    int ret;
-    int status;
-    int i;
-
-    if ((req->dev->vdev.guest_features & (1 << VIRTIO_BLK_F_SCSI)) == 0) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * We require at least one output segment each for the virtio_blk_outhdr
-     * and the SCSI command block.
-     *
-     * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr
-     * and the sense buffer pointer in the input segments.
-     */
-    if (req->elem.out_num < 2 || req->elem.in_num < 3) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * No support for bidirection commands yet.
-     */
-    if (req->elem.out_num > 2 && req->elem.in_num > 3) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-        g_free(req);
-        return;
-    }
-
-    /*
-     * The scsi inhdr is placed in the second-to-last input segment, just
-     * before the regular inhdr.
-     */
-    req->scsi = (void *)req->elem.in_sg[req->elem.in_num - 2].iov_base;
-
-    memset(&hdr, 0, sizeof(struct sg_io_hdr));
-    hdr.interface_id = 'S';
-    hdr.cmd_len = req->elem.out_sg[1].iov_len;
-    hdr.cmdp = req->elem.out_sg[1].iov_base;
-    hdr.dxfer_len = 0;
-
-    if (req->elem.out_num > 2) {
-        /*
-         * If there are more than the minimally required 2 output segments
-         * there is write payload starting from the third iovec.
-         */
-        hdr.dxfer_direction = SG_DXFER_TO_DEV;
-        hdr.iovec_count = req->elem.out_num - 2;
-
-        for (i = 0; i < hdr.iovec_count; i++)
-            hdr.dxfer_len += req->elem.out_sg[i + 2].iov_len;
-
-        hdr.dxferp = req->elem.out_sg + 2;
-
-    } else if (req->elem.in_num > 3) {
-        /*
-         * If we have more than 3 input segments the guest wants to actually
-         * read data.
-         */
-        hdr.dxfer_direction = SG_DXFER_FROM_DEV;
-        hdr.iovec_count = req->elem.in_num - 3;
-        for (i = 0; i < hdr.iovec_count; i++)
-            hdr.dxfer_len += req->elem.in_sg[i].iov_len;
-
-        hdr.dxferp = req->elem.in_sg;
-    } else {
-        /*
-         * Some SCSI commands don't actually transfer any data.
-         */
-        hdr.dxfer_direction = SG_DXFER_NONE;
-    }
-
-    hdr.sbp = req->elem.in_sg[req->elem.in_num - 3].iov_base;
-    hdr.mx_sb_len = req->elem.in_sg[req->elem.in_num - 3].iov_len;
-
-    ret = bdrv_ioctl(req->dev->bs, SG_IO, &hdr);
-    if (ret) {
-        status = VIRTIO_BLK_S_UNSUPP;
-        hdr.status = ret;
-        hdr.resid = hdr.dxfer_len;
-    } else if (hdr.status) {
-        status = VIRTIO_BLK_S_IOERR;
-    } else {
-        status = VIRTIO_BLK_S_OK;
-    }
-
-    /*
-     * From SCSI-Generic-HOWTO: "Some lower level drivers (e.g. ide-scsi)
-     * clear the masked_status field [hence status gets cleared too, see
-     * block/scsi_ioctl.c] even when a CHECK_CONDITION or COMMAND_TERMINATED
-     * status has occurred.  However they do set DRIVER_SENSE in driver_status
-     * field. Also a (sb_len_wr > 0) indicates there is a sense buffer.
-     */
-    if (hdr.status == 0 && hdr.sb_len_wr > 0) {
-        hdr.status = CHECK_CONDITION;
-    }
-
-    stl_p(&req->scsi->errors,
-          hdr.status | (hdr.msg_status << 8) |
-          (hdr.host_status << 16) | (hdr.driver_status << 24));
-    stl_p(&req->scsi->residual, hdr.resid);
-    stl_p(&req->scsi->sense_len, hdr.sb_len_wr);
-    stl_p(&req->scsi->data_len, hdr.dxfer_len);
-
-    virtio_blk_req_complete(req, status);
-    g_free(req);
-}
-#else
-static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
-{
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
-    g_free(req);
-}
-#endif /* __linux__ */
-
-typedef struct MultiReqBuffer {
-    BlockRequest        blkreq[32];
-    unsigned int        num_writes;
-} MultiReqBuffer;
-
-static void virtio_submit_multiwrite(BlockDriverState *bs, MultiReqBuffer *mrb)
-{
-    int i, ret;
-
-    if (!mrb->num_writes) {
-        return;
-    }
-
-    ret = bdrv_aio_multiwrite(bs, mrb->blkreq, mrb->num_writes);
-    if (ret != 0) {
-        for (i = 0; i < mrb->num_writes; i++) {
-            if (mrb->blkreq[i].error) {
-                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
-            }
-        }
-    }
-
-    mrb->num_writes = 0;
-}
-
-static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
-{
-    bdrv_acct_start(req->dev->bs, &req->acct, 0, BDRV_ACCT_FLUSH);
-
-    /*
-     * Make sure all outstanding writes are posted to the backing device.
-     */
-    virtio_submit_multiwrite(req->dev->bs, mrb);
-    bdrv_aio_flush(req->dev->bs, virtio_blk_flush_complete, req);
-}
-
-static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
-{
-    BlockRequest *blkreq;
-    uint64_t sector;
-
-    sector = ldq_p(&req->out->sector);
-
-    bdrv_acct_start(req->dev->bs, &req->acct, req->qiov.size, BDRV_ACCT_WRITE);
-
-    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
-
-    if (sector & req->dev->sector_mask) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    if (req->qiov.size % req->dev->conf->logical_block_size) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-
-    if (mrb->num_writes == 32) {
-        virtio_submit_multiwrite(req->dev->bs, mrb);
-    }
-
-    blkreq = &mrb->blkreq[mrb->num_writes];
-    blkreq->sector = sector;
-    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
-    blkreq->qiov = &req->qiov;
-    blkreq->cb = virtio_blk_rw_complete;
-    blkreq->opaque = req;
-    blkreq->error = 0;
-
-    mrb->num_writes++;
-}
-
-static void virtio_blk_handle_read(VirtIOBlockReq *req)
-{
-    uint64_t sector;
-
-    sector = ldq_p(&req->out->sector);
-
-    bdrv_acct_start(req->dev->bs, &req->acct, req->qiov.size, BDRV_ACCT_READ);
-
-    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
-
-    if (sector & req->dev->sector_mask) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    if (req->qiov.size % req->dev->conf->logical_block_size) {
-        virtio_blk_rw_complete(req, -EIO);
-        return;
-    }
-    bdrv_aio_readv(req->dev->bs, sector, &req->qiov,
-                   req->qiov.size / BDRV_SECTOR_SIZE,
-                   virtio_blk_rw_complete, req);
-}
-
-static void virtio_blk_handle_request(VirtIOBlockReq *req,
-    MultiReqBuffer *mrb)
-{
-    uint32_t type;
-
-    if (req->elem.out_num < 1 || req->elem.in_num < 1) {
-        error_report("virtio-blk missing headers");
-        exit(1);
-    }
-
-    if (req->elem.out_sg[0].iov_len < sizeof(*req->out) ||
-        req->elem.in_sg[req->elem.in_num - 1].iov_len < sizeof(*req->in)) {
-        error_report("virtio-blk header not in correct element");
-        exit(1);
-    }
-
-    req->out = (void *)req->elem.out_sg[0].iov_base;
-    req->in = (void *)req->elem.in_sg[req->elem.in_num - 1].iov_base;
-
-    type = ldl_p(&req->out->type);
-
-    if (type & VIRTIO_BLK_T_FLUSH) {
-        virtio_blk_handle_flush(req, mrb);
-    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
-        virtio_blk_handle_scsi(req);
-    } else if (type & VIRTIO_BLK_T_GET_ID) {
-        VirtIOBlock *s = req->dev;
-
-        /*
-         * NB: per existing s/n string convention the string is
-         * terminated by '\0' only when shorter than buffer.
-         */
-        strncpy(req->elem.in_sg[0].iov_base,
-                s->serial ? s->serial : "",
-                MIN(req->elem.in_sg[0].iov_len, VIRTIO_BLK_ID_BYTES));
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-        g_free(req);
-    } else if (type & VIRTIO_BLK_T_OUT) {
-        qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1],
-                                 req->elem.out_num - 1);
-        virtio_blk_handle_write(req, mrb);
-    } else {
-        qemu_iovec_init_external(&req->qiov, &req->elem.in_sg[0],
-                                 req->elem.in_num - 1);
-        virtio_blk_handle_read(req);
-    }
-}
-
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    VirtIOBlock *s = to_virtio_blk(vdev);
-    VirtIOBlockReq *req;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
-
-    while ((req = virtio_blk_get_request(s))) {
-        virtio_blk_handle_request(req, &mrb);
-    }
-
-    virtio_submit_multiwrite(s->bs, &mrb);
-
-    /*
-     * FIXME: Want to check for completions before returning to guest mode,
-     * so cached reads and writes are reported as quickly as possible. But
-     * that should be done in the generic block layer.
-     */
-}
-
-static void virtio_blk_dma_restart_bh(void *opaque)
-{
-    VirtIOBlock *s = opaque;
-    VirtIOBlockReq *req = s->rq;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
-
-    qemu_bh_delete(s->bh);
-    s->bh = NULL;
-
-    s->rq = NULL;
-
-    while (req) {
-        virtio_blk_handle_request(req, &mrb);
-        req = req->next;
-    }
-
-    virtio_submit_multiwrite(s->bs, &mrb);
-}
-
-static void virtio_blk_dma_restart_cb(void *opaque, int running,
-                                      RunState state)
-{
-    VirtIOBlock *s = opaque;
-
-    if (!running)
-        return;
-
-    if (!s->bh) {
-        s->bh = qemu_bh_new(virtio_blk_dma_restart_bh, s);
-        qemu_bh_schedule(s->bh);
-    }
-}
-
-static void virtio_blk_reset(VirtIODevice *vdev)
-{
-    /*
-     * This should cancel pending requests, but can't do nicely until there
-     * are per-device request lists.
-     */
-    bdrv_drain_all();
+    fprintf(stderr, "virtio_blk_handle_output: should never get here,"
+                    "data plane thread should process requests\n");
+    exit(1);
 }
 
 /* coalesce internal state, copy to pci i/o region 0
@@ -519,61 +85,11 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
     return features;
 }
 
-static void virtio_blk_save(QEMUFile *f, void *opaque)
-{
-    VirtIOBlock *s = opaque;
-    VirtIOBlockReq *req = s->rq;
-
-    virtio_save(&s->vdev, f);
-    
-    while (req) {
-        qemu_put_sbyte(f, 1);
-        qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
-        req = req->next;
-    }
-    qemu_put_sbyte(f, 0);
-}
-
-static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id)
-{
-    VirtIOBlock *s = opaque;
-
-    if (version_id != 2)
-        return -EINVAL;
-
-    virtio_load(&s->vdev, f);
-    while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = virtio_blk_alloc_request(s);
-        qemu_get_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
-        req->next = s->rq;
-        s->rq = req;
-
-        virtqueue_map_sg(req->elem.in_sg, req->elem.in_addr,
-            req->elem.in_num, 1);
-        virtqueue_map_sg(req->elem.out_sg, req->elem.out_addr,
-            req->elem.out_num, 0);
-    }
-
-    return 0;
-}
-
-static void virtio_blk_resize(void *opaque)
-{
-    VirtIOBlock *s = opaque;
-
-    virtio_notify_config(&s->vdev);
-}
-
-static const BlockDevOps virtio_block_ops = {
-    .resize_cb = virtio_blk_resize,
-};
-
 VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
                               char **serial)
 {
     VirtIOBlock *s;
     int cylinders, heads, secs;
-    static int virtio_blk_id;
     DriveInfo *dinfo;
 
     if (!conf->bs) {
@@ -599,21 +115,15 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
-    s->vdev.reset = virtio_blk_reset;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
-    s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
 
-    qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
     s->qdev = dev;
-    register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
-                    virtio_blk_save, virtio_blk_load, s);
-    bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
     bdrv_set_buffer_alignment(s->bs, conf->logical_block_size);
 
     bdrv_iostatus_enable(s->bs);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 02/27] virtio-blk: Set up host notifier for data plane
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Set up the virtqueue notify ioeventfd that the data plane will monitor.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index a627427..0389294 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -26,6 +26,8 @@ typedef struct VirtIOBlock
     char *serial;
     unsigned short sector_mask;
     DeviceState *qdev;
+
+    bool data_plane_started;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -33,6 +35,39 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+static void virtio_blk_data_plane_start(VirtIOBlock *s)
+{
+    if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        return;
+    }
+
+    s->data_plane_started = true;
+}
+
+static void virtio_blk_data_plane_stop(VirtIOBlock *s)
+{
+    s->data_plane_started = false;
+
+    s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
+}
+
+static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
+{
+    VirtIOBlock *s = to_virtio_blk(vdev);
+
+    /* Toggle host notifier only on status change */
+    if (s->data_plane_started == !!(val & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        return;
+    }
+
+    if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
+        virtio_blk_data_plane_start(s);
+    } else {
+        virtio_blk_data_plane_stop(s);
+    }
+}
+
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     fprintf(stderr, "virtio_blk_handle_output: should never get here,"
@@ -115,6 +150,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
+    s->vdev.set_status = virtio_blk_set_status;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
@@ -122,6 +158,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
+    s->data_plane_started = false;
 
     s->qdev = dev;
     bdrv_set_buffer_alignment(s->bs, conf->logical_block_size);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 02/27] virtio-blk: Set up host notifier for data plane
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Set up the virtqueue notify ioeventfd that the data plane will monitor.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index a627427..0389294 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -26,6 +26,8 @@ typedef struct VirtIOBlock
     char *serial;
     unsigned short sector_mask;
     DeviceState *qdev;
+
+    bool data_plane_started;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -33,6 +35,39 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+static void virtio_blk_data_plane_start(VirtIOBlock *s)
+{
+    if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        return;
+    }
+
+    s->data_plane_started = true;
+}
+
+static void virtio_blk_data_plane_stop(VirtIOBlock *s)
+{
+    s->data_plane_started = false;
+
+    s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
+}
+
+static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
+{
+    VirtIOBlock *s = to_virtio_blk(vdev);
+
+    /* Toggle host notifier only on status change */
+    if (s->data_plane_started == !!(val & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        return;
+    }
+
+    if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
+        virtio_blk_data_plane_start(s);
+    } else {
+        virtio_blk_data_plane_stop(s);
+    }
+}
+
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     fprintf(stderr, "virtio_blk_handle_output: should never get here,"
@@ -115,6 +150,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
 
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
+    s->vdev.set_status = virtio_blk_set_status;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
@@ -122,6 +158,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
+    s->data_plane_started = false;
 
     s->qdev = dev;
     bdrv_set_buffer_alignment(s->bs, conf->logical_block_size);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 03/27] virtio-blk: Data plane thread event loop
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Add a simple event handling loop based on epoll(2).  The data plane
thread now receives virtqueue notify and Linux AIO completion events.

The data plane thread currently does not shut down.  Either it needs to
be a detached thread or have clean shutdown support.

Most of the data plane start/stop code can be done once on virtio-blk
init/cleanup instead of each time the virtio device is brought up/down
by the driver.  Only the vring address and the notify pio address
change.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |  125 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 9 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 0389294..f6043bc 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,12 +11,25 @@
  *
  */
 
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+#include <libaio.h>
 #include "qemu-common.h"
+#include "qemu-thread.h"
 #include "qemu-error.h"
-#include "trace.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
 
+enum {
+    SEG_MAX = 126, /* maximum number of I/O segments */
+};
+
+typedef struct
+{
+    EventNotifier *notifier;        /* eventfd */
+    void (*handler)(void);          /* handler function */
+} EventHandler;
+
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
@@ -28,6 +41,13 @@ typedef struct VirtIOBlock
     DeviceState *qdev;
 
     bool data_plane_started;
+    QemuThread data_plane_thread;
+
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
+    EventHandler io_handler;        /* Linux AIO completion handler */
+    EventHandler notify_handler;    /* virtqueue notify handler */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -35,21 +55,108 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-static void virtio_blk_data_plane_start(VirtIOBlock *s)
+static void handle_io(void)
+{
+    fprintf(stderr, "io completion happened\n");
+}
+
+static void handle_notify(void)
+{
+    fprintf(stderr, "virtqueue notify happened\n");
+}
+
+static void *data_plane_thread(void *opaque)
 {
+    VirtIOBlock *s = opaque;
+    struct epoll_event event;
+    int nevents;
+    EventHandler *event_handler;
+
+    /* Signals are masked, EINTR should never happen */
+
+    for (;;) {
+        /* Wait for the next event.  Only do one event per call to keep the
+         * function simple, this could be changed later. */
+        nevents = epoll_wait(s->epoll_fd, &event, 1, -1);
+        if (unlikely(nevents != 1)) {
+            fprintf(stderr, "epoll_wait failed: %m\n");
+            continue; /* should never happen */
+        }
+
+        /* Find out which event handler has become active */
+        event_handler = event.data.ptr;
+
+        /* Clear the eventfd */
+        event_notifier_test_and_clear(event_handler->notifier);
+
+        /* Handle the event */
+        event_handler->handler();
+    }
+    return NULL;
+}
+
+static void add_event_handler(int epoll_fd, EventHandler *event_handler)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = event_handler,
+    };
+    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(event_handler->notifier), &event) != 0) {
+        fprintf(stderr, "virtio-blk failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+static void data_plane_start(VirtIOBlock *s)
+{
+    /* Create epoll file descriptor */
+    s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (s->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        return; /* TODO error handling */
+    }
+
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        return;
+        return; /* TODO error handling */
+    }
+
+    s->notify_handler.notifier = virtio_queue_get_host_notifier(s->vq),
+    s->notify_handler.handler = handle_notify;
+    add_event_handler(s->epoll_fd, &s->notify_handler);
+
+    /* Create aio context */
+    if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
+        fprintf(stderr, "virtio-blk io_setup failed\n");
+        return; /* TODO error handling */
     }
 
+    if (event_notifier_init(&s->io_notifier, 0) != 0) {
+        fprintf(stderr, "virtio-blk io event notifier creation failed\n");
+        return; /* TODO error handling */
+    }
+
+    s->io_handler.notifier = &s->io_notifier;
+    s->io_handler.handler = handle_io;
+    add_event_handler(s->epoll_fd, &s->io_handler);
+
+    qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
+
     s->data_plane_started = true;
 }
 
-static void virtio_blk_data_plane_stop(VirtIOBlock *s)
+static void data_plane_stop(VirtIOBlock *s)
 {
     s->data_plane_started = false;
 
+    /* TODO stop data plane thread */
+
+    event_notifier_cleanup(&s->io_notifier);
+    io_destroy(s->io_ctx);
+
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
+
+    close(s->epoll_fd);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
@@ -62,15 +169,15 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
     }
 
     if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
-        virtio_blk_data_plane_start(s);
+        data_plane_start(s);
     } else {
-        virtio_blk_data_plane_stop(s);
+        data_plane_stop(s);
     }
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    fprintf(stderr, "virtio_blk_handle_output: should never get here,"
+    fprintf(stderr, "virtio_blk_handle_output: should never get here, "
                     "data plane thread should process requests\n");
     exit(1);
 }
@@ -89,7 +196,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
     memset(&blkcfg, 0, sizeof(blkcfg));
     stq_raw(&blkcfg.capacity, capacity);
-    stl_raw(&blkcfg.seg_max, 128 - 2);
+    stl_raw(&blkcfg.seg_max, SEG_MAX);
     stw_raw(&blkcfg.cylinders, cylinders);
     stl_raw(&blkcfg.blk_size, blk_size);
     stw_raw(&blkcfg.min_io_size, s->conf->min_io_size / blk_size);
@@ -157,7 +264,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
-    s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
+    s->vq = virtio_add_queue(&s->vdev, SEG_MAX + 2, virtio_blk_handle_output);
     s->data_plane_started = false;
 
     s->qdev = dev;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 03/27] virtio-blk: Data plane thread event loop
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Add a simple event handling loop based on epoll(2).  The data plane
thread now receives virtqueue notify and Linux AIO completion events.

The data plane thread currently does not shut down.  Either it needs to
be a detached thread or have clean shutdown support.

Most of the data plane start/stop code can be done once on virtio-blk
init/cleanup instead of each time the virtio device is brought up/down
by the driver.  Only the vring address and the notify pio address
change.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |  125 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 9 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 0389294..f6043bc 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,12 +11,25 @@
  *
  */
 
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+#include <libaio.h>
 #include "qemu-common.h"
+#include "qemu-thread.h"
 #include "qemu-error.h"
-#include "trace.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
 
+enum {
+    SEG_MAX = 126, /* maximum number of I/O segments */
+};
+
+typedef struct
+{
+    EventNotifier *notifier;        /* eventfd */
+    void (*handler)(void);          /* handler function */
+} EventHandler;
+
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
@@ -28,6 +41,13 @@ typedef struct VirtIOBlock
     DeviceState *qdev;
 
     bool data_plane_started;
+    QemuThread data_plane_thread;
+
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
+    EventHandler io_handler;        /* Linux AIO completion handler */
+    EventHandler notify_handler;    /* virtqueue notify handler */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -35,21 +55,108 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-static void virtio_blk_data_plane_start(VirtIOBlock *s)
+static void handle_io(void)
+{
+    fprintf(stderr, "io completion happened\n");
+}
+
+static void handle_notify(void)
+{
+    fprintf(stderr, "virtqueue notify happened\n");
+}
+
+static void *data_plane_thread(void *opaque)
 {
+    VirtIOBlock *s = opaque;
+    struct epoll_event event;
+    int nevents;
+    EventHandler *event_handler;
+
+    /* Signals are masked, EINTR should never happen */
+
+    for (;;) {
+        /* Wait for the next event.  Only do one event per call to keep the
+         * function simple, this could be changed later. */
+        nevents = epoll_wait(s->epoll_fd, &event, 1, -1);
+        if (unlikely(nevents != 1)) {
+            fprintf(stderr, "epoll_wait failed: %m\n");
+            continue; /* should never happen */
+        }
+
+        /* Find out which event handler has become active */
+        event_handler = event.data.ptr;
+
+        /* Clear the eventfd */
+        event_notifier_test_and_clear(event_handler->notifier);
+
+        /* Handle the event */
+        event_handler->handler();
+    }
+    return NULL;
+}
+
+static void add_event_handler(int epoll_fd, EventHandler *event_handler)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = event_handler,
+    };
+    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(event_handler->notifier), &event) != 0) {
+        fprintf(stderr, "virtio-blk failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+static void data_plane_start(VirtIOBlock *s)
+{
+    /* Create epoll file descriptor */
+    s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (s->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        return; /* TODO error handling */
+    }
+
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        return;
+        return; /* TODO error handling */
+    }
+
+    s->notify_handler.notifier = virtio_queue_get_host_notifier(s->vq),
+    s->notify_handler.handler = handle_notify;
+    add_event_handler(s->epoll_fd, &s->notify_handler);
+
+    /* Create aio context */
+    if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
+        fprintf(stderr, "virtio-blk io_setup failed\n");
+        return; /* TODO error handling */
     }
 
+    if (event_notifier_init(&s->io_notifier, 0) != 0) {
+        fprintf(stderr, "virtio-blk io event notifier creation failed\n");
+        return; /* TODO error handling */
+    }
+
+    s->io_handler.notifier = &s->io_notifier;
+    s->io_handler.handler = handle_io;
+    add_event_handler(s->epoll_fd, &s->io_handler);
+
+    qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
+
     s->data_plane_started = true;
 }
 
-static void virtio_blk_data_plane_stop(VirtIOBlock *s)
+static void data_plane_stop(VirtIOBlock *s)
 {
     s->data_plane_started = false;
 
+    /* TODO stop data plane thread */
+
+    event_notifier_cleanup(&s->io_notifier);
+    io_destroy(s->io_ctx);
+
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
+
+    close(s->epoll_fd);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
@@ -62,15 +169,15 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
     }
 
     if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
-        virtio_blk_data_plane_start(s);
+        data_plane_start(s);
     } else {
-        virtio_blk_data_plane_stop(s);
+        data_plane_stop(s);
     }
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    fprintf(stderr, "virtio_blk_handle_output: should never get here,"
+    fprintf(stderr, "virtio_blk_handle_output: should never get here, "
                     "data plane thread should process requests\n");
     exit(1);
 }
@@ -89,7 +196,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
     memset(&blkcfg, 0, sizeof(blkcfg));
     stq_raw(&blkcfg.capacity, capacity);
-    stl_raw(&blkcfg.seg_max, 128 - 2);
+    stl_raw(&blkcfg.seg_max, SEG_MAX);
     stw_raw(&blkcfg.cylinders, cylinders);
     stl_raw(&blkcfg.blk_size, blk_size);
     stw_raw(&blkcfg.min_io_size, s->conf->min_io_size / blk_size);
@@ -157,7 +264,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
-    s->vq = virtio_add_queue(&s->vdev, 128, virtio_blk_handle_output);
+    s->vq = virtio_add_queue(&s->vdev, SEG_MAX + 2, virtio_blk_handle_output);
     s->data_plane_started = false;
 
     s->qdev = dev;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 04/27] virtio-blk: Map vring
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Map the vring to host memory so it can be accessed without the overhead
of the QEMU memory functions.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index f6043bc..4c790a3 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -14,6 +14,7 @@
 #include <sys/epoll.h>
 #include <sys/eventfd.h>
 #include <libaio.h>
+#include <linux/virtio_ring.h>
 #include "qemu-common.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
@@ -43,6 +44,8 @@ typedef struct VirtIOBlock
     bool data_plane_started;
     QemuThread data_plane_thread;
 
+    struct vring vring;
+
     int epoll_fd;                   /* epoll(2) file descriptor */
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
@@ -55,6 +58,43 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Map the guest's vring to host memory
+ *
+ * This is not allowed but we know the ring won't move.
+ */
+static void map_vring(struct vring *vring, VirtIODevice *vdev, int n)
+{
+    target_phys_addr_t physaddr, len;
+
+    vring->num = virtio_queue_get_num(vdev, n);
+
+    physaddr = virtio_queue_get_desc_addr(vdev, n);
+    len = virtio_queue_get_desc_size(vdev, n);
+    vring->desc = cpu_physical_memory_map(physaddr, &len, 0);
+
+    physaddr = virtio_queue_get_avail_addr(vdev, n);
+    len = virtio_queue_get_avail_size(vdev, n);
+    vring->avail = cpu_physical_memory_map(physaddr, &len, 0);
+
+    physaddr = virtio_queue_get_used_addr(vdev, n);
+    len = virtio_queue_get_used_size(vdev, n);
+    vring->used = cpu_physical_memory_map(physaddr, &len, 0);
+
+    if (!vring->desc || !vring->avail || !vring->used) {
+        fprintf(stderr, "virtio-blk failed to map vring\n");
+        exit(1);
+    }
+
+    fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
+            virtio_queue_get_ring_addr(vdev, n),
+            vring->desc, vring->avail, vring->used);
+}
+
+static void unmap_vring(struct vring *vring, VirtIODevice *vdev, int n)
+{
+    cpu_physical_memory_unmap(vring->desc, virtio_queue_get_ring_size(vdev, n), 0, 0);
+}
+
 static void handle_io(void)
 {
     fprintf(stderr, "io completion happened\n");
@@ -109,6 +149,8 @@ static void add_event_handler(int epoll_fd, EventHandler *event_handler)
 
 static void data_plane_start(VirtIOBlock *s)
 {
+    map_vring(&s->vring, &s->vdev, 0);
+
     /* Create epoll file descriptor */
     s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
     if (s->epoll_fd < 0) {
@@ -157,6 +199,8 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     close(s->epoll_fd);
+
+    unmap_vring(&s->vring, &s->vdev, 0);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 04/27] virtio-blk: Map vring
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Map the vring to host memory so it can be accessed without the overhead
of the QEMU memory functions.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index f6043bc..4c790a3 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -14,6 +14,7 @@
 #include <sys/epoll.h>
 #include <sys/eventfd.h>
 #include <libaio.h>
+#include <linux/virtio_ring.h>
 #include "qemu-common.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
@@ -43,6 +44,8 @@ typedef struct VirtIOBlock
     bool data_plane_started;
     QemuThread data_plane_thread;
 
+    struct vring vring;
+
     int epoll_fd;                   /* epoll(2) file descriptor */
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
@@ -55,6 +58,43 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Map the guest's vring to host memory
+ *
+ * This is not allowed but we know the ring won't move.
+ */
+static void map_vring(struct vring *vring, VirtIODevice *vdev, int n)
+{
+    target_phys_addr_t physaddr, len;
+
+    vring->num = virtio_queue_get_num(vdev, n);
+
+    physaddr = virtio_queue_get_desc_addr(vdev, n);
+    len = virtio_queue_get_desc_size(vdev, n);
+    vring->desc = cpu_physical_memory_map(physaddr, &len, 0);
+
+    physaddr = virtio_queue_get_avail_addr(vdev, n);
+    len = virtio_queue_get_avail_size(vdev, n);
+    vring->avail = cpu_physical_memory_map(physaddr, &len, 0);
+
+    physaddr = virtio_queue_get_used_addr(vdev, n);
+    len = virtio_queue_get_used_size(vdev, n);
+    vring->used = cpu_physical_memory_map(physaddr, &len, 0);
+
+    if (!vring->desc || !vring->avail || !vring->used) {
+        fprintf(stderr, "virtio-blk failed to map vring\n");
+        exit(1);
+    }
+
+    fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
+            virtio_queue_get_ring_addr(vdev, n),
+            vring->desc, vring->avail, vring->used);
+}
+
+static void unmap_vring(struct vring *vring, VirtIODevice *vdev, int n)
+{
+    cpu_physical_memory_unmap(vring->desc, virtio_queue_get_ring_size(vdev, n), 0, 0);
+}
+
 static void handle_io(void)
 {
     fprintf(stderr, "io completion happened\n");
@@ -109,6 +149,8 @@ static void add_event_handler(int epoll_fd, EventHandler *event_handler)
 
 static void data_plane_start(VirtIOBlock *s)
 {
+    map_vring(&s->vring, &s->vdev, 0);
+
     /* Create epoll file descriptor */
     s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
     if (s->epoll_fd < 0) {
@@ -157,6 +199,8 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     close(s->epoll_fd);
+
+    unmap_vring(&s->vring, &s->vdev, 0);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 05/27] virtio-blk: Do cheapest possible memory mapping
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Instead of using QEMU memory access functions, grab the host address of
guest physical address zero and simply add to this base address.

This not only simplifies vring mapping but will also make virtqueue
element access cheap by avoiding QEMU memory access functions in the I/O
code path.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   58 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 4c790a3..abd9386 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -51,6 +51,8 @@ typedef struct VirtIOBlock
     EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
+
+    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -58,43 +60,44 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Map target physical address to host address
+ */
+static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
+{
+    return s->phys_mem_zero_host_ptr + phys;
+}
+
+/* Setup for cheap target physical to host address conversion
+ *
+ * This is a hack for direct access to guest memory, we're not really allowed
+ * to do this.
+ */
+static void setup_phys_to_host(VirtIOBlock *s)
+{
+    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
+    s->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
+    if (!s->phys_mem_zero_host_ptr) {
+        fprintf(stderr, "setup_phys_to_host failed\n");
+        exit(1);
+    }
+}
+
 /* Map the guest's vring to host memory
  *
  * This is not allowed but we know the ring won't move.
  */
-static void map_vring(struct vring *vring, VirtIODevice *vdev, int n)
+static void map_vring(struct vring *vring, VirtIOBlock *s, VirtIODevice *vdev, int n)
 {
-    target_phys_addr_t physaddr, len;
-
     vring->num = virtio_queue_get_num(vdev, n);
-
-    physaddr = virtio_queue_get_desc_addr(vdev, n);
-    len = virtio_queue_get_desc_size(vdev, n);
-    vring->desc = cpu_physical_memory_map(physaddr, &len, 0);
-
-    physaddr = virtio_queue_get_avail_addr(vdev, n);
-    len = virtio_queue_get_avail_size(vdev, n);
-    vring->avail = cpu_physical_memory_map(physaddr, &len, 0);
-
-    physaddr = virtio_queue_get_used_addr(vdev, n);
-    len = virtio_queue_get_used_size(vdev, n);
-    vring->used = cpu_physical_memory_map(physaddr, &len, 0);
-
-    if (!vring->desc || !vring->avail || !vring->used) {
-        fprintf(stderr, "virtio-blk failed to map vring\n");
-        exit(1);
-    }
+    vring->desc = phys_to_host(s, virtio_queue_get_desc_addr(vdev, n));
+    vring->avail = phys_to_host(s, virtio_queue_get_avail_addr(vdev, n));
+    vring->used = phys_to_host(s, virtio_queue_get_used_addr(vdev, n));
 
     fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
             virtio_queue_get_ring_addr(vdev, n),
             vring->desc, vring->avail, vring->used);
 }
 
-static void unmap_vring(struct vring *vring, VirtIODevice *vdev, int n)
-{
-    cpu_physical_memory_unmap(vring->desc, virtio_queue_get_ring_size(vdev, n), 0, 0);
-}
-
 static void handle_io(void)
 {
     fprintf(stderr, "io completion happened\n");
@@ -149,7 +152,8 @@ static void add_event_handler(int epoll_fd, EventHandler *event_handler)
 
 static void data_plane_start(VirtIOBlock *s)
 {
-    map_vring(&s->vring, &s->vdev, 0);
+    setup_phys_to_host(s);
+    map_vring(&s->vring, s, &s->vdev, 0);
 
     /* Create epoll file descriptor */
     s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
@@ -199,8 +203,6 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     close(s->epoll_fd);
-
-    unmap_vring(&s->vring, &s->vdev, 0);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 05/27] virtio-blk: Do cheapest possible memory mapping
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Instead of using QEMU memory access functions, grab the host address of
guest physical address zero and simply add to this base address.

This not only simplifies vring mapping but will also make virtqueue
element access cheap by avoiding QEMU memory access functions in the I/O
code path.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   58 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 4c790a3..abd9386 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -51,6 +51,8 @@ typedef struct VirtIOBlock
     EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
+
+    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -58,43 +60,44 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Map target physical address to host address
+ */
+static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
+{
+    return s->phys_mem_zero_host_ptr + phys;
+}
+
+/* Setup for cheap target physical to host address conversion
+ *
+ * This is a hack for direct access to guest memory, we're not really allowed
+ * to do this.
+ */
+static void setup_phys_to_host(VirtIOBlock *s)
+{
+    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
+    s->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
+    if (!s->phys_mem_zero_host_ptr) {
+        fprintf(stderr, "setup_phys_to_host failed\n");
+        exit(1);
+    }
+}
+
 /* Map the guest's vring to host memory
  *
  * This is not allowed but we know the ring won't move.
  */
-static void map_vring(struct vring *vring, VirtIODevice *vdev, int n)
+static void map_vring(struct vring *vring, VirtIOBlock *s, VirtIODevice *vdev, int n)
 {
-    target_phys_addr_t physaddr, len;
-
     vring->num = virtio_queue_get_num(vdev, n);
-
-    physaddr = virtio_queue_get_desc_addr(vdev, n);
-    len = virtio_queue_get_desc_size(vdev, n);
-    vring->desc = cpu_physical_memory_map(physaddr, &len, 0);
-
-    physaddr = virtio_queue_get_avail_addr(vdev, n);
-    len = virtio_queue_get_avail_size(vdev, n);
-    vring->avail = cpu_physical_memory_map(physaddr, &len, 0);
-
-    physaddr = virtio_queue_get_used_addr(vdev, n);
-    len = virtio_queue_get_used_size(vdev, n);
-    vring->used = cpu_physical_memory_map(physaddr, &len, 0);
-
-    if (!vring->desc || !vring->avail || !vring->used) {
-        fprintf(stderr, "virtio-blk failed to map vring\n");
-        exit(1);
-    }
+    vring->desc = phys_to_host(s, virtio_queue_get_desc_addr(vdev, n));
+    vring->avail = phys_to_host(s, virtio_queue_get_avail_addr(vdev, n));
+    vring->used = phys_to_host(s, virtio_queue_get_used_addr(vdev, n));
 
     fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
             virtio_queue_get_ring_addr(vdev, n),
             vring->desc, vring->avail, vring->used);
 }
 
-static void unmap_vring(struct vring *vring, VirtIODevice *vdev, int n)
-{
-    cpu_physical_memory_unmap(vring->desc, virtio_queue_get_ring_size(vdev, n), 0, 0);
-}
-
 static void handle_io(void)
 {
     fprintf(stderr, "io completion happened\n");
@@ -149,7 +152,8 @@ static void add_event_handler(int epoll_fd, EventHandler *event_handler)
 
 static void data_plane_start(VirtIOBlock *s)
 {
-    map_vring(&s->vring, &s->vdev, 0);
+    setup_phys_to_host(s);
+    map_vring(&s->vring, s, &s->vdev, 0);
 
     /* Create epoll file descriptor */
     s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
@@ -199,8 +203,6 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     close(s->epoll_fd);
-
-    unmap_vring(&s->vring, &s->vdev, 0);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 06/27] virtio-blk: Take PCI memory range into account
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Support >4 GB physical memory accesses.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index abd9386..99654f1 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -64,6 +64,13 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
  */
 static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
 {
+    /* Adjust for 3.6-4 GB PCI memory range */
+    if (phys >= 0x100000000) {
+        phys -= 0x100000000 - 0xe0000000;
+    } else if (phys >= 0xe0000000) {
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        exit(1);
+    }
     return s->phys_mem_zero_host_ptr + phys;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Support >4 GB physical memory accesses.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index abd9386..99654f1 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -64,6 +64,13 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
  */
 static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
 {
+    /* Adjust for 3.6-4 GB PCI memory range */
+    if (phys >= 0x100000000) {
+        phys -= 0x100000000 - 0xe0000000;
+    } else if (phys >= 0xe0000000) {
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        exit(1);
+    }
     return s->phys_mem_zero_host_ptr + phys;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 07/27] virtio-blk: Put dataplane code into its own directory
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/event-poll.h |   79 +++++++++++++++++++
 hw/dataplane/vring.h      |  191 +++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c           |  149 ++++++++---------------------------
 3 files changed, 304 insertions(+), 115 deletions(-)
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/vring.h

diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
new file mode 100644
index 0000000..f38e969
--- /dev/null
+++ b/hw/dataplane/event-poll.h
@@ -0,0 +1,79 @@
+#ifndef EVENT_POLL_H
+#define EVENT_POLL_H
+
+#include <sys/epoll.h>
+#include "event_notifier.h"
+
+typedef struct EventHandler EventHandler;
+typedef void EventCallback(EventHandler *handler);
+struct EventHandler
+{
+    EventNotifier *notifier;    /* eventfd */
+    EventCallback *callback;    /* callback function */
+};
+
+typedef struct {
+    int epoll_fd;               /* epoll(2) file descriptor */
+} EventPoll;
+
+static void event_poll_init(EventPoll *poll)
+{
+    /* Create epoll file descriptor */
+    poll->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (poll->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        exit(1);
+    }
+}
+
+static void event_poll_cleanup(EventPoll *poll)
+{
+    close(poll->epoll_fd);
+    poll->epoll_fd = -1;
+}
+
+/* Add an event notifier and its callback for polling */
+static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Block until the next event and invoke its callback
+ *
+ * Signals must be masked, EINTR should never happen.  This is true for QEMU
+ * threads.
+ */
+static void event_poll(EventPoll *poll)
+{
+    EventHandler *handler;
+    struct epoll_event event;
+    int nevents;
+
+    /* Wait for the next event.  Only do one event per call to keep the
+     * function simple, this could be changed later. */
+    nevents = epoll_wait(poll->epoll_fd, &event, 1, -1);
+    if (unlikely(nevents != 1)) {
+        fprintf(stderr, "epoll_wait failed: %m\n");
+        exit(1); /* should never happen */
+    }
+
+    /* Find out which event handler has become active */
+    handler = event.data.ptr;
+
+    /* Clear the eventfd */
+    event_notifier_test_and_clear(handler->notifier);
+
+    /* Handle the event */
+    handler->callback(handler);
+}
+
+#endif /* EVENT_POLL_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
new file mode 100644
index 0000000..7099a99
--- /dev/null
+++ b/hw/dataplane/vring.h
@@ -0,0 +1,191 @@
+#ifndef VRING_H
+#define VRING_H
+
+#include <linux/virtio_ring.h>
+#include "qemu-common.h"
+
+typedef struct {
+    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
+    struct vring vr;                /* virtqueue vring mapped to host memory */
+    __u16 last_avail_idx;           /* last processed avail ring index */
+    __u16 last_used_idx;            /* last processed used ring index */
+} Vring;
+
+static inline unsigned int vring_get_num(Vring *vring)
+{
+    return vring->vr.num;
+}
+
+/* Map target physical address to host address
+ */
+static inline void *phys_to_host(Vring *vring, target_phys_addr_t phys)
+{
+    /* Adjust for 3.6-4 GB PCI memory range */
+    if (phys >= 0x100000000) {
+        phys -= 0x100000000 - 0xe0000000;
+    } else if (phys >= 0xe0000000) {
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        exit(1);
+    }
+    return vring->phys_mem_zero_host_ptr + phys;
+}
+
+/* Setup for cheap target physical to host address conversion
+ *
+ * This is a hack for direct access to guest memory, we're not really allowed
+ * to do this.
+ */
+static void setup_phys_to_host(Vring *vring)
+{
+    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
+    vring->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
+    if (!vring->phys_mem_zero_host_ptr) {
+        fprintf(stderr, "setup_phys_to_host failed\n");
+        exit(1);
+    }
+}
+
+/* Map the guest's vring to host memory
+ *
+ * This is not allowed but we know the ring won't move.
+ */
+static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
+{
+    setup_phys_to_host(vring);
+
+    vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
+               phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
+
+    vring->last_avail_idx = vring->vr.avail->idx;
+    vring->last_used_idx = vring->vr.used->idx;
+
+    fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
+            virtio_queue_get_ring_addr(vdev, n),
+            vring->vr.desc, vring->vr.avail, vring->vr.used);
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+static unsigned int vring_pop(Vring *vring,
+		      struct iovec iov[], unsigned int iov_size,
+		      unsigned int *out_num, unsigned int *in_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head, found = 0, num = vring->vr.num;
+    __u16 avail_idx, last_avail_idx;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vring->last_avail_idx;
+    avail_idx = vring->vr.avail->idx;
+
+	if (unlikely((__u16)(avail_idx - last_avail_idx) > num)) {
+		fprintf(stderr, "Guest moved used index from %u to %u\n",
+		        last_avail_idx, avail_idx);
+		exit(1);
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (avail_idx == last_avail_idx)
+		return num;
+
+	/* Only get avail ring entries after they have been exposed by guest. */
+	__sync_synchronize(); /* smp_rmb() */
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	head = vring->vr.avail->ring[last_avail_idx % num];
+
+	/* If their number is silly, that's an error. */
+	if (unlikely(head >= num)) {
+		fprintf(stderr, "Guest says index %u > %u is available\n",
+		        head, num);
+		exit(1);
+	}
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+
+	i = head;
+	do {
+		if (unlikely(i >= num)) {
+			fprintf(stderr, "Desc index is %u > %u, head = %u\n",
+			        i, num, head);
+            exit(1);
+		}
+		if (unlikely(++found > num)) {
+			fprintf(stderr, "Loop detected: last one at %u "
+			        "vq size %u head %u\n",
+			        i, num, head);
+            exit(1);
+		}
+        desc = vring->vr.desc[i];
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+/*			ret = get_indirect(dev, vq, iov, iov_size,
+					   out_num, in_num,
+					   log, log_num, &desc);
+			if (unlikely(ret < 0)) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return ret;
+			}
+			continue; */
+            fprintf(stderr, "virtio-blk indirect vring not supported\n");
+            exit(1);
+		}
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += 1;
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				fprintf(stderr, "Descriptor has out after in: "
+				        "idx %d\n", i);
+                exit(1);
+			}
+			*out_num += 1;
+		}
+        i = desc.next;
+	} while (desc.flags & VRING_DESC_F_NEXT);
+
+	/* On success, increment avail index. */
+	vring->last_avail_idx++;
+	return head;
+}
+
+/* After we've used one of their buffers, we tell them about it.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+static __attribute__((unused)) void vring_push(Vring *vring, unsigned int head, int len)
+{
+	struct vring_used_elem *used;
+
+	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
+	 * next entry in that used ring. */
+	used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
+    used->id = head;
+    used->len = len;
+
+	/* Make sure buffer is written before we update index. */
+	__sync_synchronize(); /* smp_wmb() */
+
+    vring->vr.used->idx = ++vring->last_used_idx;
+}
+
+#endif /* VRING_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 99654f1..2c1cce8 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,26 +11,21 @@
  *
  */
 
-#include <sys/epoll.h>
-#include <sys/eventfd.h>
 #include <libaio.h>
-#include <linux/virtio_ring.h>
 #include "qemu-common.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
+#include "hw/dataplane/event-poll.h"
+#include "hw/dataplane/vring.h"
+#include "kvm.h"
 
 enum {
-    SEG_MAX = 126, /* maximum number of I/O segments */
+    SEG_MAX = 126,                  /* maximum number of I/O segments */
+    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
 };
 
-typedef struct
-{
-    EventNotifier *notifier;        /* eventfd */
-    void (*handler)(void);          /* handler function */
-} EventHandler;
-
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
@@ -44,15 +39,13 @@ typedef struct VirtIOBlock
     bool data_plane_started;
     QemuThread data_plane_thread;
 
-    struct vring vring;
+    Vring vring;                    /* virtqueue vring */
 
-    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventPoll event_poll;           /* event poller */
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
-
-    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -60,138 +53,64 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-/* Map target physical address to host address
- */
-static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
+static void handle_io(EventHandler *handler)
 {
-    /* Adjust for 3.6-4 GB PCI memory range */
-    if (phys >= 0x100000000) {
-        phys -= 0x100000000 - 0xe0000000;
-    } else if (phys >= 0xe0000000) {
-        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
-        exit(1);
-    }
-    return s->phys_mem_zero_host_ptr + phys;
+    fprintf(stderr, "io completion happened\n");
 }
 
-/* Setup for cheap target physical to host address conversion
- *
- * This is a hack for direct access to guest memory, we're not really allowed
- * to do this.
- */
-static void setup_phys_to_host(VirtIOBlock *s)
+static void handle_notify(EventHandler *handler)
 {
-    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
-    s->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
-    if (!s->phys_mem_zero_host_ptr) {
-        fprintf(stderr, "setup_phys_to_host failed\n");
-        exit(1);
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
+    struct iovec iov[VRING_MAX];
+    unsigned int out_num, in_num;
+    int head;
+
+    head = vring_pop(&s->vring, iov, ARRAY_SIZE(iov), &out_num, &in_num);
+    if (unlikely(head >= vring_get_num(&s->vring))) {
+        fprintf(stderr, "false alarm, nothing on vring\n");
+        return;
     }
-}
 
-/* Map the guest's vring to host memory
- *
- * This is not allowed but we know the ring won't move.
- */
-static void map_vring(struct vring *vring, VirtIOBlock *s, VirtIODevice *vdev, int n)
-{
-    vring->num = virtio_queue_get_num(vdev, n);
-    vring->desc = phys_to_host(s, virtio_queue_get_desc_addr(vdev, n));
-    vring->avail = phys_to_host(s, virtio_queue_get_avail_addr(vdev, n));
-    vring->used = phys_to_host(s, virtio_queue_get_used_addr(vdev, n));
-
-    fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
-            virtio_queue_get_ring_addr(vdev, n),
-            vring->desc, vring->avail, vring->used);
-}
-
-static void handle_io(void)
-{
-    fprintf(stderr, "io completion happened\n");
-}
-
-static void handle_notify(void)
-{
-    fprintf(stderr, "virtqueue notify happened\n");
+    fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
 }
 
 static void *data_plane_thread(void *opaque)
 {
     VirtIOBlock *s = opaque;
-    struct epoll_event event;
-    int nevents;
-    EventHandler *event_handler;
-
-    /* Signals are masked, EINTR should never happen */
 
     for (;;) {
-        /* Wait for the next event.  Only do one event per call to keep the
-         * function simple, this could be changed later. */
-        nevents = epoll_wait(s->epoll_fd, &event, 1, -1);
-        if (unlikely(nevents != 1)) {
-            fprintf(stderr, "epoll_wait failed: %m\n");
-            continue; /* should never happen */
-        }
-
-        /* Find out which event handler has become active */
-        event_handler = event.data.ptr;
-
-        /* Clear the eventfd */
-        event_notifier_test_and_clear(event_handler->notifier);
-
-        /* Handle the event */
-        event_handler->handler();
+        event_poll(&s->event_poll);
     }
     return NULL;
 }
 
-static void add_event_handler(int epoll_fd, EventHandler *event_handler)
-{
-    struct epoll_event event = {
-        .events = EPOLLIN,
-        .data.ptr = event_handler,
-    };
-    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(event_handler->notifier), &event) != 0) {
-        fprintf(stderr, "virtio-blk failed to add event handler to epoll: %m\n");
-        exit(1);
-    }
-}
-
 static void data_plane_start(VirtIOBlock *s)
 {
-    setup_phys_to_host(s);
-    map_vring(&s->vring, s, &s->vdev, 0);
-
-    /* Create epoll file descriptor */
-    s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
-    if (s->epoll_fd < 0) {
-        fprintf(stderr, "epoll_create1 failed: %m\n");
-        return; /* TODO error handling */
-    }
+    vring_setup(&s->vring, &s->vdev, 0);
+
+    event_poll_init(&s->event_poll);
 
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        return; /* TODO error handling */
+        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
+        exit(1);
     }
 
-    s->notify_handler.notifier = virtio_queue_get_host_notifier(s->vq),
-    s->notify_handler.handler = handle_notify;
-    add_event_handler(s->epoll_fd, &s->notify_handler);
+    event_poll_add(&s->event_poll, &s->notify_handler,
+                   virtio_queue_get_host_notifier(s->vq),
+                   handle_notify);
 
     /* Create aio context */
     if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
         fprintf(stderr, "virtio-blk io_setup failed\n");
-        return; /* TODO error handling */
+        exit(1);
     }
 
     if (event_notifier_init(&s->io_notifier, 0) != 0) {
         fprintf(stderr, "virtio-blk io event notifier creation failed\n");
-        return; /* TODO error handling */
+        exit(1);
     }
 
-    s->io_handler.notifier = &s->io_notifier;
-    s->io_handler.handler = handle_io;
-    add_event_handler(s->epoll_fd, &s->io_handler);
+    event_poll_add(&s->event_poll, &s->io_handler, &s->io_notifier, handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
 
@@ -209,7 +128,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
-    close(s->epoll_fd);
+    event_poll_cleanup(&s->event_poll);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
@@ -317,7 +236,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
-    s->vq = virtio_add_queue(&s->vdev, SEG_MAX + 2, virtio_blk_handle_output);
+    s->vq = virtio_add_queue(&s->vdev, VRING_MAX, virtio_blk_handle_output);
     s->data_plane_started = false;
 
     s->qdev = dev;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 07/27] virtio-blk: Put dataplane code into its own directory
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/event-poll.h |   79 +++++++++++++++++++
 hw/dataplane/vring.h      |  191 +++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c           |  149 ++++++++---------------------------
 3 files changed, 304 insertions(+), 115 deletions(-)
 create mode 100644 hw/dataplane/event-poll.h
 create mode 100644 hw/dataplane/vring.h

diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
new file mode 100644
index 0000000..f38e969
--- /dev/null
+++ b/hw/dataplane/event-poll.h
@@ -0,0 +1,79 @@
+#ifndef EVENT_POLL_H
+#define EVENT_POLL_H
+
+#include <sys/epoll.h>
+#include "event_notifier.h"
+
+typedef struct EventHandler EventHandler;
+typedef void EventCallback(EventHandler *handler);
+struct EventHandler
+{
+    EventNotifier *notifier;    /* eventfd */
+    EventCallback *callback;    /* callback function */
+};
+
+typedef struct {
+    int epoll_fd;               /* epoll(2) file descriptor */
+} EventPoll;
+
+static void event_poll_init(EventPoll *poll)
+{
+    /* Create epoll file descriptor */
+    poll->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+    if (poll->epoll_fd < 0) {
+        fprintf(stderr, "epoll_create1 failed: %m\n");
+        exit(1);
+    }
+}
+
+static void event_poll_cleanup(EventPoll *poll)
+{
+    close(poll->epoll_fd);
+    poll->epoll_fd = -1;
+}
+
+/* Add an event notifier and its callback for polling */
+static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Block until the next event and invoke its callback
+ *
+ * Signals must be masked, EINTR should never happen.  This is true for QEMU
+ * threads.
+ */
+static void event_poll(EventPoll *poll)
+{
+    EventHandler *handler;
+    struct epoll_event event;
+    int nevents;
+
+    /* Wait for the next event.  Only do one event per call to keep the
+     * function simple, this could be changed later. */
+    nevents = epoll_wait(poll->epoll_fd, &event, 1, -1);
+    if (unlikely(nevents != 1)) {
+        fprintf(stderr, "epoll_wait failed: %m\n");
+        exit(1); /* should never happen */
+    }
+
+    /* Find out which event handler has become active */
+    handler = event.data.ptr;
+
+    /* Clear the eventfd */
+    event_notifier_test_and_clear(handler->notifier);
+
+    /* Handle the event */
+    handler->callback(handler);
+}
+
+#endif /* EVENT_POLL_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
new file mode 100644
index 0000000..7099a99
--- /dev/null
+++ b/hw/dataplane/vring.h
@@ -0,0 +1,191 @@
+#ifndef VRING_H
+#define VRING_H
+
+#include <linux/virtio_ring.h>
+#include "qemu-common.h"
+
+typedef struct {
+    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
+    struct vring vr;                /* virtqueue vring mapped to host memory */
+    __u16 last_avail_idx;           /* last processed avail ring index */
+    __u16 last_used_idx;            /* last processed used ring index */
+} Vring;
+
+static inline unsigned int vring_get_num(Vring *vring)
+{
+    return vring->vr.num;
+}
+
+/* Map target physical address to host address
+ */
+static inline void *phys_to_host(Vring *vring, target_phys_addr_t phys)
+{
+    /* Adjust for 3.6-4 GB PCI memory range */
+    if (phys >= 0x100000000) {
+        phys -= 0x100000000 - 0xe0000000;
+    } else if (phys >= 0xe0000000) {
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        exit(1);
+    }
+    return vring->phys_mem_zero_host_ptr + phys;
+}
+
+/* Setup for cheap target physical to host address conversion
+ *
+ * This is a hack for direct access to guest memory, we're not really allowed
+ * to do this.
+ */
+static void setup_phys_to_host(Vring *vring)
+{
+    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
+    vring->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
+    if (!vring->phys_mem_zero_host_ptr) {
+        fprintf(stderr, "setup_phys_to_host failed\n");
+        exit(1);
+    }
+}
+
+/* Map the guest's vring to host memory
+ *
+ * This is not allowed but we know the ring won't move.
+ */
+static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
+{
+    setup_phys_to_host(vring);
+
+    vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
+               phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
+
+    vring->last_avail_idx = vring->vr.avail->idx;
+    vring->last_used_idx = vring->vr.used->idx;
+
+    fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
+            virtio_queue_get_ring_addr(vdev, n),
+            vring->vr.desc, vring->vr.avail, vring->vr.used);
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+static unsigned int vring_pop(Vring *vring,
+		      struct iovec iov[], unsigned int iov_size,
+		      unsigned int *out_num, unsigned int *in_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head, found = 0, num = vring->vr.num;
+    __u16 avail_idx, last_avail_idx;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vring->last_avail_idx;
+    avail_idx = vring->vr.avail->idx;
+
+	if (unlikely((__u16)(avail_idx - last_avail_idx) > num)) {
+		fprintf(stderr, "Guest moved used index from %u to %u\n",
+		        last_avail_idx, avail_idx);
+		exit(1);
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (avail_idx == last_avail_idx)
+		return num;
+
+	/* Only get avail ring entries after they have been exposed by guest. */
+	__sync_synchronize(); /* smp_rmb() */
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	head = vring->vr.avail->ring[last_avail_idx % num];
+
+	/* If their number is silly, that's an error. */
+	if (unlikely(head >= num)) {
+		fprintf(stderr, "Guest says index %u > %u is available\n",
+		        head, num);
+		exit(1);
+	}
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+
+	i = head;
+	do {
+		if (unlikely(i >= num)) {
+			fprintf(stderr, "Desc index is %u > %u, head = %u\n",
+			        i, num, head);
+            exit(1);
+		}
+		if (unlikely(++found > num)) {
+			fprintf(stderr, "Loop detected: last one at %u "
+			        "vq size %u head %u\n",
+			        i, num, head);
+            exit(1);
+		}
+        desc = vring->vr.desc[i];
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+/*			ret = get_indirect(dev, vq, iov, iov_size,
+					   out_num, in_num,
+					   log, log_num, &desc);
+			if (unlikely(ret < 0)) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return ret;
+			}
+			continue; */
+            fprintf(stderr, "virtio-blk indirect vring not supported\n");
+            exit(1);
+		}
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += 1;
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				fprintf(stderr, "Descriptor has out after in: "
+				        "idx %d\n", i);
+                exit(1);
+			}
+			*out_num += 1;
+		}
+        i = desc.next;
+	} while (desc.flags & VRING_DESC_F_NEXT);
+
+	/* On success, increment avail index. */
+	vring->last_avail_idx++;
+	return head;
+}
+
+/* After we've used one of their buffers, we tell them about it.
+ *
+ * Stolen from linux-2.6/drivers/vhost/vhost.c.
+ */
+static __attribute__((unused)) void vring_push(Vring *vring, unsigned int head, int len)
+{
+	struct vring_used_elem *used;
+
+	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
+	 * next entry in that used ring. */
+	used = &vring->vr.used->ring[vring->last_used_idx % vring->vr.num];
+    used->id = head;
+    used->len = len;
+
+	/* Make sure buffer is written before we update index. */
+	__sync_synchronize(); /* smp_wmb() */
+
+    vring->vr.used->idx = ++vring->last_used_idx;
+}
+
+#endif /* VRING_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 99654f1..2c1cce8 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,26 +11,21 @@
  *
  */
 
-#include <sys/epoll.h>
-#include <sys/eventfd.h>
 #include <libaio.h>
-#include <linux/virtio_ring.h>
 #include "qemu-common.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
+#include "hw/dataplane/event-poll.h"
+#include "hw/dataplane/vring.h"
+#include "kvm.h"
 
 enum {
-    SEG_MAX = 126, /* maximum number of I/O segments */
+    SEG_MAX = 126,                  /* maximum number of I/O segments */
+    VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
 };
 
-typedef struct
-{
-    EventNotifier *notifier;        /* eventfd */
-    void (*handler)(void);          /* handler function */
-} EventHandler;
-
 typedef struct VirtIOBlock
 {
     VirtIODevice vdev;
@@ -44,15 +39,13 @@ typedef struct VirtIOBlock
     bool data_plane_started;
     QemuThread data_plane_thread;
 
-    struct vring vring;
+    Vring vring;                    /* virtqueue vring */
 
-    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventPoll event_poll;           /* event poller */
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
-
-    void *phys_mem_zero_host_ptr;   /* host pointer to guest RAM */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -60,138 +53,64 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-/* Map target physical address to host address
- */
-static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
+static void handle_io(EventHandler *handler)
 {
-    /* Adjust for 3.6-4 GB PCI memory range */
-    if (phys >= 0x100000000) {
-        phys -= 0x100000000 - 0xe0000000;
-    } else if (phys >= 0xe0000000) {
-        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
-        exit(1);
-    }
-    return s->phys_mem_zero_host_ptr + phys;
+    fprintf(stderr, "io completion happened\n");
 }
 
-/* Setup for cheap target physical to host address conversion
- *
- * This is a hack for direct access to guest memory, we're not really allowed
- * to do this.
- */
-static void setup_phys_to_host(VirtIOBlock *s)
+static void handle_notify(EventHandler *handler)
 {
-    target_phys_addr_t len = 4096; /* RAM is really much larger but we cheat */
-    s->phys_mem_zero_host_ptr = cpu_physical_memory_map(0, &len, 0);
-    if (!s->phys_mem_zero_host_ptr) {
-        fprintf(stderr, "setup_phys_to_host failed\n");
-        exit(1);
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
+    struct iovec iov[VRING_MAX];
+    unsigned int out_num, in_num;
+    int head;
+
+    head = vring_pop(&s->vring, iov, ARRAY_SIZE(iov), &out_num, &in_num);
+    if (unlikely(head >= vring_get_num(&s->vring))) {
+        fprintf(stderr, "false alarm, nothing on vring\n");
+        return;
     }
-}
 
-/* Map the guest's vring to host memory
- *
- * This is not allowed but we know the ring won't move.
- */
-static void map_vring(struct vring *vring, VirtIOBlock *s, VirtIODevice *vdev, int n)
-{
-    vring->num = virtio_queue_get_num(vdev, n);
-    vring->desc = phys_to_host(s, virtio_queue_get_desc_addr(vdev, n));
-    vring->avail = phys_to_host(s, virtio_queue_get_avail_addr(vdev, n));
-    vring->used = phys_to_host(s, virtio_queue_get_used_addr(vdev, n));
-
-    fprintf(stderr, "virtio-blk vring physical=%#lx desc=%p avail=%p used=%p\n",
-            virtio_queue_get_ring_addr(vdev, n),
-            vring->desc, vring->avail, vring->used);
-}
-
-static void handle_io(void)
-{
-    fprintf(stderr, "io completion happened\n");
-}
-
-static void handle_notify(void)
-{
-    fprintf(stderr, "virtqueue notify happened\n");
+    fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
 }
 
 static void *data_plane_thread(void *opaque)
 {
     VirtIOBlock *s = opaque;
-    struct epoll_event event;
-    int nevents;
-    EventHandler *event_handler;
-
-    /* Signals are masked, EINTR should never happen */
 
     for (;;) {
-        /* Wait for the next event.  Only do one event per call to keep the
-         * function simple, this could be changed later. */
-        nevents = epoll_wait(s->epoll_fd, &event, 1, -1);
-        if (unlikely(nevents != 1)) {
-            fprintf(stderr, "epoll_wait failed: %m\n");
-            continue; /* should never happen */
-        }
-
-        /* Find out which event handler has become active */
-        event_handler = event.data.ptr;
-
-        /* Clear the eventfd */
-        event_notifier_test_and_clear(event_handler->notifier);
-
-        /* Handle the event */
-        event_handler->handler();
+        event_poll(&s->event_poll);
     }
     return NULL;
 }
 
-static void add_event_handler(int epoll_fd, EventHandler *event_handler)
-{
-    struct epoll_event event = {
-        .events = EPOLLIN,
-        .data.ptr = event_handler,
-    };
-    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(event_handler->notifier), &event) != 0) {
-        fprintf(stderr, "virtio-blk failed to add event handler to epoll: %m\n");
-        exit(1);
-    }
-}
-
 static void data_plane_start(VirtIOBlock *s)
 {
-    setup_phys_to_host(s);
-    map_vring(&s->vring, s, &s->vdev, 0);
-
-    /* Create epoll file descriptor */
-    s->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
-    if (s->epoll_fd < 0) {
-        fprintf(stderr, "epoll_create1 failed: %m\n");
-        return; /* TODO error handling */
-    }
+    vring_setup(&s->vring, &s->vdev, 0);
+
+    event_poll_init(&s->event_poll);
 
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
-        return; /* TODO error handling */
+        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
+        exit(1);
     }
 
-    s->notify_handler.notifier = virtio_queue_get_host_notifier(s->vq),
-    s->notify_handler.handler = handle_notify;
-    add_event_handler(s->epoll_fd, &s->notify_handler);
+    event_poll_add(&s->event_poll, &s->notify_handler,
+                   virtio_queue_get_host_notifier(s->vq),
+                   handle_notify);
 
     /* Create aio context */
     if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
         fprintf(stderr, "virtio-blk io_setup failed\n");
-        return; /* TODO error handling */
+        exit(1);
     }
 
     if (event_notifier_init(&s->io_notifier, 0) != 0) {
         fprintf(stderr, "virtio-blk io event notifier creation failed\n");
-        return; /* TODO error handling */
+        exit(1);
     }
 
-    s->io_handler.notifier = &s->io_notifier;
-    s->io_handler.handler = handle_io;
-    add_event_handler(s->epoll_fd, &s->io_handler);
+    event_poll_add(&s->event_poll, &s->io_handler, &s->io_notifier, handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
 
@@ -209,7 +128,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
-    close(s->epoll_fd);
+    event_poll_cleanup(&s->event_poll);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
@@ -317,7 +236,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
     bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
-    s->vq = virtio_add_queue(&s->vdev, SEG_MAX + 2, virtio_blk_handle_output);
+    s->vq = virtio_add_queue(&s->vdev, VRING_MAX, virtio_blk_handle_output);
     s->data_plane_started = false;
 
     s->qdev = dev;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 08/27] virtio-blk: Read requests from the vring
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |    8 +++++--
 hw/virtio-blk.c      |   62 ++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 7099a99..b07d4f6 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -76,7 +76,7 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
 static unsigned int vring_pop(Vring *vring,
-		      struct iovec iov[], unsigned int iov_size,
+		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
 	struct vring_desc desc;
@@ -138,10 +138,14 @@ static unsigned int vring_pop(Vring *vring,
 				return ret;
 			}
 			continue; */
-            fprintf(stderr, "virtio-blk indirect vring not supported\n");
+            fprintf(stderr, "Indirect vring not supported\n");
             exit(1);
 		}
 
+        if (iov >= iov_end) {
+            fprintf(stderr, "Not enough vring iovecs\n");
+            exit(1);
+        }
         iov->iov_base = phys_to_host(vring, desc.addr);
         iov->iov_len  = desc.len;
         iov++;
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 2c1cce8..91f1bab 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -24,6 +24,7 @@
 enum {
     SEG_MAX = 126,                  /* maximum number of I/O segments */
     VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
+    REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
 };
 
 typedef struct VirtIOBlock
@@ -58,20 +59,63 @@ static void handle_io(EventHandler *handler)
     fprintf(stderr, "io completion happened\n");
 }
 
+static void process_request(struct iovec iov[], unsigned int out_num, unsigned int in_num)
+{
+    /* Virtio block requests look like this: */
+    struct virtio_blk_outhdr *outhdr; /* iov[0] */
+    /* data[]                            ... */
+    struct virtio_blk_inhdr *inhdr;   /* iov[out_num + in_num - 1] */
+
+    if (unlikely(out_num == 0 || in_num == 0 ||
+                iov[0].iov_len != sizeof *outhdr ||
+                iov[out_num + in_num - 1].iov_len != sizeof *inhdr)) {
+        fprintf(stderr, "virtio-blk invalid request\n");
+        exit(1);
+    }
+
+    outhdr = iov[0].iov_base;
+    inhdr = iov[out_num + in_num - 1].iov_base;
+
+    fprintf(stderr, "virtio-blk request type=%#x sector=%#lx\n",
+            outhdr->type, outhdr->sector);
+}
+
 static void handle_notify(EventHandler *handler)
 {
     VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
-    struct iovec iov[VRING_MAX];
-    unsigned int out_num, in_num;
-    int head;
 
-    head = vring_pop(&s->vring, iov, ARRAY_SIZE(iov), &out_num, &in_num);
-    if (unlikely(head >= vring_get_num(&s->vring))) {
-        fprintf(stderr, "false alarm, nothing on vring\n");
-        return;
-    }
+    /* There is one array of iovecs into which all new requests are extracted
+     * from the vring.  Requests are read from the vring and the translated
+     * descriptors are written to the iovecs array.  The iovecs do not have to
+     * persist across handle_notify() calls because the kernel copies the
+     * iovecs on io_submit().
+     *
+     * Handling io_submit() EAGAIN may require storing the requests across
+     * handle_notify() calls until the kernel has sufficient resources to
+     * accept more I/O.  This is not implemented yet.
+     */
+    struct iovec iovec[VRING_MAX];
+    struct iovec *iov, *end = &iovec[VRING_MAX];
+
+    /* When a request is read from the vring, the index of the first descriptor
+     * (aka head) is returned so that the completed request can be pushed onto
+     * the vring later.
+     *
+     * The number of hypervisor read-only iovecs is out_num.  The number of
+     * hypervisor write-only iovecs is in_num.
+     */
+    unsigned int head, out_num = 0, in_num = 0;
+
+    for (iov = iovec; ; iov += out_num + in_num) {
+        head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+        if (head >= vring_get_num(&s->vring)) {
+            break; /* no more requests */
+        }
+
+        fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
 
-    fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
+        process_request(iov, out_num, in_num);
+    }
 }
 
 static void *data_plane_thread(void *opaque)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 08/27] virtio-blk: Read requests from the vring
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |    8 +++++--
 hw/virtio-blk.c      |   62 ++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 7099a99..b07d4f6 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -76,7 +76,7 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
 static unsigned int vring_pop(Vring *vring,
-		      struct iovec iov[], unsigned int iov_size,
+		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
 	struct vring_desc desc;
@@ -138,10 +138,14 @@ static unsigned int vring_pop(Vring *vring,
 				return ret;
 			}
 			continue; */
-            fprintf(stderr, "virtio-blk indirect vring not supported\n");
+            fprintf(stderr, "Indirect vring not supported\n");
             exit(1);
 		}
 
+        if (iov >= iov_end) {
+            fprintf(stderr, "Not enough vring iovecs\n");
+            exit(1);
+        }
         iov->iov_base = phys_to_host(vring, desc.addr);
         iov->iov_len  = desc.len;
         iov++;
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 2c1cce8..91f1bab 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -24,6 +24,7 @@
 enum {
     SEG_MAX = 126,                  /* maximum number of I/O segments */
     VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
+    REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
 };
 
 typedef struct VirtIOBlock
@@ -58,20 +59,63 @@ static void handle_io(EventHandler *handler)
     fprintf(stderr, "io completion happened\n");
 }
 
+static void process_request(struct iovec iov[], unsigned int out_num, unsigned int in_num)
+{
+    /* Virtio block requests look like this: */
+    struct virtio_blk_outhdr *outhdr; /* iov[0] */
+    /* data[]                            ... */
+    struct virtio_blk_inhdr *inhdr;   /* iov[out_num + in_num - 1] */
+
+    if (unlikely(out_num == 0 || in_num == 0 ||
+                iov[0].iov_len != sizeof *outhdr ||
+                iov[out_num + in_num - 1].iov_len != sizeof *inhdr)) {
+        fprintf(stderr, "virtio-blk invalid request\n");
+        exit(1);
+    }
+
+    outhdr = iov[0].iov_base;
+    inhdr = iov[out_num + in_num - 1].iov_base;
+
+    fprintf(stderr, "virtio-blk request type=%#x sector=%#lx\n",
+            outhdr->type, outhdr->sector);
+}
+
 static void handle_notify(EventHandler *handler)
 {
     VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
-    struct iovec iov[VRING_MAX];
-    unsigned int out_num, in_num;
-    int head;
 
-    head = vring_pop(&s->vring, iov, ARRAY_SIZE(iov), &out_num, &in_num);
-    if (unlikely(head >= vring_get_num(&s->vring))) {
-        fprintf(stderr, "false alarm, nothing on vring\n");
-        return;
-    }
+    /* There is one array of iovecs into which all new requests are extracted
+     * from the vring.  Requests are read from the vring and the translated
+     * descriptors are written to the iovecs array.  The iovecs do not have to
+     * persist across handle_notify() calls because the kernel copies the
+     * iovecs on io_submit().
+     *
+     * Handling io_submit() EAGAIN may require storing the requests across
+     * handle_notify() calls until the kernel has sufficient resources to
+     * accept more I/O.  This is not implemented yet.
+     */
+    struct iovec iovec[VRING_MAX];
+    struct iovec *iov, *end = &iovec[VRING_MAX];
+
+    /* When a request is read from the vring, the index of the first descriptor
+     * (aka head) is returned so that the completed request can be pushed onto
+     * the vring later.
+     *
+     * The number of hypervisor read-only iovecs is out_num.  The number of
+     * hypervisor write-only iovecs is in_num.
+     */
+    unsigned int head, out_num = 0, in_num = 0;
+
+    for (iov = iovec; ; iov += out_num + in_num) {
+        head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+        if (head >= vring_get_num(&s->vring)) {
+            break; /* no more requests */
+        }
+
+        fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
 
-    fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
+        process_request(iov, out_num, in_num);
+    }
 }
 
 static void *data_plane_thread(void *opaque)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 09/27] virtio-blk: Add Linux AIO queue
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Requests read from the vring will be placed in a queue where they can be
merged as necessary.  Once all requests have been read from the vring,
the queue can be submitted.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/ioq.h |  104 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c    |   33 ++++++++---------
 2 files changed, 120 insertions(+), 17 deletions(-)
 create mode 100644 hw/dataplane/ioq.h

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
new file mode 100644
index 0000000..26ca307
--- /dev/null
+++ b/hw/dataplane/ioq.h
@@ -0,0 +1,104 @@
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+typedef struct {
+    int fd;                         /* file descriptor */
+    unsigned int maxreqs;           /* max length of freelist and queue */
+
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier notifier;         /* Linux AIO eventfd */
+
+    /* Requests can complete in any order so a free list is necessary to manage
+     * available iocbs.
+     */
+    struct iocb **freelist;         /* free iocbs */
+    unsigned int freelist_idx;
+
+    /* Multiple requests are queued up before submitting them all in one go */
+    struct iocb **queue;            /* queued iocbs */
+    unsigned int queue_idx;
+} IOQueue;
+
+static void ioq_init(IOQueue *ioq, int fd, unsigned int maxreqs)
+{
+    ioq->fd = fd;
+    ioq->maxreqs = maxreqs;
+
+    if (io_setup(maxreqs, &ioq->io_ctx) != 0) {
+        fprintf(stderr, "ioq io_setup failed\n");
+        exit(1);
+    }
+
+    if (event_notifier_init(&ioq->notifier, 0) != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed\n");
+        exit(1);
+    }
+
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * maxreqs);
+    ioq->freelist_idx = 0;
+
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * maxreqs);
+    ioq->queue_idx = 0;
+}
+
+static void ioq_cleanup(IOQueue *ioq)
+{
+    g_free(ioq->freelist);
+    g_free(ioq->queue);
+
+    event_notifier_cleanup(&ioq->notifier);
+    io_destroy(ioq->io_ctx);
+}
+
+static EventNotifier *ioq_get_notifier(IOQueue *ioq)
+{
+    return &ioq->notifier;
+}
+
+static struct iocb *ioq_get_iocb(IOQueue *ioq)
+{
+    if (unlikely(ioq->freelist_idx == 0)) {
+        fprintf(stderr, "ioq underflow\n");
+        exit(1);
+    }
+    struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
+    ioq->queue[ioq->queue_idx++] = iocb;
+}
+
+static __attribute__((unused)) void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+{
+    if (unlikely(ioq->freelist_idx == ioq->maxreqs)) {
+        fprintf(stderr, "ioq overflow\n");
+        exit(1);
+    }
+    ioq->freelist[ioq->freelist_idx++] = iocb;
+}
+
+static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    if (read) {
+        io_prep_preadv(iocb, ioq->fd, iov, count, offset);
+    } else {
+        io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
+    }
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+}
+
+static __attribute__((unused)) void ioq_fdsync(IOQueue *ioq)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    io_prep_fdsync(iocb, ioq->fd);
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+}
+
+static __attribute__((unused)) int ioq_submit(IOQueue *ioq)
+{
+    int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    ioq->queue_idx = 0; /* reset */
+    return rc;
+}
+
+#endif /* IO_QUEUE_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 91f1bab..5e1ed79 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -13,12 +13,14 @@
 
 #include <libaio.h>
 #include "qemu-common.h"
+#include "block_int.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
 #include "hw/dataplane/event-poll.h"
 #include "hw/dataplane/vring.h"
+#include "hw/dataplane/ioq.h"
 #include "kvm.h"
 
 enum {
@@ -42,9 +44,9 @@ typedef struct VirtIOBlock
 
     Vring vring;                    /* virtqueue vring */
 
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+
     EventPoll event_poll;           /* event poller */
-    io_context_t io_ctx;            /* Linux AIO context */
-    EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
 } VirtIOBlock;
@@ -128,6 +130,14 @@ static void *data_plane_thread(void *opaque)
     return NULL;
 }
 
+/* Normally the block driver passes down the fd, there's no way to get it from
+ * above.
+ */
+static int get_raw_posix_fd_hack(VirtIOBlock *s)
+{
+    return *(int*)s->bs->file->opaque;
+}
+
 static void data_plane_start(VirtIOBlock *s)
 {
     vring_setup(&s->vring, &s->vdev, 0);
@@ -138,23 +148,13 @@ static void data_plane_start(VirtIOBlock *s)
         fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
-
     event_poll_add(&s->event_poll, &s->notify_handler,
                    virtio_queue_get_host_notifier(s->vq),
                    handle_notify);
 
-    /* Create aio context */
-    if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
-        fprintf(stderr, "virtio-blk io_setup failed\n");
-        exit(1);
-    }
-
-    if (event_notifier_init(&s->io_notifier, 0) != 0) {
-        fprintf(stderr, "virtio-blk io event notifier creation failed\n");
-        exit(1);
-    }
-
-    event_poll_add(&s->event_poll, &s->io_handler, &s->io_notifier, handle_io);
+    ioq_init(&s->ioqueue, get_raw_posix_fd_hack(s), REQ_MAX);
+    /* TODO populate ioqueue freelist */
+    event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
 
@@ -167,8 +167,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     /* TODO stop data plane thread */
 
-    event_notifier_cleanup(&s->io_notifier);
-    io_destroy(s->io_ctx);
+    ioq_cleanup(&s->ioqueue);
 
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 09/27] virtio-blk: Add Linux AIO queue
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Requests read from the vring will be placed in a queue where they can be
merged as necessary.  Once all requests have been read from the vring,
the queue can be submitted.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/ioq.h |  104 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c    |   33 ++++++++---------
 2 files changed, 120 insertions(+), 17 deletions(-)
 create mode 100644 hw/dataplane/ioq.h

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
new file mode 100644
index 0000000..26ca307
--- /dev/null
+++ b/hw/dataplane/ioq.h
@@ -0,0 +1,104 @@
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+typedef struct {
+    int fd;                         /* file descriptor */
+    unsigned int maxreqs;           /* max length of freelist and queue */
+
+    io_context_t io_ctx;            /* Linux AIO context */
+    EventNotifier notifier;         /* Linux AIO eventfd */
+
+    /* Requests can complete in any order so a free list is necessary to manage
+     * available iocbs.
+     */
+    struct iocb **freelist;         /* free iocbs */
+    unsigned int freelist_idx;
+
+    /* Multiple requests are queued up before submitting them all in one go */
+    struct iocb **queue;            /* queued iocbs */
+    unsigned int queue_idx;
+} IOQueue;
+
+static void ioq_init(IOQueue *ioq, int fd, unsigned int maxreqs)
+{
+    ioq->fd = fd;
+    ioq->maxreqs = maxreqs;
+
+    if (io_setup(maxreqs, &ioq->io_ctx) != 0) {
+        fprintf(stderr, "ioq io_setup failed\n");
+        exit(1);
+    }
+
+    if (event_notifier_init(&ioq->notifier, 0) != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed\n");
+        exit(1);
+    }
+
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * maxreqs);
+    ioq->freelist_idx = 0;
+
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * maxreqs);
+    ioq->queue_idx = 0;
+}
+
+static void ioq_cleanup(IOQueue *ioq)
+{
+    g_free(ioq->freelist);
+    g_free(ioq->queue);
+
+    event_notifier_cleanup(&ioq->notifier);
+    io_destroy(ioq->io_ctx);
+}
+
+static EventNotifier *ioq_get_notifier(IOQueue *ioq)
+{
+    return &ioq->notifier;
+}
+
+static struct iocb *ioq_get_iocb(IOQueue *ioq)
+{
+    if (unlikely(ioq->freelist_idx == 0)) {
+        fprintf(stderr, "ioq underflow\n");
+        exit(1);
+    }
+    struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
+    ioq->queue[ioq->queue_idx++] = iocb;
+}
+
+static __attribute__((unused)) void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+{
+    if (unlikely(ioq->freelist_idx == ioq->maxreqs)) {
+        fprintf(stderr, "ioq overflow\n");
+        exit(1);
+    }
+    ioq->freelist[ioq->freelist_idx++] = iocb;
+}
+
+static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    if (read) {
+        io_prep_preadv(iocb, ioq->fd, iov, count, offset);
+    } else {
+        io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
+    }
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+}
+
+static __attribute__((unused)) void ioq_fdsync(IOQueue *ioq)
+{
+    struct iocb *iocb = ioq_get_iocb(ioq);
+
+    io_prep_fdsync(iocb, ioq->fd);
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+}
+
+static __attribute__((unused)) int ioq_submit(IOQueue *ioq)
+{
+    int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    ioq->queue_idx = 0; /* reset */
+    return rc;
+}
+
+#endif /* IO_QUEUE_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 91f1bab..5e1ed79 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -13,12 +13,14 @@
 
 #include <libaio.h>
 #include "qemu-common.h"
+#include "block_int.h"
 #include "qemu-thread.h"
 #include "qemu-error.h"
 #include "blockdev.h"
 #include "virtio-blk.h"
 #include "hw/dataplane/event-poll.h"
 #include "hw/dataplane/vring.h"
+#include "hw/dataplane/ioq.h"
 #include "kvm.h"
 
 enum {
@@ -42,9 +44,9 @@ typedef struct VirtIOBlock
 
     Vring vring;                    /* virtqueue vring */
 
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+
     EventPoll event_poll;           /* event poller */
-    io_context_t io_ctx;            /* Linux AIO context */
-    EventNotifier io_notifier;      /* Linux AIO eventfd */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
 } VirtIOBlock;
@@ -128,6 +130,14 @@ static void *data_plane_thread(void *opaque)
     return NULL;
 }
 
+/* Normally the block driver passes down the fd, there's no way to get it from
+ * above.
+ */
+static int get_raw_posix_fd_hack(VirtIOBlock *s)
+{
+    return *(int*)s->bs->file->opaque;
+}
+
 static void data_plane_start(VirtIOBlock *s)
 {
     vring_setup(&s->vring, &s->vdev, 0);
@@ -138,23 +148,13 @@ static void data_plane_start(VirtIOBlock *s)
         fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
-
     event_poll_add(&s->event_poll, &s->notify_handler,
                    virtio_queue_get_host_notifier(s->vq),
                    handle_notify);
 
-    /* Create aio context */
-    if (io_setup(SEG_MAX, &s->io_ctx) != 0) {
-        fprintf(stderr, "virtio-blk io_setup failed\n");
-        exit(1);
-    }
-
-    if (event_notifier_init(&s->io_notifier, 0) != 0) {
-        fprintf(stderr, "virtio-blk io event notifier creation failed\n");
-        exit(1);
-    }
-
-    event_poll_add(&s->event_poll, &s->io_handler, &s->io_notifier, handle_io);
+    ioq_init(&s->ioqueue, get_raw_posix_fd_hack(s), REQ_MAX);
+    /* TODO populate ioqueue freelist */
+    event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
 
@@ -167,8 +167,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     /* TODO stop data plane thread */
 
-    event_notifier_cleanup(&s->io_notifier);
-    io_destroy(s->io_ctx);
+    ioq_cleanup(&s->ioqueue);
 
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 10/27] virtio-blk: Stop data plane thread cleanly
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/event-poll.h |   79 ++++++++++++++++-------
 hw/dataplane/ioq.h        |   65 +++++++++++++------
 hw/dataplane/vring.h      |    6 +-
 hw/virtio-blk.c           |  154 +++++++++++++++++++++++++++++++++++++++------
 4 files changed, 243 insertions(+), 61 deletions(-)

diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
index f38e969..acd85e1 100644
--- a/hw/dataplane/event-poll.h
+++ b/hw/dataplane/event-poll.h
@@ -5,17 +5,40 @@
 #include "event_notifier.h"
 
 typedef struct EventHandler EventHandler;
-typedef void EventCallback(EventHandler *handler);
+typedef bool EventCallback(EventHandler *handler);
 struct EventHandler
 {
-    EventNotifier *notifier;    /* eventfd */
-    EventCallback *callback;    /* callback function */
+    EventNotifier *notifier;        /* eventfd */
+    EventCallback *callback;        /* callback function */
 };
 
 typedef struct {
-    int epoll_fd;               /* epoll(2) file descriptor */
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventNotifier stop_notifier;    /* stop poll notifier */
+    EventHandler stop_handler;      /* stop poll handler */
 } EventPoll;
 
+/* Add an event notifier and its callback for polling */
+static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Event callback for stopping the event_poll_run() loop */
+static bool handle_stop(EventHandler *handler)
+{
+    return false; /* stop event loop */
+}
+
 static void event_poll_init(EventPoll *poll)
 {
     /* Create epoll file descriptor */
@@ -24,35 +47,29 @@ static void event_poll_init(EventPoll *poll)
         fprintf(stderr, "epoll_create1 failed: %m\n");
         exit(1);
     }
+
+    /* Set up stop notifier */
+    if (event_notifier_init(&poll->stop_notifier, 0) < 0) {
+        fprintf(stderr, "failed to init stop notifier\n");
+        exit(1);
+    }
+    event_poll_add(poll, &poll->stop_handler,
+                   &poll->stop_notifier, handle_stop);
 }
 
 static void event_poll_cleanup(EventPoll *poll)
 {
+    event_notifier_cleanup(&poll->stop_notifier);
     close(poll->epoll_fd);
     poll->epoll_fd = -1;
 }
 
-/* Add an event notifier and its callback for polling */
-static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
-{
-    struct epoll_event event = {
-        .events = EPOLLIN,
-        .data.ptr = handler,
-    };
-    handler->notifier = notifier;
-    handler->callback = callback;
-    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
-        fprintf(stderr, "failed to add event handler to epoll: %m\n");
-        exit(1);
-    }
-}
-
 /* Block until the next event and invoke its callback
  *
  * Signals must be masked, EINTR should never happen.  This is true for QEMU
  * threads.
  */
-static void event_poll(EventPoll *poll)
+static bool event_poll(EventPoll *poll)
 {
     EventHandler *handler;
     struct epoll_event event;
@@ -73,7 +90,27 @@ static void event_poll(EventPoll *poll)
     event_notifier_test_and_clear(handler->notifier);
 
     /* Handle the event */
-    handler->callback(handler);
+    return handler->callback(handler);
+}
+
+static void event_poll_run(EventPoll *poll)
+{
+    while (event_poll(poll)) {
+        /* do nothing */
+    }
+}
+
+/* Stop the event_poll_run() loop
+ *
+ * This function can be used from another thread.
+ */
+static void event_poll_stop(EventPoll *poll)
+{
+    uint64_t dummy = 1;
+    int eventfd = event_notifier_get_fd(&poll->stop_notifier);
+    ssize_t unused __attribute__((unused));
+
+    unused = write(eventfd, &dummy, sizeof dummy);
 }
 
 #endif /* EVENT_POLL_H */
diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index 26ca307..7200e87 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -3,10 +3,10 @@
 
 typedef struct {
     int fd;                         /* file descriptor */
-    unsigned int maxreqs;           /* max length of freelist and queue */
+    unsigned int max_reqs;           /* max length of freelist and queue */
 
     io_context_t io_ctx;            /* Linux AIO context */
-    EventNotifier notifier;         /* Linux AIO eventfd */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
 
     /* Requests can complete in any order so a free list is necessary to manage
      * available iocbs.
@@ -19,25 +19,28 @@ typedef struct {
     unsigned int queue_idx;
 } IOQueue;
 
-static void ioq_init(IOQueue *ioq, int fd, unsigned int maxreqs)
+static void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs)
 {
+    int rc;
+
     ioq->fd = fd;
-    ioq->maxreqs = maxreqs;
+    ioq->max_reqs = max_reqs;
 
-    if (io_setup(maxreqs, &ioq->io_ctx) != 0) {
-        fprintf(stderr, "ioq io_setup failed\n");
+    memset(&ioq->io_ctx, 0, sizeof ioq->io_ctx);
+    if ((rc = io_setup(max_reqs, &ioq->io_ctx)) != 0) {
+        fprintf(stderr, "ioq io_setup failed %d\n", rc);
         exit(1);
     }
 
-    if (event_notifier_init(&ioq->notifier, 0) != 0) {
-        fprintf(stderr, "ioq io event notifier creation failed\n");
+    if ((rc = event_notifier_init(&ioq->io_notifier, 0)) != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed %d\n", rc);
         exit(1);
     }
 
-    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * maxreqs);
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * max_reqs);
     ioq->freelist_idx = 0;
 
-    ioq->queue = g_malloc0(sizeof ioq->queue[0] * maxreqs);
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * max_reqs);
     ioq->queue_idx = 0;
 }
 
@@ -46,13 +49,13 @@ static void ioq_cleanup(IOQueue *ioq)
     g_free(ioq->freelist);
     g_free(ioq->queue);
 
-    event_notifier_cleanup(&ioq->notifier);
+    event_notifier_cleanup(&ioq->io_notifier);
     io_destroy(ioq->io_ctx);
 }
 
 static EventNotifier *ioq_get_notifier(IOQueue *ioq)
 {
-    return &ioq->notifier;
+    return &ioq->io_notifier;
 }
 
 static struct iocb *ioq_get_iocb(IOQueue *ioq)
@@ -63,18 +66,19 @@ static struct iocb *ioq_get_iocb(IOQueue *ioq)
     }
     struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
     ioq->queue[ioq->queue_idx++] = iocb;
+    return iocb;
 }
 
-static __attribute__((unused)) void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+static void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
 {
-    if (unlikely(ioq->freelist_idx == ioq->maxreqs)) {
+    if (unlikely(ioq->freelist_idx == ioq->max_reqs)) {
         fprintf(stderr, "ioq overflow\n");
         exit(1);
     }
     ioq->freelist[ioq->freelist_idx++] = iocb;
 }
 
-static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
+static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
 {
     struct iocb *iocb = ioq_get_iocb(ioq);
 
@@ -83,22 +87,45 @@ static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iov
     } else {
         io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
     }
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
 }
 
-static __attribute__((unused)) void ioq_fdsync(IOQueue *ioq)
+static struct iocb *ioq_fdsync(IOQueue *ioq)
 {
     struct iocb *iocb = ioq_get_iocb(ioq);
 
     io_prep_fdsync(iocb, ioq->fd);
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
 }
 
-static __attribute__((unused)) int ioq_submit(IOQueue *ioq)
+static int ioq_submit(IOQueue *ioq)
 {
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
     ioq->queue_idx = 0; /* reset */
     return rc;
 }
 
+typedef void IOQueueCompletion(struct iocb *iocb, ssize_t ret, void *opaque);
+static int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion, void *opaque)
+{
+    struct io_event events[ioq->max_reqs];
+    int nevents, i;
+
+    nevents = io_getevents(ioq->io_ctx, 0, ioq->max_reqs, events, NULL);
+    if (unlikely(nevents < 0)) {
+        fprintf(stderr, "io_getevents failed %d\n", nevents);
+        exit(1);
+    }
+
+    for (i = 0; i < nevents; i++) {
+        ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
+
+        completion(events[i].obj, ret, opaque);
+        ioq_put_iocb(ioq, events[i].obj);
+    }
+    return nevents;
+}
+
 #endif /* IO_QUEUE_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index b07d4f6..70675e5 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -56,8 +56,8 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
     vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
                phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
 
-    vring->last_avail_idx = vring->vr.avail->idx;
-    vring->last_used_idx = vring->vr.used->idx;
+    vring->last_avail_idx = 0;
+    vring->last_used_idx = 0;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
             virtio_queue_get_ring_addr(vdev, n),
@@ -176,7 +176,7 @@ static unsigned int vring_pop(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static __attribute__((unused)) void vring_push(Vring *vring, unsigned int head, int len)
+static void vring_push(Vring *vring, unsigned int head, int len)
 {
 	struct vring_used_elem *used;
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 5e1ed79..52ea601 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -29,8 +29,13 @@ enum {
     REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
 };
 
-typedef struct VirtIOBlock
-{
+typedef struct {
+    struct iocb iocb;               /* Linux AIO control block */
+    unsigned char *status;          /* virtio block status code */
+    unsigned int head;              /* vring descriptor index */
+} VirtIOBlockRequest;
+
+typedef struct {
     VirtIODevice vdev;
     BlockDriverState *bs;
     VirtQueue *vq;
@@ -44,11 +49,12 @@ typedef struct VirtIOBlock
 
     Vring vring;                    /* virtqueue vring */
 
-    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
-
     EventPoll event_poll;           /* event poller */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
+
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the queue */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -56,12 +62,40 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-static void handle_io(EventHandler *handler)
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
-    fprintf(stderr, "io completion happened\n");
+    VirtIOBlock *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    int len;
+
+    if (likely(ret >= 0)) {
+        *req->status = VIRTIO_BLK_S_OK;
+        len = ret;
+    } else {
+        *req->status = VIRTIO_BLK_S_IOERR;
+        len = 0;
+    }
+
+    /* According to the virtio specification len should be the number of bytes
+     * written to, but for virtio-blk it seems to be the number of bytes
+     * transferred plus the status bytes.
+     */
+    vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
-static void process_request(struct iovec iov[], unsigned int out_num, unsigned int in_num)
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        /* TODO is this thread-safe and can it be done faster? */
+        virtio_irq(s->vq);
+    }
+
+    return true;
+}
+
+static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
     struct virtio_blk_outhdr *outhdr; /* iov[0] */
@@ -78,11 +112,54 @@ static void process_request(struct iovec iov[], unsigned int out_num, unsigned i
     outhdr = iov[0].iov_base;
     inhdr = iov[out_num + in_num - 1].iov_base;
 
+    /*
     fprintf(stderr, "virtio-blk request type=%#x sector=%#lx\n",
             outhdr->type, outhdr->sector);
+    */
+
+    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
+        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
+        exit(1);
+    }
+
+    struct iocb *iocb;
+    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    case VIRTIO_BLK_T_IN:
+        if (unlikely(out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid read request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, true, &iov[1], in_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
+        break;
+
+    case VIRTIO_BLK_T_OUT:
+        if (unlikely(in_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid write request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
+        break;
+
+    case VIRTIO_BLK_T_FLUSH:
+        if (unlikely(in_num != 1 || out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid flush request\n");
+            exit(1);
+        }
+        iocb = ioq_fdsync(ioq);
+        break;
+
+    default:
+        fprintf(stderr, "virtio-blk multiple request type bits set\n");
+        exit(1);
+    }
+
+    /* Fill in virtio block metadata needed for completion */
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    req->head = head;
+    req->status = &inhdr->status;
 }
 
-static void handle_notify(EventHandler *handler)
+static bool handle_notify(EventHandler *handler)
 {
     VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
 
@@ -114,19 +191,29 @@ static void handle_notify(EventHandler *handler)
             break; /* no more requests */
         }
 
-        fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
+        /*
+        fprintf(stderr, "out_num=%u in_num=%u head=%u\n", out_num, in_num, head);
+        */
 
-        process_request(iov, out_num, in_num);
+        process_request(&s->ioqueue, iov, out_num, in_num, head);
     }
+
+    /* Submit requests, if any */
+    if (likely(iov != iovec)) {
+        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
+            fprintf(stderr, "ioq_submit failed\n");
+            exit(1);
+        }
+    }
+
+    return true;
 }
 
 static void *data_plane_thread(void *opaque)
 {
     VirtIOBlock *s = opaque;
 
-    for (;;) {
-        event_poll(&s->event_poll);
-    }
+    event_poll_run(&s->event_poll);
     return NULL;
 }
 
@@ -140,10 +227,13 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
 
 static void data_plane_start(VirtIOBlock *s)
 {
+    int i;
+
     vring_setup(&s->vring, &s->vdev, 0);
 
     event_poll_init(&s->event_poll);
 
+    /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
@@ -152,8 +242,11 @@ static void data_plane_start(VirtIOBlock *s)
                    virtio_queue_get_host_notifier(s->vq),
                    handle_notify);
 
+    /* Set up ioqueue */
     ioq_init(&s->ioqueue, get_raw_posix_fd_hack(s), REQ_MAX);
-    /* TODO populate ioqueue freelist */
+    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
+        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
+    }
     event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
@@ -165,7 +258,9 @@ static void data_plane_stop(VirtIOBlock *s)
 {
     s->data_plane_started = false;
 
-    /* TODO stop data plane thread */
+    /* Tell data plane thread to stop and then wait for it to return */
+    event_poll_stop(&s->event_poll);
+    pthread_join(s->data_plane_thread.thread, NULL);
 
     ioq_cleanup(&s->ioqueue);
 
@@ -183,6 +278,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
         return;
     }
 
+    /*
+    fprintf(stderr, "virtio_blk_set_status %#x\n", val);
+    */
+
     if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
         data_plane_start(s);
     } else {
@@ -190,11 +289,29 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
     }
 }
 
+static void virtio_blk_reset(VirtIODevice *vdev)
+{
+    virtio_blk_set_status(vdev, 0);
+}
+
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    fprintf(stderr, "virtio_blk_handle_output: should never get here, "
-                    "data plane thread should process requests\n");
-    exit(1);
+    VirtIOBlock *s = to_virtio_blk(vdev);
+
+    if (s->data_plane_started) {
+        fprintf(stderr, "virtio_blk_handle_output: should never get here, "
+                        "data plane thread should process requests\n");
+        exit(1);
+    }
+
+    /* Linux seems to notify before the driver comes up.  This needs more
+     * investigation.  Just use a hack for now.
+     */
+    virtio_blk_set_status(vdev, VIRTIO_CONFIG_S_DRIVER_OK); /* start the thread */
+
+    /* Now kick the thread */
+    uint64_t dummy = 1;
+    ssize_t unused __attribute__((unused)) = write(event_notifier_get_fd(virtio_queue_get_host_notifier(s->vq)), &dummy, sizeof dummy);
 }
 
 /* coalesce internal state, copy to pci i/o region 0
@@ -273,6 +390,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
     s->vdev.set_status = virtio_blk_set_status;
+    s->vdev.reset = virtio_blk_reset;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 10/27] virtio-blk: Stop data plane thread cleanly
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/event-poll.h |   79 ++++++++++++++++-------
 hw/dataplane/ioq.h        |   65 +++++++++++++------
 hw/dataplane/vring.h      |    6 +-
 hw/virtio-blk.c           |  154 +++++++++++++++++++++++++++++++++++++++------
 4 files changed, 243 insertions(+), 61 deletions(-)

diff --git a/hw/dataplane/event-poll.h b/hw/dataplane/event-poll.h
index f38e969..acd85e1 100644
--- a/hw/dataplane/event-poll.h
+++ b/hw/dataplane/event-poll.h
@@ -5,17 +5,40 @@
 #include "event_notifier.h"
 
 typedef struct EventHandler EventHandler;
-typedef void EventCallback(EventHandler *handler);
+typedef bool EventCallback(EventHandler *handler);
 struct EventHandler
 {
-    EventNotifier *notifier;    /* eventfd */
-    EventCallback *callback;    /* callback function */
+    EventNotifier *notifier;        /* eventfd */
+    EventCallback *callback;        /* callback function */
 };
 
 typedef struct {
-    int epoll_fd;               /* epoll(2) file descriptor */
+    int epoll_fd;                   /* epoll(2) file descriptor */
+    EventNotifier stop_notifier;    /* stop poll notifier */
+    EventHandler stop_handler;      /* stop poll handler */
 } EventPoll;
 
+/* Add an event notifier and its callback for polling */
+static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
+{
+    struct epoll_event event = {
+        .events = EPOLLIN,
+        .data.ptr = handler,
+    };
+    handler->notifier = notifier;
+    handler->callback = callback;
+    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
+        fprintf(stderr, "failed to add event handler to epoll: %m\n");
+        exit(1);
+    }
+}
+
+/* Event callback for stopping the event_poll_run() loop */
+static bool handle_stop(EventHandler *handler)
+{
+    return false; /* stop event loop */
+}
+
 static void event_poll_init(EventPoll *poll)
 {
     /* Create epoll file descriptor */
@@ -24,35 +47,29 @@ static void event_poll_init(EventPoll *poll)
         fprintf(stderr, "epoll_create1 failed: %m\n");
         exit(1);
     }
+
+    /* Set up stop notifier */
+    if (event_notifier_init(&poll->stop_notifier, 0) < 0) {
+        fprintf(stderr, "failed to init stop notifier\n");
+        exit(1);
+    }
+    event_poll_add(poll, &poll->stop_handler,
+                   &poll->stop_notifier, handle_stop);
 }
 
 static void event_poll_cleanup(EventPoll *poll)
 {
+    event_notifier_cleanup(&poll->stop_notifier);
     close(poll->epoll_fd);
     poll->epoll_fd = -1;
 }
 
-/* Add an event notifier and its callback for polling */
-static void event_poll_add(EventPoll *poll, EventHandler *handler, EventNotifier *notifier, EventCallback *callback)
-{
-    struct epoll_event event = {
-        .events = EPOLLIN,
-        .data.ptr = handler,
-    };
-    handler->notifier = notifier;
-    handler->callback = callback;
-    if (epoll_ctl(poll->epoll_fd, EPOLL_CTL_ADD, event_notifier_get_fd(notifier), &event) != 0) {
-        fprintf(stderr, "failed to add event handler to epoll: %m\n");
-        exit(1);
-    }
-}
-
 /* Block until the next event and invoke its callback
  *
  * Signals must be masked, EINTR should never happen.  This is true for QEMU
  * threads.
  */
-static void event_poll(EventPoll *poll)
+static bool event_poll(EventPoll *poll)
 {
     EventHandler *handler;
     struct epoll_event event;
@@ -73,7 +90,27 @@ static void event_poll(EventPoll *poll)
     event_notifier_test_and_clear(handler->notifier);
 
     /* Handle the event */
-    handler->callback(handler);
+    return handler->callback(handler);
+}
+
+static void event_poll_run(EventPoll *poll)
+{
+    while (event_poll(poll)) {
+        /* do nothing */
+    }
+}
+
+/* Stop the event_poll_run() loop
+ *
+ * This function can be used from another thread.
+ */
+static void event_poll_stop(EventPoll *poll)
+{
+    uint64_t dummy = 1;
+    int eventfd = event_notifier_get_fd(&poll->stop_notifier);
+    ssize_t unused __attribute__((unused));
+
+    unused = write(eventfd, &dummy, sizeof dummy);
 }
 
 #endif /* EVENT_POLL_H */
diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index 26ca307..7200e87 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -3,10 +3,10 @@
 
 typedef struct {
     int fd;                         /* file descriptor */
-    unsigned int maxreqs;           /* max length of freelist and queue */
+    unsigned int max_reqs;           /* max length of freelist and queue */
 
     io_context_t io_ctx;            /* Linux AIO context */
-    EventNotifier notifier;         /* Linux AIO eventfd */
+    EventNotifier io_notifier;      /* Linux AIO eventfd */
 
     /* Requests can complete in any order so a free list is necessary to manage
      * available iocbs.
@@ -19,25 +19,28 @@ typedef struct {
     unsigned int queue_idx;
 } IOQueue;
 
-static void ioq_init(IOQueue *ioq, int fd, unsigned int maxreqs)
+static void ioq_init(IOQueue *ioq, int fd, unsigned int max_reqs)
 {
+    int rc;
+
     ioq->fd = fd;
-    ioq->maxreqs = maxreqs;
+    ioq->max_reqs = max_reqs;
 
-    if (io_setup(maxreqs, &ioq->io_ctx) != 0) {
-        fprintf(stderr, "ioq io_setup failed\n");
+    memset(&ioq->io_ctx, 0, sizeof ioq->io_ctx);
+    if ((rc = io_setup(max_reqs, &ioq->io_ctx)) != 0) {
+        fprintf(stderr, "ioq io_setup failed %d\n", rc);
         exit(1);
     }
 
-    if (event_notifier_init(&ioq->notifier, 0) != 0) {
-        fprintf(stderr, "ioq io event notifier creation failed\n");
+    if ((rc = event_notifier_init(&ioq->io_notifier, 0)) != 0) {
+        fprintf(stderr, "ioq io event notifier creation failed %d\n", rc);
         exit(1);
     }
 
-    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * maxreqs);
+    ioq->freelist = g_malloc0(sizeof ioq->freelist[0] * max_reqs);
     ioq->freelist_idx = 0;
 
-    ioq->queue = g_malloc0(sizeof ioq->queue[0] * maxreqs);
+    ioq->queue = g_malloc0(sizeof ioq->queue[0] * max_reqs);
     ioq->queue_idx = 0;
 }
 
@@ -46,13 +49,13 @@ static void ioq_cleanup(IOQueue *ioq)
     g_free(ioq->freelist);
     g_free(ioq->queue);
 
-    event_notifier_cleanup(&ioq->notifier);
+    event_notifier_cleanup(&ioq->io_notifier);
     io_destroy(ioq->io_ctx);
 }
 
 static EventNotifier *ioq_get_notifier(IOQueue *ioq)
 {
-    return &ioq->notifier;
+    return &ioq->io_notifier;
 }
 
 static struct iocb *ioq_get_iocb(IOQueue *ioq)
@@ -63,18 +66,19 @@ static struct iocb *ioq_get_iocb(IOQueue *ioq)
     }
     struct iocb *iocb = ioq->freelist[--ioq->freelist_idx];
     ioq->queue[ioq->queue_idx++] = iocb;
+    return iocb;
 }
 
-static __attribute__((unused)) void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
+static void ioq_put_iocb(IOQueue *ioq, struct iocb *iocb)
 {
-    if (unlikely(ioq->freelist_idx == ioq->maxreqs)) {
+    if (unlikely(ioq->freelist_idx == ioq->max_reqs)) {
         fprintf(stderr, "ioq overflow\n");
         exit(1);
     }
     ioq->freelist[ioq->freelist_idx++] = iocb;
 }
 
-static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
+static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigned int count, long long offset)
 {
     struct iocb *iocb = ioq_get_iocb(ioq);
 
@@ -83,22 +87,45 @@ static __attribute__((unused)) void ioq_rdwr(IOQueue *ioq, bool read, struct iov
     } else {
         io_prep_pwritev(iocb, ioq->fd, iov, count, offset);
     }
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
 }
 
-static __attribute__((unused)) void ioq_fdsync(IOQueue *ioq)
+static struct iocb *ioq_fdsync(IOQueue *ioq)
 {
     struct iocb *iocb = ioq_get_iocb(ioq);
 
     io_prep_fdsync(iocb, ioq->fd);
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->notifier));
+    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
+    return iocb;
 }
 
-static __attribute__((unused)) int ioq_submit(IOQueue *ioq)
+static int ioq_submit(IOQueue *ioq)
 {
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
     ioq->queue_idx = 0; /* reset */
     return rc;
 }
 
+typedef void IOQueueCompletion(struct iocb *iocb, ssize_t ret, void *opaque);
+static int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion, void *opaque)
+{
+    struct io_event events[ioq->max_reqs];
+    int nevents, i;
+
+    nevents = io_getevents(ioq->io_ctx, 0, ioq->max_reqs, events, NULL);
+    if (unlikely(nevents < 0)) {
+        fprintf(stderr, "io_getevents failed %d\n", nevents);
+        exit(1);
+    }
+
+    for (i = 0; i < nevents; i++) {
+        ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
+
+        completion(events[i].obj, ret, opaque);
+        ioq_put_iocb(ioq, events[i].obj);
+    }
+    return nevents;
+}
+
 #endif /* IO_QUEUE_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index b07d4f6..70675e5 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -56,8 +56,8 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
     vring_init(&vring->vr, virtio_queue_get_num(vdev, n),
                phys_to_host(vring, virtio_queue_get_ring_addr(vdev, n)), 4096);
 
-    vring->last_avail_idx = vring->vr.avail->idx;
-    vring->last_used_idx = vring->vr.used->idx;
+    vring->last_avail_idx = 0;
+    vring->last_used_idx = 0;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
             virtio_queue_get_ring_addr(vdev, n),
@@ -176,7 +176,7 @@ static unsigned int vring_pop(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static __attribute__((unused)) void vring_push(Vring *vring, unsigned int head, int len)
+static void vring_push(Vring *vring, unsigned int head, int len)
 {
 	struct vring_used_elem *used;
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 5e1ed79..52ea601 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -29,8 +29,13 @@ enum {
     REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
 };
 
-typedef struct VirtIOBlock
-{
+typedef struct {
+    struct iocb iocb;               /* Linux AIO control block */
+    unsigned char *status;          /* virtio block status code */
+    unsigned int head;              /* vring descriptor index */
+} VirtIOBlockRequest;
+
+typedef struct {
     VirtIODevice vdev;
     BlockDriverState *bs;
     VirtQueue *vq;
@@ -44,11 +49,12 @@ typedef struct VirtIOBlock
 
     Vring vring;                    /* virtqueue vring */
 
-    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
-
     EventPoll event_poll;           /* event poller */
     EventHandler io_handler;        /* Linux AIO completion handler */
     EventHandler notify_handler;    /* virtqueue notify handler */
+
+    IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+    VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the queue */
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -56,12 +62,40 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
-static void handle_io(EventHandler *handler)
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
-    fprintf(stderr, "io completion happened\n");
+    VirtIOBlock *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    int len;
+
+    if (likely(ret >= 0)) {
+        *req->status = VIRTIO_BLK_S_OK;
+        len = ret;
+    } else {
+        *req->status = VIRTIO_BLK_S_IOERR;
+        len = 0;
+    }
+
+    /* According to the virtio specification len should be the number of bytes
+     * written to, but for virtio-blk it seems to be the number of bytes
+     * transferred plus the status bytes.
+     */
+    vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
-static void process_request(struct iovec iov[], unsigned int out_num, unsigned int in_num)
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        /* TODO is this thread-safe and can it be done faster? */
+        virtio_irq(s->vq);
+    }
+
+    return true;
+}
+
+static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
     struct virtio_blk_outhdr *outhdr; /* iov[0] */
@@ -78,11 +112,54 @@ static void process_request(struct iovec iov[], unsigned int out_num, unsigned i
     outhdr = iov[0].iov_base;
     inhdr = iov[out_num + in_num - 1].iov_base;
 
+    /*
     fprintf(stderr, "virtio-blk request type=%#x sector=%#lx\n",
             outhdr->type, outhdr->sector);
+    */
+
+    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
+        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
+        exit(1);
+    }
+
+    struct iocb *iocb;
+    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    case VIRTIO_BLK_T_IN:
+        if (unlikely(out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid read request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, true, &iov[1], in_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
+        break;
+
+    case VIRTIO_BLK_T_OUT:
+        if (unlikely(in_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid write request\n");
+            exit(1);
+        }
+        iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
+        break;
+
+    case VIRTIO_BLK_T_FLUSH:
+        if (unlikely(in_num != 1 || out_num != 1)) {
+            fprintf(stderr, "virtio-blk invalid flush request\n");
+            exit(1);
+        }
+        iocb = ioq_fdsync(ioq);
+        break;
+
+    default:
+        fprintf(stderr, "virtio-blk multiple request type bits set\n");
+        exit(1);
+    }
+
+    /* Fill in virtio block metadata needed for completion */
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+    req->head = head;
+    req->status = &inhdr->status;
 }
 
-static void handle_notify(EventHandler *handler)
+static bool handle_notify(EventHandler *handler)
 {
     VirtIOBlock *s = container_of(handler, VirtIOBlock, notify_handler);
 
@@ -114,19 +191,29 @@ static void handle_notify(EventHandler *handler)
             break; /* no more requests */
         }
 
-        fprintf(stderr, "head=%u out_num=%u in_num=%u\n", head, out_num, in_num);
+        /*
+        fprintf(stderr, "out_num=%u in_num=%u head=%u\n", out_num, in_num, head);
+        */
 
-        process_request(iov, out_num, in_num);
+        process_request(&s->ioqueue, iov, out_num, in_num, head);
     }
+
+    /* Submit requests, if any */
+    if (likely(iov != iovec)) {
+        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
+            fprintf(stderr, "ioq_submit failed\n");
+            exit(1);
+        }
+    }
+
+    return true;
 }
 
 static void *data_plane_thread(void *opaque)
 {
     VirtIOBlock *s = opaque;
 
-    for (;;) {
-        event_poll(&s->event_poll);
-    }
+    event_poll_run(&s->event_poll);
     return NULL;
 }
 
@@ -140,10 +227,13 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
 
 static void data_plane_start(VirtIOBlock *s)
 {
+    int i;
+
     vring_setup(&s->vring, &s->vdev, 0);
 
     event_poll_init(&s->event_poll);
 
+    /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
@@ -152,8 +242,11 @@ static void data_plane_start(VirtIOBlock *s)
                    virtio_queue_get_host_notifier(s->vq),
                    handle_notify);
 
+    /* Set up ioqueue */
     ioq_init(&s->ioqueue, get_raw_posix_fd_hack(s), REQ_MAX);
-    /* TODO populate ioqueue freelist */
+    for (i = 0; i < ARRAY_SIZE(s->requests); i++) {
+        ioq_put_iocb(&s->ioqueue, &s->requests[i].iocb);
+    }
     event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
     qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
@@ -165,7 +258,9 @@ static void data_plane_stop(VirtIOBlock *s)
 {
     s->data_plane_started = false;
 
-    /* TODO stop data plane thread */
+    /* Tell data plane thread to stop and then wait for it to return */
+    event_poll_stop(&s->event_poll);
+    pthread_join(s->data_plane_thread.thread, NULL);
 
     ioq_cleanup(&s->ioqueue);
 
@@ -183,6 +278,10 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
         return;
     }
 
+    /*
+    fprintf(stderr, "virtio_blk_set_status %#x\n", val);
+    */
+
     if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
         data_plane_start(s);
     } else {
@@ -190,11 +289,29 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
     }
 }
 
+static void virtio_blk_reset(VirtIODevice *vdev)
+{
+    virtio_blk_set_status(vdev, 0);
+}
+
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
-    fprintf(stderr, "virtio_blk_handle_output: should never get here, "
-                    "data plane thread should process requests\n");
-    exit(1);
+    VirtIOBlock *s = to_virtio_blk(vdev);
+
+    if (s->data_plane_started) {
+        fprintf(stderr, "virtio_blk_handle_output: should never get here, "
+                        "data plane thread should process requests\n");
+        exit(1);
+    }
+
+    /* Linux seems to notify before the driver comes up.  This needs more
+     * investigation.  Just use a hack for now.
+     */
+    virtio_blk_set_status(vdev, VIRTIO_CONFIG_S_DRIVER_OK); /* start the thread */
+
+    /* Now kick the thread */
+    uint64_t dummy = 1;
+    ssize_t unused __attribute__((unused)) = write(event_notifier_get_fd(virtio_queue_get_host_notifier(s->vq)), &dummy, sizeof dummy);
 }
 
 /* coalesce internal state, copy to pci i/o region 0
@@ -273,6 +390,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
     s->vdev.set_status = virtio_blk_set_status;
+    s->vdev.reset = virtio_blk_reset;
     s->bs = conf->bs;
     s->conf = conf;
     s->serial = *serial;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 11/27] virtio-blk: Indirect vring and flush support
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

RHEL6 and other new guest kernels use indirect vring descriptors to
increase the number of requests that can be batched.  This fundamentally
changes vring from a scheme that requires fixed resources to something
more dynamic (although there is still an absolute maximum number of
descriptors).  Cope with indirect vrings by taking on as many requests
as we can in one go and then postponing the remaining requests until the
first batch completes.

It would be possible to switch to dynamic resource management so iovec
and iocb structs are malloced.  This would allow the entire ring to be
processed even with indirect descriptors, but would probably hit a
bottleneck when io_submit refuses to queue more requests.  Therefore,
stick with the simpler scheme for now.

Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
all files.  In particular, an O_DIRECT opened file on ext4 does not
support Linux AIO fdsync.  Work around this by performing fdatasync()
synchronously for now.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/ioq.h   |   18 ++++-----
 hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
 hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
 3 files changed, 144 insertions(+), 52 deletions(-)

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index 7200e87..d1545d6 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -3,7 +3,7 @@
 
 typedef struct {
     int fd;                         /* file descriptor */
-    unsigned int max_reqs;           /* max length of freelist and queue */
+    unsigned int max_reqs;          /* max length of freelist and queue */
 
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
@@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
     return iocb;
 }
 
-static struct iocb *ioq_fdsync(IOQueue *ioq)
-{
-    struct iocb *iocb = ioq_get_iocb(ioq);
-
-    io_prep_fdsync(iocb, ioq->fd);
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
-    return iocb;
-}
-
 static int ioq_submit(IOQueue *ioq)
 {
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    if (unlikely(rc < 0)) {
+        unsigned int i;
+        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
+        for (i = 0; i < ioq->queue_idx; i++) {
+            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
+        }
+    }
     ioq->queue_idx = 0; /* reset */
     return rc;
 }
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 70675e5..3eab4b4 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
+static bool vring_more_avail(Vring *vring)
+{
+	return vring->vr.avail->idx != vring->last_avail_idx;
+}
+
+/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
+static bool get_indirect(Vring *vring,
+			struct iovec iov[], struct iovec *iov_end,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vring_desc *indirect)
+{
+	struct vring_desc desc;
+	unsigned int i = 0, count, found = 0;
+
+	/* Sanity check */
+	if (unlikely(indirect->len % sizeof desc)) {
+		fprintf(stderr, "Invalid length in indirect descriptor: "
+		       "len 0x%llx not multiple of 0x%zx\n",
+		       (unsigned long long)indirect->len,
+		       sizeof desc);
+		exit(1);
+	}
+
+	count = indirect->len / sizeof desc;
+	/* Buffers are chained via a 16 bit next field, so
+	 * we can have at most 2^16 of these. */
+	if (unlikely(count > USHRT_MAX + 1)) {
+		fprintf(stderr, "Indirect buffer length too big: %d\n",
+		       indirect->len);
+        exit(1);
+	}
+
+    /* Point to translate indirect desc chain */
+    indirect = phys_to_host(vring, indirect->addr);
+
+	/* We will use the result as an address to read from, so most
+	 * architectures only need a compiler barrier here. */
+	__sync_synchronize(); /* read_barrier_depends(); */
+
+	do {
+		if (unlikely(++found > count)) {
+			fprintf(stderr, "Loop detected: last one at %u "
+			       "indirect size %u\n",
+			       i, count);
+			exit(1);
+		}
+
+        desc = *indirect++;
+		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+			fprintf(stderr, "Nested indirect descriptor\n");
+            exit(1);
+		}
+
+        /* Stop for now if there are not enough iovecs available. */
+        if (iov >= iov_end) {
+            return false;
+        }
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+		/* If this is an input descriptor, increment that count. */
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			*in_num += 1;
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				fprintf(stderr, "Indirect descriptor "
+				       "has out after in: idx %d\n", i);
+                exit(1);
+			}
+			*out_num += 1;
+		}
+        i = desc.next;
+	} while (desc.flags & VRING_DESC_F_NEXT);
+    return true;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
@@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
 		}
         desc = vring->vr.desc[i];
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
-/*			ret = get_indirect(dev, vq, iov, iov_size,
-					   out_num, in_num,
-					   log, log_num, &desc);
-			if (unlikely(ret < 0)) {
-				vq_err(vq, "Failure detected "
-				       "in indirect descriptor at idx %d\n", i);
-				return ret;
-			}
-			continue; */
-            fprintf(stderr, "Indirect vring not supported\n");
-            exit(1);
+			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
+                return num; /* not enough iovecs, stop for now */
+            }
+            continue;
 		}
 
+        /* If there are not enough iovecs left, stop for now.  The caller
+         * should check if there are more descs available once they have dealt
+         * with the current set.
+         */
         if (iov >= iov_end) {
-            fprintf(stderr, "Not enough vring iovecs\n");
-            exit(1);
+            return num;
         }
+
         iov->iov_base = phys_to_host(vring, desc.addr);
         iov->iov_len  = desc.len;
         iov++;
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 52ea601..591eace 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Normally the block driver passes down the fd, there's no way to get it from
+ * above.
+ */
+static int get_raw_posix_fd_hack(VirtIOBlock *s)
+{
+    return *(int*)s->bs->file->opaque;
+}
+
 static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
     VirtIOBlock *s = opaque;
@@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
     vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
-static bool handle_io(EventHandler *handler)
-{
-    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
-
-    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
-        /* TODO is this thread-safe and can it be done faster? */
-        virtio_irq(s->vq);
-    }
-
-    return true;
-}
-
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
@@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             outhdr->type, outhdr->sector);
     */
 
-    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
+    /* TODO Linux sets the barrier bit even when not advertised! */
+    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
+
+    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
         fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
         exit(1);
     }
 
     struct iocb *iocb;
-    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
     case VIRTIO_BLK_T_IN:
         if (unlikely(out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid read request\n");
@@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             fprintf(stderr, "virtio-blk invalid flush request\n");
             exit(1);
         }
-        iocb = ioq_fdsync(ioq);
-        break;
+
+        /* TODO fdsync is not supported by all backends, do it synchronously here! */
+        {
+            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
+            fdatasync(get_raw_posix_fd_hack(s));
+            inhdr->status = VIRTIO_BLK_S_OK;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            virtio_irq(s->vq);
+        }
+        return;
 
     default:
         fprintf(stderr, "virtio-blk multiple request type bits set\n");
@@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
     }
 
     /* Submit requests, if any */
-    if (likely(iov != iovec)) {
-        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
-            fprintf(stderr, "ioq_submit failed\n");
-            exit(1);
-        }
+    int rc = ioq_submit(&s->ioqueue);
+    if (unlikely(rc < 0)) {
+        fprintf(stderr, "ioq_submit failed %d\n", rc);
+        exit(1);
+    }
+    return true;
+}
+
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        /* TODO is this thread-safe and can it be done faster? */
+        virtio_irq(s->vq);
+    }
+
+    /* If there were more requests than iovecs, the vring will not be empty yet
+     * so check again.  There should now be enough resources to process more
+     * requests.
+     */
+    if (vring_more_avail(&s->vring)) {
+        return handle_notify(&s->notify_handler);
     }
 
     return true;
@@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
     return NULL;
 }
 
-/* Normally the block driver passes down the fd, there's no way to get it from
- * above.
- */
-static int get_raw_posix_fd_hack(VirtIOBlock *s)
-{
-    return *(int*)s->bs->file->opaque;
-}
-
 static void data_plane_start(VirtIOBlock *s)
 {
     int i;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 11/27] virtio-blk: Indirect vring and flush support
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

RHEL6 and other new guest kernels use indirect vring descriptors to
increase the number of requests that can be batched.  This fundamentally
changes vring from a scheme that requires fixed resources to something
more dynamic (although there is still an absolute maximum number of
descriptors).  Cope with indirect vrings by taking on as many requests
as we can in one go and then postponing the remaining requests until the
first batch completes.

It would be possible to switch to dynamic resource management so iovec
and iocb structs are malloced.  This would allow the entire ring to be
processed even with indirect descriptors, but would probably hit a
bottleneck when io_submit refuses to queue more requests.  Therefore,
stick with the simpler scheme for now.

Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
all files.  In particular, an O_DIRECT opened file on ext4 does not
support Linux AIO fdsync.  Work around this by performing fdatasync()
synchronously for now.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/ioq.h   |   18 ++++-----
 hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
 hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
 3 files changed, 144 insertions(+), 52 deletions(-)

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index 7200e87..d1545d6 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -3,7 +3,7 @@
 
 typedef struct {
     int fd;                         /* file descriptor */
-    unsigned int max_reqs;           /* max length of freelist and queue */
+    unsigned int max_reqs;          /* max length of freelist and queue */
 
     io_context_t io_ctx;            /* Linux AIO context */
     EventNotifier io_notifier;      /* Linux AIO eventfd */
@@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
     return iocb;
 }
 
-static struct iocb *ioq_fdsync(IOQueue *ioq)
-{
-    struct iocb *iocb = ioq_get_iocb(ioq);
-
-    io_prep_fdsync(iocb, ioq->fd);
-    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
-    return iocb;
-}
-
 static int ioq_submit(IOQueue *ioq)
 {
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
+    if (unlikely(rc < 0)) {
+        unsigned int i;
+        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
+        for (i = 0; i < ioq->queue_idx; i++) {
+            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
+        }
+    }
     ioq->queue_idx = 0; /* reset */
     return rc;
 }
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 70675e5..3eab4b4 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
+static bool vring_more_avail(Vring *vring)
+{
+	return vring->vr.avail->idx != vring->last_avail_idx;
+}
+
+/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
+static bool get_indirect(Vring *vring,
+			struct iovec iov[], struct iovec *iov_end,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vring_desc *indirect)
+{
+	struct vring_desc desc;
+	unsigned int i = 0, count, found = 0;
+
+	/* Sanity check */
+	if (unlikely(indirect->len % sizeof desc)) {
+		fprintf(stderr, "Invalid length in indirect descriptor: "
+		       "len 0x%llx not multiple of 0x%zx\n",
+		       (unsigned long long)indirect->len,
+		       sizeof desc);
+		exit(1);
+	}
+
+	count = indirect->len / sizeof desc;
+	/* Buffers are chained via a 16 bit next field, so
+	 * we can have at most 2^16 of these. */
+	if (unlikely(count > USHRT_MAX + 1)) {
+		fprintf(stderr, "Indirect buffer length too big: %d\n",
+		       indirect->len);
+        exit(1);
+	}
+
+    /* Point to translate indirect desc chain */
+    indirect = phys_to_host(vring, indirect->addr);
+
+	/* We will use the result as an address to read from, so most
+	 * architectures only need a compiler barrier here. */
+	__sync_synchronize(); /* read_barrier_depends(); */
+
+	do {
+		if (unlikely(++found > count)) {
+			fprintf(stderr, "Loop detected: last one at %u "
+			       "indirect size %u\n",
+			       i, count);
+			exit(1);
+		}
+
+        desc = *indirect++;
+		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+			fprintf(stderr, "Nested indirect descriptor\n");
+            exit(1);
+		}
+
+        /* Stop for now if there are not enough iovecs available. */
+        if (iov >= iov_end) {
+            return false;
+        }
+
+        iov->iov_base = phys_to_host(vring, desc.addr);
+        iov->iov_len  = desc.len;
+        iov++;
+
+		/* If this is an input descriptor, increment that count. */
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			*in_num += 1;
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				fprintf(stderr, "Indirect descriptor "
+				       "has out after in: idx %d\n", i);
+                exit(1);
+			}
+			*out_num += 1;
+		}
+        i = desc.next;
+	} while (desc.flags & VRING_DESC_F_NEXT);
+    return true;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
@@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
 		}
         desc = vring->vr.desc[i];
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
-/*			ret = get_indirect(dev, vq, iov, iov_size,
-					   out_num, in_num,
-					   log, log_num, &desc);
-			if (unlikely(ret < 0)) {
-				vq_err(vq, "Failure detected "
-				       "in indirect descriptor at idx %d\n", i);
-				return ret;
-			}
-			continue; */
-            fprintf(stderr, "Indirect vring not supported\n");
-            exit(1);
+			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
+                return num; /* not enough iovecs, stop for now */
+            }
+            continue;
 		}
 
+        /* If there are not enough iovecs left, stop for now.  The caller
+         * should check if there are more descs available once they have dealt
+         * with the current set.
+         */
         if (iov >= iov_end) {
-            fprintf(stderr, "Not enough vring iovecs\n");
-            exit(1);
+            return num;
         }
+
         iov->iov_base = phys_to_host(vring, desc.addr);
         iov->iov_len  = desc.len;
         iov++;
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 52ea601..591eace 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
     return (VirtIOBlock *)vdev;
 }
 
+/* Normally the block driver passes down the fd, there's no way to get it from
+ * above.
+ */
+static int get_raw_posix_fd_hack(VirtIOBlock *s)
+{
+    return *(int*)s->bs->file->opaque;
+}
+
 static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
     VirtIOBlock *s = opaque;
@@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
     vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
-static bool handle_io(EventHandler *handler)
-{
-    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
-
-    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
-        /* TODO is this thread-safe and can it be done faster? */
-        virtio_irq(s->vq);
-    }
-
-    return true;
-}
-
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
@@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             outhdr->type, outhdr->sector);
     */
 
-    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
+    /* TODO Linux sets the barrier bit even when not advertised! */
+    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
+
+    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
         fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
         exit(1);
     }
 
     struct iocb *iocb;
-    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
     case VIRTIO_BLK_T_IN:
         if (unlikely(out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid read request\n");
@@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             fprintf(stderr, "virtio-blk invalid flush request\n");
             exit(1);
         }
-        iocb = ioq_fdsync(ioq);
-        break;
+
+        /* TODO fdsync is not supported by all backends, do it synchronously here! */
+        {
+            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
+            fdatasync(get_raw_posix_fd_hack(s));
+            inhdr->status = VIRTIO_BLK_S_OK;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            virtio_irq(s->vq);
+        }
+        return;
 
     default:
         fprintf(stderr, "virtio-blk multiple request type bits set\n");
@@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
     }
 
     /* Submit requests, if any */
-    if (likely(iov != iovec)) {
-        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
-            fprintf(stderr, "ioq_submit failed\n");
-            exit(1);
-        }
+    int rc = ioq_submit(&s->ioqueue);
+    if (unlikely(rc < 0)) {
+        fprintf(stderr, "ioq_submit failed %d\n", rc);
+        exit(1);
+    }
+    return true;
+}
+
+static bool handle_io(EventHandler *handler)
+{
+    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
+
+    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
+        /* TODO is this thread-safe and can it be done faster? */
+        virtio_irq(s->vq);
+    }
+
+    /* If there were more requests than iovecs, the vring will not be empty yet
+     * so check again.  There should now be enough resources to process more
+     * requests.
+     */
+    if (vring_more_avail(&s->vring)) {
+        return handle_notify(&s->notify_handler);
     }
 
     return true;
@@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
     return NULL;
 }
 
-/* Normally the block driver passes down the fd, there's no way to get it from
- * above.
- */
-static int get_raw_posix_fd_hack(VirtIOBlock *s)
-{
-    return *(int*)s->bs->file->opaque;
-}
-
 static void data_plane_start(VirtIOBlock *s)
 {
     int i;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 3eab4b4..44ef4a9 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -1,6 +1,11 @@
 #ifndef VRING_H
 #define VRING_H
 
+/* Some virtio_ring.h files use BUG_ON() */
+#ifndef BUG_ON
+#define BUG_ON(x)
+#endif
+
 #include <linux/virtio_ring.h>
 #include "qemu-common.h"
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 3eab4b4..44ef4a9 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -1,6 +1,11 @@
 #ifndef VRING_H
 #define VRING_H
 
+/* Some virtio_ring.h files use BUG_ON() */
+#ifndef BUG_ON
+#define BUG_ON(x)
+#endif
+
 #include <linux/virtio_ring.h>
 #include "qemu-common.h"
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 13/27] virtio-blk: Increase max requests for indirect vring
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

With indirect vring descriptors, one can no longer assume that the
maximum number of requests is VRING_MAX / 2 (outhdr and inhdr).  Now a
single indirect descriptor can contain the outhdr and inhdr so max
requests becomes VRING_MAX.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 591eace..7ae3c56 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -26,7 +26,9 @@
 enum {
     SEG_MAX = 126,                  /* maximum number of I/O segments */
     VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
-    REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
+    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
+                                     * is VRING_MAX / 2 with traditional and
+                                     * VRING_MAX with indirect descriptors */
 };
 
 typedef struct {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 13/27] virtio-blk: Increase max requests for indirect vring
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

With indirect vring descriptors, one can no longer assume that the
maximum number of requests is VRING_MAX / 2 (outhdr and inhdr).  Now a
single indirect descriptor can contain the outhdr and inhdr so max
requests becomes VRING_MAX.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 591eace..7ae3c56 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -26,7 +26,9 @@
 enum {
     SEG_MAX = 126,                  /* maximum number of I/O segments */
     VRING_MAX = SEG_MAX + 2,        /* maximum number of vring descriptors */
-    REQ_MAX = VRING_MAX / 2,        /* maximum number of requests in the vring */
+    REQ_MAX = VRING_MAX,            /* maximum number of requests in the vring,
+                                     * is VRING_MAX / 2 with traditional and
+                                     * VRING_MAX with indirect descriptors */
 };
 
 typedef struct {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 14/27] virtio-blk: Use pthreads instead of qemu-thread
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Using qemu-thread.h seemed like a nice idea but it has two limitations:

1. QEMU needs to be built with --enable-io-thread
2. qemu-kvm doesn't build with --enable-io-thread

For now just copy the pthread_create() code straight into virtio-blk.c.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 7ae3c56..1616be5 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,6 +11,7 @@
  *
  */
 
+#include <pthread.h>
 #include <libaio.h>
 #include "qemu-common.h"
 #include "block_int.h"
@@ -47,7 +48,7 @@ typedef struct {
     DeviceState *qdev;
 
     bool data_plane_started;
-    QemuThread data_plane_thread;
+    pthread_t data_plane_thread;
 
     Vring vring;                    /* virtqueue vring */
 
@@ -268,7 +269,16 @@ static void data_plane_start(VirtIOBlock *s)
     }
     event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
-    qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
+    /* Create data plane thread */
+    sigset_t set, oldset;
+    sigfillset(&set);
+    pthread_sigmask(SIG_SETMASK, &set, &oldset);
+    if (pthread_create(&s->data_plane_thread, NULL, data_plane_thread, s) != 0)
+    {
+        fprintf(stderr, "pthread create failed: %m\n");
+        exit(1);
+    }
+    pthread_sigmask(SIG_SETMASK, &oldset, NULL);
 
     s->data_plane_started = true;
 }
@@ -279,7 +289,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     /* Tell data plane thread to stop and then wait for it to return */
     event_poll_stop(&s->event_poll);
-    pthread_join(s->data_plane_thread.thread, NULL);
+    pthread_join(s->data_plane_thread, NULL);
 
     ioq_cleanup(&s->ioqueue);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 14/27] virtio-blk: Use pthreads instead of qemu-thread
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Using qemu-thread.h seemed like a nice idea but it has two limitations:

1. QEMU needs to be built with --enable-io-thread
2. qemu-kvm doesn't build with --enable-io-thread

For now just copy the pthread_create() code straight into virtio-blk.c.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 7ae3c56..1616be5 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -11,6 +11,7 @@
  *
  */
 
+#include <pthread.h>
 #include <libaio.h>
 #include "qemu-common.h"
 #include "block_int.h"
@@ -47,7 +48,7 @@ typedef struct {
     DeviceState *qdev;
 
     bool data_plane_started;
-    QemuThread data_plane_thread;
+    pthread_t data_plane_thread;
 
     Vring vring;                    /* virtqueue vring */
 
@@ -268,7 +269,16 @@ static void data_plane_start(VirtIOBlock *s)
     }
     event_poll_add(&s->event_poll, &s->io_handler, ioq_get_notifier(&s->ioqueue), handle_io);
 
-    qemu_thread_create(&s->data_plane_thread, data_plane_thread, s, QEMU_THREAD_JOINABLE);
+    /* Create data plane thread */
+    sigset_t set, oldset;
+    sigfillset(&set);
+    pthread_sigmask(SIG_SETMASK, &set, &oldset);
+    if (pthread_create(&s->data_plane_thread, NULL, data_plane_thread, s) != 0)
+    {
+        fprintf(stderr, "pthread create failed: %m\n");
+        exit(1);
+    }
+    pthread_sigmask(SIG_SETMASK, &oldset, NULL);
 
     s->data_plane_started = true;
 }
@@ -279,7 +289,7 @@ static void data_plane_stop(VirtIOBlock *s)
 
     /* Tell data plane thread to stop and then wait for it to return */
     event_poll_stop(&s->event_poll);
-    pthread_join(s->data_plane_thread.thread, NULL);
+    pthread_join(s->data_plane_thread, NULL);
 
     ioq_cleanup(&s->ioqueue);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 15/27] notifier: Add a function to set the notifier
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Although past users only needed to test and clear event notifiers, it is
useful to be able to set them too.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 event_notifier.c |    7 +++++++
 event_notifier.h |    1 +
 2 files changed, 8 insertions(+)

diff --git a/event_notifier.c b/event_notifier.c
index 0b82981..006adc5 100644
--- a/event_notifier.c
+++ b/event_notifier.c
@@ -59,3 +59,10 @@ int event_notifier_test(EventNotifier *e)
     }
     return r == sizeof(value);
 }
+
+int event_notifier_set(EventNotifier *e)
+{
+    uint64_t value = 1;
+    int r = write(e->fd, &value, sizeof(value));
+    return r == sizeof(value);
+}
diff --git a/event_notifier.h b/event_notifier.h
index 886222c..46a22f8 100644
--- a/event_notifier.h
+++ b/event_notifier.h
@@ -24,5 +24,6 @@ void event_notifier_cleanup(EventNotifier *);
 int event_notifier_get_fd(EventNotifier *);
 int event_notifier_test_and_clear(EventNotifier *);
 int event_notifier_test(EventNotifier *);
+int event_notifier_set(EventNotifier *);
 
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 15/27] notifier: Add a function to set the notifier
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Although past users only needed to test and clear event notifiers, it is
useful to be able to set them too.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 event_notifier.c |    7 +++++++
 event_notifier.h |    1 +
 2 files changed, 8 insertions(+)

diff --git a/event_notifier.c b/event_notifier.c
index 0b82981..006adc5 100644
--- a/event_notifier.c
+++ b/event_notifier.c
@@ -59,3 +59,10 @@ int event_notifier_test(EventNotifier *e)
     }
     return r == sizeof(value);
 }
+
+int event_notifier_set(EventNotifier *e)
+{
+    uint64_t value = 1;
+    int r = write(e->fd, &value, sizeof(value));
+    return r == sizeof(value);
+}
diff --git a/event_notifier.h b/event_notifier.h
index 886222c..46a22f8 100644
--- a/event_notifier.h
+++ b/event_notifier.h
@@ -24,5 +24,6 @@ void event_notifier_cleanup(EventNotifier *);
 int event_notifier_get_fd(EventNotifier *);
 int event_notifier_test_and_clear(EventNotifier *);
 int event_notifier_test(EventNotifier *);
+int event_notifier_set(EventNotifier *);
 
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 16/27] virtio-blk: Kick data plane thread using event notifier set
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

---
 hw/virtio-blk.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 1616be5..d75c187 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -339,8 +339,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     virtio_blk_set_status(vdev, VIRTIO_CONFIG_S_DRIVER_OK); /* start the thread */
 
     /* Now kick the thread */
-    uint64_t dummy = 1;
-    ssize_t unused __attribute__((unused)) = write(event_notifier_get_fd(virtio_queue_get_host_notifier(s->vq)), &dummy, sizeof dummy);
+    event_notifier_set(virtio_queue_get_host_notifier(s->vq));
 }
 
 /* coalesce internal state, copy to pci i/o region 0
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 16/27] virtio-blk: Kick data plane thread using event notifier set
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

---
 hw/virtio-blk.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 1616be5..d75c187 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -339,8 +339,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     virtio_blk_set_status(vdev, VIRTIO_CONFIG_S_DRIVER_OK); /* start the thread */
 
     /* Now kick the thread */
-    uint64_t dummy = 1;
-    ssize_t unused __attribute__((unused)) = write(event_notifier_get_fd(virtio_queue_get_host_notifier(s->vq)), &dummy, sizeof dummy);
+    event_notifier_set(virtio_queue_get_host_notifier(s->vq));
 }
 
 /* coalesce internal state, copy to pci i/o region 0
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 17/27] virtio-blk: Use guest notifier to raise interrupts
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

The data plane thread isn't allowed to call virtio_irq() directly
because that function is not thread-safe.  Use the guest notifier just
like virtio-net to handle IRQs.

When MSI-X is in use and the vector is unmasked, the guest notifier
directly sets the IRQ inside the host kernel.  If the vector is masked,
then QEMU's iothread needs to take note of the IRQ.  If MSI-X is not in
use, then QEMU's iothread handles the IRQ and this will be slower than
synchronously calling notify_irq() from the data plane thread.
---
 hw/virtio-blk.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index d75c187..bdff68a 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -73,6 +73,18 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
     return *(int*)s->bs->file->opaque;
 }
 
+/* Raise an interrupt to signal guest, if necessary */
+static void virtio_blk_notify_guest(VirtIOBlock *s)
+{
+    /* Always notify when queue is empty (when feature acknowledge) */
+	if ((s->vring.vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) &&
+	    (s->vring.vr.avail->idx != s->vring.last_avail_idx ||
+        !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
+		return;
+
+    event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
+}
+
 static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
     VirtIOBlock *s = opaque;
@@ -154,7 +166,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             fdatasync(get_raw_posix_fd_hack(s));
             inhdr->status = VIRTIO_BLK_S_OK;
             vring_push(&s->vring, head, sizeof *inhdr);
-            virtio_irq(s->vq);
+            virtio_blk_notify_guest(s);
         }
         return;
 
@@ -222,8 +234,7 @@ static bool handle_io(EventHandler *handler)
     VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
 
     if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
-        /* TODO is this thread-safe and can it be done faster? */
-        virtio_irq(s->vq);
+        virtio_blk_notify_guest(s);
     }
 
     /* If there were more requests than iovecs, the vring will not be empty yet
@@ -251,11 +262,17 @@ static void data_plane_start(VirtIOBlock *s)
 
     vring_setup(&s->vring, &s->vdev, 0);
 
+    /* Set up guest notifier (irq) */
+    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
+        exit(1);
+    }
+
     event_poll_init(&s->event_poll);
 
     /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
         exit(1);
     }
     event_poll_add(&s->event_poll, &s->notify_handler,
@@ -296,6 +313,9 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     event_poll_cleanup(&s->event_poll);
+
+    /* Clean up guest notifier (irq) */
+    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 17/27] virtio-blk: Use guest notifier to raise interrupts
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

The data plane thread isn't allowed to call virtio_irq() directly
because that function is not thread-safe.  Use the guest notifier just
like virtio-net to handle IRQs.

When MSI-X is in use and the vector is unmasked, the guest notifier
directly sets the IRQ inside the host kernel.  If the vector is masked,
then QEMU's iothread needs to take note of the IRQ.  If MSI-X is not in
use, then QEMU's iothread handles the IRQ and this will be slower than
synchronously calling notify_irq() from the data plane thread.
---
 hw/virtio-blk.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index d75c187..bdff68a 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -73,6 +73,18 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
     return *(int*)s->bs->file->opaque;
 }
 
+/* Raise an interrupt to signal guest, if necessary */
+static void virtio_blk_notify_guest(VirtIOBlock *s)
+{
+    /* Always notify when queue is empty (when feature acknowledge) */
+	if ((s->vring.vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) &&
+	    (s->vring.vr.avail->idx != s->vring.last_avail_idx ||
+        !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
+		return;
+
+    event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
+}
+
 static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
 {
     VirtIOBlock *s = opaque;
@@ -154,7 +166,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
             fdatasync(get_raw_posix_fd_hack(s));
             inhdr->status = VIRTIO_BLK_S_OK;
             vring_push(&s->vring, head, sizeof *inhdr);
-            virtio_irq(s->vq);
+            virtio_blk_notify_guest(s);
         }
         return;
 
@@ -222,8 +234,7 @@ static bool handle_io(EventHandler *handler)
     VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
 
     if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
-        /* TODO is this thread-safe and can it be done faster? */
-        virtio_irq(s->vq);
+        virtio_blk_notify_guest(s);
     }
 
     /* If there were more requests than iovecs, the vring will not be empty yet
@@ -251,11 +262,17 @@ static void data_plane_start(VirtIOBlock *s)
 
     vring_setup(&s->vring, &s->vdev, 0);
 
+    /* Set up guest notifier (irq) */
+    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+        fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
+        exit(1);
+    }
+
     event_poll_init(&s->event_poll);
 
     /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
+        fprintf(stderr, "virtio-blk failed to set host notifier\n");
         exit(1);
     }
     event_poll_add(&s->event_poll, &s->notify_handler,
@@ -296,6 +313,9 @@ static void data_plane_stop(VirtIOBlock *s)
     s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, false);
 
     event_poll_cleanup(&s->event_poll);
+
+    /* Clean up guest notifier (irq) */
+    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Optimize for the MSI-X enabled and vector unmasked case where it is
possible to issue the KVM ioctl() directly instead of using irqfd.

This patch introduces a new virtio binding function which tries to
notify in a thread-safe way.  If this is not possible, the function
returns false.  Virtio block then knows to use irqfd as a fallback.
---
 hw/msix.c       |   17 +++++++++++++++++
 hw/msix.h       |    1 +
 hw/virtio-blk.c |   10 ++++++++--
 hw/virtio-pci.c |    8 ++++++++
 hw/virtio.c     |    9 +++++++++
 hw/virtio.h     |    3 +++
 6 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 7955221..3308604 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -503,6 +503,23 @@ void msix_notify(PCIDevice *dev, unsigned vector)
     stl_le_phys(address, data);
 }
 
+bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
+{
+    if (unlikely(vector >= dev->msix_entries_nr || !dev->msix_entry_used[vector])) {
+        return false;
+    }
+    if (unlikely(msix_is_masked(dev, vector))) {
+        return false;
+    }
+#ifdef KVM_CAP_IRQCHIP
+    if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
+        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
+        return true;
+    }
+#endif
+    return false;
+}
+
 void msix_reset(PCIDevice *dev)
 {
     if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
diff --git a/hw/msix.h b/hw/msix.h
index a8661e1..99fb08f 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -26,6 +26,7 @@ void msix_vector_unuse(PCIDevice *dev, unsigned vector);
 void msix_unuse_all_vectors(PCIDevice *dev);
 
 void msix_notify(PCIDevice *dev, unsigned vector);
+bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index bdff68a..efeffa0 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -82,6 +82,12 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
         !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
 		return;
 
+    /* Try to issue the ioctl() directly for speed */
+    if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
+        return;
+    }
+
+    /* If the fast path didn't work, use irqfd */
     event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
 }
 
@@ -263,7 +269,7 @@ static void data_plane_start(VirtIOBlock *s)
     vring_setup(&s->vring, &s->vdev, 0);
 
     /* Set up guest notifier (irq) */
-    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+    if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
@@ -315,7 +321,7 @@ static void data_plane_stop(VirtIOBlock *s)
     event_poll_cleanup(&s->event_poll);
 
     /* Clean up guest notifier (irq) */
-    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
+    s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, false);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index f1e13af..03512b3 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -106,6 +106,13 @@ static void virtio_pci_notify(void *opaque, uint16_t vector)
         qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static bool virtio_pci_try_notify_from_thread(void *opaque, uint16_t vector)
+{
+    VirtIOPCIProxy *proxy = opaque;
+    return msix_enabled(&proxy->pci_dev) &&
+           msix_try_notify_from_thread(&proxy->pci_dev, vector);
+}
+
 static void virtio_pci_save_config(void * opaque, QEMUFile *f)
 {
     VirtIOPCIProxy *proxy = opaque;
@@ -707,6 +714,7 @@ static void virtio_pci_vmstate_change(void *opaque, bool running)
 
 static const VirtIOBindings virtio_pci_bindings = {
     .notify = virtio_pci_notify,
+    .try_notify_from_thread = virtio_pci_try_notify_from_thread,
     .save_config = virtio_pci_save_config,
     .load_config = virtio_pci_load_config,
     .save_queue = virtio_pci_save_queue,
diff --git a/hw/virtio.c b/hw/virtio.c
index 064aecf..a1d1a8a 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -689,6 +689,15 @@ static inline int vring_need_event(uint16_t event, uint16_t new, uint16_t old)
 	return (uint16_t)(new - event - 1) < (uint16_t)(new - old);
 }
 
+bool virtio_queue_try_notify_from_thread(VirtQueue *vq)
+{
+    VirtIODevice *vdev = vq->vdev;
+    if (likely(vdev->binding->try_notify_from_thread)) {
+        return vdev->binding->try_notify_from_thread(vdev->binding_opaque, vq->vector);
+    }
+    return false;
+}
+
 static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     uint16_t old, new;
diff --git a/hw/virtio.h b/hw/virtio.h
index 400c092..2cdf2be 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -93,6 +93,7 @@ typedef struct VirtQueueElement
 
 typedef struct {
     void (*notify)(void * opaque, uint16_t vector);
+    bool (*try_notify_from_thread)(void * opaque, uint16_t vector);
     void (*save_config)(void * opaque, QEMUFile *f);
     void (*save_queue)(void * opaque, int n, QEMUFile *f);
     int (*load_config)(void * opaque, QEMUFile *f);
@@ -160,6 +161,8 @@ void virtio_cleanup(VirtIODevice *vdev);
 
 void virtio_notify_config(VirtIODevice *vdev);
 
+bool virtio_queue_try_notify_from_thread(VirtQueue *vq);
+
 void virtio_queue_set_notification(VirtQueue *vq, int enable);
 
 int virtio_queue_ready(VirtQueue *vq);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Optimize for the MSI-X enabled and vector unmasked case where it is
possible to issue the KVM ioctl() directly instead of using irqfd.

This patch introduces a new virtio binding function which tries to
notify in a thread-safe way.  If this is not possible, the function
returns false.  Virtio block then knows to use irqfd as a fallback.
---
 hw/msix.c       |   17 +++++++++++++++++
 hw/msix.h       |    1 +
 hw/virtio-blk.c |   10 ++++++++--
 hw/virtio-pci.c |    8 ++++++++
 hw/virtio.c     |    9 +++++++++
 hw/virtio.h     |    3 +++
 6 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 7955221..3308604 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -503,6 +503,23 @@ void msix_notify(PCIDevice *dev, unsigned vector)
     stl_le_phys(address, data);
 }
 
+bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
+{
+    if (unlikely(vector >= dev->msix_entries_nr || !dev->msix_entry_used[vector])) {
+        return false;
+    }
+    if (unlikely(msix_is_masked(dev, vector))) {
+        return false;
+    }
+#ifdef KVM_CAP_IRQCHIP
+    if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
+        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
+        return true;
+    }
+#endif
+    return false;
+}
+
 void msix_reset(PCIDevice *dev)
 {
     if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
diff --git a/hw/msix.h b/hw/msix.h
index a8661e1..99fb08f 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -26,6 +26,7 @@ void msix_vector_unuse(PCIDevice *dev, unsigned vector);
 void msix_unuse_all_vectors(PCIDevice *dev);
 
 void msix_notify(PCIDevice *dev, unsigned vector);
+bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index bdff68a..efeffa0 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -82,6 +82,12 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
         !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
 		return;
 
+    /* Try to issue the ioctl() directly for speed */
+    if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
+        return;
+    }
+
+    /* If the fast path didn't work, use irqfd */
     event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
 }
 
@@ -263,7 +269,7 @@ static void data_plane_start(VirtIOBlock *s)
     vring_setup(&s->vring, &s->vdev, 0);
 
     /* Set up guest notifier (irq) */
-    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
+    if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
         fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
@@ -315,7 +321,7 @@ static void data_plane_stop(VirtIOBlock *s)
     event_poll_cleanup(&s->event_poll);
 
     /* Clean up guest notifier (irq) */
-    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
+    s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, false);
 }
 
 static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index f1e13af..03512b3 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -106,6 +106,13 @@ static void virtio_pci_notify(void *opaque, uint16_t vector)
         qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static bool virtio_pci_try_notify_from_thread(void *opaque, uint16_t vector)
+{
+    VirtIOPCIProxy *proxy = opaque;
+    return msix_enabled(&proxy->pci_dev) &&
+           msix_try_notify_from_thread(&proxy->pci_dev, vector);
+}
+
 static void virtio_pci_save_config(void * opaque, QEMUFile *f)
 {
     VirtIOPCIProxy *proxy = opaque;
@@ -707,6 +714,7 @@ static void virtio_pci_vmstate_change(void *opaque, bool running)
 
 static const VirtIOBindings virtio_pci_bindings = {
     .notify = virtio_pci_notify,
+    .try_notify_from_thread = virtio_pci_try_notify_from_thread,
     .save_config = virtio_pci_save_config,
     .load_config = virtio_pci_load_config,
     .save_queue = virtio_pci_save_queue,
diff --git a/hw/virtio.c b/hw/virtio.c
index 064aecf..a1d1a8a 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -689,6 +689,15 @@ static inline int vring_need_event(uint16_t event, uint16_t new, uint16_t old)
 	return (uint16_t)(new - event - 1) < (uint16_t)(new - old);
 }
 
+bool virtio_queue_try_notify_from_thread(VirtQueue *vq)
+{
+    VirtIODevice *vdev = vq->vdev;
+    if (likely(vdev->binding->try_notify_from_thread)) {
+        return vdev->binding->try_notify_from_thread(vdev->binding_opaque, vq->vector);
+    }
+    return false;
+}
+
 static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     uint16_t old, new;
diff --git a/hw/virtio.h b/hw/virtio.h
index 400c092..2cdf2be 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -93,6 +93,7 @@ typedef struct VirtQueueElement
 
 typedef struct {
     void (*notify)(void * opaque, uint16_t vector);
+    bool (*try_notify_from_thread)(void * opaque, uint16_t vector);
     void (*save_config)(void * opaque, QEMUFile *f);
     void (*save_queue)(void * opaque, int n, QEMUFile *f);
     int (*load_config)(void * opaque, QEMUFile *f);
@@ -160,6 +161,8 @@ void virtio_cleanup(VirtIODevice *vdev);
 
 void virtio_notify_config(VirtIODevice *vdev);
 
+bool virtio_queue_try_notify_from_thread(VirtQueue *vq);
+
 void virtio_queue_set_notification(VirtQueue *vq, int enable);
 
 int virtio_queue_ready(VirtQueue *vq);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 19/27] virtio-blk: Disable guest->host notifies while processing vring
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

---
 hw/dataplane/vring.h |   28 +++++++++++++++++++++++-----
 hw/virtio-blk.c      |   47 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 44ef4a9..cdd4d4a 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -69,11 +69,29 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
+/* Are there more descriptors available? */
 static bool vring_more_avail(Vring *vring)
 {
 	return vring->vr.avail->idx != vring->last_avail_idx;
 }
 
+/* Hint to disable guest->host notifies */
+static void vring_disable_cb(Vring *vring)
+{
+    vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+}
+
+/* Re-enable guest->host notifies
+ *
+ * Returns false if there are more descriptors in the ring.
+ */
+static bool vring_enable_cb(Vring *vring)
+{
+    vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    __sync_synchronize(); /* mb() */
+    return !vring_more_avail(vring);
+}
+
 /* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
 static bool get_indirect(Vring *vring,
 			struct iovec iov[], struct iovec *iov_end,
@@ -160,7 +178,7 @@ static bool get_indirect(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static unsigned int vring_pop(Vring *vring,
+static int vring_pop(Vring *vring,
 		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
@@ -178,9 +196,9 @@ static unsigned int vring_pop(Vring *vring,
 		exit(1);
 	}
 
-	/* If there's nothing new since last we looked, return invalid. */
+	/* If there's nothing new since last we looked. */
 	if (avail_idx == last_avail_idx)
-		return num;
+		return -EAGAIN;
 
 	/* Only get avail ring entries after they have been exposed by guest. */
 	__sync_synchronize(); /* smp_rmb() */
@@ -215,7 +233,7 @@ static unsigned int vring_pop(Vring *vring,
         desc = vring->vr.desc[i];
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
 			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
-                return num; /* not enough iovecs, stop for now */
+                return -ENOBUFS; /* not enough iovecs, stop for now */
             }
             continue;
 		}
@@ -225,7 +243,7 @@ static unsigned int vring_pop(Vring *vring,
          * with the current set.
          */
         if (iov >= iov_end) {
-            return num;
+            return -ENOBUFS;
         }
 
         iov->iov_base = phys_to_host(vring, desc.addr);
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index efeffa0..f67fdb7 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -202,7 +202,8 @@ static bool handle_notify(EventHandler *handler)
      * accept more I/O.  This is not implemented yet.
      */
     struct iovec iovec[VRING_MAX];
-    struct iovec *iov, *end = &iovec[VRING_MAX];
+    struct iovec *end = &iovec[VRING_MAX];
+    struct iovec *iov = iovec;
 
     /* When a request is read from the vring, the index of the first descriptor
      * (aka head) is returned so that the completed request can be pushed onto
@@ -211,19 +212,41 @@ static bool handle_notify(EventHandler *handler)
      * The number of hypervisor read-only iovecs is out_num.  The number of
      * hypervisor write-only iovecs is in_num.
      */
-    unsigned int head, out_num = 0, in_num = 0;
+    int head;
+    unsigned int out_num = 0, in_num = 0;
 
-    for (iov = iovec; ; iov += out_num + in_num) {
-        head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
-        if (head >= vring_get_num(&s->vring)) {
-            break; /* no more requests */
-        }
+    for (;;) {
+        /* Disable guest->host notifies to avoid unnecessary vmexits */
+        vring_disable_cb(&s->vring);
+
+        for (;;) {
+            head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+            if (head < 0) {
+                break; /* no more requests */
+            }
 
-        /*
-        fprintf(stderr, "out_num=%u in_num=%u head=%u\n", out_num, in_num, head);
-        */
+            /*
+            fprintf(stderr, "out_num=%u in_num=%u head=%d\n", out_num, in_num, head);
+            */
 
-        process_request(&s->ioqueue, iov, out_num, in_num, head);
+            process_request(&s->ioqueue, iov, out_num, in_num, head);
+            iov += out_num + in_num;
+        }
+
+        if (likely(head == -EAGAIN)) { /* vring emptied */
+            /* Re-enable guest->host notifies and stop processing the vring.
+             * But if the guest has snuck in more descriptors, keep processing.
+             */
+            if (likely(vring_enable_cb(&s->vring))) {
+                break;
+            }
+        } else { /* head == -ENOBUFS, cannot continue since iovecs[] is depleted */
+            /* Since there are no iovecs[] left, stop processing for now.  Do
+             * not re-enable guest->host notifies since the I/O completion
+             * handler knows to check for more vring descriptors anyway.
+             */
+            break;
+        }
     }
 
     /* Submit requests, if any */
@@ -247,7 +270,7 @@ static bool handle_io(EventHandler *handler)
      * so check again.  There should now be enough resources to process more
      * requests.
      */
-    if (vring_more_avail(&s->vring)) {
+    if (unlikely(vring_more_avail(&s->vring))) {
         return handle_notify(&s->notify_handler);
     }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 19/27] virtio-blk: Disable guest->host notifies while processing vring
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

---
 hw/dataplane/vring.h |   28 +++++++++++++++++++++++-----
 hw/virtio-blk.c      |   47 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index 44ef4a9..cdd4d4a 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -69,11 +69,29 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
+/* Are there more descriptors available? */
 static bool vring_more_avail(Vring *vring)
 {
 	return vring->vr.avail->idx != vring->last_avail_idx;
 }
 
+/* Hint to disable guest->host notifies */
+static void vring_disable_cb(Vring *vring)
+{
+    vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+}
+
+/* Re-enable guest->host notifies
+ *
+ * Returns false if there are more descriptors in the ring.
+ */
+static bool vring_enable_cb(Vring *vring)
+{
+    vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    __sync_synchronize(); /* mb() */
+    return !vring_more_avail(vring);
+}
+
 /* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
 static bool get_indirect(Vring *vring,
 			struct iovec iov[], struct iovec *iov_end,
@@ -160,7 +178,7 @@ static bool get_indirect(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static unsigned int vring_pop(Vring *vring,
+static int vring_pop(Vring *vring,
 		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
@@ -178,9 +196,9 @@ static unsigned int vring_pop(Vring *vring,
 		exit(1);
 	}
 
-	/* If there's nothing new since last we looked, return invalid. */
+	/* If there's nothing new since last we looked. */
 	if (avail_idx == last_avail_idx)
-		return num;
+		return -EAGAIN;
 
 	/* Only get avail ring entries after they have been exposed by guest. */
 	__sync_synchronize(); /* smp_rmb() */
@@ -215,7 +233,7 @@ static unsigned int vring_pop(Vring *vring,
         desc = vring->vr.desc[i];
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
 			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
-                return num; /* not enough iovecs, stop for now */
+                return -ENOBUFS; /* not enough iovecs, stop for now */
             }
             continue;
 		}
@@ -225,7 +243,7 @@ static unsigned int vring_pop(Vring *vring,
          * with the current set.
          */
         if (iov >= iov_end) {
-            return num;
+            return -ENOBUFS;
         }
 
         iov->iov_base = phys_to_host(vring, desc.addr);
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index efeffa0..f67fdb7 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -202,7 +202,8 @@ static bool handle_notify(EventHandler *handler)
      * accept more I/O.  This is not implemented yet.
      */
     struct iovec iovec[VRING_MAX];
-    struct iovec *iov, *end = &iovec[VRING_MAX];
+    struct iovec *end = &iovec[VRING_MAX];
+    struct iovec *iov = iovec;
 
     /* When a request is read from the vring, the index of the first descriptor
      * (aka head) is returned so that the completed request can be pushed onto
@@ -211,19 +212,41 @@ static bool handle_notify(EventHandler *handler)
      * The number of hypervisor read-only iovecs is out_num.  The number of
      * hypervisor write-only iovecs is in_num.
      */
-    unsigned int head, out_num = 0, in_num = 0;
+    int head;
+    unsigned int out_num = 0, in_num = 0;
 
-    for (iov = iovec; ; iov += out_num + in_num) {
-        head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
-        if (head >= vring_get_num(&s->vring)) {
-            break; /* no more requests */
-        }
+    for (;;) {
+        /* Disable guest->host notifies to avoid unnecessary vmexits */
+        vring_disable_cb(&s->vring);
+
+        for (;;) {
+            head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+            if (head < 0) {
+                break; /* no more requests */
+            }
 
-        /*
-        fprintf(stderr, "out_num=%u in_num=%u head=%u\n", out_num, in_num, head);
-        */
+            /*
+            fprintf(stderr, "out_num=%u in_num=%u head=%d\n", out_num, in_num, head);
+            */
 
-        process_request(&s->ioqueue, iov, out_num, in_num, head);
+            process_request(&s->ioqueue, iov, out_num, in_num, head);
+            iov += out_num + in_num;
+        }
+
+        if (likely(head == -EAGAIN)) { /* vring emptied */
+            /* Re-enable guest->host notifies and stop processing the vring.
+             * But if the guest has snuck in more descriptors, keep processing.
+             */
+            if (likely(vring_enable_cb(&s->vring))) {
+                break;
+            }
+        } else { /* head == -ENOBUFS, cannot continue since iovecs[] is depleted */
+            /* Since there are no iovecs[] left, stop processing for now.  Do
+             * not re-enable guest->host notifies since the I/O completion
+             * handler knows to check for more vring descriptors anyway.
+             */
+            break;
+        }
     }
 
     /* Submit requests, if any */
@@ -247,7 +270,7 @@ static bool handle_io(EventHandler *handler)
      * so check again.  There should now be enough resources to process more
      * requests.
      */
-    if (vring_more_avail(&s->vring)) {
+    if (unlikely(vring_more_avail(&s->vring))) {
         return handle_notify(&s->notify_handler);
     }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 20/27] virtio-blk: Add ioscheduler to detect mergable requests
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

---
 hw/dataplane/iosched.h |   78 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c        |    5 ++++
 2 files changed, 83 insertions(+)
 create mode 100644 hw/dataplane/iosched.h

diff --git a/hw/dataplane/iosched.h b/hw/dataplane/iosched.h
new file mode 100644
index 0000000..12ebccc
--- /dev/null
+++ b/hw/dataplane/iosched.h
@@ -0,0 +1,78 @@
+#ifndef IOSCHED_H
+#define IOSCHED_H
+
+#include "hw/dataplane/ioq.h"
+
+typedef struct {
+    unsigned long iocbs;
+    unsigned long merges;
+    unsigned long sched_calls;
+} IOSched;
+
+static int iocb_cmp(const void *a, const void *b)
+{
+    const struct iocb *iocb_a = a;
+    const struct iocb *iocb_b = b;
+
+    /*
+     * Note that we can't simply subtract req2->sector from req1->sector
+     * here as that could overflow the return value.
+     */
+    if (iocb_a->u.c.offset > iocb_b->u.c.offset) {
+        return 1;
+    } else if (iocb_a->u.c.offset < iocb_b->u.c.offset) {
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static size_t iocb_nbytes(struct iocb *iocb)
+{
+    struct iovec *iov = iocb->u.c.buf;
+    size_t nbytes = 0;
+    size_t i;
+    for (i = 0; i < iocb->u.c.nbytes; i++) {
+        nbytes += iov->iov_len;
+        iov++;
+    }
+    return nbytes;
+}
+
+static void iosched_init(IOSched *iosched)
+{
+    memset(iosched, 0, sizeof *iosched);
+}
+
+static void iosched_print_stats(IOSched *iosched)
+{
+    fprintf(stderr, "iocbs = %lu merges = %lu sched_calls = %lu\n",
+            iosched->iocbs, iosched->merges, iosched->sched_calls);
+    memset(iosched, 0, sizeof *iosched);
+}
+
+static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int count)
+{
+    struct iocb *sorted[count];
+    struct iocb *last;
+    unsigned int i;
+
+    if ((++iosched->sched_calls % 1000) == 0) {
+        iosched_print_stats(iosched);
+    }
+
+    memcpy(sorted, unsorted, sizeof sorted);
+    qsort(sorted, count, sizeof sorted[0], iocb_cmp);
+
+    iosched->iocbs += count;
+    last = sorted[0];
+    for (i = 1; i < count; i++) {
+        if (last->aio_lio_opcode == sorted[i]->aio_lio_opcode &&
+            last->u.c.offset + iocb_nbytes(last) == sorted[i]->u.c.offset) {
+            iosched->merges++;
+        }
+        last = sorted[i];
+    }
+}
+
+#endif /* IOSCHED_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index f67fdb7..75cb0f2 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -22,6 +22,7 @@
 #include "hw/dataplane/event-poll.h"
 #include "hw/dataplane/vring.h"
 #include "hw/dataplane/ioq.h"
+#include "hw/dataplane/iosched.h"
 #include "kvm.h"
 
 enum {
@@ -57,6 +58,7 @@ typedef struct {
     EventHandler notify_handler;    /* virtqueue notify handler */
 
     IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+    IOSched iosched;                /* I/O scheduler */
     VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the queue */
 } VirtIOBlock;
 
@@ -249,6 +251,8 @@ static bool handle_notify(EventHandler *handler)
         }
     }
 
+    iosched(&s->iosched, s->ioqueue.queue, s->ioqueue.queue_idx);
+
     /* Submit requests, if any */
     int rc = ioq_submit(&s->ioqueue);
     if (unlikely(rc < 0)) {
@@ -289,6 +293,7 @@ static void data_plane_start(VirtIOBlock *s)
 {
     int i;
 
+    iosched_init(&s->iosched);
     vring_setup(&s->vring, &s->vdev, 0);
 
     /* Set up guest notifier (irq) */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 20/27] virtio-blk: Add ioscheduler to detect mergable requests
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

---
 hw/dataplane/iosched.h |   78 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio-blk.c        |    5 ++++
 2 files changed, 83 insertions(+)
 create mode 100644 hw/dataplane/iosched.h

diff --git a/hw/dataplane/iosched.h b/hw/dataplane/iosched.h
new file mode 100644
index 0000000..12ebccc
--- /dev/null
+++ b/hw/dataplane/iosched.h
@@ -0,0 +1,78 @@
+#ifndef IOSCHED_H
+#define IOSCHED_H
+
+#include "hw/dataplane/ioq.h"
+
+typedef struct {
+    unsigned long iocbs;
+    unsigned long merges;
+    unsigned long sched_calls;
+} IOSched;
+
+static int iocb_cmp(const void *a, const void *b)
+{
+    const struct iocb *iocb_a = a;
+    const struct iocb *iocb_b = b;
+
+    /*
+     * Note that we can't simply subtract req2->sector from req1->sector
+     * here as that could overflow the return value.
+     */
+    if (iocb_a->u.c.offset > iocb_b->u.c.offset) {
+        return 1;
+    } else if (iocb_a->u.c.offset < iocb_b->u.c.offset) {
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static size_t iocb_nbytes(struct iocb *iocb)
+{
+    struct iovec *iov = iocb->u.c.buf;
+    size_t nbytes = 0;
+    size_t i;
+    for (i = 0; i < iocb->u.c.nbytes; i++) {
+        nbytes += iov->iov_len;
+        iov++;
+    }
+    return nbytes;
+}
+
+static void iosched_init(IOSched *iosched)
+{
+    memset(iosched, 0, sizeof *iosched);
+}
+
+static void iosched_print_stats(IOSched *iosched)
+{
+    fprintf(stderr, "iocbs = %lu merges = %lu sched_calls = %lu\n",
+            iosched->iocbs, iosched->merges, iosched->sched_calls);
+    memset(iosched, 0, sizeof *iosched);
+}
+
+static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int count)
+{
+    struct iocb *sorted[count];
+    struct iocb *last;
+    unsigned int i;
+
+    if ((++iosched->sched_calls % 1000) == 0) {
+        iosched_print_stats(iosched);
+    }
+
+    memcpy(sorted, unsorted, sizeof sorted);
+    qsort(sorted, count, sizeof sorted[0], iocb_cmp);
+
+    iosched->iocbs += count;
+    last = sorted[0];
+    for (i = 1; i < count; i++) {
+        if (last->aio_lio_opcode == sorted[i]->aio_lio_opcode &&
+            last->u.c.offset + iocb_nbytes(last) == sorted[i]->u.c.offset) {
+            iosched->merges++;
+        }
+        last = sorted[i];
+    }
+}
+
+#endif /* IOSCHED_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index f67fdb7..75cb0f2 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -22,6 +22,7 @@
 #include "hw/dataplane/event-poll.h"
 #include "hw/dataplane/vring.h"
 #include "hw/dataplane/ioq.h"
+#include "hw/dataplane/iosched.h"
 #include "kvm.h"
 
 enum {
@@ -57,6 +58,7 @@ typedef struct {
     EventHandler notify_handler;    /* virtqueue notify handler */
 
     IOQueue ioqueue;                /* Linux AIO queue (should really be per dataplane thread) */
+    IOSched iosched;                /* I/O scheduler */
     VirtIOBlockRequest requests[REQ_MAX]; /* pool of requests, managed by the queue */
 } VirtIOBlock;
 
@@ -249,6 +251,8 @@ static bool handle_notify(EventHandler *handler)
         }
     }
 
+    iosched(&s->iosched, s->ioqueue.queue, s->ioqueue.queue_idx);
+
     /* Submit requests, if any */
     int rc = ioq_submit(&s->ioqueue);
     if (unlikely(rc < 0)) {
@@ -289,6 +293,7 @@ static void data_plane_start(VirtIOBlock *s)
 {
     int i;
 
+    iosched_init(&s->iosched);
     vring_setup(&s->vring, &s->vdev, 0);
 
     /* Set up guest notifier (irq) */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 21/27] virtio-blk: Add basic request merging
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

This commit adds an I/O scheduler that sorts requests and merges
adjacent requests if they have the same operation type (read/write).
The code is ugly and not very well factored but it does merge
successfully.
---
 hw/dataplane/ioq.h     |    3 +-
 hw/dataplane/iosched.h |   51 +++++++++++++++++---------
 hw/dataplane/vring.h   |    4 +--
 hw/virtio-blk.c        |   93 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index d1545d6..72e5fd6 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -96,7 +96,7 @@ static int ioq_submit(IOQueue *ioq)
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
     if (unlikely(rc < 0)) {
         unsigned int i;
-        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
+        fprintf(stderr, "io_submit failed io_ctx=%#lx nr=%d iovecs=%p rc=%d\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue, rc);
         for (i = 0; i < ioq->queue_idx; i++) {
             fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
         }
@@ -121,7 +121,6 @@ static int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion, void
         ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
 
         completion(events[i].obj, ret, opaque);
-        ioq_put_iocb(ioq, events[i].obj);
     }
     return nevents;
 }
diff --git a/hw/dataplane/iosched.h b/hw/dataplane/iosched.h
index 12ebccc..39da73c 100644
--- a/hw/dataplane/iosched.h
+++ b/hw/dataplane/iosched.h
@@ -9,6 +9,8 @@ typedef struct {
     unsigned long sched_calls;
 } IOSched;
 
+typedef void MergeFunc(struct iocb *a, struct iocb *b);
+
 static int iocb_cmp(const void *a, const void *b)
 {
     const struct iocb *iocb_a = a;
@@ -29,10 +31,10 @@ static int iocb_cmp(const void *a, const void *b)
 
 static size_t iocb_nbytes(struct iocb *iocb)
 {
-    struct iovec *iov = iocb->u.c.buf;
+    const struct iovec *iov = iocb->u.v.vec;
     size_t nbytes = 0;
     size_t i;
-    for (i = 0; i < iocb->u.c.nbytes; i++) {
+    for (i = 0; i < iocb->u.v.nr; i++) {
         nbytes += iov->iov_len;
         iov++;
     }
@@ -44,35 +46,52 @@ static void iosched_init(IOSched *iosched)
     memset(iosched, 0, sizeof *iosched);
 }
 
-static void iosched_print_stats(IOSched *iosched)
+static __attribute__((unused)) void iosched_print_stats(IOSched *iosched)
 {
     fprintf(stderr, "iocbs = %lu merges = %lu sched_calls = %lu\n",
             iosched->iocbs, iosched->merges, iosched->sched_calls);
     memset(iosched, 0, sizeof *iosched);
 }
 
-static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int count)
+static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int *count, MergeFunc merge_func)
 {
-    struct iocb *sorted[count];
-    struct iocb *last;
-    unsigned int i;
+    struct iocb *sorted[*count];
+    unsigned int merges = 0;
+    unsigned int i, j;
 
+    /*
     if ((++iosched->sched_calls % 1000) == 0) {
         iosched_print_stats(iosched);
     }
+    */
+
+    if (!*count) {
+        return;
+    }
 
     memcpy(sorted, unsorted, sizeof sorted);
-    qsort(sorted, count, sizeof sorted[0], iocb_cmp);
-
-    iosched->iocbs += count;
-    last = sorted[0];
-    for (i = 1; i < count; i++) {
-        if (last->aio_lio_opcode == sorted[i]->aio_lio_opcode &&
-            last->u.c.offset + iocb_nbytes(last) == sorted[i]->u.c.offset) {
-            iosched->merges++;
+    qsort(sorted, *count, sizeof sorted[0], iocb_cmp);
+
+    unsorted[0] = sorted[0];
+    j = 1;
+    for (i = 1; i < *count; i++) {
+        struct iocb *last = sorted[i - 1];
+        struct iocb *cur = sorted[i];
+
+        if (last->aio_lio_opcode == cur->aio_lio_opcode &&
+            last->u.c.offset + iocb_nbytes(last) == cur->u.c.offset) {
+            merge_func(last, cur);
+            merges++;
+
+            unsorted[j - 1] = cur;
+        } else {
+            unsorted[j++] = cur;
         }
-        last = sorted[i];
     }
+
+    iosched->merges += merges;
+    iosched->iocbs += *count;
+    *count = j;
 }
 
 #endif /* IOSCHED_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index cdd4d4a..bbf8c86 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -29,7 +29,7 @@ static inline void *phys_to_host(Vring *vring, target_phys_addr_t phys)
     if (phys >= 0x100000000) {
         phys -= 0x100000000 - 0xe0000000;
     } else if (phys >= 0xe0000000) {
-        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", (unsigned long)phys);
         exit(1);
     }
     return vring->phys_mem_zero_host_ptr + phys;
@@ -65,7 +65,7 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
     vring->last_used_idx = 0;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
-            virtio_queue_get_ring_addr(vdev, n),
+            (unsigned long)virtio_queue_get_ring_addr(vdev, n),
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 75cb0f2..9131a7a 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -33,11 +33,29 @@ enum {
                                      * VRING_MAX with indirect descriptors */
 };
 
-typedef struct {
+/** I/O request
+ *
+ * Most I/O requests need to know the vring index (head) and the completion
+ * status byte that must be filled in to tell the guest whether or not the
+ * request succeeded.
+ *
+ * The iovec array pointed to by the iocb is valid only before ioq_submit() is
+ * called.  After that, neither the kernel nor userspace needs to access the
+ * iovec anymore and the memory is no longer owned by this VirtIOBlockRequest.
+ *
+ * Requests can be merged together by the I/O scheduler.  When this happens,
+ * the next_merged field is used to link the requests and only the first
+ * request's iocb is used.  Merged requests require memory allocation for the
+ * iovec array and must be freed appropriately.
+ */
+typedef struct VirtIOBlockRequest VirtIOBlockRequest;
+struct VirtIOBlockRequest {
     struct iocb iocb;               /* Linux AIO control block */
     unsigned char *status;          /* virtio block status code */
     unsigned int head;              /* vring descriptor index */
-} VirtIOBlockRequest;
+    int len;                        /* number of I/O bytes, only used for merged reqs */
+    VirtIOBlockRequest *next_merged;/* next merged iocb or NULL */
+};
 
 typedef struct {
     VirtIODevice vdev;
@@ -93,15 +111,17 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
     event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
 }
 
-static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+static void complete_one_request(VirtIOBlockRequest *req, VirtIOBlock *s, ssize_t ret)
 {
-    VirtIOBlock *s = opaque;
-    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
     int len;
 
     if (likely(ret >= 0)) {
         *req->status = VIRTIO_BLK_S_OK;
-        len = ret;
+
+        /* Merged requests know their part of the length, single requests can
+         * just use the return value.
+         */
+        len = unlikely(req->len) ? req->len : ret;
     } else {
         *req->status = VIRTIO_BLK_S_IOERR;
         len = 0;
@@ -114,6 +134,59 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
     vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
+static bool is_request_merged(VirtIOBlockRequest *req)
+{
+    return req->next_merged;
+}
+
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+{
+    VirtIOBlock *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+
+    /* Free the iovec now, it isn't needed */
+    if (unlikely(is_request_merged(req))) {
+        g_free((void*)iocb->u.v.vec);
+    }
+
+    while (req) {
+        complete_one_request(req, s, ret);
+
+        VirtIOBlockRequest *next = req->next_merged;
+        ioq_put_iocb(&s->ioqueue, &req->iocb);
+        req = next;
+    }
+}
+
+static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
+{
+    /* Repeated merging could be made more efficient using realloc, but this
+     * approach keeps it simple. */
+
+    VirtIOBlockRequest *req_a = container_of(iocb_a, VirtIOBlockRequest, iocb);
+    VirtIOBlockRequest *req_b = container_of(iocb_b, VirtIOBlockRequest, iocb);
+    struct iovec *iovec = g_malloc((iocb_a->u.v.nr + iocb_b->u.v.nr) * sizeof iovec[0]);
+
+    memcpy(iovec, iocb_a->u.v.vec, iocb_a->u.v.nr * sizeof iovec[0]);
+    memcpy(iovec + iocb_a->u.v.nr, iocb_b->u.v.vec, iocb_b->u.v.nr * sizeof iovec[0]);
+
+    if (is_request_merged(req_a)) {
+        /* Free the old merged iovec */
+        g_free((void*)iocb_a->u.v.vec);
+    } else {
+        /* Stash the request length */
+        req_a->len = iocb_nbytes(iocb_a);
+    }
+
+    iocb_b->u.v.vec = iovec;
+    req_b->len = iocb_nbytes(iocb_b);
+    req_b->next_merged = req_a;
+    /*
+    fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
+            req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
+    */
+}
+
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
@@ -187,6 +260,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
     VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
     req->head = head;
     req->status = &inhdr->status;
+    req->len = 0;
+    req->next_merged = NULL;
 }
 
 static bool handle_notify(EventHandler *handler)
@@ -251,7 +326,7 @@ static bool handle_notify(EventHandler *handler)
         }
     }
 
-    iosched(&s->iosched, s->ioqueue.queue, s->ioqueue.queue_idx);
+    iosched(&s->iosched, s->ioqueue.queue, &s->ioqueue.queue_idx, merge_request);
 
     /* Submit requests, if any */
     int rc = ioq_submit(&s->ioqueue);
@@ -298,7 +373,7 @@ static void data_plane_start(VirtIOBlock *s)
 
     /* Set up guest notifier (irq) */
     if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
+        fprintf(stderr, "virtio-blk failed to set guest notifier\n");
         exit(1);
     }
 
@@ -306,7 +381,7 @@ static void data_plane_start(VirtIOBlock *s)
 
     /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
     event_poll_add(&s->event_poll, &s->notify_handler,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 21/27] virtio-blk: Add basic request merging
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

This commit adds an I/O scheduler that sorts requests and merges
adjacent requests if they have the same operation type (read/write).
The code is ugly and not very well factored but it does merge
successfully.
---
 hw/dataplane/ioq.h     |    3 +-
 hw/dataplane/iosched.h |   51 +++++++++++++++++---------
 hw/dataplane/vring.h   |    4 +--
 hw/virtio-blk.c        |   93 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
index d1545d6..72e5fd6 100644
--- a/hw/dataplane/ioq.h
+++ b/hw/dataplane/ioq.h
@@ -96,7 +96,7 @@ static int ioq_submit(IOQueue *ioq)
     int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
     if (unlikely(rc < 0)) {
         unsigned int i;
-        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
+        fprintf(stderr, "io_submit failed io_ctx=%#lx nr=%d iovecs=%p rc=%d\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue, rc);
         for (i = 0; i < ioq->queue_idx; i++) {
             fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
         }
@@ -121,7 +121,6 @@ static int ioq_run_completion(IOQueue *ioq, IOQueueCompletion *completion, void
         ssize_t ret = ((uint64_t)events[i].res2 << 32) | events[i].res;
 
         completion(events[i].obj, ret, opaque);
-        ioq_put_iocb(ioq, events[i].obj);
     }
     return nevents;
 }
diff --git a/hw/dataplane/iosched.h b/hw/dataplane/iosched.h
index 12ebccc..39da73c 100644
--- a/hw/dataplane/iosched.h
+++ b/hw/dataplane/iosched.h
@@ -9,6 +9,8 @@ typedef struct {
     unsigned long sched_calls;
 } IOSched;
 
+typedef void MergeFunc(struct iocb *a, struct iocb *b);
+
 static int iocb_cmp(const void *a, const void *b)
 {
     const struct iocb *iocb_a = a;
@@ -29,10 +31,10 @@ static int iocb_cmp(const void *a, const void *b)
 
 static size_t iocb_nbytes(struct iocb *iocb)
 {
-    struct iovec *iov = iocb->u.c.buf;
+    const struct iovec *iov = iocb->u.v.vec;
     size_t nbytes = 0;
     size_t i;
-    for (i = 0; i < iocb->u.c.nbytes; i++) {
+    for (i = 0; i < iocb->u.v.nr; i++) {
         nbytes += iov->iov_len;
         iov++;
     }
@@ -44,35 +46,52 @@ static void iosched_init(IOSched *iosched)
     memset(iosched, 0, sizeof *iosched);
 }
 
-static void iosched_print_stats(IOSched *iosched)
+static __attribute__((unused)) void iosched_print_stats(IOSched *iosched)
 {
     fprintf(stderr, "iocbs = %lu merges = %lu sched_calls = %lu\n",
             iosched->iocbs, iosched->merges, iosched->sched_calls);
     memset(iosched, 0, sizeof *iosched);
 }
 
-static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int count)
+static void iosched(IOSched *iosched, struct iocb *unsorted[], unsigned int *count, MergeFunc merge_func)
 {
-    struct iocb *sorted[count];
-    struct iocb *last;
-    unsigned int i;
+    struct iocb *sorted[*count];
+    unsigned int merges = 0;
+    unsigned int i, j;
 
+    /*
     if ((++iosched->sched_calls % 1000) == 0) {
         iosched_print_stats(iosched);
     }
+    */
+
+    if (!*count) {
+        return;
+    }
 
     memcpy(sorted, unsorted, sizeof sorted);
-    qsort(sorted, count, sizeof sorted[0], iocb_cmp);
-
-    iosched->iocbs += count;
-    last = sorted[0];
-    for (i = 1; i < count; i++) {
-        if (last->aio_lio_opcode == sorted[i]->aio_lio_opcode &&
-            last->u.c.offset + iocb_nbytes(last) == sorted[i]->u.c.offset) {
-            iosched->merges++;
+    qsort(sorted, *count, sizeof sorted[0], iocb_cmp);
+
+    unsorted[0] = sorted[0];
+    j = 1;
+    for (i = 1; i < *count; i++) {
+        struct iocb *last = sorted[i - 1];
+        struct iocb *cur = sorted[i];
+
+        if (last->aio_lio_opcode == cur->aio_lio_opcode &&
+            last->u.c.offset + iocb_nbytes(last) == cur->u.c.offset) {
+            merge_func(last, cur);
+            merges++;
+
+            unsorted[j - 1] = cur;
+        } else {
+            unsorted[j++] = cur;
         }
-        last = sorted[i];
     }
+
+    iosched->merges += merges;
+    iosched->iocbs += *count;
+    *count = j;
 }
 
 #endif /* IOSCHED_H */
diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index cdd4d4a..bbf8c86 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -29,7 +29,7 @@ static inline void *phys_to_host(Vring *vring, target_phys_addr_t phys)
     if (phys >= 0x100000000) {
         phys -= 0x100000000 - 0xe0000000;
     } else if (phys >= 0xe0000000) {
-        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
+        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", (unsigned long)phys);
         exit(1);
     }
     return vring->phys_mem_zero_host_ptr + phys;
@@ -65,7 +65,7 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
     vring->last_used_idx = 0;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
-            virtio_queue_get_ring_addr(vdev, n),
+            (unsigned long)virtio_queue_get_ring_addr(vdev, n),
             vring->vr.desc, vring->vr.avail, vring->vr.used);
 }
 
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 75cb0f2..9131a7a 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -33,11 +33,29 @@ enum {
                                      * VRING_MAX with indirect descriptors */
 };
 
-typedef struct {
+/** I/O request
+ *
+ * Most I/O requests need to know the vring index (head) and the completion
+ * status byte that must be filled in to tell the guest whether or not the
+ * request succeeded.
+ *
+ * The iovec array pointed to by the iocb is valid only before ioq_submit() is
+ * called.  After that, neither the kernel nor userspace needs to access the
+ * iovec anymore and the memory is no longer owned by this VirtIOBlockRequest.
+ *
+ * Requests can be merged together by the I/O scheduler.  When this happens,
+ * the next_merged field is used to link the requests and only the first
+ * request's iocb is used.  Merged requests require memory allocation for the
+ * iovec array and must be freed appropriately.
+ */
+typedef struct VirtIOBlockRequest VirtIOBlockRequest;
+struct VirtIOBlockRequest {
     struct iocb iocb;               /* Linux AIO control block */
     unsigned char *status;          /* virtio block status code */
     unsigned int head;              /* vring descriptor index */
-} VirtIOBlockRequest;
+    int len;                        /* number of I/O bytes, only used for merged reqs */
+    VirtIOBlockRequest *next_merged;/* next merged iocb or NULL */
+};
 
 typedef struct {
     VirtIODevice vdev;
@@ -93,15 +111,17 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
     event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
 }
 
-static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+static void complete_one_request(VirtIOBlockRequest *req, VirtIOBlock *s, ssize_t ret)
 {
-    VirtIOBlock *s = opaque;
-    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
     int len;
 
     if (likely(ret >= 0)) {
         *req->status = VIRTIO_BLK_S_OK;
-        len = ret;
+
+        /* Merged requests know their part of the length, single requests can
+         * just use the return value.
+         */
+        len = unlikely(req->len) ? req->len : ret;
     } else {
         *req->status = VIRTIO_BLK_S_IOERR;
         len = 0;
@@ -114,6 +134,59 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
     vring_push(&s->vring, req->head, len + sizeof req->status);
 }
 
+static bool is_request_merged(VirtIOBlockRequest *req)
+{
+    return req->next_merged;
+}
+
+static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
+{
+    VirtIOBlock *s = opaque;
+    VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
+
+    /* Free the iovec now, it isn't needed */
+    if (unlikely(is_request_merged(req))) {
+        g_free((void*)iocb->u.v.vec);
+    }
+
+    while (req) {
+        complete_one_request(req, s, ret);
+
+        VirtIOBlockRequest *next = req->next_merged;
+        ioq_put_iocb(&s->ioqueue, &req->iocb);
+        req = next;
+    }
+}
+
+static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
+{
+    /* Repeated merging could be made more efficient using realloc, but this
+     * approach keeps it simple. */
+
+    VirtIOBlockRequest *req_a = container_of(iocb_a, VirtIOBlockRequest, iocb);
+    VirtIOBlockRequest *req_b = container_of(iocb_b, VirtIOBlockRequest, iocb);
+    struct iovec *iovec = g_malloc((iocb_a->u.v.nr + iocb_b->u.v.nr) * sizeof iovec[0]);
+
+    memcpy(iovec, iocb_a->u.v.vec, iocb_a->u.v.nr * sizeof iovec[0]);
+    memcpy(iovec + iocb_a->u.v.nr, iocb_b->u.v.vec, iocb_b->u.v.nr * sizeof iovec[0]);
+
+    if (is_request_merged(req_a)) {
+        /* Free the old merged iovec */
+        g_free((void*)iocb_a->u.v.vec);
+    } else {
+        /* Stash the request length */
+        req_a->len = iocb_nbytes(iocb_a);
+    }
+
+    iocb_b->u.v.vec = iovec;
+    req_b->len = iocb_nbytes(iocb_b);
+    req_b->next_merged = req_a;
+    /*
+    fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
+            req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
+    */
+}
+
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
 {
     /* Virtio block requests look like this: */
@@ -187,6 +260,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
     VirtIOBlockRequest *req = container_of(iocb, VirtIOBlockRequest, iocb);
     req->head = head;
     req->status = &inhdr->status;
+    req->len = 0;
+    req->next_merged = NULL;
 }
 
 static bool handle_notify(EventHandler *handler)
@@ -251,7 +326,7 @@ static bool handle_notify(EventHandler *handler)
         }
     }
 
-    iosched(&s->iosched, s->ioqueue.queue, s->ioqueue.queue_idx);
+    iosched(&s->iosched, s->ioqueue.queue, &s->ioqueue.queue_idx, merge_request);
 
     /* Submit requests, if any */
     int rc = ioq_submit(&s->ioqueue);
@@ -298,7 +373,7 @@ static void data_plane_start(VirtIOBlock *s)
 
     /* Set up guest notifier (irq) */
     if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
+        fprintf(stderr, "virtio-blk failed to set guest notifier\n");
         exit(1);
     }
 
@@ -306,7 +381,7 @@ static void data_plane_start(VirtIOBlock *s)
 
     /* Set up virtqueue notify */
     if (s->vdev.binding->set_host_notifier(s->vdev.binding_opaque, 0, true) != 0) {
-        fprintf(stderr, "virtio-blk failed to set host notifier\n");
+        fprintf(stderr, "virtio-blk failed to set host notifier, ensure -enable-kvm is set\n");
         exit(1);
     }
     event_poll_add(&s->event_poll, &s->notify_handler,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 22/27] virtio-blk: Fix request merging
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Khoa Huynh <khoa@us.ibm.com> discovered that request merging is broken.
The merged iocb is not updated to reflect the total number of iovecs and
the offset is also outdated.

This patch fixes request merging.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 9131a7a..51807b5 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
         req_a->len = iocb_nbytes(iocb_a);
     }
 
-    iocb_b->u.v.vec = iovec;
-    req_b->len = iocb_nbytes(iocb_b);
-    req_b->next_merged = req_a;
     /*
     fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
             req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
     */
+
+    iocb_b->u.v.vec = iovec;
+    iocb_b->u.v.nr += iocb_a->u.v.nr;
+    iocb_b->u.v.offset = iocb_a->u.v.offset;
+
+    req_b->len = iocb_nbytes(iocb_b);
+    req_b->next_merged = req_a;
 }
 
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 22/27] virtio-blk: Fix request merging
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Khoa Huynh <khoa@us.ibm.com> discovered that request merging is broken.
The merged iocb is not updated to reflect the total number of iovecs and
the offset is also outdated.

This patch fixes request merging.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 9131a7a..51807b5 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
         req_a->len = iocb_nbytes(iocb_a);
     }
 
-    iocb_b->u.v.vec = iovec;
-    req_b->len = iocb_nbytes(iocb_b);
-    req_b->next_merged = req_a;
     /*
     fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
             req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
     */
+
+    iocb_b->u.v.vec = iovec;
+    iocb_b->u.v.nr += iocb_a->u.v.nr;
+    iocb_b->u.v.offset = iocb_a->u.v.offset;
+
+    req_b->len = iocb_nbytes(iocb_b);
+    req_b->next_merged = req_a;
 }
 
 static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 23/27] virtio-blk: Stub out SCSI commands
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 51807b5..8734029 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
 
     /* TODO Linux sets the barrier bit even when not advertised! */
     uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
-
-    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
-        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
-        exit(1);
-    }
-
     struct iocb *iocb;
-    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | VIRTIO_BLK_T_FLUSH)) {
     case VIRTIO_BLK_T_IN:
         if (unlikely(out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid read request\n");
@@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
         iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
         break;
 
+    case VIRTIO_BLK_T_SCSI_CMD:
+        if (unlikely(in_num == 0)) {
+            fprintf(stderr, "virtio-blk invalid SCSI command request\n");
+            exit(1);
+        }
+
+        /* TODO support SCSI commands */
+        {
+            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
+            inhdr->status = VIRTIO_BLK_S_UNSUPP;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            virtio_blk_notify_guest(s);
+        }
+        return;
+
     case VIRTIO_BLK_T_FLUSH:
         if (unlikely(in_num != 1 || out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid flush request\n");
@@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
         return;
 
     default:
-        fprintf(stderr, "virtio-blk multiple request type bits set\n");
+        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
         exit(1);
     }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 23/27] virtio-blk: Stub out SCSI commands
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |   25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 51807b5..8734029 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
 
     /* TODO Linux sets the barrier bit even when not advertised! */
     uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
-
-    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
-        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
-        exit(1);
-    }
-
     struct iocb *iocb;
-    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
+    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | VIRTIO_BLK_T_FLUSH)) {
     case VIRTIO_BLK_T_IN:
         if (unlikely(out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid read request\n");
@@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
         iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
         break;
 
+    case VIRTIO_BLK_T_SCSI_CMD:
+        if (unlikely(in_num == 0)) {
+            fprintf(stderr, "virtio-blk invalid SCSI command request\n");
+            exit(1);
+        }
+
+        /* TODO support SCSI commands */
+        {
+            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
+            inhdr->status = VIRTIO_BLK_S_UNSUPP;
+            vring_push(&s->vring, head, sizeof *inhdr);
+            virtio_blk_notify_guest(s);
+        }
+        return;
+
     case VIRTIO_BLK_T_FLUSH:
         if (unlikely(in_num != 1 || out_num != 1)) {
             fprintf(stderr, "virtio-blk invalid flush request\n");
@@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
         return;
 
     default:
-        fprintf(stderr, "virtio-blk multiple request type bits set\n");
+        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
         exit(1);
     }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 24/27] virtio-blk: fix incorrect length
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 8734029..cff2298 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -131,7 +131,7 @@ static void complete_one_request(VirtIOBlockRequest *req, VirtIOBlock *s, ssize_
      * written to, but for virtio-blk it seems to be the number of bytes
      * transferred plus the status bytes.
      */
-    vring_push(&s->vring, req->head, len + sizeof req->status);
+    vring_push(&s->vring, req->head, len + sizeof(*req->status));
 }
 
 static bool is_request_merged(VirtIOBlockRequest *req)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 24/27] virtio-blk: fix incorrect length
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/virtio-blk.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 8734029..cff2298 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -131,7 +131,7 @@ static void complete_one_request(VirtIOBlockRequest *req, VirtIOBlock *s, ssize_
      * written to, but for virtio-blk it seems to be the number of bytes
      * transferred plus the status bytes.
      */
-    vring_push(&s->vring, req->head, len + sizeof req->status);
+    vring_push(&s->vring, req->head, len + sizeof(*req->status));
 }
 
 static bool is_request_merged(VirtIOBlockRequest *req)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 25/27] msix: fix irqchip breakage in msix_try_notify_from_thread()
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Commit bd8b215bce453706c3951460cc7e6627ccb90314 removed #ifdef
KVM_CAP_IRQCHIP from hw/msix.c after it turned out <linux/kvm.h> is not
included since msix.o is built in libhw64/.  Do the same for
msix_try_notify_from_thread() since we do not have access to
<linux/kvm.h> here and hence KVM_CAP_IRQCHIP is not defined.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/msix.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 3308604..0ed1013 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -511,12 +511,10 @@ bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
     if (unlikely(msix_is_masked(dev, vector))) {
         return false;
     }
-#ifdef KVM_CAP_IRQCHIP
     if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
         kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
         return true;
     }
-#endif
     return false;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 25/27] msix: fix irqchip breakage in msix_try_notify_from_thread()
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Commit bd8b215bce453706c3951460cc7e6627ccb90314 removed #ifdef
KVM_CAP_IRQCHIP from hw/msix.c after it turned out <linux/kvm.h> is not
included since msix.o is built in libhw64/.  Do the same for
msix_try_notify_from_thread() since we do not have access to
<linux/kvm.h> here and hence KVM_CAP_IRQCHIP is not defined.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/msix.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 3308604..0ed1013 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -511,12 +511,10 @@ bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
     if (unlikely(msix_is_masked(dev, vector))) {
         return false;
     }
-#ifdef KVM_CAP_IRQCHIP
     if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
         kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
         return true;
     }
-#endif
     return false;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 26/27] msix: use upstream kvm_irqchip_set_irq()
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

Commit 9507e305ec54062fccc88fcf6fccf1898a7e7141 changed the
kvm_set_irq() function to kvm_irqchip_set_irq().

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/msix.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/msix.c b/hw/msix.c
index 0ed1013..373017a 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -512,7 +512,7 @@ bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
         return false;
     }
     if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
-        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
+        kvm_irqchip_set_irq(kvm_state, dev->msix_irq_entries[vector].gsi, 1);
         return true;
     }
     return false;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 26/27] msix: use upstream kvm_irqchip_set_irq()
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

Commit 9507e305ec54062fccc88fcf6fccf1898a7e7141 changed the
kvm_set_irq() function to kvm_irqchip_set_irq().

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/msix.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/msix.c b/hw/msix.c
index 0ed1013..373017a 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -512,7 +512,7 @@ bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
         return false;
     }
     if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
-        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
+        kvm_irqchip_set_irq(kvm_state, dev->msix_irq_entries[vector].gsi, 1);
         return true;
     }
     return false;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [RFC v9 27/27] virtio-blk: add EVENT_IDX support to dataplane
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Michael S. Tsirkin, Asias He, Khoa Huynh, Stefan Hajnoczi

This patch adds support for the VIRTIO_RING_F_EVENT_IDX feature for
interrupt mitigation.  virtio-blk doesn't do anything fancy with it so
we may not see a performance improvement.  This patch will allow newer
guest kernels to run successfully.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |   65 ++++++++++++++++++++++++++++++++++++++++----------
 hw/virtio-blk.c      |   16 ++++++-------
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index bbf8c86..d939a22 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -14,6 +14,8 @@ typedef struct {
     struct vring vr;                /* virtqueue vring mapped to host memory */
     __u16 last_avail_idx;           /* last processed avail ring index */
     __u16 last_used_idx;            /* last processed used ring index */
+    uint16_t signalled_used;        /* EVENT_IDX state */
+    bool signalled_used_valid;
 } Vring;
 
 static inline unsigned int vring_get_num(Vring *vring)
@@ -63,6 +65,8 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
 
     vring->last_avail_idx = 0;
     vring->last_used_idx = 0;
+    vring->signalled_used = 0;
+    vring->signalled_used_valid = false;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
             (unsigned long)virtio_queue_get_ring_addr(vdev, n),
@@ -75,21 +79,48 @@ static bool vring_more_avail(Vring *vring)
 	return vring->vr.avail->idx != vring->last_avail_idx;
 }
 
-/* Hint to disable guest->host notifies */
-static void vring_disable_cb(Vring *vring)
+/* Toggle guest->host notifies */
+static void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
 {
-    vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        if (enable) {
+            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+        }
+    } else if (enable) {
+        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    } else {
+        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    }
 }
 
-/* Re-enable guest->host notifies
- *
- * Returns false if there are more descriptors in the ring.
- */
-static bool vring_enable_cb(Vring *vring)
+/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
+static bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
 {
-    vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-    __sync_synchronize(); /* mb() */
-    return !vring_more_avail(vring);
+    uint16_t old, new;
+    bool v;
+    /* Flush out used index updates. This is paired
+     * with the barrier that the Guest executes when enabling
+     * interrupts. */
+    __sync_synchronize(); /* smp_mb() */
+
+    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
+        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
+        return true;
+    }
+
+    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
+        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
+    }
+    old = vring->signalled_used;
+    v = vring->signalled_used_valid;
+    new = vring->signalled_used = vring->last_used_idx;
+    vring->signalled_used_valid = true;
+
+    if (unlikely(!v)) {
+        return true;
+    }
+
+    return vring_need_event(vring_used_event(&vring->vr), new, old);
 }
 
 /* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
@@ -178,7 +209,7 @@ static bool get_indirect(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static int vring_pop(Vring *vring,
+static int vring_pop(VirtIODevice *vdev, Vring *vring,
 		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
@@ -214,6 +245,10 @@ static int vring_pop(Vring *vring,
 		exit(1);
 	}
 
+	if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+		vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+	}
+
 	/* When we start there are none of either input nor output. */
 	*out_num = *in_num = 0;
 
@@ -279,6 +314,7 @@ static int vring_pop(Vring *vring,
 static void vring_push(Vring *vring, unsigned int head, int len)
 {
 	struct vring_used_elem *used;
+	uint16_t new;
 
 	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
 	 * next entry in that used ring. */
@@ -289,7 +325,10 @@ static void vring_push(Vring *vring, unsigned int head, int len)
 	/* Make sure buffer is written before we update index. */
 	__sync_synchronize(); /* smp_wmb() */
 
-    vring->vr.used->idx = ++vring->last_used_idx;
+	new = vring->vr.used->idx = ++vring->last_used_idx;
+	if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
+		vring->signalled_used_valid = false;
+	}
 }
 
 #endif /* VRING_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index cff2298..a3e3d8c 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -96,11 +96,9 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
 /* Raise an interrupt to signal guest, if necessary */
 static void virtio_blk_notify_guest(VirtIOBlock *s)
 {
-    /* Always notify when queue is empty (when feature acknowledge) */
-	if ((s->vring.vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (s->vring.vr.avail->idx != s->vring.last_avail_idx ||
-        !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
-		return;
+    if (!vring_should_notify(&s->vdev, &s->vring)) {
+        return;
+    }
 
     /* Try to issue the ioctl() directly for speed */
     if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
@@ -307,10 +305,10 @@ static bool handle_notify(EventHandler *handler)
 
     for (;;) {
         /* Disable guest->host notifies to avoid unnecessary vmexits */
-        vring_disable_cb(&s->vring);
+        vring_set_notification(&s->vdev, &s->vring, false);
 
         for (;;) {
-            head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+            head = vring_pop(&s->vdev, &s->vring, iov, end, &out_num, &in_num);
             if (head < 0) {
                 break; /* no more requests */
             }
@@ -327,7 +325,9 @@ static bool handle_notify(EventHandler *handler)
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
-            if (likely(vring_enable_cb(&s->vring))) {
+            vring_set_notification(&s->vdev, &s->vring, true);
+            __sync_synchronize(); /* smp_mb() */
+            if (!vring_more_avail(&s->vring)) {
                 break;
             }
         } else { /* head == -ENOBUFS, cannot continue since iovecs[] is depleted */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [Qemu-devel] [RFC v9 27/27] virtio-blk: add EVENT_IDX support to dataplane
@ 2012-07-18 15:07   ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-18 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, Khoa Huynh, Paolo Bonzini, Asias He

This patch adds support for the VIRTIO_RING_F_EVENT_IDX feature for
interrupt mitigation.  virtio-blk doesn't do anything fancy with it so
we may not see a performance improvement.  This patch will allow newer
guest kernels to run successfully.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 hw/dataplane/vring.h |   65 ++++++++++++++++++++++++++++++++++++++++----------
 hw/virtio-blk.c      |   16 ++++++-------
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
index bbf8c86..d939a22 100644
--- a/hw/dataplane/vring.h
+++ b/hw/dataplane/vring.h
@@ -14,6 +14,8 @@ typedef struct {
     struct vring vr;                /* virtqueue vring mapped to host memory */
     __u16 last_avail_idx;           /* last processed avail ring index */
     __u16 last_used_idx;            /* last processed used ring index */
+    uint16_t signalled_used;        /* EVENT_IDX state */
+    bool signalled_used_valid;
 } Vring;
 
 static inline unsigned int vring_get_num(Vring *vring)
@@ -63,6 +65,8 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
 
     vring->last_avail_idx = 0;
     vring->last_used_idx = 0;
+    vring->signalled_used = 0;
+    vring->signalled_used_valid = false;
 
     fprintf(stderr, "vring physical=%#lx desc=%p avail=%p used=%p\n",
             (unsigned long)virtio_queue_get_ring_addr(vdev, n),
@@ -75,21 +79,48 @@ static bool vring_more_avail(Vring *vring)
 	return vring->vr.avail->idx != vring->last_avail_idx;
 }
 
-/* Hint to disable guest->host notifies */
-static void vring_disable_cb(Vring *vring)
+/* Toggle guest->host notifies */
+static void vring_set_notification(VirtIODevice *vdev, Vring *vring, bool enable)
 {
-    vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+        if (enable) {
+            vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+        }
+    } else if (enable) {
+        vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    } else {
+        vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+    }
 }
 
-/* Re-enable guest->host notifies
- *
- * Returns false if there are more descriptors in the ring.
- */
-static bool vring_enable_cb(Vring *vring)
+/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
+static bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
 {
-    vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-    __sync_synchronize(); /* mb() */
-    return !vring_more_avail(vring);
+    uint16_t old, new;
+    bool v;
+    /* Flush out used index updates. This is paired
+     * with the barrier that the Guest executes when enabling
+     * interrupts. */
+    __sync_synchronize(); /* smp_mb() */
+
+    if ((vdev->guest_features & VIRTIO_F_NOTIFY_ON_EMPTY) &&
+        unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
+        return true;
+    }
+
+    if (!(vdev->guest_features & VIRTIO_RING_F_EVENT_IDX)) {
+        return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
+    }
+    old = vring->signalled_used;
+    v = vring->signalled_used_valid;
+    new = vring->signalled_used = vring->last_used_idx;
+    vring->signalled_used_valid = true;
+
+    if (unlikely(!v)) {
+        return true;
+    }
+
+    return vring_need_event(vring_used_event(&vring->vr), new, old);
 }
 
 /* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
@@ -178,7 +209,7 @@ static bool get_indirect(Vring *vring,
  *
  * Stolen from linux-2.6/drivers/vhost/vhost.c.
  */
-static int vring_pop(Vring *vring,
+static int vring_pop(VirtIODevice *vdev, Vring *vring,
 		      struct iovec iov[], struct iovec *iov_end,
 		      unsigned int *out_num, unsigned int *in_num)
 {
@@ -214,6 +245,10 @@ static int vring_pop(Vring *vring,
 		exit(1);
 	}
 
+	if (vdev->guest_features & (1 << VIRTIO_RING_F_EVENT_IDX)) {
+		vring_avail_event(&vring->vr) = vring->vr.avail->idx;
+	}
+
 	/* When we start there are none of either input nor output. */
 	*out_num = *in_num = 0;
 
@@ -279,6 +314,7 @@ static int vring_pop(Vring *vring,
 static void vring_push(Vring *vring, unsigned int head, int len)
 {
 	struct vring_used_elem *used;
+	uint16_t new;
 
 	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
 	 * next entry in that used ring. */
@@ -289,7 +325,10 @@ static void vring_push(Vring *vring, unsigned int head, int len)
 	/* Make sure buffer is written before we update index. */
 	__sync_synchronize(); /* smp_wmb() */
 
-    vring->vr.used->idx = ++vring->last_used_idx;
+	new = vring->vr.used->idx = ++vring->last_used_idx;
+	if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
+		vring->signalled_used_valid = false;
+	}
 }
 
 #endif /* VRING_H */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index cff2298..a3e3d8c 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -96,11 +96,9 @@ static int get_raw_posix_fd_hack(VirtIOBlock *s)
 /* Raise an interrupt to signal guest, if necessary */
 static void virtio_blk_notify_guest(VirtIOBlock *s)
 {
-    /* Always notify when queue is empty (when feature acknowledge) */
-	if ((s->vring.vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (s->vring.vr.avail->idx != s->vring.last_avail_idx ||
-        !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
-		return;
+    if (!vring_should_notify(&s->vdev, &s->vring)) {
+        return;
+    }
 
     /* Try to issue the ioctl() directly for speed */
     if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
@@ -307,10 +305,10 @@ static bool handle_notify(EventHandler *handler)
 
     for (;;) {
         /* Disable guest->host notifies to avoid unnecessary vmexits */
-        vring_disable_cb(&s->vring);
+        vring_set_notification(&s->vdev, &s->vring, false);
 
         for (;;) {
-            head = vring_pop(&s->vring, iov, end, &out_num, &in_num);
+            head = vring_pop(&s->vdev, &s->vring, iov, end, &out_num, &in_num);
             if (head < 0) {
                 break; /* no more requests */
             }
@@ -327,7 +325,9 @@ static bool handle_notify(EventHandler *handler)
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
-            if (likely(vring_enable_cb(&s->vring))) {
+            vring_set_notification(&s->vdev, &s->vring, true);
+            __sync_synchronize(); /* smp_mb() */
+            if (!vring_more_avail(&s->vring)) {
                 break;
             }
         } else { /* head == -ENOBUFS, cannot continue since iovecs[] is depleted */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:40     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
> Optimize for the MSI-X enabled and vector unmasked case where it is
> possible to issue the KVM ioctl() directly instead of using irqfd.

Why? Is an ioctl faster?

> This patch introduces a new virtio binding function which tries to
> notify in a thread-safe way.  If this is not possible, the function
> returns false.  Virtio block then knows to use irqfd as a fallback.
> ---
>  hw/msix.c       |   17 +++++++++++++++++
>  hw/msix.h       |    1 +
>  hw/virtio-blk.c |   10 ++++++++--
>  hw/virtio-pci.c |    8 ++++++++
>  hw/virtio.c     |    9 +++++++++
>  hw/virtio.h     |    3 +++
>  6 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/msix.c b/hw/msix.c
> index 7955221..3308604 100644
> --- a/hw/msix.c
> +++ b/hw/msix.c
> @@ -503,6 +503,23 @@ void msix_notify(PCIDevice *dev, unsigned vector)
>      stl_le_phys(address, data);
>  }
>  
> +bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
> +{
> +    if (unlikely(vector >= dev->msix_entries_nr || !dev->msix_entry_used[vector])) {
> +        return false;
> +    }
> +    if (unlikely(msix_is_masked(dev, vector))) {
> +        return false;
> +    }
> +#ifdef KVM_CAP_IRQCHIP
> +    if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
> +        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
> +        return true;
> +    }
> +#endif
> +    return false;
> +}
> +
>  void msix_reset(PCIDevice *dev)
>  {
>      if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
> diff --git a/hw/msix.h b/hw/msix.h
> index a8661e1..99fb08f 100644
> --- a/hw/msix.h
> +++ b/hw/msix.h
> @@ -26,6 +26,7 @@ void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>  void msix_unuse_all_vectors(PCIDevice *dev);
>  
>  void msix_notify(PCIDevice *dev, unsigned vector);
> +bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector);
>  
>  void msix_reset(PCIDevice *dev);
>  
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index bdff68a..efeffa0 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -82,6 +82,12 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
>          !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
>  		return;
>  
> +    /* Try to issue the ioctl() directly for speed */
> +    if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
> +        return;
> +    }
> +
> +    /* If the fast path didn't work, use irqfd */
>      event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
>  }
>  
> @@ -263,7 +269,7 @@ static void data_plane_start(VirtIOBlock *s)
>      vring_setup(&s->vring, &s->vdev, 0);
>  
>      /* Set up guest notifier (irq) */
> -    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
> +    if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
>          fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
>          exit(1);
>      }
> @@ -315,7 +321,7 @@ static void data_plane_stop(VirtIOBlock *s)
>      event_poll_cleanup(&s->event_poll);
>  
>      /* Clean up guest notifier (irq) */
> -    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
> +    s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, false);
>  }
>  
>  static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index f1e13af..03512b3 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -106,6 +106,13 @@ static void virtio_pci_notify(void *opaque, uint16_t vector)
>          qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
>  }
>  
> +static bool virtio_pci_try_notify_from_thread(void *opaque, uint16_t vector)
> +{
> +    VirtIOPCIProxy *proxy = opaque;
> +    return msix_enabled(&proxy->pci_dev) &&
> +           msix_try_notify_from_thread(&proxy->pci_dev, vector);
> +}
> +
>  static void virtio_pci_save_config(void * opaque, QEMUFile *f)
>  {
>      VirtIOPCIProxy *proxy = opaque;
> @@ -707,6 +714,7 @@ static void virtio_pci_vmstate_change(void *opaque, bool running)
>  
>  static const VirtIOBindings virtio_pci_bindings = {
>      .notify = virtio_pci_notify,
> +    .try_notify_from_thread = virtio_pci_try_notify_from_thread,
>      .save_config = virtio_pci_save_config,
>      .load_config = virtio_pci_load_config,
>      .save_queue = virtio_pci_save_queue,
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 064aecf..a1d1a8a 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -689,6 +689,15 @@ static inline int vring_need_event(uint16_t event, uint16_t new, uint16_t old)
>  	return (uint16_t)(new - event - 1) < (uint16_t)(new - old);
>  }
>  
> +bool virtio_queue_try_notify_from_thread(VirtQueue *vq)
> +{
> +    VirtIODevice *vdev = vq->vdev;
> +    if (likely(vdev->binding->try_notify_from_thread)) {
> +        return vdev->binding->try_notify_from_thread(vdev->binding_opaque, vq->vector);
> +    }
> +    return false;
> +}
> +
>  static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      uint16_t old, new;
> diff --git a/hw/virtio.h b/hw/virtio.h
> index 400c092..2cdf2be 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -93,6 +93,7 @@ typedef struct VirtQueueElement
>  
>  typedef struct {
>      void (*notify)(void * opaque, uint16_t vector);
> +    bool (*try_notify_from_thread)(void * opaque, uint16_t vector);
>      void (*save_config)(void * opaque, QEMUFile *f);
>      void (*save_queue)(void * opaque, int n, QEMUFile *f);
>      int (*load_config)(void * opaque, QEMUFile *f);
> @@ -160,6 +161,8 @@ void virtio_cleanup(VirtIODevice *vdev);
>  
>  void virtio_notify_config(VirtIODevice *vdev);
>  
> +bool virtio_queue_try_notify_from_thread(VirtQueue *vq);
> +
>  void virtio_queue_set_notification(VirtQueue *vq, int enable);
>  
>  int virtio_queue_ready(VirtQueue *vq);
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
@ 2012-07-18 15:40     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
> Optimize for the MSI-X enabled and vector unmasked case where it is
> possible to issue the KVM ioctl() directly instead of using irqfd.

Why? Is an ioctl faster?

> This patch introduces a new virtio binding function which tries to
> notify in a thread-safe way.  If this is not possible, the function
> returns false.  Virtio block then knows to use irqfd as a fallback.
> ---
>  hw/msix.c       |   17 +++++++++++++++++
>  hw/msix.h       |    1 +
>  hw/virtio-blk.c |   10 ++++++++--
>  hw/virtio-pci.c |    8 ++++++++
>  hw/virtio.c     |    9 +++++++++
>  hw/virtio.h     |    3 +++
>  6 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/msix.c b/hw/msix.c
> index 7955221..3308604 100644
> --- a/hw/msix.c
> +++ b/hw/msix.c
> @@ -503,6 +503,23 @@ void msix_notify(PCIDevice *dev, unsigned vector)
>      stl_le_phys(address, data);
>  }
>  
> +bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector)
> +{
> +    if (unlikely(vector >= dev->msix_entries_nr || !dev->msix_entry_used[vector])) {
> +        return false;
> +    }
> +    if (unlikely(msix_is_masked(dev, vector))) {
> +        return false;
> +    }
> +#ifdef KVM_CAP_IRQCHIP
> +    if (likely(kvm_enabled() && kvm_irqchip_in_kernel())) {
> +        kvm_set_irq(dev->msix_irq_entries[vector].gsi, 1, NULL);
> +        return true;
> +    }
> +#endif
> +    return false;
> +}
> +
>  void msix_reset(PCIDevice *dev)
>  {
>      if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
> diff --git a/hw/msix.h b/hw/msix.h
> index a8661e1..99fb08f 100644
> --- a/hw/msix.h
> +++ b/hw/msix.h
> @@ -26,6 +26,7 @@ void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>  void msix_unuse_all_vectors(PCIDevice *dev);
>  
>  void msix_notify(PCIDevice *dev, unsigned vector);
> +bool msix_try_notify_from_thread(PCIDevice *dev, unsigned vector);
>  
>  void msix_reset(PCIDevice *dev);
>  
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index bdff68a..efeffa0 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -82,6 +82,12 @@ static void virtio_blk_notify_guest(VirtIOBlock *s)
>          !(s->vdev.guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))))
>  		return;
>  
> +    /* Try to issue the ioctl() directly for speed */
> +    if (likely(virtio_queue_try_notify_from_thread(s->vq))) {
> +        return;
> +    }
> +
> +    /* If the fast path didn't work, use irqfd */
>      event_notifier_set(virtio_queue_get_guest_notifier(s->vq));
>  }
>  
> @@ -263,7 +269,7 @@ static void data_plane_start(VirtIOBlock *s)
>      vring_setup(&s->vring, &s->vdev, 0);
>  
>      /* Set up guest notifier (irq) */
> -    if (s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, true) != 0) {
> +    if (s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, true) != 0) {
>          fprintf(stderr, "virtio-blk failed to set guest notifier, ensure -enable-kvm is set\n");
>          exit(1);
>      }
> @@ -315,7 +321,7 @@ static void data_plane_stop(VirtIOBlock *s)
>      event_poll_cleanup(&s->event_poll);
>  
>      /* Clean up guest notifier (irq) */
> -    s->vdev.binding->set_guest_notifier(s->vdev.binding_opaque, 0, false);
> +    s->vdev.binding->set_guest_notifiers(s->vdev.binding_opaque, false);
>  }
>  
>  static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t val)
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index f1e13af..03512b3 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -106,6 +106,13 @@ static void virtio_pci_notify(void *opaque, uint16_t vector)
>          qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
>  }
>  
> +static bool virtio_pci_try_notify_from_thread(void *opaque, uint16_t vector)
> +{
> +    VirtIOPCIProxy *proxy = opaque;
> +    return msix_enabled(&proxy->pci_dev) &&
> +           msix_try_notify_from_thread(&proxy->pci_dev, vector);
> +}
> +
>  static void virtio_pci_save_config(void * opaque, QEMUFile *f)
>  {
>      VirtIOPCIProxy *proxy = opaque;
> @@ -707,6 +714,7 @@ static void virtio_pci_vmstate_change(void *opaque, bool running)
>  
>  static const VirtIOBindings virtio_pci_bindings = {
>      .notify = virtio_pci_notify,
> +    .try_notify_from_thread = virtio_pci_try_notify_from_thread,
>      .save_config = virtio_pci_save_config,
>      .load_config = virtio_pci_load_config,
>      .save_queue = virtio_pci_save_queue,
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 064aecf..a1d1a8a 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -689,6 +689,15 @@ static inline int vring_need_event(uint16_t event, uint16_t new, uint16_t old)
>  	return (uint16_t)(new - event - 1) < (uint16_t)(new - old);
>  }
>  
> +bool virtio_queue_try_notify_from_thread(VirtQueue *vq)
> +{
> +    VirtIODevice *vdev = vq->vdev;
> +    if (likely(vdev->binding->try_notify_from_thread)) {
> +        return vdev->binding->try_notify_from_thread(vdev->binding_opaque, vq->vector);
> +    }
> +    return false;
> +}
> +
>  static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      uint16_t old, new;
> diff --git a/hw/virtio.h b/hw/virtio.h
> index 400c092..2cdf2be 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -93,6 +93,7 @@ typedef struct VirtQueueElement
>  
>  typedef struct {
>      void (*notify)(void * opaque, uint16_t vector);
> +    bool (*try_notify_from_thread)(void * opaque, uint16_t vector);
>      void (*save_config)(void * opaque, QEMUFile *f);
>      void (*save_queue)(void * opaque, int n, QEMUFile *f);
>      int (*load_config)(void * opaque, QEMUFile *f);
> @@ -160,6 +161,8 @@ void virtio_cleanup(VirtIODevice *vdev);
>  
>  void virtio_notify_config(VirtIODevice *vdev);
>  
> +bool virtio_queue_try_notify_from_thread(VirtQueue *vq);
> +
>  void virtio_queue_set_notification(VirtQueue *vq, int enable);
>  
>  int virtio_queue_ready(VirtQueue *vq);
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 00/27] virtio: virtio-blk data plane
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:43   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> This series implements a dedicated thread for virtio-blk processing using Linux
> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
> old but I wanted to share it on the list since it has been mentioned on mailing
> lists and IRC recently.
> 
> These patches can be used for benchmarking and discussion about how to improve
> block performance.  Paolo Bonzini has also worked in this area and might want
> to share his patches.
> 
> The basic approach is:
> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>    signalling when the guest kicks the virtqueue.
> 2. Requests are processed without going through the QEMU block layer using
>    Linux AIO directly.
> 3. Completion interrupts are injected via ioctl from the dedicated thread.
> 
> The series also contains request merging as a bdrv_aio_multiwrite() equivalent.
> This was only to get a comparison against the QEMU block layer and I would drop
> it for other types of analysis.
> 
> The effect of this series is that O_DIRECT Linux AIO on raw files can bypass
> the QEMU global mutex and block layer.  This means higher performance.

Do you have any numbers at all?

> A cleaned up version of this approach could be added to QEMU as a raw O_DIRECT
> Linux AIO fast path.  Image file formats, protocols, and other block layer
> features are not supported by virtio-blk-data-plane.
> 
> Git repo:
> http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane
> 
> Stefan Hajnoczi (27):
>   virtio-blk: Remove virtqueue request handling code
>   virtio-blk: Set up host notifier for data plane
>   virtio-blk: Data plane thread event loop
>   virtio-blk: Map vring
>   virtio-blk: Do cheapest possible memory mapping
>   virtio-blk: Take PCI memory range into account
>   virtio-blk: Put dataplane code into its own directory
>   virtio-blk: Read requests from the vring
>   virtio-blk: Add Linux AIO queue
>   virtio-blk: Stop data plane thread cleanly
>   virtio-blk: Indirect vring and flush support
>   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
>   virtio-blk: Increase max requests for indirect vring
>   virtio-blk: Use pthreads instead of qemu-thread
>   notifier: Add a function to set the notifier
>   virtio-blk: Kick data plane thread using event notifier set
>   virtio-blk: Use guest notifier to raise interrupts
>   virtio-blk: Call ioctl() directly instead of irqfd
>   virtio-blk: Disable guest->host notifies while processing vring
>   virtio-blk: Add ioscheduler to detect mergable requests
>   virtio-blk: Add basic request merging
>   virtio-blk: Fix request merging
>   virtio-blk: Stub out SCSI commands
>   virtio-blk: fix incorrect length
>   msix: fix irqchip breakage in msix_try_notify_from_thread()
>   msix: use upstream kvm_irqchip_set_irq()
>   virtio-blk: add EVENT_IDX support to dataplane
> 
>  event_notifier.c          |    7 +
>  event_notifier.h          |    1 +
>  hw/dataplane/event-poll.h |  116 +++++++
>  hw/dataplane/ioq.h        |  128 ++++++++
>  hw/dataplane/iosched.h    |   97 ++++++
>  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
>  hw/msix.c                 |   15 +
>  hw/msix.h                 |    1 +
>  hw/virtio-blk.c           |  753 +++++++++++++++++++++------------------------
>  hw/virtio-pci.c           |    8 +
>  hw/virtio.c               |    9 +
>  hw/virtio.h               |    3 +
>  12 files changed, 1074 insertions(+), 398 deletions(-)
>  create mode 100644 hw/dataplane/event-poll.h
>  create mode 100644 hw/dataplane/ioq.h
>  create mode 100644 hw/dataplane/iosched.h
>  create mode 100644 hw/dataplane/vring.h
> 
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 15:43   ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> This series implements a dedicated thread for virtio-blk processing using Linux
> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
> old but I wanted to share it on the list since it has been mentioned on mailing
> lists and IRC recently.
> 
> These patches can be used for benchmarking and discussion about how to improve
> block performance.  Paolo Bonzini has also worked in this area and might want
> to share his patches.
> 
> The basic approach is:
> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
>    signalling when the guest kicks the virtqueue.
> 2. Requests are processed without going through the QEMU block layer using
>    Linux AIO directly.
> 3. Completion interrupts are injected via ioctl from the dedicated thread.
> 
> The series also contains request merging as a bdrv_aio_multiwrite() equivalent.
> This was only to get a comparison against the QEMU block layer and I would drop
> it for other types of analysis.
> 
> The effect of this series is that O_DIRECT Linux AIO on raw files can bypass
> the QEMU global mutex and block layer.  This means higher performance.

Do you have any numbers at all?

> A cleaned up version of this approach could be added to QEMU as a raw O_DIRECT
> Linux AIO fast path.  Image file formats, protocols, and other block layer
> features are not supported by virtio-blk-data-plane.
> 
> Git repo:
> http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/virtio-blk-data-plane
> 
> Stefan Hajnoczi (27):
>   virtio-blk: Remove virtqueue request handling code
>   virtio-blk: Set up host notifier for data plane
>   virtio-blk: Data plane thread event loop
>   virtio-blk: Map vring
>   virtio-blk: Do cheapest possible memory mapping
>   virtio-blk: Take PCI memory range into account
>   virtio-blk: Put dataplane code into its own directory
>   virtio-blk: Read requests from the vring
>   virtio-blk: Add Linux AIO queue
>   virtio-blk: Stop data plane thread cleanly
>   virtio-blk: Indirect vring and flush support
>   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
>   virtio-blk: Increase max requests for indirect vring
>   virtio-blk: Use pthreads instead of qemu-thread
>   notifier: Add a function to set the notifier
>   virtio-blk: Kick data plane thread using event notifier set
>   virtio-blk: Use guest notifier to raise interrupts
>   virtio-blk: Call ioctl() directly instead of irqfd
>   virtio-blk: Disable guest->host notifies while processing vring
>   virtio-blk: Add ioscheduler to detect mergable requests
>   virtio-blk: Add basic request merging
>   virtio-blk: Fix request merging
>   virtio-blk: Stub out SCSI commands
>   virtio-blk: fix incorrect length
>   msix: fix irqchip breakage in msix_try_notify_from_thread()
>   msix: use upstream kvm_irqchip_set_irq()
>   virtio-blk: add EVENT_IDX support to dataplane
> 
>  event_notifier.c          |    7 +
>  event_notifier.h          |    1 +
>  hw/dataplane/event-poll.h |  116 +++++++
>  hw/dataplane/ioq.h        |  128 ++++++++
>  hw/dataplane/iosched.h    |   97 ++++++
>  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
>  hw/msix.c                 |   15 +
>  hw/msix.h                 |    1 +
>  hw/virtio-blk.c           |  753 +++++++++++++++++++++------------------------
>  hw/virtio-pci.c           |    8 +
>  hw/virtio.c               |    9 +
>  hw/virtio.h               |    3 +
>  12 files changed, 1074 insertions(+), 398 deletions(-)
>  create mode 100644 hw/dataplane/event-poll.h
>  create mode 100644 hw/dataplane/ioq.h
>  create mode 100644 hw/dataplane/iosched.h
>  create mode 100644 hw/dataplane/vring.h
> 
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 00/27] virtio: virtio-blk data plane
  2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 15:49   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> This series implements a dedicated thread for virtio-blk processing using Linux
> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
> old but I wanted to share it on the list since it has been mentioned on mailing
> lists and IRC recently.

BTW are these any bugfixes here upstream needs?
I could not tell.

-- 
MST

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 15:49   ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> This series implements a dedicated thread for virtio-blk processing using Linux
> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
> old but I wanted to share it on the list since it has been mentioned on mailing
> lists and IRC recently.

BTW are these any bugfixes here upstream needs?
I could not tell.

-- 
MST

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 00/27] virtio: virtio-blk data plane
  2012-07-18 15:43   ` [Qemu-devel] " Michael S. Tsirkin
@ 2012-07-18 16:18     ` Khoa Huynh
  -1 siblings, 0 replies; 90+ messages in thread
From: Khoa Huynh @ 2012-07-18 16:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Paolo Bonzini, Asias He

[-- Attachment #1: Type: text/plain, Size: 5320 bytes --]


"Michael S. Tsirkin" <mst@redhat.com> wrote on 07/18/2012 10:43:23 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
> Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Anthony Liguori/
> Austin/IBM@IBMUS, Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini
> <pbonzini@redhat.com>, Asias He <asias@redhat.com>, Khoa Huynh/
> Austin/IBM@IBMUS
> Date: 07/18/2012 10:46 AM
> Subject: Re: [RFC v9 00/27] virtio: virtio-blk data plane
>
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> > This series implements a dedicated thread for virtio-blk
> processing using Linux
> > AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3
> and somewhat
> > old but I wanted to share it on the list since it has been
> mentioned on mailing
> > lists and IRC recently.
> >
> > These patches can be used for benchmarking and discussion about
> how to improve
> > block performance.  Paolo Bonzini has also worked in this area andmight
want
> > to share his patches.
> >
> > The basic approach is:
> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >    signalling when the guest kicks the virtqueue.
> > 2. Requests are processed without going through the QEMU block layer
using
> >    Linux AIO directly.
> > 3. Completion interrupts are injected via ioctl from the dedicated
thread.
> >
> > The series also contains request merging as a bdrv_aio_multiwrite
> () equivalent.
> > This was only to get a comparison against the QEMU block layer and
> I would drop
> > it for other types of analysis.
> >
> > The effect of this series is that O_DIRECT Linux AIO on raw files can
bypass
> > the QEMU global mutex and block layer.  This means higher performance.
>
> Do you have any numbers at all?

Yes, we do have a lot of data for this data-plane patch set.  I can send
you
detailed charts if you like, but generally, we run into a performance
bottleneck
with the existing qemu due to the qemu global mutex, and thus, could only
get
to about 140,000 IOPS for a single guest (at least on my setup).  With this
data-plane patch set, we bypass this bottleneck and have been able to
achieve
more than 600,000 IOPS for a single guest, and an aggregate 1.33 million
IOPS
with 4 guests on a single host.

Just for reference, VMware has claimed that they could get 300,000 IOPS for
a
single VM and 1 million IOPS with 6 VMs on a single VSphere 5.0 host.  So
we
definitely need something like this for KVM to be competitive with VMware
and
other hypervisors.  Of course, this would also help satisfy the high I/O
rate
requirements for BigData and other data-intensive applications or
benchmarks
running on KVM.

Thanks,
-Khoa

>
> > A cleaned up version of this approach could be added to QEMU as a
> raw O_DIRECT
> > Linux AIO fast path.  Image file formats, protocols, and other block
layer
> > features are not supported by virtio-blk-data-plane.
> >
> > Git repo:
> > http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/
> virtio-blk-data-plane
> >
> > Stefan Hajnoczi (27):
> >   virtio-blk: Remove virtqueue request handling code
> >   virtio-blk: Set up host notifier for data plane
> >   virtio-blk: Data plane thread event loop
> >   virtio-blk: Map vring
> >   virtio-blk: Do cheapest possible memory mapping
> >   virtio-blk: Take PCI memory range into account
> >   virtio-blk: Put dataplane code into its own directory
> >   virtio-blk: Read requests from the vring
> >   virtio-blk: Add Linux AIO queue
> >   virtio-blk: Stop data plane thread cleanly
> >   virtio-blk: Indirect vring and flush support
> >   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
> >   virtio-blk: Increase max requests for indirect vring
> >   virtio-blk: Use pthreads instead of qemu-thread
> >   notifier: Add a function to set the notifier
> >   virtio-blk: Kick data plane thread using event notifier set
> >   virtio-blk: Use guest notifier to raise interrupts
> >   virtio-blk: Call ioctl() directly instead of irqfd
> >   virtio-blk: Disable guest->host notifies while processing vring
> >   virtio-blk: Add ioscheduler to detect mergable requests
> >   virtio-blk: Add basic request merging
> >   virtio-blk: Fix request merging
> >   virtio-blk: Stub out SCSI commands
> >   virtio-blk: fix incorrect length
> >   msix: fix irqchip breakage in msix_try_notify_from_thread()
> >   msix: use upstream kvm_irqchip_set_irq()
> >   virtio-blk: add EVENT_IDX support to dataplane
> >
> >  event_notifier.c          |    7 +
> >  event_notifier.h          |    1 +
> >  hw/dataplane/event-poll.h |  116 +++++++
> >  hw/dataplane/ioq.h        |  128 ++++++++
> >  hw/dataplane/iosched.h    |   97 ++++++
> >  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
> >  hw/msix.c                 |   15 +
> >  hw/msix.h                 |    1 +
> >  hw/virtio-blk.c           |  753 ++++++++++++++++++++
> +------------------------
> >  hw/virtio-pci.c           |    8 +
> >  hw/virtio.c               |    9 +
> >  hw/virtio.h               |    3 +
> >  12 files changed, 1074 insertions(+), 398 deletions(-)
> >  create mode 100644 hw/dataplane/event-poll.h
> >  create mode 100644 hw/dataplane/ioq.h
> >  create mode 100644 hw/dataplane/iosched.h
> >  create mode 100644 hw/dataplane/vring.h
> >
> > --
> > 1.7.10.4
>

[-- Attachment #2: Type: text/html, Size: 7881 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 16:18     ` Khoa Huynh
  0 siblings, 0 replies; 90+ messages in thread
From: Khoa Huynh @ 2012-07-18 16:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Paolo Bonzini, Asias He

[-- Attachment #1: Type: text/plain, Size: 5320 bytes --]


"Michael S. Tsirkin" <mst@redhat.com> wrote on 07/18/2012 10:43:23 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
> Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Anthony Liguori/
> Austin/IBM@IBMUS, Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini
> <pbonzini@redhat.com>, Asias He <asias@redhat.com>, Khoa Huynh/
> Austin/IBM@IBMUS
> Date: 07/18/2012 10:46 AM
> Subject: Re: [RFC v9 00/27] virtio: virtio-blk data plane
>
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> > This series implements a dedicated thread for virtio-blk
> processing using Linux
> > AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3
> and somewhat
> > old but I wanted to share it on the list since it has been
> mentioned on mailing
> > lists and IRC recently.
> >
> > These patches can be used for benchmarking and discussion about
> how to improve
> > block performance.  Paolo Bonzini has also worked in this area andmight
want
> > to share his patches.
> >
> > The basic approach is:
> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >    signalling when the guest kicks the virtqueue.
> > 2. Requests are processed without going through the QEMU block layer
using
> >    Linux AIO directly.
> > 3. Completion interrupts are injected via ioctl from the dedicated
thread.
> >
> > The series also contains request merging as a bdrv_aio_multiwrite
> () equivalent.
> > This was only to get a comparison against the QEMU block layer and
> I would drop
> > it for other types of analysis.
> >
> > The effect of this series is that O_DIRECT Linux AIO on raw files can
bypass
> > the QEMU global mutex and block layer.  This means higher performance.
>
> Do you have any numbers at all?

Yes, we do have a lot of data for this data-plane patch set.  I can send
you
detailed charts if you like, but generally, we run into a performance
bottleneck
with the existing qemu due to the qemu global mutex, and thus, could only
get
to about 140,000 IOPS for a single guest (at least on my setup).  With this
data-plane patch set, we bypass this bottleneck and have been able to
achieve
more than 600,000 IOPS for a single guest, and an aggregate 1.33 million
IOPS
with 4 guests on a single host.

Just for reference, VMware has claimed that they could get 300,000 IOPS for
a
single VM and 1 million IOPS with 6 VMs on a single VSphere 5.0 host.  So
we
definitely need something like this for KVM to be competitive with VMware
and
other hypervisors.  Of course, this would also help satisfy the high I/O
rate
requirements for BigData and other data-intensive applications or
benchmarks
running on KVM.

Thanks,
-Khoa

>
> > A cleaned up version of this approach could be added to QEMU as a
> raw O_DIRECT
> > Linux AIO fast path.  Image file formats, protocols, and other block
layer
> > features are not supported by virtio-blk-data-plane.
> >
> > Git repo:
> > http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/
> virtio-blk-data-plane
> >
> > Stefan Hajnoczi (27):
> >   virtio-blk: Remove virtqueue request handling code
> >   virtio-blk: Set up host notifier for data plane
> >   virtio-blk: Data plane thread event loop
> >   virtio-blk: Map vring
> >   virtio-blk: Do cheapest possible memory mapping
> >   virtio-blk: Take PCI memory range into account
> >   virtio-blk: Put dataplane code into its own directory
> >   virtio-blk: Read requests from the vring
> >   virtio-blk: Add Linux AIO queue
> >   virtio-blk: Stop data plane thread cleanly
> >   virtio-blk: Indirect vring and flush support
> >   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
> >   virtio-blk: Increase max requests for indirect vring
> >   virtio-blk: Use pthreads instead of qemu-thread
> >   notifier: Add a function to set the notifier
> >   virtio-blk: Kick data plane thread using event notifier set
> >   virtio-blk: Use guest notifier to raise interrupts
> >   virtio-blk: Call ioctl() directly instead of irqfd
> >   virtio-blk: Disable guest->host notifies while processing vring
> >   virtio-blk: Add ioscheduler to detect mergable requests
> >   virtio-blk: Add basic request merging
> >   virtio-blk: Fix request merging
> >   virtio-blk: Stub out SCSI commands
> >   virtio-blk: fix incorrect length
> >   msix: fix irqchip breakage in msix_try_notify_from_thread()
> >   msix: use upstream kvm_irqchip_set_irq()
> >   virtio-blk: add EVENT_IDX support to dataplane
> >
> >  event_notifier.c          |    7 +
> >  event_notifier.h          |    1 +
> >  hw/dataplane/event-poll.h |  116 +++++++
> >  hw/dataplane/ioq.h        |  128 ++++++++
> >  hw/dataplane/iosched.h    |   97 ++++++
> >  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
> >  hw/msix.c                 |   15 +
> >  hw/msix.h                 |    1 +
> >  hw/virtio-blk.c           |  753 ++++++++++++++++++++
> +------------------------
> >  hw/virtio-pci.c           |    8 +
> >  hw/virtio.c               |    9 +
> >  hw/virtio.h               |    3 +
> >  12 files changed, 1074 insertions(+), 398 deletions(-)
> >  create mode 100644 hw/dataplane/event-poll.h
> >  create mode 100644 hw/dataplane/ioq.h
> >  create mode 100644 hw/dataplane/iosched.h
> >  create mode 100644 hw/dataplane/vring.h
> >
> > --
> > 1.7.10.4
>

[-- Attachment #2: Type: text/html, Size: 7881 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 00/27] virtio: virtio-blk data plane
  2012-07-18 15:43   ` [Qemu-devel] " Michael S. Tsirkin
@ 2012-07-18 16:41     ` Khoa Huynh
  -1 siblings, 0 replies; 90+ messages in thread
From: Khoa Huynh @ 2012-07-18 16:41 UTC (permalink / raw)
  To: kvm, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5368 bytes --]


Michael S. Tsirkin wrote on 07/18/2012 10:43:23 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
> Cc: Kevin Wolf <kwolf@redhat.com>, Anthony Liguori/Austin/IBM@IBMUS,
> kvm@vger.kernel.org, qemu-devel@nongnu.org, Khoa Huynh/Austin/
> IBM@IBMUS, Paolo Bonzini <pbonzini@redhat.com>, Asias He
<asias@redhat.com>
> Date: 07/18/2012 10:45 AM
> Subject: Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
> Sent by: qemu-devel-bounces+khoa=us.ibm.com@nongnu.org
>
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> > This series implements a dedicated thread for virtio-blk
> processing using Linux
> > AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3
> and somewhat
> > old but I wanted to share it on the list since it has been
> mentioned on mailing
> > lists and IRC recently.
> >
> > These patches can be used for benchmarking and discussion about
> how to improve
> > block performance.  Paolo Bonzini has also worked in this area andmight
want
> > to share his patches.
> >
> > The basic approach is:
> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >    signalling when the guest kicks the virtqueue.
> > 2. Requests are processed without going through the QEMU block layer
using
> >    Linux AIO directly.
> > 3. Completion interrupts are injected via ioctl from the dedicated
thread.
> >
> > The series also contains request merging as a bdrv_aio_multiwrite
> () equivalent.
> > This was only to get a comparison against the QEMU block layer and
> I would drop
> > it for other types of analysis.
> >
> > The effect of this series is that O_DIRECT Linux AIO on raw files can
bypass
> > the QEMU global mutex and block layer.  This means higher performance.
>
> Do you have any numbers at all?

Yes, we do have a lot of data for this data-plane patch set.  I can send
you
detailed charts if you like, but generally, we run into a performance
bottleneck
with the existing qemu due to the qemu global mutex, and thus, could only
get
to about 140,000 IOPS for a single guest (at least on my setup).  With this
data-plane patch set, we bypass this bottleneck and have been able to
achieve
more than 600,000 IOPS for a single guest, and an aggregate 1.33 million
IOPS
with 4 guests on a single host.

Just for reference, VMware has claimed that they could get 300,000 IOPS for
a
single VM and 1 million IOPS with 6 VMs on a single VSphere 5.0 host.  So
we
definitely need something like this for KVM to be competitive with VMware
and
other hypervisors.  Of course, this would also help satisfy the high I/O
rate
requirements for BigData and other data-intensive applications or
benchmarks
running on KVM.

Thanks,
-Khoa

>
> > A cleaned up version of this approach could be added to QEMU as a
> raw O_DIRECT
> > Linux AIO fast path.  Image file formats, protocols, and other block
layer
> > features are not supported by virtio-blk-data-plane.
> >
> > Git repo:
> > http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/
> virtio-blk-data-plane
> >
> > Stefan Hajnoczi (27):
> >   virtio-blk: Remove virtqueue request handling code
> >   virtio-blk: Set up host notifier for data plane
> >   virtio-blk: Data plane thread event loop
> >   virtio-blk: Map vring
> >   virtio-blk: Do cheapest possible memory mapping
> >   virtio-blk: Take PCI memory range into account
> >   virtio-blk: Put dataplane code into its own directory
> >   virtio-blk: Read requests from the vring
> >   virtio-blk: Add Linux AIO queue
> >   virtio-blk: Stop data plane thread cleanly
> >   virtio-blk: Indirect vring and flush support
> >   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
> >   virtio-blk: Increase max requests for indirect vring
> >   virtio-blk: Use pthreads instead of qemu-thread
> >   notifier: Add a function to set the notifier
> >   virtio-blk: Kick data plane thread using event notifier set
> >   virtio-blk: Use guest notifier to raise interrupts
> >   virtio-blk: Call ioctl() directly instead of irqfd
> >   virtio-blk: Disable guest->host notifies while processing vring
> >   virtio-blk: Add ioscheduler to detect mergable requests
> >   virtio-blk: Add basic request merging
> >   virtio-blk: Fix request merging
> >   virtio-blk: Stub out SCSI commands
> >   virtio-blk: fix incorrect length
> >   msix: fix irqchip breakage in msix_try_notify_from_thread()
> >   msix: use upstream kvm_irqchip_set_irq()
> >   virtio-blk: add EVENT_IDX support to dataplane
> >
> >  event_notifier.c          |    7 +
> >  event_notifier.h          |    1 +
> >  hw/dataplane/event-poll.h |  116 +++++++
> >  hw/dataplane/ioq.h        |  128 ++++++++
> >  hw/dataplane/iosched.h    |   97 ++++++
> >  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
> >  hw/msix.c                 |   15 +
> >  hw/msix.h                 |    1 +
> >  hw/virtio-blk.c           |  753 ++++++++++++++++++++
> +------------------------
> >  hw/virtio-pci.c           |    8 +
> >  hw/virtio.c               |    9 +
> >  hw/virtio.h               |    3 +
> >  12 files changed, 1074 insertions(+), 398 deletions(-)
> >  create mode 100644 hw/dataplane/event-poll.h
> >  create mode 100644 hw/dataplane/ioq.h
> >  create mode 100644 hw/dataplane/iosched.h
> >  create mode 100644 hw/dataplane/vring.h
> >
> > --
> > 1.7.10.4
>

[-- Attachment #2: Type: text/html, Size: 7943 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-18 16:41     ` Khoa Huynh
  0 siblings, 0 replies; 90+ messages in thread
From: Khoa Huynh @ 2012-07-18 16:41 UTC (permalink / raw)
  To: kvm, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5368 bytes --]


Michael S. Tsirkin wrote on 07/18/2012 10:43:23 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
> Cc: Kevin Wolf <kwolf@redhat.com>, Anthony Liguori/Austin/IBM@IBMUS,
> kvm@vger.kernel.org, qemu-devel@nongnu.org, Khoa Huynh/Austin/
> IBM@IBMUS, Paolo Bonzini <pbonzini@redhat.com>, Asias He
<asias@redhat.com>
> Date: 07/18/2012 10:45 AM
> Subject: Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
> Sent by: qemu-devel-bounces+khoa=us.ibm.com@nongnu.org
>
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
> > This series implements a dedicated thread for virtio-blk
> processing using Linux
> > AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3
> and somewhat
> > old but I wanted to share it on the list since it has been
> mentioned on mailing
> > lists and IRC recently.
> >
> > These patches can be used for benchmarking and discussion about
> how to improve
> > block performance.  Paolo Bonzini has also worked in this area andmight
want
> > to share his patches.
> >
> > The basic approach is:
> > 1. Each virtio-blk device has a thread dedicated to handling ioeventfd
> >    signalling when the guest kicks the virtqueue.
> > 2. Requests are processed without going through the QEMU block layer
using
> >    Linux AIO directly.
> > 3. Completion interrupts are injected via ioctl from the dedicated
thread.
> >
> > The series also contains request merging as a bdrv_aio_multiwrite
> () equivalent.
> > This was only to get a comparison against the QEMU block layer and
> I would drop
> > it for other types of analysis.
> >
> > The effect of this series is that O_DIRECT Linux AIO on raw files can
bypass
> > the QEMU global mutex and block layer.  This means higher performance.
>
> Do you have any numbers at all?

Yes, we do have a lot of data for this data-plane patch set.  I can send
you
detailed charts if you like, but generally, we run into a performance
bottleneck
with the existing qemu due to the qemu global mutex, and thus, could only
get
to about 140,000 IOPS for a single guest (at least on my setup).  With this
data-plane patch set, we bypass this bottleneck and have been able to
achieve
more than 600,000 IOPS for a single guest, and an aggregate 1.33 million
IOPS
with 4 guests on a single host.

Just for reference, VMware has claimed that they could get 300,000 IOPS for
a
single VM and 1 million IOPS with 6 VMs on a single VSphere 5.0 host.  So
we
definitely need something like this for KVM to be competitive with VMware
and
other hypervisors.  Of course, this would also help satisfy the high I/O
rate
requirements for BigData and other data-intensive applications or
benchmarks
running on KVM.

Thanks,
-Khoa

>
> > A cleaned up version of this approach could be added to QEMU as a
> raw O_DIRECT
> > Linux AIO fast path.  Image file formats, protocols, and other block
layer
> > features are not supported by virtio-blk-data-plane.
> >
> > Git repo:
> > http://repo.or.cz/w/qemu-kvm/stefanha.git/shortlog/refs/heads/
> virtio-blk-data-plane
> >
> > Stefan Hajnoczi (27):
> >   virtio-blk: Remove virtqueue request handling code
> >   virtio-blk: Set up host notifier for data plane
> >   virtio-blk: Data plane thread event loop
> >   virtio-blk: Map vring
> >   virtio-blk: Do cheapest possible memory mapping
> >   virtio-blk: Take PCI memory range into account
> >   virtio-blk: Put dataplane code into its own directory
> >   virtio-blk: Read requests from the vring
> >   virtio-blk: Add Linux AIO queue
> >   virtio-blk: Stop data plane thread cleanly
> >   virtio-blk: Indirect vring and flush support
> >   virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
> >   virtio-blk: Increase max requests for indirect vring
> >   virtio-blk: Use pthreads instead of qemu-thread
> >   notifier: Add a function to set the notifier
> >   virtio-blk: Kick data plane thread using event notifier set
> >   virtio-blk: Use guest notifier to raise interrupts
> >   virtio-blk: Call ioctl() directly instead of irqfd
> >   virtio-blk: Disable guest->host notifies while processing vring
> >   virtio-blk: Add ioscheduler to detect mergable requests
> >   virtio-blk: Add basic request merging
> >   virtio-blk: Fix request merging
> >   virtio-blk: Stub out SCSI commands
> >   virtio-blk: fix incorrect length
> >   msix: fix irqchip breakage in msix_try_notify_from_thread()
> >   msix: use upstream kvm_irqchip_set_irq()
> >   virtio-blk: add EVENT_IDX support to dataplane
> >
> >  event_notifier.c          |    7 +
> >  event_notifier.h          |    1 +
> >  hw/dataplane/event-poll.h |  116 +++++++
> >  hw/dataplane/ioq.h        |  128 ++++++++
> >  hw/dataplane/iosched.h    |   97 ++++++
> >  hw/dataplane/vring.h      |  334 ++++++++++++++++++++
> >  hw/msix.c                 |   15 +
> >  hw/msix.h                 |    1 +
> >  hw/virtio-blk.c           |  753 ++++++++++++++++++++
> +------------------------
> >  hw/virtio-pci.c           |    8 +
> >  hw/virtio.c               |    9 +
> >  hw/virtio.h               |    3 +
> >  12 files changed, 1074 insertions(+), 398 deletions(-)
> >  create mode 100644 hw/dataplane/event-poll.h
> >  create mode 100644 hw/dataplane/ioq.h
> >  create mode 100644 hw/dataplane/iosched.h
> >  create mode 100644 hw/dataplane/vring.h
> >
> > --
> > 1.7.10.4
>

[-- Attachment #2: Type: text/html, Size: 7943 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 11/27] virtio-blk: Indirect vring and flush support
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 18:28     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 18:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Asias He, Khoa Huynh

On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote:
> RHEL6 and other new guest kernels use indirect vring descriptors to
> increase the number of requests that can be batched.  This fundamentally
> changes vring from a scheme that requires fixed resources to something
> more dynamic (although there is still an absolute maximum number of
> descriptors).  Cope with indirect vrings by taking on as many requests
> as we can in one go and then postponing the remaining requests until the
> first batch completes.
> 
> It would be possible to switch to dynamic resource management so iovec
> and iocb structs are malloced.  This would allow the entire ring to be
> processed even with indirect descriptors, but would probably hit a
> bottleneck when io_submit refuses to queue more requests.  Therefore,
> stick with the simpler scheme for now.
> 
> Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
> all files.  In particular, an O_DIRECT opened file on ext4 does not
> support Linux AIO fdsync.  Work around this by performing fdatasync()
> synchronously for now.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/ioq.h   |   18 ++++-----
>  hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
>  hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 52 deletions(-)
> 
> diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
> index 7200e87..d1545d6 100644
> --- a/hw/dataplane/ioq.h
> +++ b/hw/dataplane/ioq.h
> @@ -3,7 +3,7 @@
>  
>  typedef struct {
>      int fd;                         /* file descriptor */
> -    unsigned int max_reqs;           /* max length of freelist and queue */
> +    unsigned int max_reqs;          /* max length of freelist and queue */
>  
>      io_context_t io_ctx;            /* Linux AIO context */
>      EventNotifier io_notifier;      /* Linux AIO eventfd */
> @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
>      return iocb;
>  }
>  
> -static struct iocb *ioq_fdsync(IOQueue *ioq)
> -{
> -    struct iocb *iocb = ioq_get_iocb(ioq);
> -
> -    io_prep_fdsync(iocb, ioq->fd);
> -    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
> -    return iocb;
> -}
> -
>  static int ioq_submit(IOQueue *ioq)
>  {
>      int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +    if (unlikely(rc < 0)) {
> +        unsigned int i;
> +        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +        for (i = 0; i < ioq->queue_idx; i++) {
> +            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
> +        }
> +    }
>      ioq->queue_idx = 0; /* reset */
>      return rc;
>  }
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 70675e5..3eab4b4 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
>              vring->vr.desc, vring->vr.avail, vring->vr.used);
>  }
>  
> +static bool vring_more_avail(Vring *vring)
> +{
> +	return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> +static bool get_indirect(Vring *vring,
> +			struct iovec iov[], struct iovec *iov_end,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vring_desc *indirect)
> +{
> +	struct vring_desc desc;
> +	unsigned int i = 0, count, found = 0;
> +
> +	/* Sanity check */
> +	if (unlikely(indirect->len % sizeof desc)) {
> +		fprintf(stderr, "Invalid length in indirect descriptor: "
> +		       "len 0x%llx not multiple of 0x%zx\n",
> +		       (unsigned long long)indirect->len,
> +		       sizeof desc);
> +		exit(1);
> +	}
> +
> +	count = indirect->len / sizeof desc;
> +	/* Buffers are chained via a 16 bit next field, so
> +	 * we can have at most 2^16 of these. */
> +	if (unlikely(count > USHRT_MAX + 1)) {
> +		fprintf(stderr, "Indirect buffer length too big: %d\n",
> +		       indirect->len);
> +        exit(1);
> +	}
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +	/* We will use the result as an address to read from, so most
> +	 * architectures only need a compiler barrier here. */
> +	__sync_synchronize(); /* read_barrier_depends(); */


qemu has its own barriers now, pls use them.

> +
> +	do {
> +		if (unlikely(++found > count)) {
> +			fprintf(stderr, "Loop detected: last one at %u "
> +			       "indirect size %u\n",
> +			       i, count);
> +			exit(1);
> +		}
> +
> +        desc = *indirect++;
> +		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +			fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +		}
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE) {
> +			*in_num += 1;
> +		} else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (unlikely(*in_num)) {
> +				fprintf(stderr, "Indirect descriptor "
> +				       "has out after in: idx %d\n", i);
> +                exit(1);
> +			}
> +			*out_num += 1;
> +		}
> +        i = desc.next;
> +	} while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
>  /* This looks in the virtqueue and for the first available buffer, and converts
>   * it to an iovec for convenient access.  Since descriptors consist of some
>   * number of output then some number of input descriptors, it's actually two
> @@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
>  		}
>          desc = vring->vr.desc[i];
>  		if (desc.flags & VRING_DESC_F_INDIRECT) {
> -/*			ret = get_indirect(dev, vq, iov, iov_size,
> -					   out_num, in_num,
> -					   log, log_num, &desc);
> -			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> -				return ret;
> -			}
> -			continue; */
> -            fprintf(stderr, "Indirect vring not supported\n");
> -            exit(1);
> +			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return num; /* not enough iovecs, stop for now */
> +            }
> +            continue;
>  		}
>  
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
>          if (iov >= iov_end) {
> -            fprintf(stderr, "Not enough vring iovecs\n");
> -            exit(1);
> +            return num;
>          }
> +
>          iov->iov_base = phys_to_host(vring, desc.addr);
>          iov->iov_len  = desc.len;
>          iov++;
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 52ea601..591eace 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>      return (VirtIOBlock *)vdev;
>  }
>  
> +/* Normally the block driver passes down the fd, there's no way to get it from
> + * above.
> + */
> +static int get_raw_posix_fd_hack(VirtIOBlock *s)
> +{
> +    return *(int*)s->bs->file->opaque;
> +}
> +
>  static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>  {
>      VirtIOBlock *s = opaque;
> @@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>      vring_push(&s->vring, req->head, len + sizeof req->status);
>  }
>  
> -static bool handle_io(EventHandler *handler)
> -{
> -    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> -
> -    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> -        /* TODO is this thread-safe and can it be done faster? */
> -        virtio_irq(s->vq);
> -    }
> -
> -    return true;
> -}
> -
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
>  {
>      /* Virtio block requests look like this: */
> @@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              outhdr->type, outhdr->sector);
>      */
>  
> -    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> +
> +    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
>          fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
>      struct iocb *iocb;
> -    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              fprintf(stderr, "virtio-blk invalid flush request\n");
>              exit(1);
>          }
> -        iocb = ioq_fdsync(ioq);
> -        break;
> +
> +        /* TODO fdsync is not supported by all backends, do it synchronously here! */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            fdatasync(get_raw_posix_fd_hack(s));
> +            inhdr->status = VIRTIO_BLK_S_OK;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_irq(s->vq);
> +        }
> +        return;
>  
>      default:
>          fprintf(stderr, "virtio-blk multiple request type bits set\n");
> @@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
>      }
>  
>      /* Submit requests, if any */
> -    if (likely(iov != iovec)) {
> -        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
> -            fprintf(stderr, "ioq_submit failed\n");
> -            exit(1);
> -        }
> +    int rc = ioq_submit(&s->ioqueue);
> +    if (unlikely(rc < 0)) {
> +        fprintf(stderr, "ioq_submit failed %d\n", rc);
> +        exit(1);
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        /* TODO is this thread-safe and can it be done faster? */
> +        virtio_irq(s->vq);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (vring_more_avail(&s->vring)) {
> +        return handle_notify(&s->notify_handler);
>      }
>  
>      return true;
> @@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
>      return NULL;
>  }
>  
> -/* Normally the block driver passes down the fd, there's no way to get it from
> - * above.
> - */
> -static int get_raw_posix_fd_hack(VirtIOBlock *s)
> -{
> -    return *(int*)s->bs->file->opaque;
> -}
> -
>  static void data_plane_start(VirtIOBlock *s)
>  {
>      int i;
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 11/27] virtio-blk: Indirect vring and flush support
@ 2012-07-18 18:28     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 18:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote:
> RHEL6 and other new guest kernels use indirect vring descriptors to
> increase the number of requests that can be batched.  This fundamentally
> changes vring from a scheme that requires fixed resources to something
> more dynamic (although there is still an absolute maximum number of
> descriptors).  Cope with indirect vrings by taking on as many requests
> as we can in one go and then postponing the remaining requests until the
> first batch completes.
> 
> It would be possible to switch to dynamic resource management so iovec
> and iocb structs are malloced.  This would allow the entire ring to be
> processed even with indirect descriptors, but would probably hit a
> bottleneck when io_submit refuses to queue more requests.  Therefore,
> stick with the simpler scheme for now.
> 
> Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
> all files.  In particular, an O_DIRECT opened file on ext4 does not
> support Linux AIO fdsync.  Work around this by performing fdatasync()
> synchronously for now.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/ioq.h   |   18 ++++-----
>  hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
>  hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 52 deletions(-)
> 
> diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
> index 7200e87..d1545d6 100644
> --- a/hw/dataplane/ioq.h
> +++ b/hw/dataplane/ioq.h
> @@ -3,7 +3,7 @@
>  
>  typedef struct {
>      int fd;                         /* file descriptor */
> -    unsigned int max_reqs;           /* max length of freelist and queue */
> +    unsigned int max_reqs;          /* max length of freelist and queue */
>  
>      io_context_t io_ctx;            /* Linux AIO context */
>      EventNotifier io_notifier;      /* Linux AIO eventfd */
> @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
>      return iocb;
>  }
>  
> -static struct iocb *ioq_fdsync(IOQueue *ioq)
> -{
> -    struct iocb *iocb = ioq_get_iocb(ioq);
> -
> -    io_prep_fdsync(iocb, ioq->fd);
> -    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
> -    return iocb;
> -}
> -
>  static int ioq_submit(IOQueue *ioq)
>  {
>      int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +    if (unlikely(rc < 0)) {
> +        unsigned int i;
> +        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +        for (i = 0; i < ioq->queue_idx; i++) {
> +            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
> +        }
> +    }
>      ioq->queue_idx = 0; /* reset */
>      return rc;
>  }
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 70675e5..3eab4b4 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
>              vring->vr.desc, vring->vr.avail, vring->vr.used);
>  }
>  
> +static bool vring_more_avail(Vring *vring)
> +{
> +	return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */
> +static bool get_indirect(Vring *vring,
> +			struct iovec iov[], struct iovec *iov_end,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vring_desc *indirect)
> +{
> +	struct vring_desc desc;
> +	unsigned int i = 0, count, found = 0;
> +
> +	/* Sanity check */
> +	if (unlikely(indirect->len % sizeof desc)) {
> +		fprintf(stderr, "Invalid length in indirect descriptor: "
> +		       "len 0x%llx not multiple of 0x%zx\n",
> +		       (unsigned long long)indirect->len,
> +		       sizeof desc);
> +		exit(1);
> +	}
> +
> +	count = indirect->len / sizeof desc;
> +	/* Buffers are chained via a 16 bit next field, so
> +	 * we can have at most 2^16 of these. */
> +	if (unlikely(count > USHRT_MAX + 1)) {
> +		fprintf(stderr, "Indirect buffer length too big: %d\n",
> +		       indirect->len);
> +        exit(1);
> +	}
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +	/* We will use the result as an address to read from, so most
> +	 * architectures only need a compiler barrier here. */
> +	__sync_synchronize(); /* read_barrier_depends(); */


qemu has its own barriers now, pls use them.

> +
> +	do {
> +		if (unlikely(++found > count)) {
> +			fprintf(stderr, "Loop detected: last one at %u "
> +			       "indirect size %u\n",
> +			       i, count);
> +			exit(1);
> +		}
> +
> +        desc = *indirect++;
> +		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +			fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +		}
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE) {
> +			*in_num += 1;
> +		} else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (unlikely(*in_num)) {
> +				fprintf(stderr, "Indirect descriptor "
> +				       "has out after in: idx %d\n", i);
> +                exit(1);
> +			}
> +			*out_num += 1;
> +		}
> +        i = desc.next;
> +	} while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
>  /* This looks in the virtqueue and for the first available buffer, and converts
>   * it to an iovec for convenient access.  Since descriptors consist of some
>   * number of output then some number of input descriptors, it's actually two
> @@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
>  		}
>          desc = vring->vr.desc[i];
>  		if (desc.flags & VRING_DESC_F_INDIRECT) {
> -/*			ret = get_indirect(dev, vq, iov, iov_size,
> -					   out_num, in_num,
> -					   log, log_num, &desc);
> -			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> -				return ret;
> -			}
> -			continue; */
> -            fprintf(stderr, "Indirect vring not supported\n");
> -            exit(1);
> +			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return num; /* not enough iovecs, stop for now */
> +            }
> +            continue;
>  		}
>  
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
>          if (iov >= iov_end) {
> -            fprintf(stderr, "Not enough vring iovecs\n");
> -            exit(1);
> +            return num;
>          }
> +
>          iov->iov_base = phys_to_host(vring, desc.addr);
>          iov->iov_len  = desc.len;
>          iov++;
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 52ea601..591eace 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>      return (VirtIOBlock *)vdev;
>  }
>  
> +/* Normally the block driver passes down the fd, there's no way to get it from
> + * above.
> + */
> +static int get_raw_posix_fd_hack(VirtIOBlock *s)
> +{
> +    return *(int*)s->bs->file->opaque;
> +}
> +
>  static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>  {
>      VirtIOBlock *s = opaque;
> @@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>      vring_push(&s->vring, req->head, len + sizeof req->status);
>  }
>  
> -static bool handle_io(EventHandler *handler)
> -{
> -    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> -
> -    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> -        /* TODO is this thread-safe and can it be done faster? */
> -        virtio_irq(s->vq);
> -    }
> -
> -    return true;
> -}
> -
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
>  {
>      /* Virtio block requests look like this: */
> @@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              outhdr->type, outhdr->sector);
>      */
>  
> -    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> +
> +    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
>          fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
>      struct iocb *iocb;
> -    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              fprintf(stderr, "virtio-blk invalid flush request\n");
>              exit(1);
>          }
> -        iocb = ioq_fdsync(ioq);
> -        break;
> +
> +        /* TODO fdsync is not supported by all backends, do it synchronously here! */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            fdatasync(get_raw_posix_fd_hack(s));
> +            inhdr->status = VIRTIO_BLK_S_OK;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_irq(s->vq);
> +        }
> +        return;
>  
>      default:
>          fprintf(stderr, "virtio-blk multiple request type bits set\n");
> @@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
>      }
>  
>      /* Submit requests, if any */
> -    if (likely(iov != iovec)) {
> -        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
> -            fprintf(stderr, "ioq_submit failed\n");
> -            exit(1);
> -        }
> +    int rc = ioq_submit(&s->ioqueue);
> +    if (unlikely(rc < 0)) {
> +        fprintf(stderr, "ioq_submit failed %d\n", rc);
> +        exit(1);
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        /* TODO is this thread-safe and can it be done faster? */
> +        virtio_irq(s->vq);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (vring_more_avail(&s->vring)) {
> +        return handle_notify(&s->notify_handler);
>      }
>  
>      return true;
> @@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
>      return NULL;
>  }
>  
> -/* Normally the block driver passes down the fd, there's no way to get it from
> - * above.
> - */
> -static int get_raw_posix_fd_hack(VirtIOBlock *s)
> -{
> -    return *(int*)s->bs->file->opaque;
> -}
> -
>  static void data_plane_start(VirtIOBlock *s)
>  {
>      int i;
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 06/27] virtio-blk: Take PCI memory range into account
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 18:29     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 18:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
> Support >4 GB physical memory accesses.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Need some sane APIs, this is just too scary.

> ---
>  hw/virtio-blk.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index abd9386..99654f1 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -64,6 +64,13 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>   */
>  static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
>  {
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
> +        exit(1);
> +    }
>      return s->phys_mem_zero_host_ptr + phys;
>  }
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
@ 2012-07-18 18:29     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 18:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
> Support >4 GB physical memory accesses.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Need some sane APIs, this is just too scary.

> ---
>  hw/virtio-blk.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index abd9386..99654f1 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -64,6 +64,13 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>   */
>  static inline void *phys_to_host(VirtIOBlock *s, target_phys_addr_t phys)
>  {
> +    /* Adjust for 3.6-4 GB PCI memory range */
> +    if (phys >= 0x100000000) {
> +        phys -= 0x100000000 - 0xe0000000;
> +    } else if (phys >= 0xe0000000) {
> +        fprintf(stderr, "phys_to_host bad physical address in PCI range %#lx\n", phys);
> +        exit(1);
> +    }
>      return s->phys_mem_zero_host_ptr + phys;
>  }
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 11/27] virtio-blk: Indirect vring and flush support
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 19:02     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Asias He, Khoa Huynh

On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote:
> RHEL6 and other new guest kernels use indirect vring descriptors to
> increase the number of requests that can be batched.  This fundamentally
> changes vring from a scheme that requires fixed resources to something
> more dynamic (although there is still an absolute maximum number of
> descriptors).  Cope with indirect vrings by taking on as many requests
> as we can in one go and then postponing the remaining requests until the
> first batch completes.
> 
> It would be possible to switch to dynamic resource management so iovec
> and iocb structs are malloced.  This would allow the entire ring to be
> processed even with indirect descriptors, but would probably hit a
> bottleneck when io_submit refuses to queue more requests.  Therefore,
> stick with the simpler scheme for now.
> 
> Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
> all files.  In particular, an O_DIRECT opened file on ext4 does not
> support Linux AIO fdsync.  Work around this by performing fdatasync()
> synchronously for now.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/ioq.h   |   18 ++++-----
>  hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
>  hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 52 deletions(-)
> 
> diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
> index 7200e87..d1545d6 100644
> --- a/hw/dataplane/ioq.h
> +++ b/hw/dataplane/ioq.h
> @@ -3,7 +3,7 @@
>  
>  typedef struct {
>      int fd;                         /* file descriptor */
> -    unsigned int max_reqs;           /* max length of freelist and queue */
> +    unsigned int max_reqs;          /* max length of freelist and queue */
>  
>      io_context_t io_ctx;            /* Linux AIO context */
>      EventNotifier io_notifier;      /* Linux AIO eventfd */
> @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
>      return iocb;
>  }
>  
> -static struct iocb *ioq_fdsync(IOQueue *ioq)
> -{
> -    struct iocb *iocb = ioq_get_iocb(ioq);
> -
> -    io_prep_fdsync(iocb, ioq->fd);
> -    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
> -    return iocb;
> -}
> -
>  static int ioq_submit(IOQueue *ioq)
>  {
>      int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +    if (unlikely(rc < 0)) {
> +        unsigned int i;
> +        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +        for (i = 0; i < ioq->queue_idx; i++) {
> +            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
> +        }
> +    }
>      ioq->queue_idx = 0; /* reset */
>      return rc;
>  }
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 70675e5..3eab4b4 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
>              vring->vr.desc, vring->vr.avail, vring->vr.used);
>  }
>  
> +static bool vring_more_avail(Vring *vring)
> +{
> +	return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */

So add a Red Hat copyright pls.

> +static bool get_indirect(Vring *vring,
> +			struct iovec iov[], struct iovec *iov_end,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vring_desc *indirect)
> +{
> +	struct vring_desc desc;
> +	unsigned int i = 0, count, found = 0;
> +
> +	/* Sanity check */
> +	if (unlikely(indirect->len % sizeof desc)) {
> +		fprintf(stderr, "Invalid length in indirect descriptor: "
> +		       "len 0x%llx not multiple of 0x%zx\n",
> +		       (unsigned long long)indirect->len,
> +		       sizeof desc);
> +		exit(1);
> +	}
> +
> +	count = indirect->len / sizeof desc;
> +	/* Buffers are chained via a 16 bit next field, so
> +	 * we can have at most 2^16 of these. */
> +	if (unlikely(count > USHRT_MAX + 1)) {
> +		fprintf(stderr, "Indirect buffer length too big: %d\n",
> +		       indirect->len);
> +        exit(1);
> +	}
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +	/* We will use the result as an address to read from, so most
> +	 * architectures only need a compiler barrier here. */
> +	__sync_synchronize(); /* read_barrier_depends(); */
> +
> +	do {
> +		if (unlikely(++found > count)) {
> +			fprintf(stderr, "Loop detected: last one at %u "
> +			       "indirect size %u\n",
> +			       i, count);
> +			exit(1);
> +		}
> +
> +        desc = *indirect++;
> +		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +			fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +		}
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE) {
> +			*in_num += 1;
> +		} else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (unlikely(*in_num)) {
> +				fprintf(stderr, "Indirect descriptor "
> +				       "has out after in: idx %d\n", i);
> +                exit(1);
> +			}
> +			*out_num += 1;
> +		}
> +        i = desc.next;
> +	} while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
>  /* This looks in the virtqueue and for the first available buffer, and converts
>   * it to an iovec for convenient access.  Since descriptors consist of some
>   * number of output then some number of input descriptors, it's actually two
> @@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
>  		}
>          desc = vring->vr.desc[i];
>  		if (desc.flags & VRING_DESC_F_INDIRECT) {
> -/*			ret = get_indirect(dev, vq, iov, iov_size,
> -					   out_num, in_num,
> -					   log, log_num, &desc);
> -			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> -				return ret;
> -			}
> -			continue; */
> -            fprintf(stderr, "Indirect vring not supported\n");
> -            exit(1);
> +			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return num; /* not enough iovecs, stop for now */
> +            }
> +            continue;
>  		}
>  
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
>          if (iov >= iov_end) {
> -            fprintf(stderr, "Not enough vring iovecs\n");
> -            exit(1);
> +            return num;
>          }
> +
>          iov->iov_base = phys_to_host(vring, desc.addr);
>          iov->iov_len  = desc.len;
>          iov++;
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 52ea601..591eace 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>      return (VirtIOBlock *)vdev;
>  }
>  
> +/* Normally the block driver passes down the fd, there's no way to get it from
> + * above.
> + */
> +static int get_raw_posix_fd_hack(VirtIOBlock *s)
> +{
> +    return *(int*)s->bs->file->opaque;
> +}
> +
>  static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>  {
>      VirtIOBlock *s = opaque;
> @@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>      vring_push(&s->vring, req->head, len + sizeof req->status);
>  }
>  
> -static bool handle_io(EventHandler *handler)
> -{
> -    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> -
> -    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> -        /* TODO is this thread-safe and can it be done faster? */
> -        virtio_irq(s->vq);
> -    }
> -
> -    return true;
> -}
> -
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
>  {
>      /* Virtio block requests look like this: */
> @@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              outhdr->type, outhdr->sector);
>      */
>  
> -    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> +
> +    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
>          fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
>      struct iocb *iocb;
> -    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              fprintf(stderr, "virtio-blk invalid flush request\n");
>              exit(1);
>          }
> -        iocb = ioq_fdsync(ioq);
> -        break;
> +
> +        /* TODO fdsync is not supported by all backends, do it synchronously here! */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            fdatasync(get_raw_posix_fd_hack(s));
> +            inhdr->status = VIRTIO_BLK_S_OK;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_irq(s->vq);
> +        }
> +        return;
>  
>      default:
>          fprintf(stderr, "virtio-blk multiple request type bits set\n");
> @@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
>      }
>  
>      /* Submit requests, if any */
> -    if (likely(iov != iovec)) {
> -        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
> -            fprintf(stderr, "ioq_submit failed\n");
> -            exit(1);
> -        }
> +    int rc = ioq_submit(&s->ioqueue);
> +    if (unlikely(rc < 0)) {
> +        fprintf(stderr, "ioq_submit failed %d\n", rc);
> +        exit(1);
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        /* TODO is this thread-safe and can it be done faster? */
> +        virtio_irq(s->vq);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (vring_more_avail(&s->vring)) {
> +        return handle_notify(&s->notify_handler);
>      }
>  
>      return true;
> @@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
>      return NULL;
>  }
>  
> -/* Normally the block driver passes down the fd, there's no way to get it from
> - * above.
> - */
> -static int get_raw_posix_fd_hack(VirtIOBlock *s)
> -{
> -    return *(int*)s->bs->file->opaque;
> -}
> -
>  static void data_plane_start(VirtIOBlock *s)
>  {
>      int i;
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 11/27] virtio-blk: Indirect vring and flush support
@ 2012-07-18 19:02     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote:
> RHEL6 and other new guest kernels use indirect vring descriptors to
> increase the number of requests that can be batched.  This fundamentally
> changes vring from a scheme that requires fixed resources to something
> more dynamic (although there is still an absolute maximum number of
> descriptors).  Cope with indirect vrings by taking on as many requests
> as we can in one go and then postponing the remaining requests until the
> first batch completes.
> 
> It would be possible to switch to dynamic resource management so iovec
> and iocb structs are malloced.  This would allow the entire ring to be
> processed even with indirect descriptors, but would probably hit a
> bottleneck when io_submit refuses to queue more requests.  Therefore,
> stick with the simpler scheme for now.
> 
> Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
> all files.  In particular, an O_DIRECT opened file on ext4 does not
> support Linux AIO fdsync.  Work around this by performing fdatasync()
> synchronously for now.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/ioq.h   |   18 ++++-----
>  hw/dataplane/vring.h |  103 +++++++++++++++++++++++++++++++++++++++++++-------
>  hw/virtio-blk.c      |   75 ++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 52 deletions(-)
> 
> diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
> index 7200e87..d1545d6 100644
> --- a/hw/dataplane/ioq.h
> +++ b/hw/dataplane/ioq.h
> @@ -3,7 +3,7 @@
>  
>  typedef struct {
>      int fd;                         /* file descriptor */
> -    unsigned int max_reqs;           /* max length of freelist and queue */
> +    unsigned int max_reqs;          /* max length of freelist and queue */
>  
>      io_context_t io_ctx;            /* Linux AIO context */
>      EventNotifier io_notifier;      /* Linux AIO eventfd */
> @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne
>      return iocb;
>  }
>  
> -static struct iocb *ioq_fdsync(IOQueue *ioq)
> -{
> -    struct iocb *iocb = ioq_get_iocb(ioq);
> -
> -    io_prep_fdsync(iocb, ioq->fd);
> -    io_set_eventfd(iocb, event_notifier_get_fd(&ioq->io_notifier));
> -    return iocb;
> -}
> -
>  static int ioq_submit(IOQueue *ioq)
>  {
>      int rc = io_submit(ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +    if (unlikely(rc < 0)) {
> +        unsigned int i;
> +        fprintf(stderr, "io_submit io_ctx=%#lx nr=%d iovecs=%p\n", (uint64_t)ioq->io_ctx, ioq->queue_idx, ioq->queue);
> +        for (i = 0; i < ioq->queue_idx; i++) {
> +            fprintf(stderr, "[%u] type=%#x fd=%d\n", i, ioq->queue[i]->aio_lio_opcode, ioq->queue[i]->aio_fildes);
> +        }
> +    }
>      ioq->queue_idx = 0; /* reset */
>      return rc;
>  }
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 70675e5..3eab4b4 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n)
>              vring->vr.desc, vring->vr.avail, vring->vr.used);
>  }
>  
> +static bool vring_more_avail(Vring *vring)
> +{
> +	return vring->vr.avail->idx != vring->last_avail_idx;
> +}
> +
> +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */

So add a Red Hat copyright pls.

> +static bool get_indirect(Vring *vring,
> +			struct iovec iov[], struct iovec *iov_end,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vring_desc *indirect)
> +{
> +	struct vring_desc desc;
> +	unsigned int i = 0, count, found = 0;
> +
> +	/* Sanity check */
> +	if (unlikely(indirect->len % sizeof desc)) {
> +		fprintf(stderr, "Invalid length in indirect descriptor: "
> +		       "len 0x%llx not multiple of 0x%zx\n",
> +		       (unsigned long long)indirect->len,
> +		       sizeof desc);
> +		exit(1);
> +	}
> +
> +	count = indirect->len / sizeof desc;
> +	/* Buffers are chained via a 16 bit next field, so
> +	 * we can have at most 2^16 of these. */
> +	if (unlikely(count > USHRT_MAX + 1)) {
> +		fprintf(stderr, "Indirect buffer length too big: %d\n",
> +		       indirect->len);
> +        exit(1);
> +	}
> +
> +    /* Point to translate indirect desc chain */
> +    indirect = phys_to_host(vring, indirect->addr);
> +
> +	/* We will use the result as an address to read from, so most
> +	 * architectures only need a compiler barrier here. */
> +	__sync_synchronize(); /* read_barrier_depends(); */
> +
> +	do {
> +		if (unlikely(++found > count)) {
> +			fprintf(stderr, "Loop detected: last one at %u "
> +			       "indirect size %u\n",
> +			       i, count);
> +			exit(1);
> +		}
> +
> +        desc = *indirect++;
> +		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
> +			fprintf(stderr, "Nested indirect descriptor\n");
> +            exit(1);
> +		}
> +
> +        /* Stop for now if there are not enough iovecs available. */
> +        if (iov >= iov_end) {
> +            return false;
> +        }
> +
> +        iov->iov_base = phys_to_host(vring, desc.addr);
> +        iov->iov_len  = desc.len;
> +        iov++;
> +
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE) {
> +			*in_num += 1;
> +		} else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (unlikely(*in_num)) {
> +				fprintf(stderr, "Indirect descriptor "
> +				       "has out after in: idx %d\n", i);
> +                exit(1);
> +			}
> +			*out_num += 1;
> +		}
> +        i = desc.next;
> +	} while (desc.flags & VRING_DESC_F_NEXT);
> +    return true;
> +}
> +
>  /* This looks in the virtqueue and for the first available buffer, and converts
>   * it to an iovec for convenient access.  Since descriptors consist of some
>   * number of output then some number of input descriptors, it's actually two
> @@ -129,23 +209,20 @@ static unsigned int vring_pop(Vring *vring,
>  		}
>          desc = vring->vr.desc[i];
>  		if (desc.flags & VRING_DESC_F_INDIRECT) {
> -/*			ret = get_indirect(dev, vq, iov, iov_size,
> -					   out_num, in_num,
> -					   log, log_num, &desc);
> -			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> -				return ret;
> -			}
> -			continue; */
> -            fprintf(stderr, "Indirect vring not supported\n");
> -            exit(1);
> +			if (!get_indirect(vring, iov, iov_end, out_num, in_num, &desc)) {
> +                return num; /* not enough iovecs, stop for now */
> +            }
> +            continue;
>  		}
>  
> +        /* If there are not enough iovecs left, stop for now.  The caller
> +         * should check if there are more descs available once they have dealt
> +         * with the current set.
> +         */
>          if (iov >= iov_end) {
> -            fprintf(stderr, "Not enough vring iovecs\n");
> -            exit(1);
> +            return num;
>          }
> +
>          iov->iov_base = phys_to_host(vring, desc.addr);
>          iov->iov_len  = desc.len;
>          iov++;
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 52ea601..591eace 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -62,6 +62,14 @@ static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>      return (VirtIOBlock *)vdev;
>  }
>  
> +/* Normally the block driver passes down the fd, there's no way to get it from
> + * above.
> + */
> +static int get_raw_posix_fd_hack(VirtIOBlock *s)
> +{
> +    return *(int*)s->bs->file->opaque;
> +}
> +
>  static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>  {
>      VirtIOBlock *s = opaque;
> @@ -83,18 +91,6 @@ static void complete_request(struct iocb *iocb, ssize_t ret, void *opaque)
>      vring_push(&s->vring, req->head, len + sizeof req->status);
>  }
>  
> -static bool handle_io(EventHandler *handler)
> -{
> -    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> -
> -    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> -        /* TODO is this thread-safe and can it be done faster? */
> -        virtio_irq(s->vq);
> -    }
> -
> -    return true;
> -}
> -
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
>  {
>      /* Virtio block requests look like this: */
> @@ -117,13 +113,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              outhdr->type, outhdr->sector);
>      */
>  
> -    if (unlikely(outhdr->type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> +    /* TODO Linux sets the barrier bit even when not advertised! */
> +    uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> +
> +    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
>          fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
>      struct iocb *iocb;
> -    switch (outhdr->type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -145,8 +144,16 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>              fprintf(stderr, "virtio-blk invalid flush request\n");
>              exit(1);
>          }
> -        iocb = ioq_fdsync(ioq);
> -        break;
> +
> +        /* TODO fdsync is not supported by all backends, do it synchronously here! */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            fdatasync(get_raw_posix_fd_hack(s));
> +            inhdr->status = VIRTIO_BLK_S_OK;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_irq(s->vq);
> +        }
> +        return;
>  
>      default:
>          fprintf(stderr, "virtio-blk multiple request type bits set\n");
> @@ -199,11 +206,29 @@ static bool handle_notify(EventHandler *handler)
>      }
>  
>      /* Submit requests, if any */
> -    if (likely(iov != iovec)) {
> -        if (unlikely(ioq_submit(&s->ioqueue) < 0)) {
> -            fprintf(stderr, "ioq_submit failed\n");
> -            exit(1);
> -        }
> +    int rc = ioq_submit(&s->ioqueue);
> +    if (unlikely(rc < 0)) {
> +        fprintf(stderr, "ioq_submit failed %d\n", rc);
> +        exit(1);
> +    }
> +    return true;
> +}
> +
> +static bool handle_io(EventHandler *handler)
> +{
> +    VirtIOBlock *s = container_of(handler, VirtIOBlock, io_handler);
> +
> +    if (ioq_run_completion(&s->ioqueue, complete_request, s) > 0) {
> +        /* TODO is this thread-safe and can it be done faster? */
> +        virtio_irq(s->vq);
> +    }
> +
> +    /* If there were more requests than iovecs, the vring will not be empty yet
> +     * so check again.  There should now be enough resources to process more
> +     * requests.
> +     */
> +    if (vring_more_avail(&s->vring)) {
> +        return handle_notify(&s->notify_handler);
>      }
>  
>      return true;
> @@ -217,14 +242,6 @@ static void *data_plane_thread(void *opaque)
>      return NULL;
>  }
>  
> -/* Normally the block driver passes down the fd, there's no way to get it from
> - * above.
> - */
> -static int get_raw_posix_fd_hack(VirtIOBlock *s)
> -{
> -    return *(int*)s->bs->file->opaque;
> -}
> -
>  static void data_plane_start(VirtIOBlock *s)
>  {
>      int i;
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 19:03     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Asias He, Khoa Huynh

On Wed, Jul 18, 2012 at 04:07:39PM +0100, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/vring.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 3eab4b4..44ef4a9 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -1,6 +1,11 @@
>  #ifndef VRING_H
>  #define VRING_H
>  
> +/* Some virtio_ring.h files use BUG_ON() */

It's a bug then. Do we really need to work around broken systems?
If yes let's just ship our own headers ...

> +#ifndef BUG_ON
> +#define BUG_ON(x)
> +#endif
> +
>  #include <linux/virtio_ring.h>
>  #include "qemu-common.h"
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
@ 2012-07-18 19:03     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:39PM +0100, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  hw/dataplane/vring.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
> index 3eab4b4..44ef4a9 100644
> --- a/hw/dataplane/vring.h
> +++ b/hw/dataplane/vring.h
> @@ -1,6 +1,11 @@
>  #ifndef VRING_H
>  #define VRING_H
>  
> +/* Some virtio_ring.h files use BUG_ON() */

It's a bug then. Do we really need to work around broken systems?
If yes let's just ship our own headers ...

> +#ifndef BUG_ON
> +#define BUG_ON(x)
> +#endif
> +
>  #include <linux/virtio_ring.h>
>  #include "qemu-common.h"
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 22/27] virtio-blk: Fix request merging
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 19:04     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Asias He, Khoa Huynh

On Wed, Jul 18, 2012 at 04:07:49PM +0100, Stefan Hajnoczi wrote:
> Khoa Huynh <khoa@us.ibm.com> discovered that request merging is broken.
> The merged iocb is not updated to reflect the total number of iovecs and
> the offset is also outdated.
> 
> This patch fixes request merging.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

So all these fixups need to be folded in making it correct first time.

> ---
>  hw/virtio-blk.c |   10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 9131a7a..51807b5 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
>          req_a->len = iocb_nbytes(iocb_a);
>      }
>  
> -    iocb_b->u.v.vec = iovec;
> -    req_b->len = iocb_nbytes(iocb_b);
> -    req_b->next_merged = req_a;
>      /*
>      fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
>              req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
>      */
> +
> +    iocb_b->u.v.vec = iovec;
> +    iocb_b->u.v.nr += iocb_a->u.v.nr;
> +    iocb_b->u.v.offset = iocb_a->u.v.offset;
> +
> +    req_b->len = iocb_nbytes(iocb_b);
> +    req_b->next_merged = req_a;
>  }
>  
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 22/27] virtio-blk: Fix request merging
@ 2012-07-18 19:04     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:49PM +0100, Stefan Hajnoczi wrote:
> Khoa Huynh <khoa@us.ibm.com> discovered that request merging is broken.
> The merged iocb is not updated to reflect the total number of iovecs and
> the offset is also outdated.
> 
> This patch fixes request merging.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

So all these fixups need to be folded in making it correct first time.

> ---
>  hw/virtio-blk.c |   10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 9131a7a..51807b5 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b)
>          req_a->len = iocb_nbytes(iocb_a);
>      }
>  
> -    iocb_b->u.v.vec = iovec;
> -    req_b->len = iocb_nbytes(iocb_b);
> -    req_b->next_merged = req_a;
>      /*
>      fprintf(stderr, "merged %p (%u) and %p (%u), %u iovecs in total\n",
>              req_a, iocb_a->u.v.nr, req_b, iocb_b->u.v.nr, iocb_a->u.v.nr + iocb_b->u.v.nr);
>      */
> +
> +    iocb_b->u.v.vec = iovec;
> +    iocb_b->u.v.nr += iocb_a->u.v.nr;
> +    iocb_b->u.v.offset = iocb_a->u.v.offset;
> +
> +    req_b->len = iocb_nbytes(iocb_b);
> +    req_b->next_merged = req_a;
>  }
>  
>  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head)
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 23/27] virtio-blk: Stub out SCSI commands
  2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-18 19:05     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, kvm, Anthony Liguori, Kevin Wolf, Paolo Bonzini,
	Asias He, Khoa Huynh

On Wed, Jul 18, 2012 at 04:07:50PM +0100, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Why?

> ---
>  hw/virtio-blk.c |   25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 51807b5..8734029 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>  
>      /* TODO Linux sets the barrier bit even when not advertised! */
>      uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> -
> -    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> -        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
> -        exit(1);
> -    }
> -
>      struct iocb *iocb;
> -    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>          iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
>          break;
>  
> +    case VIRTIO_BLK_T_SCSI_CMD:
> +        if (unlikely(in_num == 0)) {
> +            fprintf(stderr, "virtio-blk invalid SCSI command request\n");
> +            exit(1);
> +        }
> +
> +        /* TODO support SCSI commands */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            inhdr->status = VIRTIO_BLK_S_UNSUPP;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_blk_notify_guest(s);
> +        }
> +        return;
> +
>      case VIRTIO_BLK_T_FLUSH:
>          if (unlikely(in_num != 1 || out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid flush request\n");
> @@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>          return;
>  
>      default:
> -        fprintf(stderr, "virtio-blk multiple request type bits set\n");
> +        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 23/27] virtio-blk: Stub out SCSI commands
@ 2012-07-18 19:05     ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, kvm, qemu-devel, Khoa Huynh,
	Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 04:07:50PM +0100, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Why?

> ---
>  hw/virtio-blk.c |   25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index 51807b5..8734029 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>  
>      /* TODO Linux sets the barrier bit even when not advertised! */
>      uint32_t type = outhdr->type & ~VIRTIO_BLK_T_BARRIER;
> -
> -    if (unlikely(type & ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
> -        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
> -        exit(1);
> -    }
> -
>      struct iocb *iocb;
> -    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
> +    switch (type & (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | VIRTIO_BLK_T_FLUSH)) {
>      case VIRTIO_BLK_T_IN:
>          if (unlikely(out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid read request\n");
> @@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>          iocb = ioq_rdwr(ioq, false, &iov[1], out_num - 1, outhdr->sector * 512UL); /* TODO is it always 512? */
>          break;
>  
> +    case VIRTIO_BLK_T_SCSI_CMD:
> +        if (unlikely(in_num == 0)) {
> +            fprintf(stderr, "virtio-blk invalid SCSI command request\n");
> +            exit(1);
> +        }
> +
> +        /* TODO support SCSI commands */
> +        {
> +            VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
> +            inhdr->status = VIRTIO_BLK_S_UNSUPP;
> +            vring_push(&s->vring, head, sizeof *inhdr);
> +            virtio_blk_notify_guest(s);
> +        }
> +        return;
> +
>      case VIRTIO_BLK_T_FLUSH:
>          if (unlikely(in_num != 1 || out_num != 1)) {
>              fprintf(stderr, "virtio-blk invalid flush request\n");
> @@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n
>          return;
>  
>      default:
> -        fprintf(stderr, "virtio-blk multiple request type bits set\n");
> +        fprintf(stderr, "virtio-blk unsupported request type %#x\n", outhdr->type);
>          exit(1);
>      }
>  
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
  2012-07-18 15:40     ` [Qemu-devel] " Michael S. Tsirkin
@ 2012-07-19  9:11       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Kevin Wolf, Anthony Liguori, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 4:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
>> Optimize for the MSI-X enabled and vector unmasked case where it is
>> possible to issue the KVM ioctl() directly instead of using irqfd.
>
> Why? Is an ioctl faster?

I have no benchmark results comparing irqfd against direct ioctl.  It
would be interesting to know if this "optimization" is worthwhile and
how much of a win it is.

The reasoning is that the irqfd code path signals an eventfd and then
kvm.ko's poll handler injects the interrupt.  The ioctl calls straight
into kvm.ko and skips the signalling/poll step.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
@ 2012-07-19  9:11       ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 4:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
>> Optimize for the MSI-X enabled and vector unmasked case where it is
>> possible to issue the KVM ioctl() directly instead of using irqfd.
>
> Why? Is an ioctl faster?

I have no benchmark results comparing irqfd against direct ioctl.  It
would be interesting to know if this "optimization" is worthwhile and
how much of a win it is.

The reasoning is that the irqfd code path signals an eventfd and then
kvm.ko's poll handler injects the interrupt.  The ioctl calls straight
into kvm.ko and skips the signalling/poll step.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC v9 06/27] virtio-blk: Take PCI memory range into account
  2012-07-18 18:29     ` [Qemu-devel] " Michael S. Tsirkin
@ 2012-07-19  9:14       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>> Support >4 GB physical memory accesses.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>
> Need some sane APIs, this is just too scary.

Yes, this prototype has (at least) two layering violations:
1. Bypassing QEMU's memory subsystem because it isn't thread-safe.
2. Bypassing block layer and extracting the underlying fd out of a
raw-posix file, allowing us to do our own Linux AIO.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
@ 2012-07-19  9:14       ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>> Support >4 GB physical memory accesses.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>
> Need some sane APIs, this is just too scary.

Yes, this prototype has (at least) two layering violations:
1. Bypassing QEMU's memory subsystem because it isn't thread-safe.
2. Bypassing block layer and extracting the underlying fd out of a
raw-posix file, allowing us to do our own Linux AIO.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
  2012-07-19  9:14       ` [Qemu-devel] " Stefan Hajnoczi
@ 2012-07-19  9:16         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Kevin Wolf, Anthony Liguori, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Thu, Jul 19, 2012 at 10:14 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>>> Support >4 GB physical memory accesses.
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>
>> Need some sane APIs, this is just too scary.
>
> Yes, this prototype has (at least) two layering violations:
> 1. Bypassing QEMU's memory subsystem because it isn't thread-safe.

Maybe we can use MemoryListener to fix this, although we need to take
care of thread-safety.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
@ 2012-07-19  9:16         ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Thu, Jul 19, 2012 at 10:14 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>>> Support >4 GB physical memory accesses.
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>
>> Need some sane APIs, this is just too scary.
>
> Yes, this prototype has (at least) two layering violations:
> 1. Bypassing QEMU's memory subsystem because it isn't thread-safe.

Maybe we can use MemoryListener to fix this, although we need to take
care of thread-safety.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
  2012-07-19  9:11       ` Stefan Hajnoczi
@ 2012-07-19  9:19         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-19  9:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Kevin Wolf, Anthony Liguori, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Thu, Jul 19, 2012 at 10:11:49AM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 18, 2012 at 4:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
> >> Optimize for the MSI-X enabled and vector unmasked case where it is
> >> possible to issue the KVM ioctl() directly instead of using irqfd.
> >
> > Why? Is an ioctl faster?
> 
> I have no benchmark results comparing irqfd against direct ioctl.  It
> would be interesting to know if this "optimization" is worthwhile and
> how much of a win it is.
> 
> The reasoning is that the irqfd code path signals an eventfd and then
> kvm.ko's poll handler injects the interrupt.  The ioctl calls straight
> into kvm.ko and skips the signalling/poll step.
> 
> Stefan

Polling is done in kernel so at least for MSI it's just a function call.
In fact ATM irqfd is more optimized.  Maybe it's faster for level IRQs
but do we really care?

-- 
MST

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd
@ 2012-07-19  9:19         ` Michael S. Tsirkin
  0 siblings, 0 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2012-07-19  9:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Thu, Jul 19, 2012 at 10:11:49AM +0100, Stefan Hajnoczi wrote:
> On Wed, Jul 18, 2012 at 4:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jul 18, 2012 at 04:07:45PM +0100, Stefan Hajnoczi wrote:
> >> Optimize for the MSI-X enabled and vector unmasked case where it is
> >> possible to issue the KVM ioctl() directly instead of using irqfd.
> >
> > Why? Is an ioctl faster?
> 
> I have no benchmark results comparing irqfd against direct ioctl.  It
> would be interesting to know if this "optimization" is worthwhile and
> how much of a win it is.
> 
> The reasoning is that the irqfd code path signals an eventfd and then
> kvm.ko's poll handler injects the interrupt.  The ioctl calls straight
> into kvm.ko and skips the signalling/poll step.
> 
> Stefan

Polling is done in kernel so at least for MSI it's just a function call.
In fact ATM irqfd is more optimized.  Maybe it's faster for level IRQs
but do we really care?

-- 
MST

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
  2012-07-19  9:16         ` Stefan Hajnoczi
@ 2012-07-19  9:29           ` Avi Kivity
  -1 siblings, 0 replies; 90+ messages in thread
From: Avi Kivity @ 2012-07-19  9:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Kevin Wolf, Anthony Liguori,
	kvm, qemu-devel, Khoa Huynh, Paolo Bonzini, Asias He

On 07/19/2012 12:16 PM, Stefan Hajnoczi wrote:
> On Thu, Jul 19, 2012 at 10:14 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>>>> Support >4 GB physical memory accesses.
>>>>
>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>>
>>> Need some sane APIs, this is just too scary.
>>
>> Yes, this prototype has (at least) two layering violations:
>> 1. Bypassing QEMU's memory subsystem because it isn't thread-safe.
> 
> Maybe we can use MemoryListener to fix this, although we need to take
> care of thread-safety.

Better to fix up the memory core.  It should be relatively easy to
rcu-ify it, as it already works by building a new memory map instead of
updating it in place.  Currently it has one root pointer, all that is
needed is to build it into a temporary pointer, then use
rcu_assign_pointer() to atomically switch the memory map.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 06/27] virtio-blk: Take PCI memory range into account
@ 2012-07-19  9:29           ` Avi Kivity
  0 siblings, 0 replies; 90+ messages in thread
From: Avi Kivity @ 2012-07-19  9:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, qemu-devel, Khoa Huynh, Paolo Bonzini,
	Asias He

On 07/19/2012 12:16 PM, Stefan Hajnoczi wrote:
> On Thu, Jul 19, 2012 at 10:14 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Wed, Jul 18, 2012 at 7:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Wed, Jul 18, 2012 at 04:07:33PM +0100, Stefan Hajnoczi wrote:
>>>> Support >4 GB physical memory accesses.
>>>>
>>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>>
>>> Need some sane APIs, this is just too scary.
>>
>> Yes, this prototype has (at least) two layering violations:
>> 1. Bypassing QEMU's memory subsystem because it isn't thread-safe.
> 
> Maybe we can use MemoryListener to fix this, although we need to take
> care of thread-safety.

Better to fix up the memory core.  It should be relatively easy to
rcu-ify it, as it already works by building a new memory map instead of
updating it in place.  Currently it has one root pointer, all that is
needed is to build it into a temporary pointer, then use
rcu_assign_pointer() to atomically switch the memory map.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
  2012-07-18 15:49   ` [Qemu-devel] " Michael S. Tsirkin
@ 2012-07-19  9:48     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Kevin Wolf, Anthony Liguori, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 4:49 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
>> This series implements a dedicated thread for virtio-blk processing using Linux
>> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
>> old but I wanted to share it on the list since it has been mentioned on mailing
>> lists and IRC recently.
>
> BTW are these any bugfixes here upstream needs?
> I could not tell.

Thanks for your review.  I agree with the cleanup comments, which are
all necessary to get this into qemu.git shape.

Regarding bugfixes, the small bugfixes included in this series only
fix the dataplane code.  Upstream is unaffected.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Qemu-devel] [RFC v9 00/27] virtio: virtio-blk data plane
@ 2012-07-19  9:48     ` Stefan Hajnoczi
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Hajnoczi @ 2012-07-19  9:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, kvm, qemu-devel,
	Khoa Huynh, Paolo Bonzini, Asias He

On Wed, Jul 18, 2012 at 4:49 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jul 18, 2012 at 04:07:27PM +0100, Stefan Hajnoczi wrote:
>> This series implements a dedicated thread for virtio-blk processing using Linux
>> AIO for raw image files only.  It is based on qemu-kvm.git a0bc8c3 and somewhat
>> old but I wanted to share it on the list since it has been mentioned on mailing
>> lists and IRC recently.
>
> BTW are these any bugfixes here upstream needs?
> I could not tell.

Thanks for your review.  I agree with the cleanup comments, which are
all necessary to get this into qemu.git shape.

Regarding bugfixes, the small bugfixes included in this series only
fix the dataplane code.  Upstream is unaffected.

Stefan

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2012-07-19  9:48 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18 15:07 [RFC v9 00/27] virtio: virtio-blk data plane Stefan Hajnoczi
2012-07-18 15:07 ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 01/27] virtio-blk: Remove virtqueue request handling code Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 02/27] virtio-blk: Set up host notifier for data plane Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 03/27] virtio-blk: Data plane thread event loop Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 04/27] virtio-blk: Map vring Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 05/27] virtio-blk: Do cheapest possible memory mapping Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 06/27] virtio-blk: Take PCI memory range into account Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 18:29   ` Michael S. Tsirkin
2012-07-18 18:29     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-19  9:14     ` Stefan Hajnoczi
2012-07-19  9:14       ` [Qemu-devel] " Stefan Hajnoczi
2012-07-19  9:16       ` Stefan Hajnoczi
2012-07-19  9:16         ` Stefan Hajnoczi
2012-07-19  9:29         ` Avi Kivity
2012-07-19  9:29           ` Avi Kivity
2012-07-18 15:07 ` [RFC v9 07/27] virtio-blk: Put dataplane code into its own directory Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 08/27] virtio-blk: Read requests from the vring Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 09/27] virtio-blk: Add Linux AIO queue Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 10/27] virtio-blk: Stop data plane thread cleanly Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 11/27] virtio-blk: Indirect vring and flush support Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 18:28   ` Michael S. Tsirkin
2012-07-18 18:28     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 19:02   ` Michael S. Tsirkin
2012-07-18 19:02     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 15:07 ` [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 19:03   ` Michael S. Tsirkin
2012-07-18 19:03     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 15:07 ` [RFC v9 13/27] virtio-blk: Increase max requests for indirect vring Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 14/27] virtio-blk: Use pthreads instead of qemu-thread Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 15/27] notifier: Add a function to set the notifier Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 16/27] virtio-blk: Kick data plane thread using event notifier set Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 17/27] virtio-blk: Use guest notifier to raise interrupts Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 18/27] virtio-blk: Call ioctl() directly instead of irqfd Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:40   ` Michael S. Tsirkin
2012-07-18 15:40     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-19  9:11     ` Stefan Hajnoczi
2012-07-19  9:11       ` Stefan Hajnoczi
2012-07-19  9:19       ` Michael S. Tsirkin
2012-07-19  9:19         ` Michael S. Tsirkin
2012-07-18 15:07 ` [RFC v9 19/27] virtio-blk: Disable guest->host notifies while processing vring Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 20/27] virtio-blk: Add ioscheduler to detect mergable requests Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 21/27] virtio-blk: Add basic request merging Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 22/27] virtio-blk: Fix " Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 19:04   ` Michael S. Tsirkin
2012-07-18 19:04     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 15:07 ` [RFC v9 23/27] virtio-blk: Stub out SCSI commands Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 19:05   ` Michael S. Tsirkin
2012-07-18 19:05     ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 15:07 ` [RFC v9 24/27] virtio-blk: fix incorrect length Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 25/27] msix: fix irqchip breakage in msix_try_notify_from_thread() Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 26/27] msix: use upstream kvm_irqchip_set_irq() Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:07 ` [RFC v9 27/27] virtio-blk: add EVENT_IDX support to dataplane Stefan Hajnoczi
2012-07-18 15:07   ` [Qemu-devel] " Stefan Hajnoczi
2012-07-18 15:43 ` [RFC v9 00/27] virtio: virtio-blk data plane Michael S. Tsirkin
2012-07-18 15:43   ` [Qemu-devel] " Michael S. Tsirkin
2012-07-18 16:18   ` Khoa Huynh
2012-07-18 16:18     ` [Qemu-devel] " Khoa Huynh
2012-07-18 16:41   ` Khoa Huynh
2012-07-18 16:41     ` [Qemu-devel] " Khoa Huynh
2012-07-18 15:49 ` Michael S. Tsirkin
2012-07-18 15:49   ` [Qemu-devel] " Michael S. Tsirkin
2012-07-19  9:48   ` Stefan Hajnoczi
2012-07-19  9:48     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.