[PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
@ 2011-12-21  1:00 Minchan Kim
  2011-12-21  1:00 ` [PATCH 1/6] block: add bio_map_sg Minchan Kim
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Minchan Kim

This patch is follow-up of Christohp Hellwig's work
[RFC: ->make_request support for virtio-blk].
http://thread.gmane.org/gmane.linux.kernel/1199763

Quote from hch
"This patchset allows the virtio-blk driver to support much higher IOP
rates which can be driven out of modern PCI-e flash devices.  At this
point it really is just a RFC due to various issues."

I fixed race bug and add batch I/O for enhancing sequential I/O,
FLUSH/FUA emulation.

I tested this patch on fusion I/O device by aio-stress.
Result is following as.

Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
Test repeated by 20 times
Guest I/O scheduler : CFQ
Host I/O scheduler : NOOP

            Request            	BIO(patch 1-4)          BIO-batch(patch 1-6)
         (MB/s)  stddev 	(MB/s)  stddev  	(MB/s)  stddev
w        737.820 4.063  	613.735 31.605  	730.288 24.854
rw       208.754 20.450 	314.630 37.352  	317.831 41.719
r        770.974 2.340  	347.483 51.370  	750.324 8.280
rr       250.391 16.910 	350.053 29.986  	325.976 24.846

This patch enhances ramdom I/O performance compared to request-based I/O path.
It's still RFC so welcome to any comment and review.

Christoph Hellwig (3):
  block: add bio_map_sg
  virtio: support unlocked queue kick
  virtio-blk: remove the unused list of pending requests

Minchan Kim (3):
  virtio-blk: implement ->make_request
  virtio-blk: Support batch I/O for enhancing sequential IO
  virtio-blk: Emulate Flush/FUA

 block/blk-merge.c            |   63 ++++
 drivers/block/virtio_blk.c   |  690 ++++++++++++++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c |   33 ++-
 include/linux/blkdev.h       |    2 +
 include/linux/virtio.h       |   21 ++
 5 files changed, 737 insertions(+), 72 deletions(-)

-- 
1.7.6.4


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] block: add bio_map_sg
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-21  1:00 ` [PATCH 2/6] virtio: support unlocked queue kick Minchan Kim
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Christoph Hellwig, Minchan Kim

From: Christoph Hellwig <hch@lst.de>

Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
This helper is useful for any driver that wants to create a scatterlist
from its ->make_request method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 block/blk-merge.c      |   63 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |    2 +
 2 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index cfcc37c..a8ac944 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -199,6 +199,69 @@ new_segment:
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/*
+ * map a bio to a scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold bio->bi_phys_segments entries
+ */
+int bio_map_sg(struct request_queue *q, struct bio *bio,
+		  struct scatterlist *sglist)
+{
+	struct bio_vec *bvec, *bvprv;
+	struct scatterlist *sg;
+	int nsegs, cluster;
+	unsigned long i;
+
+	nsegs = 0;
+	cluster = blk_queue_cluster(q);
+
+	bvprv = NULL;
+	sg = NULL;
+	bio_for_each_segment(bvec, bio, i) {
+		int nbytes = bvec->bv_len;
+
+		if (bvprv && cluster) {
+			if (sg->length + nbytes > queue_max_segment_size(q))
+				goto new_segment;
+
+			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+				goto new_segment;
+
+			sg->length += nbytes;
+		} else {
+new_segment:
+			if (!sg)
+				sg = sglist;
+			else {
+				/*
+				 * If the driver previously mapped a shorter
+				 * list, we could see a termination bit
+				 * prematurely unless it fully inits the sg
+				 * table on each mapping. We KNOW that there
+				 * must be more entries here or the driver
+				 * would be buggy, so force clear the
+				 * termination bit to avoid doing a full
+				 * sg_init_table() in drivers for each command.
+				 */
+				sg->page_link &= ~0x02;
+				sg = sg_next(sg);
+			}
+
+			sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
+			nsegs++;
+		}
+		bvprv = bvec;
+	} /* segments in bio */
+
+	if (sg)
+		sg_mark_end(sg);
+
+	BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
+	return nsegs;
+}
+EXPORT_SYMBOL(bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
 				    struct request *req,
 				    struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 94acd81..7ad8e89 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,6 +853,8 @@ extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
+extern int bio_map_sg(struct request_queue *q, struct bio *bio,
+		struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] virtio: support unlocked queue kick
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
  2011-12-21  1:00 ` [PATCH 1/6] block: add bio_map_sg Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-21  1:00 ` [PATCH 3/6] virtio-blk: remove the unused list of pending requests Minchan Kim
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Christoph Hellwig, Minchan Kim

From: Christoph Hellwig <hch@lst.de>

Split virtqueue_kick to be able to do the actual notification outside the
lock protecting the virtqueue.  This patch was originally done by
Stefan Hajnoczi, but I can't find the original one anymore and had to
recreated it from memory.  Pointers to the original or corrections for
the commit message are welcome.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 drivers/virtio/virtio_ring.c |   33 ++++++++++++++++++++++++++-------
 include/linux/virtio.h       |   21 +++++++++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..c5f0458 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -238,10 +238,12 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
-void virtqueue_kick(struct virtqueue *_vq)
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	u16 new, old;
+	bool needs_kick;
+
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */
@@ -254,13 +256,30 @@ void virtqueue_kick(struct virtqueue *_vq)
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (vq->event ?
-	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
-	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-		/* Prod other side to tell it about changes. */
-		vq->notify(&vq->vq);
-
+	if (vq->event) {
+		needs_kick = vring_need_event(vring_avail_event(&vq->vring),
+					      new, old);
+	} else {
+		needs_kick = (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY));
+	}
 	END_USE(vq);
+	return needs_kick;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
+
+void virtqueue_notify(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	/* Prod other side to tell it about changes. */
+	vq->notify(_vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_notify);
+
+void virtqueue_kick(struct virtqueue *vq)
+{
+	if (virtqueue_kick_prepare(vq))
+		virtqueue_notify(vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick);
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 4c069d8..722a35d 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -90,6 +90,27 @@ static inline int virtqueue_add_buf(struct virtqueue *vq,
 
 void virtqueue_kick(struct virtqueue *vq);
 
+/**
+ * virtqueue_kick_prepare - first half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * Instead of virtqueue_kick(), you can do:
+ *	if (virtqueue_kick_prepare(vq))
+ *		virtqueue_notify(vq);
+ *
+ * This is sometimes useful because the virtqueue_kick_prepare() needs
+ * to be serialized, but the actual virtqueue_notify() call does not.
+ */
+bool virtqueue_kick_prepare(struct virtqueue *vq);
+
+/**
+ * virtqueue_notify - second half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * This does not need to be serialized.
+ */
+void virtqueue_notify(struct virtqueue *vq);
+
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void virtqueue_disable_cb(struct virtqueue *vq);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] virtio-blk: remove the unused list of pending requests
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
  2011-12-21  1:00 ` [PATCH 1/6] block: add bio_map_sg Minchan Kim
  2011-12-21  1:00 ` [PATCH 2/6] virtio: support unlocked queue kick Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-21  1:00 ` [PATCH 4/6] virtio-blk: implement ->make_request Minchan Kim
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Christoph Hellwig, Minchan Kim

From: Christoph Hellwig <hch@lst.de>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 drivers/block/virtio_blk.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4d0b70a..26d4443 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -28,9 +28,6 @@ struct virtio_blk
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
 
-	/* Request tracking. */
-	struct list_head reqs;
-
 	mempool_t *pool;
 
 	/* Process context for config space updates */
@@ -48,7 +45,6 @@ struct virtio_blk
 
 struct virtblk_req
 {
-	struct list_head list;
 	struct request *req;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
@@ -92,7 +88,6 @@ static void blk_done(struct virtqueue *vq)
 		}
 
 		__blk_end_request_all(vbr->req, error);
-		list_del(&vbr->list);
 		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
@@ -177,7 +172,6 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		return false;
 	}
 
-	list_add_tail(&vbr->list, &vblk->reqs);
 	return true;
 }
 
@@ -383,7 +377,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_free_index;
 	}
 
-	INIT_LIST_HEAD(&vblk->reqs);
 	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
@@ -544,9 +537,6 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 
 	flush_work(&vblk->config_work);
 
-	/* Nothing should be pending. */
-	BUG_ON(!list_empty(&vblk->reqs));
-
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] virtio-blk: implement ->make_request
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (2 preceding siblings ...)
  2011-12-21  1:00 ` [PATCH 3/6] virtio-blk: remove the unused list of pending requests Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-22 12:20   ` Stefan Hajnoczi
  2011-12-21  1:00 ` [PATCH 5/6] virtio-blk: Support batch I/O for enhancing sequential IO Minchan Kim
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Minchan Kim, Christoph Hellwig, Minchan Kim

Change I/O path from request-based to bio-based for virtio-blk.

This is required for high IOPs devices which get slowed down to 1/5th of
the native speed by all the locking, memory allocation and other overhead
in the request based I/O path.

But it still supports request-based IO path for scsi ioctl but it's just
used for ioctl, not file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 drivers/block/virtio_blk.c |  303 ++++++++++++++++++++++++++++++++++++--------
 1 files changed, 247 insertions(+), 56 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 26d4443..4e476d6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -12,6 +12,7 @@
 #include <linux/idr.h>
 
 #define PART_BITS 4
+static int use_make_request = 1;
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
@@ -24,6 +25,7 @@ struct virtio_blk
 
 	struct virtio_device *vdev;
 	struct virtqueue *vq;
+	wait_queue_head_t queue_wait;
 
 	/* The disk structure for the kernel. */
 	struct gendisk *disk;
@@ -38,61 +40,124 @@ struct virtio_blk
 
 	/* Ida index - used to track minor number allocations. */
 	int index;
-
-	/* Scatterlist: can be too big for stack. */
-	struct scatterlist sg[/*sg_elems*/];
 };
 
 struct virtblk_req
 {
-	struct request *req;
+	void *private;
+	struct virtblk_req *next;
+
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
+	u8 kind;
+#define VIRTIO_BLK_REQUEST	0x00
+#define VIRTIO_BLK_BIO		0x01
 	u8 status;
+
+	struct scatterlist sg[];
 };
 
+static struct virtblk_req *alloc_virtblk_req(struct virtio_blk *vblk,
+		gfp_t gfp_mask)
+{
+	struct virtblk_req *vbr;
+
+	vbr = mempool_alloc(vblk->pool, gfp_mask);
+	if (vbr)
+		sg_init_table(vbr->sg, vblk->sg_elems);
+
+	return vbr;
+}
+
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+	switch (vbr->status) {
+	case VIRTIO_BLK_S_OK:
+		return 0;
+	case VIRTIO_BLK_S_UNSUPP:
+		return -ENOTTY;
+	default:
+		return -EIO;
+	}
+}
+
+static void virtblk_request_done(struct virtio_blk *vblk,
+		struct virtblk_req *vbr)
+{
+	struct request *req = vbr->private;
+	int error = virtblk_result(vbr);
+
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		req->resid_len = vbr->in_hdr.residual;
+		req->sense_len = vbr->in_hdr.sense_len;
+		req->errors = vbr->in_hdr.errors;
+	}
+	else if (req->cmd_type == REQ_TYPE_SPECIAL) {
+		printk("REQ_TYPE_SPECIAL done\n");
+		req->errors = (error != 0);
+	}
+
+	__blk_end_request_all(req, error);
+	mempool_free(vbr, vblk->pool);
+}
+
+static void virtblk_bio_done(struct virtio_blk *vblk,
+		struct virtblk_req *vbr)
+{
+	bio_endio(vbr->private, virtblk_result(vbr));
+	mempool_free(vbr, vblk->pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
-	struct virtblk_req *vbr;
+	struct virtblk_req *vbr, *head = NULL, *tail = NULL;
 	unsigned int len;
 	unsigned long flags;
 
 	spin_lock_irqsave(&vblk->lock, flags);
 	while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
-		int error;
-
-		switch (vbr->status) {
-		case VIRTIO_BLK_S_OK:
-			error = 0;
+		switch (vbr->kind) {
+		case VIRTIO_BLK_REQUEST:
+			virtblk_request_done(vblk, vbr);
+			/*
+			 * In case queue is stopped waiting
+			 * for more buffers.
+			 */
+        		blk_start_queue(vblk->disk->queue);
 			break;
-		case VIRTIO_BLK_S_UNSUPP:
-			error = -ENOTTY;
+		case VIRTIO_BLK_BIO:
+			if (head) {
+				tail->next = vbr;
+				tail = vbr;
+			} else {
+				tail = head = vbr;
+			}
 			break;
 		default:
-			error = -EIO;
-			break;
+			BUG();
 		}
 
-		switch (vbr->req->cmd_type) {
-		case REQ_TYPE_BLOCK_PC:
-			vbr->req->resid_len = vbr->in_hdr.residual;
-			vbr->req->sense_len = vbr->in_hdr.sense_len;
-			vbr->req->errors = vbr->in_hdr.errors;
-			break;
-		case REQ_TYPE_SPECIAL:
-			vbr->req->errors = (error != 0);
+	}
+
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	wake_up(&vblk->queue_wait);
+	/*
+	 * Process completions after freeing up space in the virtqueue and
+	 * dropping the lock.
+	 */
+	while (head) {
+		vbr = head;
+		head = head->next;
+
+		switch (vbr->kind) {
+		case VIRTIO_BLK_BIO:
+			virtblk_bio_done(vblk, vbr);
 			break;
 		default:
-			break;
+			BUG();
 		}
-
-		__blk_end_request_all(vbr->req, error);
-		mempool_free(vbr, vblk->pool);
 	}
-	/* In case queue is stopped waiting for more buffers. */
-	blk_start_queue(vblk->disk->queue);
-	spin_unlock_irqrestore(&vblk->lock, flags);
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -101,33 +166,29 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	unsigned long num, out = 0, in = 0;
 	struct virtblk_req *vbr;
 
-	vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+	vbr = alloc_virtblk_req(vblk, GFP_ATOMIC);
 	if (!vbr)
-		/* When another request finishes we'll try again. */
 		return false;
 
-	vbr->req = req;
+	vbr->private = req;
+	vbr->next = NULL;
+	vbr->kind = VIRTIO_BLK_REQUEST;
 
 	if (req->cmd_flags & REQ_FLUSH) {
 		vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
 		vbr->out_hdr.sector = 0;
-		vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+		vbr->out_hdr.ioprio = req_get_ioprio(req);
 	} else {
 		switch (req->cmd_type) {
-		case REQ_TYPE_FS:
-			vbr->out_hdr.type = 0;
-			vbr->out_hdr.sector = blk_rq_pos(vbr->req);
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
-			break;
 		case REQ_TYPE_BLOCK_PC:
 			vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
 			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.ioprio = req_get_ioprio(req);
 			break;
 		case REQ_TYPE_SPECIAL:
 			vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID;
 			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.ioprio = req_get_ioprio(req);
 			break;
 		default:
 			/* We don't put anything else in the queue. */
@@ -135,7 +196,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
 	/*
 	 * If this is a packet command we need a couple of additional headers.
@@ -143,22 +204,23 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 	 * block, and before the normal inhdr we put the sense data and the
 	 * inhdr with additional status information before the normal inhdr.
 	 */
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
-		sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
+		sg_set_buf(&vbr->sg[out++], req->cmd, req->cmd_len);
 
-	num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
+	num = blk_rq_map_sg(q, req, vbr->sg + out);
 
-	if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
-		sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
+	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+		sg_set_buf(&vbr->sg[num + out + in++], req->sense,
+			   SCSI_SENSE_BUFFERSIZE);
+		sg_set_buf(&vbr->sg[num + out + in++], &vbr->in_hdr,
 			   sizeof(vbr->in_hdr));
 	}
 
-	sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
+	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
 		   sizeof(vbr->status));
 
 	if (num) {
-		if (rq_data_dir(vbr->req) == WRITE) {
+		if (rq_data_dir(req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
 			out += num;
 		} else {
@@ -167,7 +229,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		}
 	}
 
-	if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr) < 0) {
+	if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr) < 0) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
@@ -198,6 +260,133 @@ static void do_virtblk_request(struct request_queue *q)
 		virtqueue_kick(vblk->vq);
 }
 
+struct virtblk_plug_cb {
+	struct blk_plug_cb cb;
+	struct virtio_blk *vblk;
+};
+
+static void virtblk_unplug(struct blk_plug_cb *bcb)
+{
+	struct virtblk_plug_cb *cb =
+		container_of(bcb, struct virtblk_plug_cb, cb);
+
+	virtqueue_notify(cb->vblk->vq);
+	kfree(cb);
+}
+
+static bool virtblk_plugged(struct virtio_blk *vblk)
+{
+	struct blk_plug *plug = current->plug;
+	struct virtblk_plug_cb *cb;
+
+	if (!plug)
+		return false;
+
+	list_for_each_entry(cb, &plug->cb_list, cb.list) {
+		if (cb->cb.callback == virtblk_unplug && cb->vblk == vblk)
+			return true;
+	}
+
+	/* Not currently on the callback list */
+	cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+	if (!cb)
+		return false;
+
+	cb->vblk = vblk;
+	cb->cb.callback = virtblk_unplug;
+	list_add(&cb->cb.list, &plug->cb_list);
+	return true;
+}
+
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+	struct virtblk_req *vbr, unsigned long out, unsigned long in)
+{
+	DEFINE_WAIT(wait);
+	bool retry, notify;
+
+	for (;;) {
+		prepare_to_wait(&vblk->queue_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+
+		spin_lock_irq(&vblk->lock);
+		if (virtqueue_add_buf(vblk->vq, vbr->sg,
+			out, in, vbr) < 0) {
+			retry = true;
+		} else {
+			retry = false;
+		}
+		notify = virtqueue_kick_prepare(vblk->vq);
+		spin_unlock_irq(&vblk->lock);
+
+		if (notify)
+			virtqueue_notify(vblk->vq);
+
+		if (!retry)
+			break;
+		schedule();
+	}
+	finish_wait(&vblk->queue_wait, &wait);
+}
+
+static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct virtio_blk *vblk = q->queuedata;
+	unsigned long num, out = 0, in = 0;
+	struct virtblk_req *vbr;
+	bool retry, notify;
+
+	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+	vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+	if (!vbr) {
+		bio_endio(bio, -ENOMEM);
+		return;
+	}
+
+	vbr->private = bio;
+	vbr->next = NULL;
+	vbr->kind = VIRTIO_BLK_BIO;
+
+	vbr->out_hdr.type = 0;
+	vbr->out_hdr.sector = bio->bi_sector;
+	vbr->out_hdr.ioprio = bio_prio(bio);
+
+	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+	num = bio_map_sg(q, bio, vbr->sg + out);
+
+	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	if (num) {
+		if (bio->bi_rw & REQ_WRITE) {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			out += num;
+		} else {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			in += num;
+		}
+	}
+
+	spin_lock_irq(&vblk->lock);
+	if (virtqueue_add_buf(vblk->vq, vbr->sg,
+		out, in, vbr) < 0) {
+		retry = true;
+	} else {
+		retry = false;
+	}
+
+	notify = virtqueue_kick_prepare(vblk->vq);
+	spin_unlock_irq(&vblk->lock);
+
+	if (notify && !virtblk_plugged(vblk))
+		virtqueue_notify(vblk->vq);
+
+	if (retry)
+		virtblk_add_buf_wait(vblk, vbr, out, in);
+}
+
 /* return id (s/n) string for *disk to *id_str
  */
 static int virtblk_get_id(struct gendisk *disk, char *id_str)
@@ -208,7 +397,7 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
 	int err;
 
 	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
-			   GFP_KERNEL);
+			GFP_KERNEL);
 	if (IS_ERR(bio))
 		return PTR_ERR(bio);
 
@@ -370,17 +559,16 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 
 	/* We need an extra sg elements at head and tail. */
 	sg_elems += 2;
-	vdev->priv = vblk = kmalloc(sizeof(*vblk) +
-				    sizeof(vblk->sg[0]) * sg_elems, GFP_KERNEL);
+	vdev->priv = vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
 	if (!vblk) {
 		err = -ENOMEM;
 		goto out_free_index;
 	}
 
+	init_waitqueue_head(&vblk->queue_wait);
 	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
-	sg_init_table(vblk->sg, vblk->sg_elems);
 	INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
 
 	/* We expect one virtqueue, for output. */
@@ -390,7 +578,9 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_free_vblk;
 	}
 
-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	vblk->pool = mempool_create_kmalloc_pool(1,
+		sizeof(struct virtblk_req) +
+		sizeof(struct scatterlist) * sg_elems);
 	if (!vblk->pool) {
 		err = -ENOMEM;
 		goto out_free_vq;
@@ -409,6 +599,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		goto out_put_disk;
 	}
 
+	blk_queue_make_request(q, virtblk_make_request);
 	q->queuedata = vblk;
 
 	if (index < 26) {
@@ -432,7 +623,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->index = index;
 
 	/* configure queue flush support */
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_make_request)
 		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] virtio-blk: Support batch I/O for enhancing sequential IO
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (3 preceding siblings ...)
  2011-12-21  1:00 ` [PATCH 4/6] virtio-blk: implement ->make_request Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-21  1:00 ` [PATCH 6/6] virtio-blk: Emulate Flush/FUA Minchan Kim
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Minchan Kim, Minchan Kim

BIO-based path has a disadvantage which it's not good to sequential
stream because it cannot merge BIO while reuqest can do it.

This patch makes per-cpu BIO for batch I/O.
If this request is contiguous with previous's one, this request would
be merged with previous one on batch queue.
If non-contiguous I/O issue or pass 1ms, batch queue would be drained.

Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 drivers/block/virtio_blk.c |  366 +++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 331 insertions(+), 35 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4e476d6..e32c69e 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -19,6 +19,28 @@ static DEFINE_IDA(vd_index_ida);
 
 struct workqueue_struct *virtblk_wq;
 
+#define BIO_QUEUE_MAX	32
+
+struct per_cpu_bio
+{
+	struct bio *bios[BIO_QUEUE_MAX];
+	int idx;			/* current index */
+	struct virtio_blk *vblk;
+	struct request_queue *q;
+	struct delayed_work dwork;
+	unsigned int segments; 		/* the number of accumulated segement */
+	bool seq_mode;			/* sequential mode */
+	sector_t next_offset;		/*
+					 * next expected sector offset
+					 * for becoming sequential mode
+					 */
+};
+
+struct bio_queue
+{
+	struct per_cpu_bio __percpu *pcbio;
+};
+
 struct virtio_blk
 {
 	spinlock_t lock;
@@ -38,6 +60,9 @@ struct virtio_blk
 	/* What host tells us, plus 2 for header & tailer. */
 	unsigned int sg_elems;
 
+	/* bio queue for batch IO */
+	struct bio_queue bq;
+
 	/* Ida index - used to track minor number allocations. */
 	int index;
 };
@@ -57,6 +82,8 @@ struct virtblk_req
 	struct scatterlist sg[];
 };
 
+static void wait_virtq_flush(struct virtio_blk *vblk);
+
 static struct virtblk_req *alloc_virtblk_req(struct virtio_blk *vblk,
 		gfp_t gfp_mask)
 {
@@ -93,7 +120,6 @@ static void virtblk_request_done(struct virtio_blk *vblk,
 		req->errors = vbr->in_hdr.errors;
 	}
 	else if (req->cmd_type == REQ_TYPE_SPECIAL) {
-		printk("REQ_TYPE_SPECIAL done\n");
 		req->errors = (error != 0);
 	}
 
@@ -104,7 +130,15 @@ static void virtblk_request_done(struct virtio_blk *vblk,
 static void virtblk_bio_done(struct virtio_blk *vblk,
 		struct virtblk_req *vbr)
 {
-	bio_endio(vbr->private, virtblk_result(vbr));
+	struct bio *bio;
+	bio = vbr->private;
+
+	while(bio) {
+		struct bio *free_bio = bio;
+		bio = bio->bi_next;
+		bio_endio(free_bio, virtblk_result(vbr));
+	}
+
 	mempool_free(vbr, vblk->pool);
 }
 
@@ -298,52 +332,220 @@ static bool virtblk_plugged(struct virtio_blk *vblk)
 	return true;
 }
 
-static void virtblk_add_buf_wait(struct virtio_blk *vblk,
-	struct virtblk_req *vbr, unsigned long out, unsigned long in)
+bool seq_bio(struct bio *bio, struct per_cpu_bio __percpu *pcbio)
 {
-	DEFINE_WAIT(wait);
-	bool retry, notify;
+	struct bio *last_bio;
+	int index = pcbio->idx - 1;
 
-	for (;;) {
-		prepare_to_wait(&vblk->queue_wait, &wait,
-				TASK_UNINTERRUPTIBLE);
+	BUG_ON(index < 0 || index > BIO_QUEUE_MAX);
+	last_bio = pcbio->bios[index];
+
+	if (last_bio->bi_rw != bio->bi_rw)
+		return false;
+
+	if ((last_bio->bi_sector + (last_bio->bi_size >> 9)) ==
+				bio->bi_sector)
+		return true;
+
+	return false;
+}
+
+int add_pcbio_to_vq(struct per_cpu_bio __percpu *pcbio,
+		struct virtio_blk *vblk, struct request_queue *q,
+		int *notify)
+{
+	int i;
+	unsigned long num = 0, out = 0, in = 0;
+	bool retry;
+	struct virtblk_req *vbr;
+	struct bio *bio;
+
+	vbr = alloc_virtblk_req(vblk, GFP_ATOMIC);
+	if (!vbr)
+		return 1;
+
+	vbr->private = NULL;
+	vbr->next = NULL;
+	vbr->kind = VIRTIO_BLK_BIO;
+
+	bio = pcbio->bios[0];
+	BUG_ON(!bio);
+
+	vbr->out_hdr.type = 0;
+	vbr->out_hdr.sector = bio->bi_sector;
+	vbr->out_hdr.ioprio = bio_prio(bio);
+
+	sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
-		spin_lock_irq(&vblk->lock);
-		if (virtqueue_add_buf(vblk->vq, vbr->sg,
-			out, in, vbr) < 0) {
-			retry = true;
+	for ( i = 0; i < pcbio->idx; i++) {
+		struct bio *prev;
+		bio = pcbio->bios[i];
+
+		BUG_ON(!bio);
+		num += bio_map_sg(q, bio, vbr->sg + out + num);
+		BUG_ON(num > (vblk->sg_elems - 2));
+
+		prev = vbr->private;
+		if (prev)
+			bio->bi_next = prev;
+		vbr->private = bio;
+	}
+
+	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+		   sizeof(vbr->status));
+
+	if (num) {
+		if (bio->bi_rw & REQ_WRITE) {
+			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			out += num;
 		} else {
-			retry = false;
+			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			in += num;
 		}
-		notify = virtqueue_kick_prepare(vblk->vq);
-		spin_unlock_irq(&vblk->lock);
+	}
+
+	spin_lock_irq(&vblk->lock);
+	if (virtqueue_add_buf(vblk->vq, vbr->sg,
+		out, in, vbr) < 0) {
+		struct bio *bio, *next_bio;
 
-		if (notify)
-			virtqueue_notify(vblk->vq);
+		retry = true;
 
-		if (!retry)
-			break;
-		schedule();
+		bio = vbr->private;
+		while(bio) {
+			next_bio = bio->bi_next;
+			bio->bi_next = NULL;
+			bio = next_bio;
+		}
+
+		mempool_free(vbr, vblk->pool);
+
+	} else {
+
+		for ( i = 0; i < pcbio->idx; i++) {
+			pcbio->bios[i] = NULL;
+		}
+
+		pcbio->idx = 0;
+		pcbio->segments = 0;
+
+		retry = false;
 	}
-	finish_wait(&vblk->queue_wait, &wait);
+
+	*notify |= virtqueue_kick_prepare(vblk->vq);
+	spin_unlock_irq(&vblk->lock);
+
+	return retry;
 }
 
-static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+/*
+ * Return 0 if it is successful flush
+ * This function might be able to don't flush so caller
+ * should retry it.
+ */
+int try_flush_pcb(struct per_cpu_bio __percpu *pcbio)
 {
-	struct virtio_blk *vblk = q->queuedata;
-	unsigned long num, out = 0, in = 0;
-	struct virtblk_req *vbr;
-	bool retry, notify;
+	int notify = 0;
 
-	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
-	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+	if (!pcbio->idx)
+		return 0;
 
-	vbr = alloc_virtblk_req(vblk, GFP_NOIO);
-	if (!vbr) {
-		bio_endio(bio, -ENOMEM);
-		return;
+	if (add_pcbio_to_vq(pcbio, pcbio->vblk, pcbio->q, &notify)) {
+		virtqueue_notify(pcbio->vblk->vq);
+		return 1;
 	}
 
+	if (notify && !virtblk_plugged(pcbio->vblk))
+		virtqueue_notify(pcbio->vblk->vq);
+
+	return 0;
+}
+
+static void virtblk_delay_q_flush(struct work_struct *work)
+{
+	struct per_cpu_bio __percpu *pcbio =
+		container_of(work, struct per_cpu_bio, dwork.work);
+
+	while(try_flush_pcb(pcbio))
+		wait_virtq_flush(pcbio->vblk);
+}
+
+void wait_virtq_flush(struct virtio_blk *vblk)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&vblk->queue_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+	schedule();
+	finish_wait(&vblk->queue_wait, &wait);
+}
+
+void add_bio_to_pcbio(struct bio *bio, struct per_cpu_bio __percpu *pcbio)
+{
+	BUG_ON(pcbio->idx >= BIO_QUEUE_MAX);
+
+	pcbio->bios[pcbio->idx++] = bio;
+	pcbio->segments += bio->bi_phys_segments;
+	/*
+	 * If this bio is first bio on queue, start timer to flush
+	 * bio within 1ms.
+	 */
+	if (pcbio->idx == 1)
+		queue_delayed_work_on(smp_processor_id(),
+			virtblk_wq, &pcbio->dwork,
+			msecs_to_jiffies(1));
+}
+
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+        struct virtblk_req *vbr, unsigned long out, unsigned long in)
+{
+        DEFINE_WAIT(wait);
+        bool retry, notify;
+
+        for (;;) {
+                prepare_to_wait(&vblk->queue_wait, &wait,
+                                TASK_UNINTERRUPTIBLE);
+
+                spin_lock_irq(&vblk->lock);
+                if (virtqueue_add_buf(vblk->vq, vbr->sg,
+                        out, in, vbr) < 0) {
+                        retry = true;
+                } else {
+                        retry = false;
+                }
+                notify = virtqueue_kick_prepare(vblk->vq);
+                spin_unlock_irq(&vblk->lock);
+
+                if (notify)
+                        virtqueue_notify(vblk->vq);
+
+                if (!retry)
+                        break;
+                schedule();
+        }
+        finish_wait(&vblk->queue_wait, &wait);
+}
+
+bool full_segment(struct per_cpu_bio __percpu *pcbio, struct bio *bio,
+		unsigned int max)
+{
+	bool full;
+	full = (pcbio->segments + bio->bi_phys_segments) > max;
+
+	return full;
+}
+
+int add_bio_to_vq(struct bio *bio, struct virtio_blk *vblk,
+		struct request_queue *q)
+{
+	int notify;
+	bool retry;
+	unsigned long num, out = 0, in = 0;
+	struct virtblk_req *vbr = alloc_virtblk_req(vblk, GFP_KERNEL);
+
+	if (!vbr)
+		return 1;
+
 	vbr->private = bio;
 	vbr->next = NULL;
 	vbr->kind = VIRTIO_BLK_BIO;
@@ -357,7 +559,7 @@ static void virtblk_make_request(struct request_queue *q, struct bio *bio)
 	num = bio_map_sg(q, bio, vbr->sg + out);
 
 	sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
-		   sizeof(vbr->status));
+			sizeof(vbr->status));
 
 	if (num) {
 		if (bio->bi_rw & REQ_WRITE) {
@@ -371,7 +573,7 @@ static void virtblk_make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(&vblk->lock);
 	if (virtqueue_add_buf(vblk->vq, vbr->sg,
-		out, in, vbr) < 0) {
+				out, in, vbr) < 0) {
 		retry = true;
 	} else {
 		retry = false;
@@ -385,6 +587,75 @@ static void virtblk_make_request(struct request_queue *q, struct bio *bio)
 
 	if (retry)
 		virtblk_add_buf_wait(vblk, vbr, out, in);
+	return 0;
+}
+
+bool seq_mode(struct per_cpu_bio __percpu *pcbio, struct bio *bio)
+{
+	if (pcbio->seq_mode == false)
+		return false;
+
+	if (pcbio->idx == 0)
+		return true;
+
+	return seq_bio(bio, pcbio);
+}
+
+void reset_seq_mode(struct per_cpu_bio __percpu *pcbio, struct bio *bio)
+{
+	if (bio->bi_sector == pcbio->next_offset)
+		pcbio->seq_mode = true;
+	else
+		pcbio->seq_mode = false;
+
+	pcbio->next_offset = bio->bi_sector + (bio->bi_size >> 9);
+}
+
+
+static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct virtio_blk *vblk = q->queuedata;
+	struct per_cpu_bio __percpu *pcbio;
+
+	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+retry:
+	preempt_disable();
+	pcbio = this_cpu_ptr(vblk->bq.pcbio);
+
+	if (seq_mode(pcbio, bio)) {
+		if (pcbio->idx >= BIO_QUEUE_MAX ||
+			full_segment(pcbio, bio, vblk->sg_elems -2)) {
+			if (try_flush_pcb(pcbio)) {
+				preempt_enable();
+				wait_virtq_flush(pcbio->vblk);
+				goto retry;
+			}
+
+			cancel_delayed_work(&pcbio->dwork);
+		}
+
+		add_bio_to_pcbio(bio, pcbio);
+	}
+	else {
+		while(try_flush_pcb(pcbio)) {
+			preempt_enable();
+			wait_virtq_flush(pcbio->vblk);
+			preempt_disable();
+			pcbio = this_cpu_ptr(vblk->bq.pcbio);
+		}
+
+		cancel_delayed_work(&pcbio->dwork);
+		reset_seq_mode(pcbio, bio);
+		preempt_enable();
+
+		while (add_bio_to_vq(bio, vblk, q))
+			wait_virtq_flush(pcbio->vblk);
+
+		preempt_disable();
+	}
+
+	preempt_enable();
 }
 
 /* return id (s/n) string for *disk to *id_str
@@ -532,6 +803,26 @@ static void virtblk_config_changed(struct virtio_device *vdev)
 	queue_work(virtblk_wq, &vblk->config_work);
 }
 
+void setup_per_cpu_bio(struct virtio_blk *vblk, struct request_queue *q)
+{
+	int cpu;
+
+	struct bio_queue *bq = &vblk->bq;
+	bq->pcbio = alloc_percpu(struct per_cpu_bio);
+	for_each_possible_cpu(cpu) {
+		struct per_cpu_bio __percpu *pcbio =
+					per_cpu_ptr(bq->pcbio, cpu);
+		pcbio->q = q;
+		pcbio->vblk = vblk;
+		pcbio->idx = 0;
+		pcbio->segments = 0;
+		pcbio->seq_mode = false;
+		pcbio->next_offset = 0;
+		memset(pcbio->bios, 0, BIO_QUEUE_MAX);
+		INIT_DELAYED_WORK(&pcbio->dwork, virtblk_delay_q_flush);
+	}
+}
+
 static int __devinit virtblk_probe(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk;
@@ -571,6 +862,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->sg_elems = sg_elems;
 	INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
 
+	memset(&vblk->bq, 0, sizeof(struct bio_queue));
+
 	/* We expect one virtqueue, for output. */
 	vblk->vq = virtio_find_single_vq(vdev, blk_done, "requests");
 	if (IS_ERR(vblk->vq)) {
@@ -602,6 +895,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	blk_queue_make_request(q, virtblk_make_request);
 	q->queuedata = vblk;
 
+	setup_per_cpu_bio(vblk, q);
+
 	if (index < 26) {
 		sprintf(vblk->disk->disk_name, "vd%c", 'a' + index % 26);
 	} else if (index < (26 + 1) * 26) {
@@ -736,6 +1031,7 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 	put_disk(vblk->disk);
 	mempool_destroy(vblk->pool);
 	vdev->config->del_vqs(vdev);
+	free_percpu(vblk->bq.pcbio);
 	kfree(vblk);
 	ida_simple_remove(&vd_index_ida, index);
 }
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] virtio-blk: Emulate Flush/FUA
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (4 preceding siblings ...)
  2011-12-21  1:00 ` [PATCH 5/6] virtio-blk: Support batch I/O for enhancing sequential IO Minchan Kim
@ 2011-12-21  1:00 ` Minchan Kim
  2011-12-21  5:08 ` [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Rusty Russell
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  1:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Minchan Kim, Minchan Kim

This patch emulates flush/fua on virtio-blk and pass xfstest on ext4.
But it needs more reviews.

Signed-off-by: Minchan Kim <minchan@redhat.com>
---
 drivers/block/virtio_blk.c |   89 ++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e32c69e..6721b9d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -12,7 +12,6 @@
 #include <linux/idr.h>
 
 #define PART_BITS 4
-static int use_make_request = 1;
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
@@ -77,6 +76,7 @@ struct virtblk_req
 	u8 kind;
 #define VIRTIO_BLK_REQUEST	0x00
 #define VIRTIO_BLK_BIO		0x01
+#define VIRTIO_BLK_BIO_FLUSH	0x02
 	u8 status;
 
 	struct scatterlist sg[];
@@ -160,6 +160,9 @@ static void blk_done(struct virtqueue *vq)
 			 */
         		blk_start_queue(vblk->disk->queue);
 			break;
+		case VIRTIO_BLK_BIO_FLUSH:
+			complete(vbr->private);
+			break;
 		case VIRTIO_BLK_BIO:
 			if (head) {
 				tail->next = vbr;
@@ -526,6 +529,59 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
         finish_wait(&vblk->queue_wait, &wait);
 }
 
+static int virtblk_flush(struct virtio_blk *vblk,
+			struct virtblk_req *vbr, struct bio *bio)
+{
+	int error;
+	bool retry, notify;
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	vbr->private = &done;
+	vbr->next = NULL;
+	vbr->kind = VIRTIO_BLK_BIO_FLUSH;
+
+	vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
+	vbr->out_hdr.sector = 0;
+	if (bio)
+		vbr->out_hdr.ioprio = bio_prio(bio);
+	else
+		vbr->out_hdr.ioprio = 0;
+
+	sg_set_buf(&vbr->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
+	sg_set_buf(&vbr->sg[1], &vbr->status, sizeof(vbr->status));
+
+	spin_lock_irq(&vblk->lock);
+	if (virtqueue_add_buf(vblk->vq, vbr->sg, 1, 1, vbr) < 0) {
+		retry = true;
+	} else {
+		retry = false;
+	}
+
+	notify = virtqueue_kick_prepare(vblk->vq);
+	spin_unlock_irq(&vblk->lock);
+
+	if (notify && !virtblk_plugged(vblk))
+		virtqueue_notify(vblk->vq);
+
+	if (retry)
+		virtblk_add_buf_wait(vblk, vbr, 1, 1);
+
+	wait_for_completion(&done);
+	error = virtblk_result(vbr);
+	return error;
+}
+
+void bq_flush(struct bio_queue *bq)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct per_cpu_bio __percpu *pcbio = per_cpu_ptr(bq->pcbio, cpu);
+		queue_work_on(cpu,
+			virtblk_wq, &pcbio->dwork.work);
+		flush_work_sync(&pcbio->dwork.work);
+	}
+}
+
 bool full_segment(struct per_cpu_bio __percpu *pcbio, struct bio *bio,
 		unsigned int max)
 {
@@ -616,9 +672,36 @@ static void virtblk_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct virtio_blk *vblk = q->queuedata;
 	struct per_cpu_bio __percpu *pcbio;
+	bool pre_flush, post_flush;
 
 	BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
-	BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+	pre_flush = bio->bi_rw & REQ_FLUSH;
+	post_flush = bio->bi_rw & REQ_FUA;
+
+	if (pre_flush) {
+		struct virtblk_req *dummy_vbr;
+		bq_flush(&vblk->bq);
+
+		dummy_vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+		virtblk_flush(vblk, dummy_vbr, NULL);
+		mempool_free(dummy_vbr, vblk->pool);
+
+		if (bio->bi_sector && post_flush) {
+			int error;
+			struct virtblk_req *vbr;
+			vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+			error = virtblk_flush(vblk, vbr, bio);
+			mempool_free(vbr, vblk->pool);
+
+			dummy_vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+			virtblk_flush(vblk, dummy_vbr, NULL);
+			mempool_free(dummy_vbr, vblk->pool);
+
+			bio_endio(bio, error);
+			return;
+		}
+	}
 retry:
 	preempt_disable();
 	pcbio = this_cpu_ptr(vblk->bq.pcbio);
@@ -918,7 +1001,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->index = index;
 
 	/* configure queue flush support */
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_make_request)
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
 		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (5 preceding siblings ...)
  2011-12-21  1:00 ` [PATCH 6/6] virtio-blk: Emulate Flush/FUA Minchan Kim
@ 2011-12-21  5:08 ` Rusty Russell
  2011-12-21  5:56   ` Minchan Kim
  2011-12-21  8:28 ` Sasha Levin
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2011-12-21  5:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Minchan Kim

On Wed, 21 Dec 2011 10:00:48 +0900, Minchan Kim <minchan@kernel.org> wrote:
> This patch is follow-up of Christohp Hellwig's work
> [RFC: ->make_request support for virtio-blk].
> http://thread.gmane.org/gmane.linux.kernel/1199763
> 
> Quote from hch
> "This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues."
> 
> I fixed race bug and add batch I/O for enhancing sequential I/O,
> FLUSH/FUA emulation.
> 
> I tested this patch on fusion I/O device by aio-stress.
> Result is following as.
> 
> Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> Test repeated by 20 times
> Guest I/O scheduler : CFQ
> Host I/O scheduler : NOOP
> 
>             Request            	BIO(patch 1-4)          BIO-batch(patch 1-6)
>          (MB/s)  stddev 	(MB/s)  stddev  	(MB/s)  stddev
> w        737.820 4.063  	613.735 31.605  	730.288 24.854
> rw       208.754 20.450 	314.630 37.352  	317.831 41.719
> r        770.974 2.340  	347.483 51.370  	750.324 8.280
> rr       250.391 16.910 	350.053 29.986  	325.976 24.846

So, you dropped w and r down 2%, but rw and rr up 40%.

If I knew what the various rows were, I'd have something intelligent to
say, I'm sure :)

I can find the source to aio-stress, but no obvious clues.

Help!
Rusty.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  5:08 ` [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Rusty Russell
@ 2011-12-21  5:56   ` Minchan Kim
  0 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  5:56 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig

Hi Rusty,

On Wed, Dec 21, 2011 at 03:38:03PM +1030, Rusty Russell wrote:
> On Wed, 21 Dec 2011 10:00:48 +0900, Minchan Kim <minchan@kernel.org> wrote:
> > This patch is follow-up of Christohp Hellwig's work
> > [RFC: ->make_request support for virtio-blk].
> > http://thread.gmane.org/gmane.linux.kernel/1199763
> > 
> > Quote from hch
> > "This patchset allows the virtio-blk driver to support much higher IOP
> > rates which can be driven out of modern PCI-e flash devices.  At this
> > point it really is just a RFC due to various issues."
> > 
> > I fixed race bug and add batch I/O for enhancing sequential I/O,
> > FLUSH/FUA emulation.
> > 
> > I tested this patch on fusion I/O device by aio-stress.
> > Result is following as.
> > 
> > Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> > Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> > Test repeated by 20 times
> > Guest I/O scheduler : CFQ
> > Host I/O scheduler : NOOP
> > 
> >             Request            	BIO(patch 1-4)          BIO-batch(patch 1-6)
> >          (MB/s)  stddev 	(MB/s)  stddev  	(MB/s)  stddev
> > w        737.820 4.063  	613.735 31.605  	730.288 24.854
> > rw       208.754 20.450 	314.630 37.352  	317.831 41.719
> > r        770.974 2.340  	347.483 51.370  	750.324 8.280
> > rr       250.391 16.910 	350.053 29.986  	325.976 24.846
> 
> So, you dropped w and r down 2%, but rw and rr up 40%.
> 
> If I knew what the various rows were, I'd have something intelligent to
> say, I'm sure :)

Sorry for missing that.

w: Sequential Write
rw: Random Write
r: Sequential Read
rr: Random Read

> 
> I can find the source to aio-stress, but no obvious clues.
> 
> Help!
> Rusty.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  8:28 ` Sasha Levin
@ 2011-12-21  8:17   ` Minchan Kim
  0 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-21  8:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

Hi Sasha!

On Wed, Dec 21, 2011 at 10:28:52AM +0200, Sasha Levin wrote:
> On Wed, 2011-12-21 at 10:00 +0900, Minchan Kim wrote:
> > This patch is follow-up of Christohp Hellwig's work
> > [RFC: ->make_request support for virtio-blk].
> > http://thread.gmane.org/gmane.linux.kernel/1199763
> > 
> > Quote from hch
> > "This patchset allows the virtio-blk driver to support much higher IOP
> > rates which can be driven out of modern PCI-e flash devices.  At this
> > point it really is just a RFC due to various issues."
> > 
> > I fixed race bug and add batch I/O for enhancing sequential I/O,
> > FLUSH/FUA emulation.
> > 
> > I tested this patch on fusion I/O device by aio-stress.
> > Result is following as.
> > 
> > Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> > Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> > Test repeated by 20 times
> > Guest I/O scheduler : CFQ
> > Host I/O scheduler : NOOP
> > 
> >             Request            	BIO(patch 1-4)          BIO-batch(patch 1-6)
> >          (MB/s)  stddev 	(MB/s)  stddev  	(MB/s)  stddev
> > w        737.820 4.063  	613.735 31.605  	730.288 24.854
> > rw       208.754 20.450 	314.630 37.352  	317.831 41.719
> > r        770.974 2.340  	347.483 51.370  	750.324 8.280
> > rr       250.391 16.910 	350.053 29.986  	325.976 24.846
> > 
> > This patch enhances ramdom I/O performance compared to request-based I/O path.
> > It's still RFC so welcome to any comment and review.
> 
> I did a benchmark against a /dev/shm device instead of an actual storage
> to get rid of any artifacts which are caused by the storage itself, and
> saw that while there was a nice improvement across the board, the hit
> against sequential read and write was quite significant.

Hmm, it seems bandwidth test of sequential is bad but io test is still good.
I don't know how it is possbile that iops is better but bandwidth is bad.
Anyway, it seems sequential write bw is severe but I think it could be
better if test makes many lock overhead in vblk->lock because this patch
is started from the lock overhead.

Thanks for the testing, Sasha!

> 
> I ran the tests with fio running in KVM tool against a 2G file located
> in /dev/shm. Here is a summary of the results:
> 
> Before:
> write_iops_seq
>   write: io=1409.8MB, bw=144217KB/s, iops=36054 , runt= 10010msec
> write_bw_seq
>   write: io=7700.0MB, bw=1323.5MB/s, iops=1323 , runt=  5818msec
> read_iops_seq
>   read : io=1453.7MB, bw=148672KB/s, iops=37168 , runt= 10012msec
> read_bw_seq
>   read : io=7700.0MB, bw=1882.7MB/s, iops=1882 , runt=  4090msec
> write_iops_rand
>   write: io=1266.4MB, bw=129479KB/s, iops=32369 , runt= 10015msec
> write_bw_rand
>   write: io=7539.0MB, bw=1106.1MB/s, iops=1106 , runt=  6811msec
> read_iops_rand
>   read : io=1373.3MB, bw=140475KB/s, iops=35118 , runt= 10010msec
> read_bw_rand
>   read : io=7539.0MB, bw=1314.4MB/s, iops=1314 , runt=  5736msec
> readwrite_iops_seq
>   read : io=726172KB, bw=72292KB/s, iops=18072 , runt= 10045msec
>   write: io=726460KB, bw=72321KB/s, iops=18080 , runt= 10045msec
> readwrite_bw_seq
>   read : io=3856.0MB, bw=779574KB/s, iops=761 , runt=  5065msec
>   write: io=3844.0MB, bw=777148KB/s, iops=758 , runt=  5065msec
> readwrite_iops_rand
>   read : io=701780KB, bw=70094KB/s, iops=17523 , runt= 10012msec
>   write: io=706120KB, bw=70527KB/s, iops=17631 , runt= 10012msec
> readwrite_bw_rand
>   read : io=3705.0MB, bw=601446KB/s, iops=587 , runt=  6308msec
>   write: io=3834.0MB, bw=622387KB/s, iops=607 , runt=  6308msec
> 
> After:
> write_iops_seq
>   write: io=1591.4MB, bw=162626KB/s, iops=40656 , runt= 10020msec
> write_bw_seq
>   write: io=7700.0MB, bw=1276.4MB/s, iops=1276 , runt=  6033msec
> read_iops_seq
>   read : io=1615.7MB, bw=164680KB/s, iops=41170 , runt= 10046msec
> read_bw_seq
>   read : io=7700.0MB, bw=1407.1MB/s, iops=1407 , runt=  5469msec
> write_iops_rand
>   write: io=1243.1MB, bw=126304KB/s, iops=31575 , runt= 10085msec
> write_bw_rand
>   write: io=7539.0MB, bw=1206.3MB/s, iops=1206 , runt=  6250msec
> read_iops_rand
>   read : io=1533.1MB, bw=156795KB/s, iops=39198 , runt= 10018msec
> read_bw_rand
>   read : io=7539.0MB, bw=1413.7MB/s, iops=1413 , runt=  5333msec
> readwrite_iops_seq
>   read : io=819124KB, bw=81790KB/s, iops=20447 , runt= 10015msec
>   write: io=823136KB, bw=82190KB/s, iops=20547 , runt= 10015msec
> readwrite_bw_seq
>   read : io=3913.0MB, bw=704946KB/s, iops=688 , runt=  5684msec
>   write: io=3787.0MB, bw=682246KB/s, iops=666 , runt=  5684msec
> readwrite_iops_rand
>   read : io=802148KB, bw=80159KB/s, iops=20039 , runt= 10007msec
>   write: io=801192KB, bw=80063KB/s, iops=20015 , runt= 10007msec
> readwrite_bw_rand
>   read : io=3731.0MB, bw=677762KB/s, iops=661 , runt=  5637msec
>   write: io=3808.0MB, bw=691750KB/s, iops=675 , runt=  5637msec
> 
> -- 
> 
> Sasha.
> 
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (6 preceding siblings ...)
  2011-12-21  5:08 ` [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Rusty Russell
@ 2011-12-21  8:28 ` Sasha Levin
  2011-12-21  8:17   ` Minchan Kim
  2011-12-21 19:11 ` Vivek Goyal
  2011-12-22 12:57 ` Stefan Hajnoczi
  9 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2011-12-21  8:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

On Wed, 2011-12-21 at 10:00 +0900, Minchan Kim wrote:
> This patch is follow-up of Christohp Hellwig's work
> [RFC: ->make_request support for virtio-blk].
> http://thread.gmane.org/gmane.linux.kernel/1199763
> 
> Quote from hch
> "This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues."
> 
> I fixed race bug and add batch I/O for enhancing sequential I/O,
> FLUSH/FUA emulation.
> 
> I tested this patch on fusion I/O device by aio-stress.
> Result is following as.
> 
> Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> Test repeated by 20 times
> Guest I/O scheduler : CFQ
> Host I/O scheduler : NOOP
> 
>             Request            	BIO(patch 1-4)          BIO-batch(patch 1-6)
>          (MB/s)  stddev 	(MB/s)  stddev  	(MB/s)  stddev
> w        737.820 4.063  	613.735 31.605  	730.288 24.854
> rw       208.754 20.450 	314.630 37.352  	317.831 41.719
> r        770.974 2.340  	347.483 51.370  	750.324 8.280
> rr       250.391 16.910 	350.053 29.986  	325.976 24.846
> 
> This patch enhances ramdom I/O performance compared to request-based I/O path.
> It's still RFC so welcome to any comment and review.

I did a benchmark against a /dev/shm device instead of an actual storage
to get rid of any artifacts which are caused by the storage itself, and
saw that while there was a nice improvement across the board, the hit
against sequential read and write was quite significant.

I ran the tests with fio running in KVM tool against a 2G file located
in /dev/shm. Here is a summary of the results:

Before:
write_iops_seq
  write: io=1409.8MB, bw=144217KB/s, iops=36054 , runt= 10010msec
write_bw_seq
  write: io=7700.0MB, bw=1323.5MB/s, iops=1323 , runt=  5818msec
read_iops_seq
  read : io=1453.7MB, bw=148672KB/s, iops=37168 , runt= 10012msec
read_bw_seq
  read : io=7700.0MB, bw=1882.7MB/s, iops=1882 , runt=  4090msec
write_iops_rand
  write: io=1266.4MB, bw=129479KB/s, iops=32369 , runt= 10015msec
write_bw_rand
  write: io=7539.0MB, bw=1106.1MB/s, iops=1106 , runt=  6811msec
read_iops_rand
  read : io=1373.3MB, bw=140475KB/s, iops=35118 , runt= 10010msec
read_bw_rand
  read : io=7539.0MB, bw=1314.4MB/s, iops=1314 , runt=  5736msec
readwrite_iops_seq
  read : io=726172KB, bw=72292KB/s, iops=18072 , runt= 10045msec
  write: io=726460KB, bw=72321KB/s, iops=18080 , runt= 10045msec
readwrite_bw_seq
  read : io=3856.0MB, bw=779574KB/s, iops=761 , runt=  5065msec
  write: io=3844.0MB, bw=777148KB/s, iops=758 , runt=  5065msec
readwrite_iops_rand
  read : io=701780KB, bw=70094KB/s, iops=17523 , runt= 10012msec
  write: io=706120KB, bw=70527KB/s, iops=17631 , runt= 10012msec
readwrite_bw_rand
  read : io=3705.0MB, bw=601446KB/s, iops=587 , runt=  6308msec
  write: io=3834.0MB, bw=622387KB/s, iops=607 , runt=  6308msec

After:
write_iops_seq
  write: io=1591.4MB, bw=162626KB/s, iops=40656 , runt= 10020msec
write_bw_seq
  write: io=7700.0MB, bw=1276.4MB/s, iops=1276 , runt=  6033msec
read_iops_seq
  read : io=1615.7MB, bw=164680KB/s, iops=41170 , runt= 10046msec
read_bw_seq
  read : io=7700.0MB, bw=1407.1MB/s, iops=1407 , runt=  5469msec
write_iops_rand
  write: io=1243.1MB, bw=126304KB/s, iops=31575 , runt= 10085msec
write_bw_rand
  write: io=7539.0MB, bw=1206.3MB/s, iops=1206 , runt=  6250msec
read_iops_rand
  read : io=1533.1MB, bw=156795KB/s, iops=39198 , runt= 10018msec
read_bw_rand
  read : io=7539.0MB, bw=1413.7MB/s, iops=1413 , runt=  5333msec
readwrite_iops_seq
  read : io=819124KB, bw=81790KB/s, iops=20447 , runt= 10015msec
  write: io=823136KB, bw=82190KB/s, iops=20547 , runt= 10015msec
readwrite_bw_seq
  read : io=3913.0MB, bw=704946KB/s, iops=688 , runt=  5684msec
  write: io=3787.0MB, bw=682246KB/s, iops=666 , runt=  5684msec
readwrite_iops_rand
  read : io=802148KB, bw=80159KB/s, iops=20039 , runt= 10007msec
  write: io=801192KB, bw=80063KB/s, iops=20015 , runt= 10007msec
readwrite_bw_rand
  read : io=3731.0MB, bw=677762KB/s, iops=661 , runt=  5637msec
  write: io=3808.0MB, bw=691750KB/s, iops=675 , runt=  5637msec

-- 

Sasha.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (7 preceding siblings ...)
  2011-12-21  8:28 ` Sasha Levin
@ 2011-12-21 19:11 ` Vivek Goyal
  2011-12-22  1:05   ` Minchan Kim
  2011-12-22 12:57 ` Stefan Hajnoczi
  9 siblings, 1 reply; 27+ messages in thread
From: Vivek Goyal @ 2011-12-21 19:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig, Minchan Kim

On Wed, Dec 21, 2011 at 10:00:48AM +0900, Minchan Kim wrote:
> This patch is follow-up of Christohp Hellwig's work
> [RFC: ->make_request support for virtio-blk].
> http://thread.gmane.org/gmane.linux.kernel/1199763
> 
> Quote from hch
> "This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues."
> 
> I fixed race bug and add batch I/O for enhancing sequential I/O,
> FLUSH/FUA emulation.
> 
> I tested this patch on fusion I/O device by aio-stress.
> Result is following as.
> 
> Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> Test repeated by 20 times
> Guest I/O scheduler : CFQ
> Host I/O scheduler : NOOP

May be using deadline or noop in guest is better to benchmark against
PCI-E based flash.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21 19:11 ` Vivek Goyal
@ 2011-12-22  1:05   ` Minchan Kim
  2011-12-22 15:45     ` Vivek Goyal
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-12-22  1:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

Hi Vivek,

On Wed, Dec 21, 2011 at 02:11:17PM -0500, Vivek Goyal wrote:
> On Wed, Dec 21, 2011 at 10:00:48AM +0900, Minchan Kim wrote:
> > This patch is follow-up of Christohp Hellwig's work
> > [RFC: ->make_request support for virtio-blk].
> > http://thread.gmane.org/gmane.linux.kernel/1199763
> > 
> > Quote from hch
> > "This patchset allows the virtio-blk driver to support much higher IOP
> > rates which can be driven out of modern PCI-e flash devices.  At this
> > point it really is just a RFC due to various issues."
> > 
> > I fixed race bug and add batch I/O for enhancing sequential I/O,
> > FLUSH/FUA emulation.
> > 
> > I tested this patch on fusion I/O device by aio-stress.
> > Result is following as.
> > 
> > Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT write)
> > Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> > Test repeated by 20 times
> > Guest I/O scheduler : CFQ
> > Host I/O scheduler : NOOP
> 
> May be using deadline or noop in guest is better to benchmark against
> PCI-E based flash.

Good suggestion.
I tested it by deadline on guest side.

The result is not good.
Although gap is within noise, Batch BIO's random performance is regressed
compared to CFQ. 

            Request                Batch BIO

         (MB/s)  stddev          (MB/s)  stddev
w        787.030 31.494 w        748.714 68.490
rw       216.044 29.734 rw       216.977 40.635
r        771.765 3.327  r        771.107 4.299
rr       280.096 25.135 rr       258.067 43.916

I did some small test for only Batch BIO with deadline and cfq.
to see I/O scheduler's effect.
I think result is very strange, deadline :149MB, CFQ : 87M
Because Batio BIO patch uses make_request_fn instead of request_rfn.
So I think we should not affect by I/O scheduler.(I mean we issue I/O 
before I/O scheduler handles it)

What do you think about it?
Do I miss something?


1) deadline
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (149.40 MB/s) 128.00 MB in 0.86s
thread 0 random read totals (149.22 MB/s) 128.00 MB in 0.86s


2) cfq
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (87.21 MB/s) 128.00 MB in 1.47s
thread 0 random read totals (87.15 MB/s) 128.00 MB in 1.47s


> 
> Thanks
> Vivek

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] virtio-blk: implement ->make_request
  2011-12-21  1:00 ` [PATCH 4/6] virtio-blk: implement ->make_request Minchan Kim
@ 2011-12-22 12:20   ` Stefan Hajnoczi
  2011-12-22 20:28     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Hajnoczi @ 2011-12-22 12:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig, Christoph Hellwig, Minchan Kim

On Wed, Dec 21, 2011 at 1:00 AM, Minchan Kim <minchan@kernel.org> wrote:
> +static void virtblk_add_buf_wait(struct virtio_blk *vblk,
> +       struct virtblk_req *vbr, unsigned long out, unsigned long in)
> +{
> +       DEFINE_WAIT(wait);
> +       bool retry, notify;
> +
> +       for (;;) {
> +               prepare_to_wait(&vblk->queue_wait, &wait,
> +                               TASK_UNINTERRUPTIBLE);
> +
> +               spin_lock_irq(&vblk->lock);
> +               if (virtqueue_add_buf(vblk->vq, vbr->sg,
> +                       out, in, vbr) < 0) {
> +                       retry = true;
> +               } else {
> +                       retry = false;
> +               }
> +               notify = virtqueue_kick_prepare(vblk->vq);
> +               spin_unlock_irq(&vblk->lock);
> +
> +               if (notify)

virtblk_make_request() checks that the queue is not plugged.  Do we
need to do that here too?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
                   ` (8 preceding siblings ...)
  2011-12-21 19:11 ` Vivek Goyal
@ 2011-12-22 12:57 ` Stefan Hajnoczi
  2011-12-22 23:41   ` Minchan Kim
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Hajnoczi @ 2011-12-22 12:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

On Wed, Dec 21, 2011 at 1:00 AM, Minchan Kim <minchan@kernel.org> wrote:
> This patch is follow-up of Christohp Hellwig's work
> [RFC: ->make_request support for virtio-blk].
> http://thread.gmane.org/gmane.linux.kernel/1199763
>
> Quote from hch
> "This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues."

Basic question to make sure I understood this series: does this patch
bypass the guest I/O scheduler (but then you added custom batching
code into virtio_blk.c)?

If you're stumped by the performance perhaps compare blktraces of the
request approach vs the bio approach.  We're probably performing I/O
more CPU-efficiently but the I/O pattern itself is worse.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-22  1:05   ` Minchan Kim
@ 2011-12-22 15:45     ` Vivek Goyal
  2011-12-22 23:26       ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Vivek Goyal @ 2011-12-22 15:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

On Thu, Dec 22, 2011 at 10:05:38AM +0900, Minchan Kim wrote:

[..]
> > May be using deadline or noop in guest is better to benchmark against
> > PCI-E based flash.
> 
> Good suggestion.
> I tested it by deadline on guest side.
> 
> The result is not good.
> Although gap is within noise, Batch BIO's random performance is regressed
> compared to CFQ. 
> 
>             Request                Batch BIO
> 
>          (MB/s)  stddev          (MB/s)  stddev
> w        787.030 31.494 w        748.714 68.490
> rw       216.044 29.734 rw       216.977 40.635
> r        771.765 3.327  r        771.107 4.299
> rr       280.096 25.135 rr       258.067 43.916
> 
> I did some small test for only Batch BIO with deadline and cfq.
> to see I/O scheduler's effect.
> I think result is very strange, deadline :149MB, CFQ : 87M
> Because Batio BIO patch uses make_request_fn instead of request_rfn.
> So I think we should not affect by I/O scheduler.(I mean we issue I/O 
> before I/O scheduler handles it)
> 
> What do you think about it?
> Do I miss something?

This indeed is very strange. In case of bio based drivers, changing IO
scheduler on the queue should not change anything.

Trying running blktrace on the vda devie and see if you notice something
odd.

Also you seem to be reporting contracdicting results for batch bio.

Initially you mention that random IO is regressing with deadline as
comapred to CFQ. (It dropped from 325.976 MB/s to 258.067 MB/s).

In this second test you are reporting that CFQ performs badly as
compared to deadline. (149MB/s vs 87MB/s).

Two contradicting results?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] virtio-blk: implement ->make_request
  2011-12-22 12:20   ` Stefan Hajnoczi
@ 2011-12-22 20:28     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2011-12-22 20:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Minchan Kim, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel, Christoph Hellwig,
	Christoph Hellwig, Minchan Kim

On Thu, Dec 22, 2011 at 12:20:01PM +0000, Stefan Hajnoczi wrote:
> virtblk_make_request() checks that the queue is not plugged.  Do we
> need to do that here too?

biot based drivers don't have a queue that could be plugged.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-22 15:45     ` Vivek Goyal
@ 2011-12-22 23:26       ` Minchan Kim
  0 siblings, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2011-12-22 23:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig

On Thu, Dec 22, 2011 at 10:45:06AM -0500, Vivek Goyal wrote:
> On Thu, Dec 22, 2011 at 10:05:38AM +0900, Minchan Kim wrote:
> 
> [..]
> > > May be using deadline or noop in guest is better to benchmark against
> > > PCI-E based flash.
> > 
> > Good suggestion.
> > I tested it by deadline on guest side.
> > 
> > The result is not good.
> > Although gap is within noise, Batch BIO's random performance is regressed
> > compared to CFQ. 
> > 
> >             Request                Batch BIO
> > 
> >          (MB/s)  stddev          (MB/s)  stddev
> > w        787.030 31.494 w        748.714 68.490
> > rw       216.044 29.734 rw       216.977 40.635
> > r        771.765 3.327  r        771.107 4.299
> > rr       280.096 25.135 rr       258.067 43.916
> > 
> > I did some small test for only Batch BIO with deadline and cfq.
> > to see I/O scheduler's effect.
> > I think result is very strange, deadline :149MB, CFQ : 87M
> > Because Batio BIO patch uses make_request_fn instead of request_rfn.
> > So I think we should not affect by I/O scheduler.(I mean we issue I/O 
> > before I/O scheduler handles it)
> > 
> > What do you think about it?
> > Do I miss something?
> 
> This indeed is very strange. In case of bio based drivers, changing IO
> scheduler on the queue should not change anything.
> 
> Trying running blktrace on the vda devie and see if you notice something
> odd.
> 
> Also you seem to be reporting contracdicting results for batch bio.
> 
> Initially you mention that random IO is regressing with deadline as
> comapred to CFQ. (It dropped from 325.976 MB/s to 258.067 MB/s).
> 
> In this second test you are reporting that CFQ performs badly as
> compared to deadline. (149MB/s vs 87MB/s).

First test is with 64 thread with 512M and second test is only 1 thread with 128M
for finishing quick.
I think so result is very stange.

This is data on test of deadline.
In summary, data is very fluctuated.

Sometime it's very stable but sometime data is very fluctuated like this.
I don't know it's aio-stress problem or fusion I/O stuff problem

1)
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (151.71 MB/s) 128.00 MB in 0.84s
thread 0 random read totals (151.55 MB/s) 128.00 MB in 0.84s

2)
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (201.04 MB/s) 128.00 MB in 0.64s
thread 0 random read totals (200.76 MB/s) 128.00 MB in 0.64s

3)
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (135.31 MB/s) 128.00 MB in 0.95s
thread 0 random read totals (135.19 MB/s) 128.00 MB in 0.95s

4)
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (116.93 MB/s) 128.00 MB in 1.09s
thread 0 random read totals (116.82 MB/s) 128.00 MB in 1.10s

5)
[root@RHEL-6 ~]# ./aio-stress -c 1 -t 1 -s 128 -r 8 -O -o 3 -d 512 /dev/vda
num_thread 1
adding stage random read
starting with random read
file size 128MB, record size 8KB, depth 512, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
Running single thread version 
random read on /dev/vda (130.79 MB/s) 128.00 MB in 0.98s
thread 0 random read totals (130.65 MB/s) 128.00 MB in 0.98s

> 
> Two contradicting results?

Hmm, I need test in /tmp/ to prevent it.

> 
> Thanks
> Vivek

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-22 12:57 ` Stefan Hajnoczi
@ 2011-12-22 23:41   ` Minchan Kim
  2012-01-01 16:45     ` Stefan Hajnoczi
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2011-12-22 23:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig, Vivek Goyal

On Thu, Dec 22, 2011 at 12:57:40PM +0000, Stefan Hajnoczi wrote:
> On Wed, Dec 21, 2011 at 1:00 AM, Minchan Kim <minchan@kernel.org> wrote:
> > This patch is follow-up of Christohp Hellwig's work
> > [RFC: ->make_request support for virtio-blk].
> > http://thread.gmane.org/gmane.linux.kernel/1199763
> >
> > Quote from hch
> > "This patchset allows the virtio-blk driver to support much higher IOP
> > rates which can be driven out of modern PCI-e flash devices.  At this
> > point it really is just a RFC due to various issues."
> 
> Basic question to make sure I understood this series: does this patch
> bypass the guest I/O scheduler (but then you added custom batching
> code into virtio_blk.c)?

Right.

> 
> If you're stumped by the performance perhaps compare blktraces of the
> request approach vs the bio approach.  We're probably performing I/O
> more CPU-efficiently but the I/O pattern itself is worse.

You mean I/O scheduler have many techniques to do well in I/O pattern?
That's what I want to discuss in this RFC.

I guess request layer have many techniques proved during long time
to do well I/O but BIO-based drvier ignores them for just reducing locking
overhead. Of course, we can add such techniques to BIO-batch driver like 
custom-batch in this series. But it needs lots of work, is really duplication,
and will have a problem on maintenance.

I would like to listen opinions whether this direction is good or bad.

> 
> Stefan

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2011-12-22 23:41   ` Minchan Kim
@ 2012-01-01 16:45     ` Stefan Hajnoczi
  2012-01-02  7:48       ` Dor Laor
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Stefan Hajnoczi @ 2012-01-01 16:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm,
	linux-kernel, Christoph Hellwig, Vivek Goyal

On Thu, Dec 22, 2011 at 11:41 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Thu, Dec 22, 2011 at 12:57:40PM +0000, Stefan Hajnoczi wrote:
>> On Wed, Dec 21, 2011 at 1:00 AM, Minchan Kim <minchan@kernel.org> wrote:
>> If you're stumped by the performance perhaps compare blktraces of the
>> request approach vs the bio approach.  We're probably performing I/O
>> more CPU-efficiently but the I/O pattern itself is worse.
>
> You mean I/O scheduler have many techniques to do well in I/O pattern?
> That's what I want to discuss in this RFC.
>
> I guess request layer have many techniques proved during long time
> to do well I/O but BIO-based drvier ignores them for just reducing locking
> overhead. Of course, we can add such techniques to BIO-batch driver like
> custom-batch in this series. But it needs lots of work, is really duplication,
> and will have a problem on maintenance.
>
> I would like to listen opinions whether this direction is good or bad.

This series is a good platform for performance analysis but not
something that should be merged IMO.  As you said it duplicates work
that I/O schedulers and the request-based block layer do.  If other
drivers start taking this approach too then the duplication will be
proliferated.

The value of this series is that you have a prototype to benchmark and
understand the bottlenecks in virtio-blk and the block layer better.
The results do not should that bypassing the I/O scheduler is always a
win.  The fact that you added batching suggests there is some benefit
to what the request-based code path does.  So find out what's good
about the request-based code path and how to get the best of both
worlds.

By the way, drivers for solid-state devices can set QUEUE_FLAG_NONROT
to hint that seek time optimizations may be sub-optimal.  NBD and
other virtual/pseudo device drivers set this flag.  Should virtio-blk
set it and how does it affect performance?

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-01 16:45     ` Stefan Hajnoczi
@ 2012-01-02  7:48       ` Dor Laor
  2012-01-02 16:12       ` Paolo Bonzini
  2012-01-02 16:18       ` Christoph Hellwig
  2 siblings, 0 replies; 27+ messages in thread
From: Dor Laor @ 2012-01-02  7:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Minchan Kim, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel, Christoph Hellwig,
	Vivek Goyal

On 01/01/2012 06:45 PM, Stefan Hajnoczi wrote:
> On Thu, Dec 22, 2011 at 11:41 PM, Minchan Kim<minchan@kernel.org>  wrote:
>> On Thu, Dec 22, 2011 at 12:57:40PM +0000, Stefan Hajnoczi wrote:
>>> On Wed, Dec 21, 2011 at 1:00 AM, Minchan Kim<minchan@kernel.org>  wrote:
>>> If you're stumped by the performance perhaps compare blktraces of the
>>> request approach vs the bio approach.  We're probably performing I/O
>>> more CPU-efficiently but the I/O pattern itself is worse.
>>
>> You mean I/O scheduler have many techniques to do well in I/O pattern?
>> That's what I want to discuss in this RFC.
>>
>> I guess request layer have many techniques proved during long time
>> to do well I/O but BIO-based drvier ignores them for just reducing locking
>> overhead. Of course, we can add such techniques to BIO-batch driver like
>> custom-batch in this series. But it needs lots of work, is really duplication,
>> and will have a problem on maintenance.
>>
>> I would like to listen opinions whether this direction is good or bad.
>
> This series is a good platform for performance analysis but not
> something that should be merged IMO.  As you said it duplicates work
> that I/O schedulers and the request-based block layer do.  If other
> drivers start taking this approach too then the duplication will be
> proliferated.
>
> The value of this series is that you have a prototype to benchmark and
> understand the bottlenecks in virtio-blk and the block layer better.
> The results do not should that bypassing the I/O scheduler is always a
> win.  The fact that you added batching suggests there is some benefit
> to what the request-based code path does.  So find out what's good
> about the request-based code path and how to get the best of both
> worlds.
>
> By the way, drivers for solid-state devices can set QUEUE_FLAG_NONROT
> to hint that seek time optimizations may be sub-optimal.  NBD and
> other virtual/pseudo device drivers set this flag.  Should virtio-blk
> set it and how does it affect performance?

Seems logical to me. If the underlying backing storage of the host is 
SSD or some remote fast SAN server we need such a flag. Even in the case 
of standard local storage, the host will still do the seek time 
optimization so no need to do them twice.

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-01 16:45     ` Stefan Hajnoczi
  2012-01-02  7:48       ` Dor Laor
@ 2012-01-02 16:12       ` Paolo Bonzini
  2012-01-02 16:15         ` Christoph Hellwig
  2012-01-02 16:18       ` Christoph Hellwig
  2 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2012-01-02 16:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Minchan Kim, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel, Christoph Hellwig,
	Vivek Goyal

On 01/01/2012 05:45 PM, Stefan Hajnoczi wrote:
> By the way, drivers for solid-state devices can set QUEUE_FLAG_NONROT
> to hint that seek time optimizations may be sub-optimal.  NBD and
> other virtual/pseudo device drivers set this flag.  Should virtio-blk
> set it and how does it affect performance?

By itself is not a good idea in general.

When QEMU uses O_DIRECT, the guest should not use QUEUE_FLAG_NONROT 
unless it is active for the host disk as well.  (In doubt, as is the 
case for remote hosts accessed over NFS, I would also avoid NONROT and 
allow more coalescing).

When QEMU doesn't use O_DIRECT, instead, using QUEUE_FLAG_NONROT and 
leaving optimizations to the host may make some sense.

In Xen, the back-end driver is bio-based, so the scenario is like QEMU 
with O_DIRECT.  I remember seeing worse performance when switching the 
front-end to either QUEUE_FLAG_NONROT or the noop scheduler.  This was 
with RHEL5 (2.6.18), but it might still be true in more recent kernels, 
modulo benchmarking of course.  Still, the current in-tree xen-blkfront 
driver does use QUEUE_FLAG_NONROT unconditionally, more precisely its 
synonym QUEUE_FLAG_VIRT.

Still, if benchmarking confirms this theory, QEMU could expose a hint 
via a feature bit.  The default could be simply "use QUEUE_FLAG_NONROT 
iff not using O_DIRECT", or it could be more complicated with help from 
sysfs.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-02 16:12       ` Paolo Bonzini
@ 2012-01-02 16:15         ` Christoph Hellwig
  2012-01-02 16:18           ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2012-01-02 16:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Minchan Kim, Rusty Russell, Chris Wright,
	Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Christoph Hellwig, Vivek Goyal

On Mon, Jan 02, 2012 at 05:12:00PM +0100, Paolo Bonzini wrote:
> On 01/01/2012 05:45 PM, Stefan Hajnoczi wrote:
> >By the way, drivers for solid-state devices can set QUEUE_FLAG_NONROT
> >to hint that seek time optimizations may be sub-optimal.  NBD and
> >other virtual/pseudo device drivers set this flag.  Should virtio-blk
> >set it and how does it affect performance?
> 
> By itself is not a good idea in general.
> 
> When QEMU uses O_DIRECT, the guest should not use QUEUE_FLAG_NONROT
> unless it is active for the host disk as well.  (In doubt, as is the
> case for remote hosts accessed over NFS, I would also avoid NONROT
> and allow more coalescing).

Do we have any benchmark numbers where QUEUE_FLAG_NONROT makes a
difference?  I tried a few times, and the only constant measureable
thing was that it regressed performance when used for rotating devices
in a few benchmarks.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-01 16:45     ` Stefan Hajnoczi
  2012-01-02  7:48       ` Dor Laor
  2012-01-02 16:12       ` Paolo Bonzini
@ 2012-01-02 16:18       ` Christoph Hellwig
  2012-01-02 16:21         ` Avi Kivity
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2012-01-02 16:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Minchan Kim, Rusty Russell, Chris Wright, Jens Axboe,
	Stefan Hajnoczi, kvm, linux-kernel, Christoph Hellwig,
	Vivek Goyal

On Sun, Jan 01, 2012 at 04:45:42PM +0000, Stefan Hajnoczi wrote:
> win.  The fact that you added batching suggests there is some benefit
> to what the request-based code path does.  So find out what's good
> about the request-based code path and how to get the best of both
> worlds.

Batching pretty much always is a winner.  The maximum bio size is small
enough that we'll frequently see multiple contiguos bios.  Because of
that the Md layer fo example uses the same kind of batching.  I've tried
to make this more general by passing a bio list to ->make_request and
make the on-stack plugging work on bios, but in the timeslice I had
available for that I didn't manage to actually make it work.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-02 16:15         ` Christoph Hellwig
@ 2012-01-02 16:18           ` Paolo Bonzini
  2012-01-02 16:23             ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2012-01-02 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stefan Hajnoczi, Minchan Kim, Rusty Russell, Chris Wright,
	Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel, Vivek Goyal

On 01/02/2012 05:15 PM, Christoph Hellwig wrote:
> >  When QEMU uses O_DIRECT, the guest should not use QUEUE_FLAG_NONROT
> >  unless it is active for the host disk as well.  (In doubt, as is the
> >  case for remote hosts accessed over NFS, I would also avoid NONROT
> >  and allow more coalescing).
>
> Do we have any benchmark numbers where QUEUE_FLAG_NONROT makes a
> difference?

Not that I know of.

> I tried a few times, and the only constant measureable
> thing was that it regressed performance when used for rotating devices
> in a few benchmarks.

Were you trying with cache=none or writeback?  For cache=none, that's 
exactly what I'd expect.  cache=writeback could be more interesting...

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-02 16:18       ` Christoph Hellwig
@ 2012-01-02 16:21         ` Avi Kivity
  0 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2012-01-02 16:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stefan Hajnoczi, Minchan Kim, Rusty Russell, Chris Wright,
	Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel, Vivek Goyal

On 01/02/2012 06:18 PM, Christoph Hellwig wrote:
> On Sun, Jan 01, 2012 at 04:45:42PM +0000, Stefan Hajnoczi wrote:
> > win.  The fact that you added batching suggests there is some benefit
> > to what the request-based code path does.  So find out what's good
> > about the request-based code path and how to get the best of both
> > worlds.
>
> Batching pretty much always is a winner.  The maximum bio size is small
> enough that we'll frequently see multiple contiguos bios.  

Maybe the maximum bio size should be increased then; not that I disagree
with your conclusion.

> Because of
> that the Md layer fo example uses the same kind of batching.  I've tried
> to make this more general by passing a bio list to ->make_request and
> make the on-stack plugging work on bios, but in the timeslice I had
> available for that I didn't manage to actually make it work.
>
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO
  2012-01-02 16:18           ` Paolo Bonzini
@ 2012-01-02 16:23             ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2012-01-02 16:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christoph Hellwig, Stefan Hajnoczi, Minchan Kim, Rusty Russell,
	Chris Wright, Jens Axboe, Stefan Hajnoczi, kvm, linux-kernel,
	Vivek Goyal

On Mon, Jan 02, 2012 at 05:18:13PM +0100, Paolo Bonzini wrote:
> >I tried a few times, and the only constant measureable
> >thing was that it regressed performance when used for rotating devices
> >in a few benchmarks.
> 
> Were you trying with cache=none or writeback?  For cache=none,
> that's exactly what I'd expect.  cache=writeback could be more
> interesting...

cache=none - cache=writeback isn't something people should ever use
except for read-only or extremly read-mostly workloads.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-01-02 16:23 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21  1:00 [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Minchan Kim
2011-12-21  1:00 ` [PATCH 1/6] block: add bio_map_sg Minchan Kim
2011-12-21  1:00 ` [PATCH 2/6] virtio: support unlocked queue kick Minchan Kim
2011-12-21  1:00 ` [PATCH 3/6] virtio-blk: remove the unused list of pending requests Minchan Kim
2011-12-21  1:00 ` [PATCH 4/6] virtio-blk: implement ->make_request Minchan Kim
2011-12-22 12:20   ` Stefan Hajnoczi
2011-12-22 20:28     ` Christoph Hellwig
2011-12-21  1:00 ` [PATCH 5/6] virtio-blk: Support batch I/O for enhancing sequential IO Minchan Kim
2011-12-21  1:00 ` [PATCH 6/6] virtio-blk: Emulate Flush/FUA Minchan Kim
2011-12-21  5:08 ` [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO Rusty Russell
2011-12-21  5:56   ` Minchan Kim
2011-12-21  8:28 ` Sasha Levin
2011-12-21  8:17   ` Minchan Kim
2011-12-21 19:11 ` Vivek Goyal
2011-12-22  1:05   ` Minchan Kim
2011-12-22 15:45     ` Vivek Goyal
2011-12-22 23:26       ` Minchan Kim
2011-12-22 12:57 ` Stefan Hajnoczi
2011-12-22 23:41   ` Minchan Kim
2012-01-01 16:45     ` Stefan Hajnoczi
2012-01-02  7:48       ` Dor Laor
2012-01-02 16:12       ` Paolo Bonzini
2012-01-02 16:15         ` Christoph Hellwig
2012-01-02 16:18           ` Paolo Bonzini
2012-01-02 16:23             ` Christoph Hellwig
2012-01-02 16:18       ` Christoph Hellwig
2012-01-02 16:21         ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.