All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
@ 2014-09-11 23:57 Arianna Avanzini
  2014-09-11 23:57 ` [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API Arianna Avanzini
                   ` (11 more replies)
  0 siblings, 12 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

Hello,

this patchset adds to the Xen PV block driver support to exploit the multi-
queue block layer API by sharing and using multiple I/O rings in the frontend
and backend. It is the result of my internship for GNOME's Outreach Program
for Women ([1]), in which I was mentored by Konrad Rzeszutek Wilk.

The patchset implements in the backend driver the retrieval of information
about the currently-in-use block layer API for a certain device and about
the number of available submission queues, if the API turns out to be the
multi-queue one. The information is then advertised to the frontend driver
via XenStore.
The frontend device can exploit such an information to allocate and grant
multiple I/O rings and advertise the final number to the backend so that
it will be able to map them.
The patchset has been tested with fio's IOmeter emulation on a four-cores
machine with a null_blk device (some results are available here: [2]).

With respect to the first version of this RFC patchset ([3]), the patchset has
undergone the following changes (as the structure of the patchset itself
has changed, I'm summarizing them here).

. Now the use of the multi-queue API replaces that of the request queue API,
  as indicated by Christoph Hellwig.
. Patch 0003 from the previous patchset has been split into two patches, the
  first introducing in the frontend actual support for multiple block rings,
  the second adding support to negotiate the number of I/O rings with the
  backend, as suggested by David Vrabel.
. Patch 0004 from the previous patchset has been split into two patches, the
  first introducing in the backend support for multiple block rings, the second
  adding support to negotiate the number of I/O rings, as suggested by David
  Vrabel.
. Added the BLK_MQ_F_SG_MERGE and BLK_MQ_F_SHOULD_SORT flags to the frontend
  driver's initialization as suggested by Christoph Hellwig.
. Removed empty/useless definition of the init_hctx and complete hooks, as
  pointed out by Christoph Hellwig.
. Removed useless debug printk()s from code added in xen-blkfront, as indicated
  by David Vrabel.
. Added return of an actual error code in the blk_mq_init_queue() failure path
  in xlvbd_init_blk_queue(), as suggested by Christoph Hellwig.
. Fixed coding style issue in blkfront_queue_rq() as suggested by Christoph
  Hellwig.

. Added support for the migration of a multi-queue-capable domU to a host with
  non-multi-queue-capable devices.
. Fixed locking issues in the interrupt path, avoiding to grab the io_lock
  twice when calling blk_mq_start_stopped_hw_queues().
. Fixed wrong use of the return value of blk_mq_init_queue().
. Dropped the use of ternary operator in the macros that compute the number
  of per-ring requests and grants: now they use the max() macro.

Any comments or suggestions are more than welcome.
Thank you,
Arianna

[1] http://goo.gl/bcvHMh
[2] http://goo.gl/O8RlLL
[3] http://lkml.org/lkml/2014/8/22/158

Arianna Avanzini (5):
  xen, blkfront: port to the the multi-queue block layer API
  xen, blkfront: introduce support for multiple block rings
  xen, blkfront: negotiate the number of block rings with the backend
  xen, blkback: introduce support for multiple block rings
  xen, blkback: negotiate of the number of block rings with the frontend

 drivers/block/xen-blkback/blkback.c | 377 ++++++++-------
 drivers/block/xen-blkback/common.h  | 110 +++--
 drivers/block/xen-blkback/xenbus.c  | 472 +++++++++++++------
 drivers/block/xen-blkfront.c        | 894 +++++++++++++++++++++---------------
 4 files changed, 1122 insertions(+), 731 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
  2014-09-11 23:57 ` [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-13 19:29   ` Christoph Hellwig
                     ` (3 more replies)
  2014-09-11 23:57 ` [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings Arianna Avanzini
                   ` (9 subsequent siblings)
  11 siblings, 4 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

This commit introduces support for the multi-queue block layer API,
and at the same time removes the existing request_queue API support.
The changes are only structural, and the number of supported hardware
contexts is forcedly set to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 171 ++++++++++++++++++++-----------------------
 1 file changed, 80 insertions(+), 91 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5deb235..109add6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -37,6 +37,7 @@
 
 #include <linux/interrupt.h>
 #include <linux/blkdev.h>
+#include <linux/blk-mq.h>
 #include <linux/hdreg.h>
 #include <linux/cdrom.h>
 #include <linux/module.h>
@@ -134,6 +135,8 @@ struct blkfront_info
 	unsigned int feature_persistent:1;
 	unsigned int max_indirect_segments;
 	int is_ready;
+	/* Block layer tags. */
+	struct blk_mq_tag_set tag_set;
 };
 
 static unsigned int nr_minors;
@@ -582,66 +585,69 @@ static inline void flush_requests(struct blkfront_info *info)
 		notify_remote_via_irq(info->irq);
 }
 
-/*
- * do_blkif_request
- *  read a block; request is in a request queue
- */
-static void do_blkif_request(struct request_queue *rq)
+static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct blkfront_info *info = NULL;
-	struct request *req;
-	int queued;
-
-	pr_debug("Entered do_blkif_request\n");
-
-	queued = 0;
-
-	while ((req = blk_peek_request(rq)) != NULL) {
-		info = req->rq_disk->private_data;
+	struct blkfront_info *info = req->rq_disk->private_data;
 
-		if (RING_FULL(&info->ring))
-			goto wait;
+	spin_lock_irq(&info->io_lock);
+	if (RING_FULL(&info->ring))
+		goto wait;
 
-		blk_start_request(req);
+	if ((req->cmd_type != REQ_TYPE_FS) ||
+			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
+			 !info->flush_op)) {
+		req->errors = -EIO;
+		blk_mq_complete_request(req);
+		spin_unlock_irq(&info->io_lock);
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
 
-		if ((req->cmd_type != REQ_TYPE_FS) ||
-		    ((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
-		    !info->flush_op)) {
-			__blk_end_request_all(req, -EIO);
-			continue;
-		}
+	if (blkif_queue_request(req)) {
+		blk_mq_requeue_request(req);
+		goto wait;
+	}
 
-		pr_debug("do_blk_req %p: cmd %p, sec %lx, "
-			 "(%u/%u) [%s]\n",
-			 req, req->cmd, (unsigned long)blk_rq_pos(req),
-			 blk_rq_cur_sectors(req), blk_rq_sectors(req),
-			 rq_data_dir(req) ? "write" : "read");
+	flush_requests(info);
+	spin_unlock_irq(&info->io_lock);
+	return BLK_MQ_RQ_QUEUE_OK;
 
-		if (blkif_queue_request(req)) {
-			blk_requeue_request(rq, req);
 wait:
-			/* Avoid pointless unplugs. */
-			blk_stop_queue(rq);
-			break;
-		}
-
-		queued++;
-	}
-
-	if (queued != 0)
-		flush_requests(info);
+	/* Avoid pointless unplugs. */
+	blk_mq_stop_hw_queue(hctx);
+	spin_unlock_irq(&info->io_lock);
+	return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static struct blk_mq_ops blkfront_mq_ops = {
+	.queue_rq = blkfront_queue_rq,
+	.map_queue = blk_mq_map_queue,
+};
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 				unsigned int physical_sector_size,
 				unsigned int segments)
 {
 	struct request_queue *rq;
 	struct blkfront_info *info = gd->private_data;
+	int ret;
+
+	memset(&info->tag_set, 0, sizeof(info->tag_set));
+	info->tag_set.ops = &blkfront_mq_ops;
+	info->tag_set.nr_hw_queues = 1;
+	info->tag_set.queue_depth = BLK_RING_SIZE;
+	info->tag_set.numa_node = NUMA_NO_NODE;
+	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	info->tag_set.cmd_size = 0;
+	info->tag_set.driver_data = info;
 
-	rq = blk_init_queue(do_blkif_request, &info->io_lock);
-	if (rq == NULL)
-		return -1;
+	if ((ret = blk_mq_alloc_tag_set(&info->tag_set)))
+		return ret;
+	rq = blk_mq_init_queue(&info->tag_set);
+	if (IS_ERR(rq)) {
+		blk_mq_free_tag_set(&info->tag_set);
+		return PTR_ERR(rq);
+	}
+	rq->queuedata = info;
 
 	queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
 
@@ -871,7 +877,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	spin_lock_irqsave(&info->io_lock, flags);
 
 	/* No more blkif_request(). */
-	blk_stop_queue(info->rq);
+	blk_mq_stop_hw_queues(info->rq);
 
 	/* No more gnttab callback work. */
 	gnttab_cancel_free_callback(&info->callback);
@@ -887,30 +893,32 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	xlbd_release_minors(minor, nr_minors);
 
 	blk_cleanup_queue(info->rq);
+	blk_mq_free_tag_set(&info->tag_set);
 	info->rq = NULL;
 
 	put_disk(info->gd);
 	info->gd = NULL;
 }
 
-static void kick_pending_request_queues(struct blkfront_info *info)
+static void kick_pending_request_queues(struct blkfront_info *info,
+					unsigned long *flags)
 {
 	if (!RING_FULL(&info->ring)) {
-		/* Re-enable calldowns. */
-		blk_start_queue(info->rq);
-		/* Kick things off immediately. */
-		do_blkif_request(info->rq);
+		spin_unlock_irqrestore(&info->io_lock, *flags);
+		blk_mq_start_stopped_hw_queues(info->rq, 0);
+		spin_lock_irqsave(&info->io_lock, *flags);
 	}
 }
 
 static void blkif_restart_queue(struct work_struct *work)
 {
 	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+	unsigned long flags;
 
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 	if (info->connected == BLKIF_STATE_CONNECTED)
-		kick_pending_request_queues(info);
-	spin_unlock_irq(&info->io_lock);
+		kick_pending_request_queues(info, &flags);
+	spin_unlock_irqrestore(&info->io_lock, flags);
 }
 
 static void blkif_free(struct blkfront_info *info, int suspend)
@@ -925,7 +933,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
 	/* No more blkif_request(). */
 	if (info->rq)
-		blk_stop_queue(info->rq);
+		blk_mq_stop_hw_queues(info->rq);
 
 	/* Remove all persistent grants */
 	if (!list_empty(&info->grants)) {
@@ -1150,37 +1158,37 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 			continue;
 		}
 
-		error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+		error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
 		switch (bret->operation) {
 		case BLKIF_OP_DISCARD:
 			if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
 				struct request_queue *rq = info->rq;
 				printk(KERN_WARNING "blkfront: %s: %s op failed\n",
 					   info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 				info->feature_discard = 0;
 				info->feature_secdiscard = 0;
 				queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
 				queue_flag_clear(QUEUE_FLAG_SECDISCARD, rq);
 			}
-			__blk_end_request_all(req, error);
+			blk_mq_complete_request(req);
 			break;
 		case BLKIF_OP_FLUSH_DISKCACHE:
 		case BLKIF_OP_WRITE_BARRIER:
 			if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
 				printk(KERN_WARNING "blkfront: %s: %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(bret->status == BLKIF_RSP_ERROR &&
 				     info->shadow[id].req.u.rw.nr_segments == 0)) {
 				printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(error)) {
 				if (error == -EOPNOTSUPP)
-					error = 0;
+					error = req->errors = 0;
 				info->feature_flush = 0;
 				info->flush_op = 0;
 				xlvbd_flush(info);
@@ -1192,7 +1200,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 				dev_dbg(&info->xbdev->dev, "Bad return from blkdev data "
 					"request: %x\n", bret->status);
 
-			__blk_end_request_all(req, error);
+			blk_mq_complete_request(req);
 			break;
 		default:
 			BUG();
@@ -1209,7 +1217,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	} else
 		info->ring.sring->rsp_event = i + 1;
 
-	kick_pending_request_queues(info);
+	blkif_restart_queue_callback(info);
 
 	spin_unlock_irqrestore(&info->io_lock, flags);
 
@@ -1439,6 +1447,7 @@ static int blkif_recover(struct blkfront_info *info)
 	struct bio *bio, *cloned_bio;
 	struct bio_list bio_list, merge_bio;
 	unsigned int segs, offset;
+	unsigned long flags;
 	int pending, size;
 	struct split_bio *split_bio;
 	struct list_head requests;
@@ -1492,45 +1501,24 @@ static int blkif_recover(struct blkfront_info *info)
 
 	kfree(copy);
 
-	/*
-	 * Empty the queue, this is important because we might have
-	 * requests in the queue with more segments than what we
-	 * can handle now.
-	 */
-	spin_lock_irq(&info->io_lock);
-	while ((req = blk_fetch_request(info->rq)) != NULL) {
-		if (req->cmd_flags &
-		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
-			list_add(&req->queuelist, &requests);
-			continue;
-		}
-		merge_bio.head = req->bio;
-		merge_bio.tail = req->biotail;
-		bio_list_merge(&bio_list, &merge_bio);
-		req->bio = NULL;
-		if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
-			pr_alert("diskcache flush request found!\n");
-		__blk_put_request(info->rq, req);
-	}
-	spin_unlock_irq(&info->io_lock);
-
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
 	/* Kick any other new requests queued since we resumed */
-	kick_pending_request_queues(info);
+	kick_pending_request_queues(info, &flags);
 
 	list_for_each_entry_safe(req, n, &requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
 		BUG_ON(req->nr_phys_segments > segs);
-		blk_requeue_request(info->rq, req);
+		blk_mq_requeue_request(req);
 	}
-	spin_unlock_irq(&info->io_lock);
+
+	spin_unlock_irqrestore(&info->io_lock, flags);
 
 	while ((bio = bio_list_pop(&bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
@@ -1741,6 +1729,7 @@ static void blkfront_connect(struct blkfront_info *info)
 {
 	unsigned long long sectors;
 	unsigned long sector_size;
+	unsigned long flags;
 	unsigned int physical_sector_size;
 	unsigned int binfo;
 	int err;
@@ -1865,10 +1854,10 @@ static void blkfront_connect(struct blkfront_info *info)
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
 	/* Kick pending requests. */
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 	info->connected = BLKIF_STATE_CONNECTED;
-	kick_pending_request_queues(info);
-	spin_unlock_irq(&info->io_lock);
+	kick_pending_request_queues(info, &flags);
+	spin_unlock_irqrestore(&info->io_lock, flags);
 
 	add_disk(info->gd);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-11 23:57 ` Arianna Avanzini
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi, avanzini.arianna

This commit introduces support for the multi-queue block layer API,
and at the same time removes the existing request_queue API support.
The changes are only structural, and the number of supported hardware
contexts is forcedly set to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 171 ++++++++++++++++++++-----------------------
 1 file changed, 80 insertions(+), 91 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5deb235..109add6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -37,6 +37,7 @@
 
 #include <linux/interrupt.h>
 #include <linux/blkdev.h>
+#include <linux/blk-mq.h>
 #include <linux/hdreg.h>
 #include <linux/cdrom.h>
 #include <linux/module.h>
@@ -134,6 +135,8 @@ struct blkfront_info
 	unsigned int feature_persistent:1;
 	unsigned int max_indirect_segments;
 	int is_ready;
+	/* Block layer tags. */
+	struct blk_mq_tag_set tag_set;
 };
 
 static unsigned int nr_minors;
@@ -582,66 +585,69 @@ static inline void flush_requests(struct blkfront_info *info)
 		notify_remote_via_irq(info->irq);
 }
 
-/*
- * do_blkif_request
- *  read a block; request is in a request queue
- */
-static void do_blkif_request(struct request_queue *rq)
+static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct blkfront_info *info = NULL;
-	struct request *req;
-	int queued;
-
-	pr_debug("Entered do_blkif_request\n");
-
-	queued = 0;
-
-	while ((req = blk_peek_request(rq)) != NULL) {
-		info = req->rq_disk->private_data;
+	struct blkfront_info *info = req->rq_disk->private_data;
 
-		if (RING_FULL(&info->ring))
-			goto wait;
+	spin_lock_irq(&info->io_lock);
+	if (RING_FULL(&info->ring))
+		goto wait;
 
-		blk_start_request(req);
+	if ((req->cmd_type != REQ_TYPE_FS) ||
+			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
+			 !info->flush_op)) {
+		req->errors = -EIO;
+		blk_mq_complete_request(req);
+		spin_unlock_irq(&info->io_lock);
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
 
-		if ((req->cmd_type != REQ_TYPE_FS) ||
-		    ((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
-		    !info->flush_op)) {
-			__blk_end_request_all(req, -EIO);
-			continue;
-		}
+	if (blkif_queue_request(req)) {
+		blk_mq_requeue_request(req);
+		goto wait;
+	}
 
-		pr_debug("do_blk_req %p: cmd %p, sec %lx, "
-			 "(%u/%u) [%s]\n",
-			 req, req->cmd, (unsigned long)blk_rq_pos(req),
-			 blk_rq_cur_sectors(req), blk_rq_sectors(req),
-			 rq_data_dir(req) ? "write" : "read");
+	flush_requests(info);
+	spin_unlock_irq(&info->io_lock);
+	return BLK_MQ_RQ_QUEUE_OK;
 
-		if (blkif_queue_request(req)) {
-			blk_requeue_request(rq, req);
 wait:
-			/* Avoid pointless unplugs. */
-			blk_stop_queue(rq);
-			break;
-		}
-
-		queued++;
-	}
-
-	if (queued != 0)
-		flush_requests(info);
+	/* Avoid pointless unplugs. */
+	blk_mq_stop_hw_queue(hctx);
+	spin_unlock_irq(&info->io_lock);
+	return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static struct blk_mq_ops blkfront_mq_ops = {
+	.queue_rq = blkfront_queue_rq,
+	.map_queue = blk_mq_map_queue,
+};
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 				unsigned int physical_sector_size,
 				unsigned int segments)
 {
 	struct request_queue *rq;
 	struct blkfront_info *info = gd->private_data;
+	int ret;
+
+	memset(&info->tag_set, 0, sizeof(info->tag_set));
+	info->tag_set.ops = &blkfront_mq_ops;
+	info->tag_set.nr_hw_queues = 1;
+	info->tag_set.queue_depth = BLK_RING_SIZE;
+	info->tag_set.numa_node = NUMA_NO_NODE;
+	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	info->tag_set.cmd_size = 0;
+	info->tag_set.driver_data = info;
 
-	rq = blk_init_queue(do_blkif_request, &info->io_lock);
-	if (rq == NULL)
-		return -1;
+	if ((ret = blk_mq_alloc_tag_set(&info->tag_set)))
+		return ret;
+	rq = blk_mq_init_queue(&info->tag_set);
+	if (IS_ERR(rq)) {
+		blk_mq_free_tag_set(&info->tag_set);
+		return PTR_ERR(rq);
+	}
+	rq->queuedata = info;
 
 	queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
 
@@ -871,7 +877,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	spin_lock_irqsave(&info->io_lock, flags);
 
 	/* No more blkif_request(). */
-	blk_stop_queue(info->rq);
+	blk_mq_stop_hw_queues(info->rq);
 
 	/* No more gnttab callback work. */
 	gnttab_cancel_free_callback(&info->callback);
@@ -887,30 +893,32 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	xlbd_release_minors(minor, nr_minors);
 
 	blk_cleanup_queue(info->rq);
+	blk_mq_free_tag_set(&info->tag_set);
 	info->rq = NULL;
 
 	put_disk(info->gd);
 	info->gd = NULL;
 }
 
-static void kick_pending_request_queues(struct blkfront_info *info)
+static void kick_pending_request_queues(struct blkfront_info *info,
+					unsigned long *flags)
 {
 	if (!RING_FULL(&info->ring)) {
-		/* Re-enable calldowns. */
-		blk_start_queue(info->rq);
-		/* Kick things off immediately. */
-		do_blkif_request(info->rq);
+		spin_unlock_irqrestore(&info->io_lock, *flags);
+		blk_mq_start_stopped_hw_queues(info->rq, 0);
+		spin_lock_irqsave(&info->io_lock, *flags);
 	}
 }
 
 static void blkif_restart_queue(struct work_struct *work)
 {
 	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+	unsigned long flags;
 
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 	if (info->connected == BLKIF_STATE_CONNECTED)
-		kick_pending_request_queues(info);
-	spin_unlock_irq(&info->io_lock);
+		kick_pending_request_queues(info, &flags);
+	spin_unlock_irqrestore(&info->io_lock, flags);
 }
 
 static void blkif_free(struct blkfront_info *info, int suspend)
@@ -925,7 +933,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
 	/* No more blkif_request(). */
 	if (info->rq)
-		blk_stop_queue(info->rq);
+		blk_mq_stop_hw_queues(info->rq);
 
 	/* Remove all persistent grants */
 	if (!list_empty(&info->grants)) {
@@ -1150,37 +1158,37 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 			continue;
 		}
 
-		error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+		error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
 		switch (bret->operation) {
 		case BLKIF_OP_DISCARD:
 			if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
 				struct request_queue *rq = info->rq;
 				printk(KERN_WARNING "blkfront: %s: %s op failed\n",
 					   info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 				info->feature_discard = 0;
 				info->feature_secdiscard = 0;
 				queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
 				queue_flag_clear(QUEUE_FLAG_SECDISCARD, rq);
 			}
-			__blk_end_request_all(req, error);
+			blk_mq_complete_request(req);
 			break;
 		case BLKIF_OP_FLUSH_DISKCACHE:
 		case BLKIF_OP_WRITE_BARRIER:
 			if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
 				printk(KERN_WARNING "blkfront: %s: %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(bret->status == BLKIF_RSP_ERROR &&
 				     info->shadow[id].req.u.rw.nr_segments == 0)) {
 				printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
-				error = -EOPNOTSUPP;
+				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(error)) {
 				if (error == -EOPNOTSUPP)
-					error = 0;
+					error = req->errors = 0;
 				info->feature_flush = 0;
 				info->flush_op = 0;
 				xlvbd_flush(info);
@@ -1192,7 +1200,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 				dev_dbg(&info->xbdev->dev, "Bad return from blkdev data "
 					"request: %x\n", bret->status);
 
-			__blk_end_request_all(req, error);
+			blk_mq_complete_request(req);
 			break;
 		default:
 			BUG();
@@ -1209,7 +1217,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	} else
 		info->ring.sring->rsp_event = i + 1;
 
-	kick_pending_request_queues(info);
+	blkif_restart_queue_callback(info);
 
 	spin_unlock_irqrestore(&info->io_lock, flags);
 
@@ -1439,6 +1447,7 @@ static int blkif_recover(struct blkfront_info *info)
 	struct bio *bio, *cloned_bio;
 	struct bio_list bio_list, merge_bio;
 	unsigned int segs, offset;
+	unsigned long flags;
 	int pending, size;
 	struct split_bio *split_bio;
 	struct list_head requests;
@@ -1492,45 +1501,24 @@ static int blkif_recover(struct blkfront_info *info)
 
 	kfree(copy);
 
-	/*
-	 * Empty the queue, this is important because we might have
-	 * requests in the queue with more segments than what we
-	 * can handle now.
-	 */
-	spin_lock_irq(&info->io_lock);
-	while ((req = blk_fetch_request(info->rq)) != NULL) {
-		if (req->cmd_flags &
-		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
-			list_add(&req->queuelist, &requests);
-			continue;
-		}
-		merge_bio.head = req->bio;
-		merge_bio.tail = req->biotail;
-		bio_list_merge(&bio_list, &merge_bio);
-		req->bio = NULL;
-		if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
-			pr_alert("diskcache flush request found!\n");
-		__blk_put_request(info->rq, req);
-	}
-	spin_unlock_irq(&info->io_lock);
-
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
 	/* Kick any other new requests queued since we resumed */
-	kick_pending_request_queues(info);
+	kick_pending_request_queues(info, &flags);
 
 	list_for_each_entry_safe(req, n, &requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
 		BUG_ON(req->nr_phys_segments > segs);
-		blk_requeue_request(info->rq, req);
+		blk_mq_requeue_request(req);
 	}
-	spin_unlock_irq(&info->io_lock);
+
+	spin_unlock_irqrestore(&info->io_lock, flags);
 
 	while ((bio = bio_list_pop(&bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
@@ -1741,6 +1729,7 @@ static void blkfront_connect(struct blkfront_info *info)
 {
 	unsigned long long sectors;
 	unsigned long sector_size;
+	unsigned long flags;
 	unsigned int physical_sector_size;
 	unsigned int binfo;
 	int err;
@@ -1865,10 +1854,10 @@ static void blkfront_connect(struct blkfront_info *info)
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
 	/* Kick pending requests. */
-	spin_lock_irq(&info->io_lock);
+	spin_lock_irqsave(&info->io_lock, flags);
 	info->connected = BLKIF_STATE_CONNECTED;
-	kick_pending_request_queues(info);
-	spin_unlock_irq(&info->io_lock);
+	kick_pending_request_queues(info, &flags);
+	spin_unlock_irqrestore(&info->io_lock, flags);
 
 	add_disk(info->gd);
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (2 preceding siblings ...)
  2014-09-11 23:57 ` [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-09-11 23:57 ` [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend Arianna Avanzini
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

This commit introduces in xen-blkfront actual support for multiple
block rings. The number of block rings to be used is still forced
to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 710 +++++++++++++++++++++++++------------------
 1 file changed, 410 insertions(+), 300 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 109add6..9282df1 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -102,30 +102,44 @@ MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default
 #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
 
 /*
+ * Data structure keeping per-ring info. A blkfront_info structure is always
+ * associated with one or more blkfront_ring_info.
+ */
+struct blkfront_ring_info
+{
+	spinlock_t io_lock;
+	int ring_ref;
+	struct blkif_front_ring ring;
+	unsigned int evtchn, irq;
+	struct blk_shadow shadow[BLK_RING_SIZE];
+	unsigned long shadow_free;
+
+	struct work_struct work;
+	struct gnttab_free_callback callback;
+	struct list_head grants;
+	struct list_head indirect_pages;
+	unsigned int persistent_gnts_c;
+
+	struct blkfront_info *info;
+	unsigned int hctx_index;
+};
+
+/*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
  * hang in private_data off the gendisk structure. We may end up
  * putting all kinds of interesting stuff here :-)
  */
 struct blkfront_info
 {
-	spinlock_t io_lock;
 	struct mutex mutex;
 	struct xenbus_device *xbdev;
 	struct gendisk *gd;
 	int vdevice;
 	blkif_vdev_t handle;
 	enum blkif_state connected;
-	int ring_ref;
-	struct blkif_front_ring ring;
-	unsigned int evtchn, irq;
+	unsigned int nr_rings;
+	struct blkfront_ring_info *rinfo;
 	struct request_queue *rq;
-	struct work_struct work;
-	struct gnttab_free_callback callback;
-	struct blk_shadow shadow[BLK_RING_SIZE];
-	struct list_head grants;
-	struct list_head indirect_pages;
-	unsigned int persistent_gnts_c;
-	unsigned long shadow_free;
 	unsigned int feature_flush;
 	unsigned int flush_op;
 	unsigned int feature_discard:1;
@@ -169,32 +183,35 @@ static DEFINE_SPINLOCK(minor_lock);
 #define INDIRECT_GREFS(_segs) \
 	((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
 
-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_gather_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
+				   unsigned int segs);
 
-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
-	unsigned long free = info->shadow_free;
+	unsigned long free = rinfo->shadow_free;
 	BUG_ON(free >= BLK_RING_SIZE);
-	info->shadow_free = info->shadow[free].req.u.rw.id;
-	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
+	rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+	rinfo->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
 	return free;
 }
 
-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
 			       unsigned long id)
 {
-	if (info->shadow[id].req.u.rw.id != id)
+	if (rinfo->shadow[id].req.u.rw.id != id)
 		return -EINVAL;
-	if (info->shadow[id].request == NULL)
+	if (rinfo->shadow[id].request == NULL)
 		return -EINVAL;
-	info->shadow[id].req.u.rw.id  = info->shadow_free;
-	info->shadow[id].request = NULL;
-	info->shadow_free = id;
+	rinfo->shadow[id].req.u.rw.id  = rinfo->shadow_free;
+	rinfo->shadow[id].request = NULL;
+	rinfo->shadow_free = id;
 	return 0;
 }
 
-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
 {
+	struct blkfront_info *info = rinfo->info;
 	struct page *granted_page;
 	struct grant *gnt_list_entry, *n;
 	int i = 0;
@@ -214,7 +231,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
 		}
 
 		gnt_list_entry->gref = GRANT_INVALID_REF;
-		list_add(&gnt_list_entry->node, &info->grants);
+		list_add(&gnt_list_entry->node, &rinfo->grants);
 		i++;
 	}
 
@@ -222,7 +239,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
 
 out_of_memory:
 	list_for_each_entry_safe(gnt_list_entry, n,
-	                         &info->grants, node) {
+	                         &rinfo->grants, node) {
 		list_del(&gnt_list_entry->node);
 		if (info->feature_persistent)
 			__free_page(pfn_to_page(gnt_list_entry->pfn));
@@ -235,31 +252,31 @@ out_of_memory:
 
 static struct grant *get_grant(grant_ref_t *gref_head,
                                unsigned long pfn,
-                               struct blkfront_info *info)
+                               struct blkfront_ring_info *rinfo)
 {
 	struct grant *gnt_list_entry;
 	unsigned long buffer_mfn;
 
-	BUG_ON(list_empty(&info->grants));
-	gnt_list_entry = list_first_entry(&info->grants, struct grant,
+	BUG_ON(list_empty(&rinfo->grants));
+	gnt_list_entry = list_first_entry(&rinfo->grants, struct grant,
 	                                  node);
 	list_del(&gnt_list_entry->node);
 
 	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
-		info->persistent_gnts_c--;
+		rinfo->persistent_gnts_c--;
 		return gnt_list_entry;
 	}
 
 	/* Assign a gref to this page */
 	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
 	BUG_ON(gnt_list_entry->gref == -ENOSPC);
-	if (!info->feature_persistent) {
+	if (!rinfo->info->feature_persistent) {
 		BUG_ON(!pfn);
 		gnt_list_entry->pfn = pfn;
 	}
 	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
 	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
-	                                info->xbdev->otherend_id,
+	                                rinfo->info->xbdev->otherend_id,
 	                                buffer_mfn, 0);
 	return gnt_list_entry;
 }
@@ -330,8 +347,8 @@ static void xlbd_release_minors(unsigned int minor, unsigned int nr)
 
 static void blkif_restart_queue_callback(void *arg)
 {
-	struct blkfront_info *info = (struct blkfront_info *)arg;
-	schedule_work(&info->work);
+	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)arg;
+	schedule_work(&rinfo->work);
 }
 
 static int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg)
@@ -388,10 +405,13 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
  * and writes are handled as expected.
  *
  * @req: a request struct
+ * @ring_idx: index of the ring the request is to be inserted in
  */
-static int blkif_queue_request(struct request *req)
+static int blkif_queue_request(struct request *req, unsigned int ring_idx)
 {
 	struct blkfront_info *info = req->rq_disk->private_data;
+	struct blkfront_ring_info *rinfo = &info->rinfo[ring_idx];
+	struct blkif_front_ring *ring = &info->rinfo[ring_idx].ring;
 	struct blkif_request *ring_req;
 	unsigned long id;
 	unsigned int fsect, lsect;
@@ -421,15 +441,15 @@ static int blkif_queue_request(struct request *req)
 		max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
 
 	/* Check if we have enough grants to allocate a requests */
-	if (info->persistent_gnts_c < max_grefs) {
+	if (rinfo->persistent_gnts_c < max_grefs) {
 		new_persistent_gnts = 1;
 		if (gnttab_alloc_grant_references(
-		    max_grefs - info->persistent_gnts_c,
+		    max_grefs - rinfo->persistent_gnts_c,
 		    &gref_head) < 0) {
 			gnttab_request_free_callback(
-				&info->callback,
+				&rinfo->callback,
 				blkif_restart_queue_callback,
-				info,
+				rinfo,
 				max_grefs);
 			return 1;
 		}
@@ -437,9 +457,9 @@ static int blkif_queue_request(struct request *req)
 		new_persistent_gnts = 0;
 
 	/* Fill out a communications ring structure. */
-	ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
-	id = get_id_from_freelist(info);
-	info->shadow[id].request = req;
+	ring_req = RING_GET_REQUEST(ring, ring->req_prod_pvt);
+	id = get_id_from_freelist(rinfo);
+	rinfo->shadow[id].request = req;
 
 	if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
 		ring_req->operation = BLKIF_OP_DISCARD;
@@ -455,7 +475,7 @@ static int blkif_queue_request(struct request *req)
 		       req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
 		BUG_ON(info->max_indirect_segments &&
 		       req->nr_phys_segments > info->max_indirect_segments);
-		nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
+		nseg = blk_rq_map_sg(req->q, req, rinfo->shadow[id].sg);
 		ring_req->u.rw.id = id;
 		if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
 			/*
@@ -486,7 +506,7 @@ static int blkif_queue_request(struct request *req)
 			}
 			ring_req->u.rw.nr_segments = nseg;
 		}
-		for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+		for_each_sg(rinfo->shadow[id].sg, sg, nseg, i) {
 			fsect = sg->offset >> 9;
 			lsect = fsect + (sg->length >> 9) - 1;
 
@@ -502,22 +522,22 @@ static int blkif_queue_request(struct request *req)
 					struct page *indirect_page;
 
 					/* Fetch a pre-allocated page to use for indirect grefs */
-					BUG_ON(list_empty(&info->indirect_pages));
-					indirect_page = list_first_entry(&info->indirect_pages,
+					BUG_ON(list_empty(&rinfo->indirect_pages));
+					indirect_page = list_first_entry(&rinfo->indirect_pages,
 					                                 struct page, lru);
 					list_del(&indirect_page->lru);
 					pfn = page_to_pfn(indirect_page);
 				}
-				gnt_list_entry = get_grant(&gref_head, pfn, info);
-				info->shadow[id].indirect_grants[n] = gnt_list_entry;
+				gnt_list_entry = get_grant(&gref_head, pfn, rinfo);
+				rinfo->shadow[id].indirect_grants[n] = gnt_list_entry;
 				segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
 				ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
 			}
 
-			gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
+			gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), rinfo);
 			ref = gnt_list_entry->gref;
 
-			info->shadow[id].grants_used[i] = gnt_list_entry;
+			rinfo->shadow[id].grants_used[i] = gnt_list_entry;
 
 			if (rq_data_dir(req) && info->feature_persistent) {
 				char *bvec_data;
@@ -563,10 +583,10 @@ static int blkif_queue_request(struct request *req)
 			kunmap_atomic(segments);
 	}
 
-	info->ring.req_prod_pvt++;
+	ring->req_prod_pvt++;
 
 	/* Keep a private copy so we can reissue requests when recovering. */
-	info->shadow[id].req = *ring_req;
+	rinfo->shadow[id].req = *ring_req;
 
 	if (new_persistent_gnts)
 		gnttab_free_grant_references(gref_head);
@@ -575,22 +595,26 @@ static int blkif_queue_request(struct request *req)
 }
 
 
-static inline void flush_requests(struct blkfront_info *info)
+static inline void flush_requests(struct blkfront_info *info,
+				  unsigned int ring_idx)
 {
+	struct blkfront_ring_info *rinfo = &info->rinfo[ring_idx];
 	int notify;
 
-	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
+	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&rinfo->ring, notify);
 
 	if (notify)
-		notify_remote_via_irq(info->irq);
+		notify_remote_via_irq(rinfo->irq);
 }
 
 static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct blkfront_info *info = req->rq_disk->private_data;
+	struct blkfront_ring_info *rinfo =
+			(struct blkfront_ring_info *)hctx->driver_data;
+	struct blkfront_info *info = rinfo->info;
 
-	spin_lock_irq(&info->io_lock);
-	if (RING_FULL(&info->ring))
+	spin_lock_irq(&rinfo->io_lock);
+	if (RING_FULL(&rinfo->ring))
 		goto wait;
 
 	if ((req->cmd_type != REQ_TYPE_FS) ||
@@ -598,28 +622,40 @@ static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 			 !info->flush_op)) {
 		req->errors = -EIO;
 		blk_mq_complete_request(req);
-		spin_unlock_irq(&info->io_lock);
+		spin_unlock_irq(&rinfo->io_lock);
 		return BLK_MQ_RQ_QUEUE_ERROR;
 	}
 
-	if (blkif_queue_request(req)) {
+	if (blkif_queue_request(req, rinfo->hctx_index)) {
 		blk_mq_requeue_request(req);
 		goto wait;
 	}
 
-	flush_requests(info);
-	spin_unlock_irq(&info->io_lock);
+	flush_requests(info, rinfo->hctx_index);
+	spin_unlock_irq(&rinfo->io_lock);
 	return BLK_MQ_RQ_QUEUE_OK;
 
 wait:
 	/* Avoid pointless unplugs. */
 	blk_mq_stop_hw_queue(hctx);
-	spin_unlock_irq(&info->io_lock);
+	spin_unlock_irq(&rinfo->io_lock);
 	return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static int blkfront_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			      unsigned int index)
+{
+	struct blkfront_info *info = (struct blkfront_info *)data;
+
+	hctx->driver_data = &info->rinfo[index];
+	info->rinfo[index].hctx_index = index;
+
+	return 0;
+}
+
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkfront_queue_rq,
+	.init_hctx = blkfront_init_hctx,
 	.map_queue = blk_mq_map_queue,
 };
 
@@ -870,21 +906,23 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 {
 	unsigned int minor, nr_minors;
 	unsigned long flags;
+	int i;
 
 	if (info->rq == NULL)
 		return;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	/* No more blkif_request() and gnttab callback work. */
 
-	/* No more blkif_request(). */
 	blk_mq_stop_hw_queues(info->rq);
-
-	/* No more gnttab callback work. */
-	gnttab_cancel_free_callback(&info->callback);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		gnttab_cancel_free_callback(&info->rinfo[i].callback);
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
+	}
 
 	/* Flush gnttab callback work. Must be done with no locks held. */
-	flush_work(&info->work);
+	for (i = 0 ; i < info->nr_rings ; i++)
+		flush_work(&info->rinfo[i].work);
 
 	del_gendisk(info->gd);
 
@@ -900,92 +938,55 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	info->gd = NULL;
 }
 
-static void kick_pending_request_queues(struct blkfront_info *info,
+static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
 					unsigned long *flags)
 {
-	if (!RING_FULL(&info->ring)) {
-		spin_unlock_irqrestore(&info->io_lock, *flags);
-		blk_mq_start_stopped_hw_queues(info->rq, 0);
-		spin_lock_irqsave(&info->io_lock, *flags);
+	if (!RING_FULL(&rinfo->ring)) {
+		spin_unlock_irqrestore(&rinfo->io_lock, *flags);
+		blk_mq_start_stopped_hw_queues(rinfo->info->rq, 0);
+		spin_lock_irqsave(&rinfo->io_lock, *flags);
 	}
 }
 
 static void blkif_restart_queue(struct work_struct *work)
 {
-	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+	struct blkfront_ring_info *rinfo = container_of(work,
+				struct blkfront_ring_info, work);
+	struct blkfront_info *info = rinfo->info;
 	unsigned long flags;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	spin_lock_irqsave(&rinfo->io_lock, flags);
 	if (info->connected == BLKIF_STATE_CONNECTED)
-		kick_pending_request_queues(info, &flags);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+		kick_pending_request_queues(rinfo, &flags);
+	spin_unlock_irqrestore(&rinfo->io_lock, flags);
 }
 
-static void blkif_free(struct blkfront_info *info, int suspend)
+static void blkif_free_ring(struct blkfront_ring_info *rinfo,
+			    int persistent)
 {
 	struct grant *persistent_gnt;
-	struct grant *n;
 	int i, j, segs;
 
-	/* Prevent new requests being issued until we fix things up. */
-	spin_lock_irq(&info->io_lock);
-	info->connected = suspend ?
-		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
-	/* No more blkif_request(). */
-	if (info->rq)
-		blk_mq_stop_hw_queues(info->rq);
-
-	/* Remove all persistent grants */
-	if (!list_empty(&info->grants)) {
-		list_for_each_entry_safe(persistent_gnt, n,
-		                         &info->grants, node) {
-			list_del(&persistent_gnt->node);
-			if (persistent_gnt->gref != GRANT_INVALID_REF) {
-				gnttab_end_foreign_access(persistent_gnt->gref,
-				                          0, 0UL);
-				info->persistent_gnts_c--;
-			}
-			if (info->feature_persistent)
-				__free_page(pfn_to_page(persistent_gnt->pfn));
-			kfree(persistent_gnt);
-		}
-	}
-	BUG_ON(info->persistent_gnts_c != 0);
-
-	/*
-	 * Remove indirect pages, this only happens when using indirect
-	 * descriptors but not persistent grants
-	 */
-	if (!list_empty(&info->indirect_pages)) {
-		struct page *indirect_page, *n;
-
-		BUG_ON(info->feature_persistent);
-		list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
-			list_del(&indirect_page->lru);
-			__free_page(indirect_page);
-		}
-	}
-
 	for (i = 0; i < BLK_RING_SIZE; i++) {
 		/*
 		 * Clear persistent grants present in requests already
 		 * on the shared ring
 		 */
-		if (!info->shadow[i].request)
+		if (!rinfo->shadow[i].request)
 			goto free_shadow;
 
-		segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
-		       info->shadow[i].req.u.indirect.nr_segments :
-		       info->shadow[i].req.u.rw.nr_segments;
+		segs = rinfo->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+		       rinfo->shadow[i].req.u.indirect.nr_segments :
+		       rinfo->shadow[i].req.u.rw.nr_segments;
 		for (j = 0; j < segs; j++) {
-			persistent_gnt = info->shadow[i].grants_used[j];
+			persistent_gnt = rinfo->shadow[i].grants_used[j];
 			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
-			if (info->feature_persistent)
+			if (persistent)
 				__free_page(pfn_to_page(persistent_gnt->pfn));
 			kfree(persistent_gnt);
 		}
 
-		if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+		if (rinfo->shadow[i].req.operation != BLKIF_OP_INDIRECT)
 			/*
 			 * If this is not an indirect operation don't try to
 			 * free indirect segments
@@ -993,44 +994,101 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 			goto free_shadow;
 
 		for (j = 0; j < INDIRECT_GREFS(segs); j++) {
-			persistent_gnt = info->shadow[i].indirect_grants[j];
+			persistent_gnt = rinfo->shadow[i].indirect_grants[j];
 			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
 			__free_page(pfn_to_page(persistent_gnt->pfn));
 			kfree(persistent_gnt);
 		}
 
 free_shadow:
-		kfree(info->shadow[i].grants_used);
-		info->shadow[i].grants_used = NULL;
-		kfree(info->shadow[i].indirect_grants);
-		info->shadow[i].indirect_grants = NULL;
-		kfree(info->shadow[i].sg);
-		info->shadow[i].sg = NULL;
+		kfree(rinfo->shadow[i].grants_used);
+		rinfo->shadow[i].grants_used = NULL;
+		kfree(rinfo->shadow[i].indirect_grants);
+		rinfo->shadow[i].indirect_grants = NULL;
+		kfree(rinfo->shadow[i].sg);
+		rinfo->shadow[i].sg = NULL;
 	}
 
-	/* No more gnttab callback work. */
-	gnttab_cancel_free_callback(&info->callback);
-	spin_unlock_irq(&info->io_lock);
+}
 
-	/* Flush gnttab callback work. Must be done with no locks held. */
-	flush_work(&info->work);
+static void blkif_free(struct blkfront_info *info, int suspend)
+{
+	struct grant *persistent_gnt;
+	struct grant *n;
+	int i;
 
-	/* Free resources associated with old device channel. */
-	if (info->ring_ref != GRANT_INVALID_REF) {
-		gnttab_end_foreign_access(info->ring_ref, 0,
-					  (unsigned long)info->ring.sring);
-		info->ring_ref = GRANT_INVALID_REF;
-		info->ring.sring = NULL;
-	}
-	if (info->irq)
-		unbind_from_irqhandler(info->irq, info);
-	info->evtchn = info->irq = 0;
+	info->connected = suspend ?
+		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
+
+	/*
+	 * Prevent new requests being issued until we fix things up:
+	 * no more blkif_request() and no more gnttab callback work.
+	 */
+	if (info->rq) {
+		blk_mq_stop_hw_queues(info->rq);
+
+		for (i = 0 ; i < info->nr_rings ; i++) {
+			struct blkfront_ring_info *rinfo = &info->rinfo[i];
+
+			spin_lock_irq(&info->rinfo[i].io_lock);
+			/* Remove all persistent grants */
+			if (!list_empty(&rinfo->grants)) {
+				list_for_each_entry_safe(persistent_gnt, n,
+				                         &rinfo->grants, node) {
+					list_del(&persistent_gnt->node);
+					if (persistent_gnt->gref != GRANT_INVALID_REF) {
+						gnttab_end_foreign_access(persistent_gnt->gref,
+						                          0, 0UL);
+						rinfo->persistent_gnts_c--;
+					}
+					if (info->feature_persistent)
+						__free_page(pfn_to_page(persistent_gnt->pfn));
+					kfree(persistent_gnt);
+				}
+			}
+			BUG_ON(rinfo->persistent_gnts_c != 0);
+
+			/*
+			 * Remove indirect pages, this only happens when using indirect
+			 * descriptors but not persistent grants
+			 */
+			if (!list_empty(&rinfo->indirect_pages)) {
+				struct page *indirect_page, *n;
+
+				BUG_ON(info->feature_persistent);
+				list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
+					list_del(&indirect_page->lru);
+					__free_page(indirect_page);
+				}
+			}
 
+			blkif_free_ring(rinfo, info->feature_persistent);
+
+			gnttab_cancel_free_callback(&rinfo->callback);
+			spin_unlock_irq(&rinfo->io_lock);
+
+			/* Flush gnttab callback work. Must be done with no locks held. */
+			flush_work(&info->rinfo[i].work);
+
+			/* Free resources associated with old device channel. */
+			if (rinfo->ring_ref != GRANT_INVALID_REF) {
+				gnttab_end_foreign_access(rinfo->ring_ref, 0,
+							  (unsigned long)rinfo->ring.sring);
+				rinfo->ring_ref = GRANT_INVALID_REF;
+				rinfo->ring.sring = NULL;
+			}
+			if (rinfo->irq)
+				unbind_from_irqhandler(rinfo->irq, rinfo);
+			rinfo->evtchn = rinfo->irq = 0;
+		}
+	}
 }
 
-static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+static void blkif_completion(struct blk_shadow *s,
+			     struct blkfront_ring_info *rinfo,
 			     struct blkif_response *bret)
 {
+	struct blkfront_info *info = rinfo->info;
 	int i = 0;
 	struct scatterlist *sg;
 	char *bvec_data;
@@ -1071,8 +1129,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			if (!info->feature_persistent)
 				pr_alert_ratelimited("backed has not unmapped grant: %u\n",
 						     s->grants_used[i]->gref);
-			list_add(&s->grants_used[i]->node, &info->grants);
-			info->persistent_gnts_c++;
+			list_add(&s->grants_used[i]->node, &rinfo->grants);
+			rinfo->persistent_gnts_c++;
 		} else {
 			/*
 			 * If the grant is not mapped by the backend we end the
@@ -1082,7 +1140,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			 */
 			gnttab_end_foreign_access(s->grants_used[i]->gref, 0, 0UL);
 			s->grants_used[i]->gref = GRANT_INVALID_REF;
-			list_add_tail(&s->grants_used[i]->node, &info->grants);
+			list_add_tail(&s->grants_used[i]->node, &rinfo->grants);
 		}
 	}
 	if (s->req.operation == BLKIF_OP_INDIRECT) {
@@ -1091,8 +1149,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 				if (!info->feature_persistent)
 					pr_alert_ratelimited("backed has not unmapped grant: %u\n",
 							     s->indirect_grants[i]->gref);
-				list_add(&s->indirect_grants[i]->node, &info->grants);
-				info->persistent_gnts_c++;
+				list_add(&s->indirect_grants[i]->node, &rinfo->grants);
+				rinfo->persistent_gnts_c++;
 			} else {
 				struct page *indirect_page;
 
@@ -1102,9 +1160,9 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 				 * available pages for indirect grefs.
 				 */
 				indirect_page = pfn_to_page(s->indirect_grants[i]->pfn);
-				list_add(&indirect_page->lru, &info->indirect_pages);
+				list_add(&indirect_page->lru, &rinfo->indirect_pages);
 				s->indirect_grants[i]->gref = GRANT_INVALID_REF;
-				list_add_tail(&s->indirect_grants[i]->node, &info->grants);
+				list_add_tail(&s->indirect_grants[i]->node, &rinfo->grants);
 			}
 		}
 	}
@@ -1116,24 +1174,25 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	struct blkif_response *bret;
 	RING_IDX i, rp;
 	unsigned long flags;
-	struct blkfront_info *info = (struct blkfront_info *)dev_id;
+	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
+	struct blkfront_info *info = rinfo->info;
 	int error;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	spin_lock_irqsave(&rinfo->io_lock, flags);
 
 	if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
-		spin_unlock_irqrestore(&info->io_lock, flags);
+		spin_unlock_irqrestore(&rinfo->io_lock, flags);
 		return IRQ_HANDLED;
 	}
 
  again:
-	rp = info->ring.sring->rsp_prod;
+	rp = rinfo->ring.sring->rsp_prod;
 	rmb(); /* Ensure we see queued responses up to 'rp'. */
 
-	for (i = info->ring.rsp_cons; i != rp; i++) {
+	for (i = rinfo->ring.rsp_cons; i != rp; i++) {
 		unsigned long id;
 
-		bret = RING_GET_RESPONSE(&info->ring, i);
+		bret = RING_GET_RESPONSE(&rinfo->ring, i);
 		id   = bret->id;
 		/*
 		 * The backend has messed up and given us an id that we would
@@ -1147,12 +1206,12 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 			 * the id is busted. */
 			continue;
 		}
-		req  = info->shadow[id].request;
+		req  = rinfo->shadow[id].request;
 
 		if (bret->operation != BLKIF_OP_DISCARD)
-			blkif_completion(&info->shadow[id], info, bret);
+			blkif_completion(&rinfo->shadow[id], rinfo, bret);
 
-		if (add_id_to_freelist(info, id)) {
+		if (add_id_to_freelist(rinfo, id)) {
 			WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
 			     info->gd->disk_name, op_name(bret->operation), id);
 			continue;
@@ -1181,7 +1240,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(bret->status == BLKIF_RSP_ERROR &&
-				     info->shadow[id].req.u.rw.nr_segments == 0)) {
+				     rinfo->shadow[id].req.u.rw.nr_segments == 0)) {
 				printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
 				error = req->errors = -EOPNOTSUPP;
@@ -1207,31 +1266,31 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 		}
 	}
 
-	info->ring.rsp_cons = i;
+	rinfo->ring.rsp_cons = i;
 
-	if (i != info->ring.req_prod_pvt) {
+	if (i != rinfo->ring.req_prod_pvt) {
 		int more_to_do;
-		RING_FINAL_CHECK_FOR_RESPONSES(&info->ring, more_to_do);
+		RING_FINAL_CHECK_FOR_RESPONSES(&rinfo->ring, more_to_do);
 		if (more_to_do)
 			goto again;
 	} else
-		info->ring.sring->rsp_event = i + 1;
+		rinfo->ring.sring->rsp_event = i + 1;
 
-	blkif_restart_queue_callback(info);
+	blkif_restart_queue_callback(rinfo);
 
-	spin_unlock_irqrestore(&info->io_lock, flags);
+	spin_unlock_irqrestore(&rinfo->io_lock, flags);
 
 	return IRQ_HANDLED;
 }
 
 
 static int setup_blkring(struct xenbus_device *dev,
-			 struct blkfront_info *info)
+			 struct blkfront_ring_info *rinfo)
 {
 	struct blkif_sring *sring;
 	int err;
 
-	info->ring_ref = GRANT_INVALID_REF;
+	rinfo->ring_ref = GRANT_INVALID_REF;
 
 	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH);
 	if (!sring) {
@@ -1239,32 +1298,32 @@ static int setup_blkring(struct xenbus_device *dev,
 		return -ENOMEM;
 	}
 	SHARED_RING_INIT(sring);
-	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
+	FRONT_RING_INIT(&rinfo->ring, sring, PAGE_SIZE);
 
-	err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
+	err = xenbus_grant_ring(dev, virt_to_mfn(rinfo->ring.sring));
 	if (err < 0) {
 		free_page((unsigned long)sring);
-		info->ring.sring = NULL;
+		rinfo->ring.sring = NULL;
 		goto fail;
 	}
-	info->ring_ref = err;
+	rinfo->ring_ref = err;
 
-	err = xenbus_alloc_evtchn(dev, &info->evtchn);
+	err = xenbus_alloc_evtchn(dev, &rinfo->evtchn);
 	if (err)
 		goto fail;
 
-	err = bind_evtchn_to_irqhandler(info->evtchn, blkif_interrupt, 0,
-					"blkif", info);
+	err = bind_evtchn_to_irqhandler(rinfo->evtchn, blkif_interrupt, 0,
+					"blkif", rinfo);
 	if (err <= 0) {
 		xenbus_dev_fatal(dev, err,
 				 "bind_evtchn_to_irqhandler failed");
 		goto fail;
 	}
-	info->irq = err;
+	rinfo->irq = err;
 
 	return 0;
 fail:
-	blkif_free(info, 0);
+	blkif_free(rinfo->info, 0);
 	return err;
 }
 
@@ -1274,13 +1333,16 @@ static int talk_to_blkback(struct xenbus_device *dev,
 			   struct blkfront_info *info)
 {
 	const char *message = NULL;
+	char ring_ref_s[64] = "", evtchn_s[64] = "";
 	struct xenbus_transaction xbt;
-	int err;
+	int i, err;
 
-	/* Create shared ring, alloc event channel. */
-	err = setup_blkring(dev, info);
-	if (err)
-		goto out;
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		/* Create shared ring, alloc event channel. */
+		err = setup_blkring(dev, &info->rinfo[i]);
+		if (err)
+			goto out;
+	}
 
 again:
 	err = xenbus_transaction_start(&xbt);
@@ -1289,18 +1351,24 @@ again:
 		goto destroy_blkring;
 	}
 
-	err = xenbus_printf(xbt, dev->nodename,
-			    "ring-ref", "%u", info->ring_ref);
-	if (err) {
-		message = "writing ring-ref";
-		goto abort_transaction;
-	}
-	err = xenbus_printf(xbt, dev->nodename,
-			    "event-channel", "%u", info->evtchn);
-	if (err) {
-		message = "writing event-channel";
-		goto abort_transaction;
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		BUG_ON(i > 0);
+		snprintf(ring_ref_s, 64, "ring-ref");
+		snprintf(evtchn_s, 64, "event-channel");
+		err = xenbus_printf(xbt, dev->nodename,
+				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
+		if (err) {
+			message = "writing ring-ref";
+			goto abort_transaction;
+		}
+		err = xenbus_printf(xbt, dev->nodename,
+				    evtchn_s, "%u", info->rinfo[i].evtchn);
+		if (err) {
+			message = "writing event-channel";
+			goto abort_transaction;
+		}
 	}
+
 	err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
 			    XEN_IO_PROTO_ABI_NATIVE);
 	if (err) {
@@ -1344,7 +1412,7 @@ again:
 static int blkfront_probe(struct xenbus_device *dev,
 			  const struct xenbus_device_id *id)
 {
-	int err, vdevice, i;
+	int err, vdevice, i, r;
 	struct blkfront_info *info;
 
 	/* FIXME: Use dynamic device id if this is not set. */
@@ -1396,23 +1464,36 @@ static int blkfront_probe(struct xenbus_device *dev,
 	}
 
 	mutex_init(&info->mutex);
-	spin_lock_init(&info->io_lock);
 	info->xbdev = dev;
 	info->vdevice = vdevice;
-	INIT_LIST_HEAD(&info->grants);
-	INIT_LIST_HEAD(&info->indirect_pages);
-	info->persistent_gnts_c = 0;
 	info->connected = BLKIF_STATE_DISCONNECTED;
-	INIT_WORK(&info->work, blkif_restart_queue);
-
-	for (i = 0; i < BLK_RING_SIZE; i++)
-		info->shadow[i].req.u.rw.id = i+1;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
 
 	/* Front end dir is a number, which is used as the id. */
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
 
+	/* Allocate the correct number of rings. */
+	info->nr_rings = 1;
+	pr_info("blkfront: %s: %d rings\n",
+		info->gd->disk_name, info->nr_rings);
+
+	info->rinfo = kzalloc(info->nr_rings *
+				sizeof(struct blkfront_ring_info),
+			      GFP_KERNEL);
+	for (r = 0 ; r < info->nr_rings ; r++) {
+		struct blkfront_ring_info *rinfo = &info->rinfo[r];
+
+		rinfo->info = info;
+		rinfo->persistent_gnts_c = 0;
+		INIT_LIST_HEAD(&rinfo->grants);
+		INIT_LIST_HEAD(&rinfo->indirect_pages);
+		INIT_WORK(&rinfo->work, blkif_restart_queue);
+		spin_lock_init(&rinfo->io_lock);
+		for (i = 0; i < BLK_RING_SIZE; i++)
+			rinfo->shadow[i].req.u.rw.id = i+1;
+		rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+	}
+
 	err = talk_to_blkback(dev, info);
 	if (err) {
 		kfree(info);
@@ -1438,88 +1519,100 @@ static void split_bio_end(struct bio *bio, int error)
 	bio_put(bio);
 }
 
-static int blkif_recover(struct blkfront_info *info)
+static int blkif_setup_shadow(struct blkfront_ring_info *rinfo,
+			      struct blk_shadow **copy)
 {
 	int i;
+
+	/* Stage 1: Make a safe copy of the shadow state. */
+	*copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
+		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
+	if (!*copy)
+		return -ENOMEM;
+
+	/* Stage 2: Set up free list. */
+	memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
+	for (i = 0; i < BLK_RING_SIZE; i++)
+		rinfo->shadow[i].req.u.rw.id = i+1;
+	rinfo->shadow_free = rinfo->ring.req_prod_pvt;
+	rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+
+	return 0;
+}
+
+static int blkif_recover(struct blkfront_info *info)
+{
+	int i, r;
 	struct request *req, *n;
 	struct blk_shadow *copy;
-	int rc;
+	int rc = 0;
 	struct bio *bio, *cloned_bio;
-	struct bio_list bio_list, merge_bio;
+	struct bio_list uninitialized_var(bio_list), merge_bio;
 	unsigned int segs, offset;
 	unsigned long flags;
 	int pending, size;
 	struct split_bio *split_bio;
 	struct list_head requests;
 
-	/* Stage 1: Make a safe copy of the shadow state. */
-	copy = kmemdup(info->shadow, sizeof(info->shadow),
-		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
-	if (!copy)
-		return -ENOMEM;
-
-	/* Stage 2: Set up free list. */
-	memset(&info->shadow, 0, sizeof(info->shadow));
-	for (i = 0; i < BLK_RING_SIZE; i++)
-		info->shadow[i].req.u.rw.id = i+1;
-	info->shadow_free = info->ring.req_prod_pvt;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+	segs = blkfront_gather_indirect(info);
 
-	rc = blkfront_setup_indirect(info);
-	if (rc) {
-		kfree(copy);
-		return rc;
-	}
+	for (r = 0 ; r < info->nr_rings ; r++) {
+		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
+		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
+		if (rc) {
+			kfree(copy);
+			return rc;
+		}
 
-	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-	blk_queue_max_segments(info->rq, segs);
-	bio_list_init(&bio_list);
-	INIT_LIST_HEAD(&requests);
-	for (i = 0; i < BLK_RING_SIZE; i++) {
-		/* Not in use? */
-		if (!copy[i].request)
-			continue;
+		segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
+		blk_queue_max_segments(info->rq, segs);
+		bio_list_init(&bio_list);
+		INIT_LIST_HEAD(&requests);
+		for (i = 0; i < BLK_RING_SIZE; i++) {
+			/* Not in use? */
+			if (!copy[i].request)
+				continue;
 
-		/*
-		 * Get the bios in the request so we can re-queue them.
-		 */
-		if (copy[i].request->cmd_flags &
-		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
 			/*
-			 * Flush operations don't contain bios, so
-			 * we need to requeue the whole request
+			 * Get the bios in the request so we can re-queue them.
 			 */
-			list_add(&copy[i].request->queuelist, &requests);
-			continue;
+			if (copy[i].request->cmd_flags &
+			    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+				/*
+				 * Flush operations don't contain bios, so
+				 * we need to requeue the whole request
+				 */
+				list_add(&copy[i].request->queuelist, &requests);
+				continue;
+			}
+			merge_bio.head = copy[i].request->bio;
+			merge_bio.tail = copy[i].request->biotail;
+			bio_list_merge(&bio_list, &merge_bio);
+			copy[i].request->bio = NULL;
+			blk_put_request(copy[i].request);
 		}
-		merge_bio.head = copy[i].request->bio;
-		merge_bio.tail = copy[i].request->biotail;
-		bio_list_merge(&bio_list, &merge_bio);
-		copy[i].request->bio = NULL;
-		blk_put_request(copy[i].request);
+		kfree(copy);
 	}
 
-	kfree(copy);
-
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	spin_lock_irqsave(&info->io_lock, flags);
-
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	/* Kick any other new requests queued since we resumed */
-	kick_pending_request_queues(info, &flags);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		/* Kick any other new requests queued since we resumed */
+		kick_pending_request_queues(&info->rinfo[i], &flags);
 
-	list_for_each_entry_safe(req, n, &requests, queuelist) {
-		/* Requeue pending requests (flush or discard) */
-		list_del_init(&req->queuelist);
-		BUG_ON(req->nr_phys_segments > segs);
-		blk_mq_requeue_request(req);
+		list_for_each_entry_safe(req, n, &requests, queuelist) {
+			/* Requeue pending requests (flush or discard) */
+			list_del_init(&req->queuelist);
+			BUG_ON(req->nr_phys_segments > segs);
+			blk_mq_requeue_request(req);
+		}
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
 	}
 
-	spin_unlock_irqrestore(&info->io_lock, flags);
-
 	while ((bio = bio_list_pop(&bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
 		if (bio_segments(bio) > segs) {
@@ -1643,14 +1736,15 @@ static void blkfront_setup_discard(struct blkfront_info *info)
 		info->feature_secdiscard = !!discard_secure;
 }
 
-static int blkfront_setup_indirect(struct blkfront_info *info)
+
+static int blkfront_gather_indirect(struct blkfront_info *info)
 {
 	unsigned int indirect_segments, segs;
-	int err, i;
+	int err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+				"feature-max-indirect-segments", "%u",
+				&indirect_segments,
+				NULL);
 
-	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-			    "feature-max-indirect-segments", "%u", &indirect_segments,
-			    NULL);
 	if (err) {
 		info->max_indirect_segments = 0;
 		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
@@ -1660,7 +1754,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		segs = info->max_indirect_segments;
 	}
 
-	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+	return segs;
+}
+
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
+				   unsigned int segs)
+{
+	struct blkfront_info *info = rinfo->info;
+	int err, i;
+
+	err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
 	if (err)
 		goto out_of_memory;
 
@@ -1672,31 +1775,31 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		 */
 		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
 
-		BUG_ON(!list_empty(&info->indirect_pages));
+		BUG_ON(!list_empty(&rinfo->indirect_pages));
 		for (i = 0; i < num; i++) {
 			struct page *indirect_page = alloc_page(GFP_NOIO);
 			if (!indirect_page)
 				goto out_of_memory;
-			list_add(&indirect_page->lru, &info->indirect_pages);
+			list_add(&indirect_page->lru, &rinfo->indirect_pages);
 		}
 	}
 
 	for (i = 0; i < BLK_RING_SIZE; i++) {
-		info->shadow[i].grants_used = kzalloc(
-			sizeof(info->shadow[i].grants_used[0]) * segs,
+		rinfo->shadow[i].grants_used = kzalloc(
+			sizeof(rinfo->shadow[i].grants_used[0]) * segs,
 			GFP_NOIO);
-		info->shadow[i].sg = kzalloc(sizeof(info->shadow[i].sg[0]) * segs, GFP_NOIO);
+		rinfo->shadow[i].sg = kzalloc(sizeof(rinfo->shadow[i].sg[0]) * segs, GFP_NOIO);
 		if (info->max_indirect_segments)
-			info->shadow[i].indirect_grants = kzalloc(
-				sizeof(info->shadow[i].indirect_grants[0]) *
+			rinfo->shadow[i].indirect_grants = kzalloc(
+				sizeof(rinfo->shadow[i].indirect_grants[0]) *
 				INDIRECT_GREFS(segs),
 				GFP_NOIO);
-		if ((info->shadow[i].grants_used == NULL) ||
-			(info->shadow[i].sg == NULL) ||
+		if ((rinfo->shadow[i].grants_used == NULL) ||
+			(rinfo->shadow[i].sg == NULL) ||
 		     (info->max_indirect_segments &&
-		     (info->shadow[i].indirect_grants == NULL)))
+		     (rinfo->shadow[i].indirect_grants == NULL)))
 			goto out_of_memory;
-		sg_init_table(info->shadow[i].sg, segs);
+		sg_init_table(rinfo->shadow[i].sg, segs);
 	}
 
 
@@ -1704,16 +1807,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 
 out_of_memory:
 	for (i = 0; i < BLK_RING_SIZE; i++) {
-		kfree(info->shadow[i].grants_used);
-		info->shadow[i].grants_used = NULL;
-		kfree(info->shadow[i].sg);
-		info->shadow[i].sg = NULL;
-		kfree(info->shadow[i].indirect_grants);
-		info->shadow[i].indirect_grants = NULL;
-	}
-	if (!list_empty(&info->indirect_pages)) {
+		kfree(rinfo->shadow[i].grants_used);
+		rinfo->shadow[i].grants_used = NULL;
+		kfree(rinfo->shadow[i].sg);
+		rinfo->shadow[i].sg = NULL;
+		kfree(rinfo->shadow[i].indirect_grants);
+		rinfo->shadow[i].indirect_grants = NULL;
+	}
+	if (!list_empty(&rinfo->indirect_pages)) {
 		struct page *indirect_page, *n;
-		list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
+		list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
 			list_del(&indirect_page->lru);
 			__free_page(indirect_page);
 		}
@@ -1732,7 +1835,8 @@ static void blkfront_connect(struct blkfront_info *info)
 	unsigned long flags;
 	unsigned int physical_sector_size;
 	unsigned int binfo;
-	int err;
+	unsigned int segs;
+	int i, err;
 	int barrier, flush, discard, persistent;
 
 	switch (info->connected) {
@@ -1836,11 +1940,14 @@ static void blkfront_connect(struct blkfront_info *info)
 	else
 		info->feature_persistent = persistent;
 
-	err = blkfront_setup_indirect(info);
-	if (err) {
-		xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
-				 info->xbdev->otherend);
-		return;
+	segs = blkfront_gather_indirect(info);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		err = blkfront_setup_indirect(&info->rinfo[i], segs);
+		if (err) {
+			xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
+					 info->xbdev->otherend);
+			return;
+		}
 	}
 
 	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
@@ -1853,11 +1960,14 @@ static void blkfront_connect(struct blkfront_info *info)
 
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	/* Kick pending requests. */
-	spin_lock_irqsave(&info->io_lock, flags);
 	info->connected = BLKIF_STATE_CONNECTED;
-	kick_pending_request_queues(info, &flags);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+
+	/* Kick pending requests. */
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		kick_pending_request_queues(&info->rinfo[i], &flags);
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
+	}
 
 	add_disk(info->gd);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
  2014-09-11 23:57 ` [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API Arianna Avanzini
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-11 23:57 ` Arianna Avanzini
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi, avanzini.arianna

This commit introduces in xen-blkfront actual support for multiple
block rings. The number of block rings to be used is still forced
to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 710 +++++++++++++++++++++++++------------------
 1 file changed, 410 insertions(+), 300 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 109add6..9282df1 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -102,30 +102,44 @@ MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default
 #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
 
 /*
+ * Data structure keeping per-ring info. A blkfront_info structure is always
+ * associated with one or more blkfront_ring_info.
+ */
+struct blkfront_ring_info
+{
+	spinlock_t io_lock;
+	int ring_ref;
+	struct blkif_front_ring ring;
+	unsigned int evtchn, irq;
+	struct blk_shadow shadow[BLK_RING_SIZE];
+	unsigned long shadow_free;
+
+	struct work_struct work;
+	struct gnttab_free_callback callback;
+	struct list_head grants;
+	struct list_head indirect_pages;
+	unsigned int persistent_gnts_c;
+
+	struct blkfront_info *info;
+	unsigned int hctx_index;
+};
+
+/*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
  * hang in private_data off the gendisk structure. We may end up
  * putting all kinds of interesting stuff here :-)
  */
 struct blkfront_info
 {
-	spinlock_t io_lock;
 	struct mutex mutex;
 	struct xenbus_device *xbdev;
 	struct gendisk *gd;
 	int vdevice;
 	blkif_vdev_t handle;
 	enum blkif_state connected;
-	int ring_ref;
-	struct blkif_front_ring ring;
-	unsigned int evtchn, irq;
+	unsigned int nr_rings;
+	struct blkfront_ring_info *rinfo;
 	struct request_queue *rq;
-	struct work_struct work;
-	struct gnttab_free_callback callback;
-	struct blk_shadow shadow[BLK_RING_SIZE];
-	struct list_head grants;
-	struct list_head indirect_pages;
-	unsigned int persistent_gnts_c;
-	unsigned long shadow_free;
 	unsigned int feature_flush;
 	unsigned int flush_op;
 	unsigned int feature_discard:1;
@@ -169,32 +183,35 @@ static DEFINE_SPINLOCK(minor_lock);
 #define INDIRECT_GREFS(_segs) \
 	((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
 
-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_gather_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
+				   unsigned int segs);
 
-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
-	unsigned long free = info->shadow_free;
+	unsigned long free = rinfo->shadow_free;
 	BUG_ON(free >= BLK_RING_SIZE);
-	info->shadow_free = info->shadow[free].req.u.rw.id;
-	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
+	rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+	rinfo->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
 	return free;
 }
 
-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
 			       unsigned long id)
 {
-	if (info->shadow[id].req.u.rw.id != id)
+	if (rinfo->shadow[id].req.u.rw.id != id)
 		return -EINVAL;
-	if (info->shadow[id].request == NULL)
+	if (rinfo->shadow[id].request == NULL)
 		return -EINVAL;
-	info->shadow[id].req.u.rw.id  = info->shadow_free;
-	info->shadow[id].request = NULL;
-	info->shadow_free = id;
+	rinfo->shadow[id].req.u.rw.id  = rinfo->shadow_free;
+	rinfo->shadow[id].request = NULL;
+	rinfo->shadow_free = id;
 	return 0;
 }
 
-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
 {
+	struct blkfront_info *info = rinfo->info;
 	struct page *granted_page;
 	struct grant *gnt_list_entry, *n;
 	int i = 0;
@@ -214,7 +231,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
 		}
 
 		gnt_list_entry->gref = GRANT_INVALID_REF;
-		list_add(&gnt_list_entry->node, &info->grants);
+		list_add(&gnt_list_entry->node, &rinfo->grants);
 		i++;
 	}
 
@@ -222,7 +239,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
 
 out_of_memory:
 	list_for_each_entry_safe(gnt_list_entry, n,
-	                         &info->grants, node) {
+	                         &rinfo->grants, node) {
 		list_del(&gnt_list_entry->node);
 		if (info->feature_persistent)
 			__free_page(pfn_to_page(gnt_list_entry->pfn));
@@ -235,31 +252,31 @@ out_of_memory:
 
 static struct grant *get_grant(grant_ref_t *gref_head,
                                unsigned long pfn,
-                               struct blkfront_info *info)
+                               struct blkfront_ring_info *rinfo)
 {
 	struct grant *gnt_list_entry;
 	unsigned long buffer_mfn;
 
-	BUG_ON(list_empty(&info->grants));
-	gnt_list_entry = list_first_entry(&info->grants, struct grant,
+	BUG_ON(list_empty(&rinfo->grants));
+	gnt_list_entry = list_first_entry(&rinfo->grants, struct grant,
 	                                  node);
 	list_del(&gnt_list_entry->node);
 
 	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
-		info->persistent_gnts_c--;
+		rinfo->persistent_gnts_c--;
 		return gnt_list_entry;
 	}
 
 	/* Assign a gref to this page */
 	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
 	BUG_ON(gnt_list_entry->gref == -ENOSPC);
-	if (!info->feature_persistent) {
+	if (!rinfo->info->feature_persistent) {
 		BUG_ON(!pfn);
 		gnt_list_entry->pfn = pfn;
 	}
 	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
 	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
-	                                info->xbdev->otherend_id,
+	                                rinfo->info->xbdev->otherend_id,
 	                                buffer_mfn, 0);
 	return gnt_list_entry;
 }
@@ -330,8 +347,8 @@ static void xlbd_release_minors(unsigned int minor, unsigned int nr)
 
 static void blkif_restart_queue_callback(void *arg)
 {
-	struct blkfront_info *info = (struct blkfront_info *)arg;
-	schedule_work(&info->work);
+	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)arg;
+	schedule_work(&rinfo->work);
 }
 
 static int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg)
@@ -388,10 +405,13 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
  * and writes are handled as expected.
  *
  * @req: a request struct
+ * @ring_idx: index of the ring the request is to be inserted in
  */
-static int blkif_queue_request(struct request *req)
+static int blkif_queue_request(struct request *req, unsigned int ring_idx)
 {
 	struct blkfront_info *info = req->rq_disk->private_data;
+	struct blkfront_ring_info *rinfo = &info->rinfo[ring_idx];
+	struct blkif_front_ring *ring = &info->rinfo[ring_idx].ring;
 	struct blkif_request *ring_req;
 	unsigned long id;
 	unsigned int fsect, lsect;
@@ -421,15 +441,15 @@ static int blkif_queue_request(struct request *req)
 		max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
 
 	/* Check if we have enough grants to allocate a requests */
-	if (info->persistent_gnts_c < max_grefs) {
+	if (rinfo->persistent_gnts_c < max_grefs) {
 		new_persistent_gnts = 1;
 		if (gnttab_alloc_grant_references(
-		    max_grefs - info->persistent_gnts_c,
+		    max_grefs - rinfo->persistent_gnts_c,
 		    &gref_head) < 0) {
 			gnttab_request_free_callback(
-				&info->callback,
+				&rinfo->callback,
 				blkif_restart_queue_callback,
-				info,
+				rinfo,
 				max_grefs);
 			return 1;
 		}
@@ -437,9 +457,9 @@ static int blkif_queue_request(struct request *req)
 		new_persistent_gnts = 0;
 
 	/* Fill out a communications ring structure. */
-	ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
-	id = get_id_from_freelist(info);
-	info->shadow[id].request = req;
+	ring_req = RING_GET_REQUEST(ring, ring->req_prod_pvt);
+	id = get_id_from_freelist(rinfo);
+	rinfo->shadow[id].request = req;
 
 	if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
 		ring_req->operation = BLKIF_OP_DISCARD;
@@ -455,7 +475,7 @@ static int blkif_queue_request(struct request *req)
 		       req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
 		BUG_ON(info->max_indirect_segments &&
 		       req->nr_phys_segments > info->max_indirect_segments);
-		nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
+		nseg = blk_rq_map_sg(req->q, req, rinfo->shadow[id].sg);
 		ring_req->u.rw.id = id;
 		if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
 			/*
@@ -486,7 +506,7 @@ static int blkif_queue_request(struct request *req)
 			}
 			ring_req->u.rw.nr_segments = nseg;
 		}
-		for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+		for_each_sg(rinfo->shadow[id].sg, sg, nseg, i) {
 			fsect = sg->offset >> 9;
 			lsect = fsect + (sg->length >> 9) - 1;
 
@@ -502,22 +522,22 @@ static int blkif_queue_request(struct request *req)
 					struct page *indirect_page;
 
 					/* Fetch a pre-allocated page to use for indirect grefs */
-					BUG_ON(list_empty(&info->indirect_pages));
-					indirect_page = list_first_entry(&info->indirect_pages,
+					BUG_ON(list_empty(&rinfo->indirect_pages));
+					indirect_page = list_first_entry(&rinfo->indirect_pages,
 					                                 struct page, lru);
 					list_del(&indirect_page->lru);
 					pfn = page_to_pfn(indirect_page);
 				}
-				gnt_list_entry = get_grant(&gref_head, pfn, info);
-				info->shadow[id].indirect_grants[n] = gnt_list_entry;
+				gnt_list_entry = get_grant(&gref_head, pfn, rinfo);
+				rinfo->shadow[id].indirect_grants[n] = gnt_list_entry;
 				segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
 				ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
 			}
 
-			gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
+			gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), rinfo);
 			ref = gnt_list_entry->gref;
 
-			info->shadow[id].grants_used[i] = gnt_list_entry;
+			rinfo->shadow[id].grants_used[i] = gnt_list_entry;
 
 			if (rq_data_dir(req) && info->feature_persistent) {
 				char *bvec_data;
@@ -563,10 +583,10 @@ static int blkif_queue_request(struct request *req)
 			kunmap_atomic(segments);
 	}
 
-	info->ring.req_prod_pvt++;
+	ring->req_prod_pvt++;
 
 	/* Keep a private copy so we can reissue requests when recovering. */
-	info->shadow[id].req = *ring_req;
+	rinfo->shadow[id].req = *ring_req;
 
 	if (new_persistent_gnts)
 		gnttab_free_grant_references(gref_head);
@@ -575,22 +595,26 @@ static int blkif_queue_request(struct request *req)
 }
 
 
-static inline void flush_requests(struct blkfront_info *info)
+static inline void flush_requests(struct blkfront_info *info,
+				  unsigned int ring_idx)
 {
+	struct blkfront_ring_info *rinfo = &info->rinfo[ring_idx];
 	int notify;
 
-	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
+	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&rinfo->ring, notify);
 
 	if (notify)
-		notify_remote_via_irq(info->irq);
+		notify_remote_via_irq(rinfo->irq);
 }
 
 static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct blkfront_info *info = req->rq_disk->private_data;
+	struct blkfront_ring_info *rinfo =
+			(struct blkfront_ring_info *)hctx->driver_data;
+	struct blkfront_info *info = rinfo->info;
 
-	spin_lock_irq(&info->io_lock);
-	if (RING_FULL(&info->ring))
+	spin_lock_irq(&rinfo->io_lock);
+	if (RING_FULL(&rinfo->ring))
 		goto wait;
 
 	if ((req->cmd_type != REQ_TYPE_FS) ||
@@ -598,28 +622,40 @@ static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 			 !info->flush_op)) {
 		req->errors = -EIO;
 		blk_mq_complete_request(req);
-		spin_unlock_irq(&info->io_lock);
+		spin_unlock_irq(&rinfo->io_lock);
 		return BLK_MQ_RQ_QUEUE_ERROR;
 	}
 
-	if (blkif_queue_request(req)) {
+	if (blkif_queue_request(req, rinfo->hctx_index)) {
 		blk_mq_requeue_request(req);
 		goto wait;
 	}
 
-	flush_requests(info);
-	spin_unlock_irq(&info->io_lock);
+	flush_requests(info, rinfo->hctx_index);
+	spin_unlock_irq(&rinfo->io_lock);
 	return BLK_MQ_RQ_QUEUE_OK;
 
 wait:
 	/* Avoid pointless unplugs. */
 	blk_mq_stop_hw_queue(hctx);
-	spin_unlock_irq(&info->io_lock);
+	spin_unlock_irq(&rinfo->io_lock);
 	return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
+static int blkfront_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			      unsigned int index)
+{
+	struct blkfront_info *info = (struct blkfront_info *)data;
+
+	hctx->driver_data = &info->rinfo[index];
+	info->rinfo[index].hctx_index = index;
+
+	return 0;
+}
+
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkfront_queue_rq,
+	.init_hctx = blkfront_init_hctx,
 	.map_queue = blk_mq_map_queue,
 };
 
@@ -870,21 +906,23 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 {
 	unsigned int minor, nr_minors;
 	unsigned long flags;
+	int i;
 
 	if (info->rq == NULL)
 		return;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	/* No more blkif_request() and gnttab callback work. */
 
-	/* No more blkif_request(). */
 	blk_mq_stop_hw_queues(info->rq);
-
-	/* No more gnttab callback work. */
-	gnttab_cancel_free_callback(&info->callback);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		gnttab_cancel_free_callback(&info->rinfo[i].callback);
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
+	}
 
 	/* Flush gnttab callback work. Must be done with no locks held. */
-	flush_work(&info->work);
+	for (i = 0 ; i < info->nr_rings ; i++)
+		flush_work(&info->rinfo[i].work);
 
 	del_gendisk(info->gd);
 
@@ -900,92 +938,55 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	info->gd = NULL;
 }
 
-static void kick_pending_request_queues(struct blkfront_info *info,
+static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
 					unsigned long *flags)
 {
-	if (!RING_FULL(&info->ring)) {
-		spin_unlock_irqrestore(&info->io_lock, *flags);
-		blk_mq_start_stopped_hw_queues(info->rq, 0);
-		spin_lock_irqsave(&info->io_lock, *flags);
+	if (!RING_FULL(&rinfo->ring)) {
+		spin_unlock_irqrestore(&rinfo->io_lock, *flags);
+		blk_mq_start_stopped_hw_queues(rinfo->info->rq, 0);
+		spin_lock_irqsave(&rinfo->io_lock, *flags);
 	}
 }
 
 static void blkif_restart_queue(struct work_struct *work)
 {
-	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+	struct blkfront_ring_info *rinfo = container_of(work,
+				struct blkfront_ring_info, work);
+	struct blkfront_info *info = rinfo->info;
 	unsigned long flags;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	spin_lock_irqsave(&rinfo->io_lock, flags);
 	if (info->connected == BLKIF_STATE_CONNECTED)
-		kick_pending_request_queues(info, &flags);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+		kick_pending_request_queues(rinfo, &flags);
+	spin_unlock_irqrestore(&rinfo->io_lock, flags);
 }
 
-static void blkif_free(struct blkfront_info *info, int suspend)
+static void blkif_free_ring(struct blkfront_ring_info *rinfo,
+			    int persistent)
 {
 	struct grant *persistent_gnt;
-	struct grant *n;
 	int i, j, segs;
 
-	/* Prevent new requests being issued until we fix things up. */
-	spin_lock_irq(&info->io_lock);
-	info->connected = suspend ?
-		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
-	/* No more blkif_request(). */
-	if (info->rq)
-		blk_mq_stop_hw_queues(info->rq);
-
-	/* Remove all persistent grants */
-	if (!list_empty(&info->grants)) {
-		list_for_each_entry_safe(persistent_gnt, n,
-		                         &info->grants, node) {
-			list_del(&persistent_gnt->node);
-			if (persistent_gnt->gref != GRANT_INVALID_REF) {
-				gnttab_end_foreign_access(persistent_gnt->gref,
-				                          0, 0UL);
-				info->persistent_gnts_c--;
-			}
-			if (info->feature_persistent)
-				__free_page(pfn_to_page(persistent_gnt->pfn));
-			kfree(persistent_gnt);
-		}
-	}
-	BUG_ON(info->persistent_gnts_c != 0);
-
-	/*
-	 * Remove indirect pages, this only happens when using indirect
-	 * descriptors but not persistent grants
-	 */
-	if (!list_empty(&info->indirect_pages)) {
-		struct page *indirect_page, *n;
-
-		BUG_ON(info->feature_persistent);
-		list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
-			list_del(&indirect_page->lru);
-			__free_page(indirect_page);
-		}
-	}
-
 	for (i = 0; i < BLK_RING_SIZE; i++) {
 		/*
 		 * Clear persistent grants present in requests already
 		 * on the shared ring
 		 */
-		if (!info->shadow[i].request)
+		if (!rinfo->shadow[i].request)
 			goto free_shadow;
 
-		segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
-		       info->shadow[i].req.u.indirect.nr_segments :
-		       info->shadow[i].req.u.rw.nr_segments;
+		segs = rinfo->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+		       rinfo->shadow[i].req.u.indirect.nr_segments :
+		       rinfo->shadow[i].req.u.rw.nr_segments;
 		for (j = 0; j < segs; j++) {
-			persistent_gnt = info->shadow[i].grants_used[j];
+			persistent_gnt = rinfo->shadow[i].grants_used[j];
 			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
-			if (info->feature_persistent)
+			if (persistent)
 				__free_page(pfn_to_page(persistent_gnt->pfn));
 			kfree(persistent_gnt);
 		}
 
-		if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+		if (rinfo->shadow[i].req.operation != BLKIF_OP_INDIRECT)
 			/*
 			 * If this is not an indirect operation don't try to
 			 * free indirect segments
@@ -993,44 +994,101 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 			goto free_shadow;
 
 		for (j = 0; j < INDIRECT_GREFS(segs); j++) {
-			persistent_gnt = info->shadow[i].indirect_grants[j];
+			persistent_gnt = rinfo->shadow[i].indirect_grants[j];
 			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
 			__free_page(pfn_to_page(persistent_gnt->pfn));
 			kfree(persistent_gnt);
 		}
 
 free_shadow:
-		kfree(info->shadow[i].grants_used);
-		info->shadow[i].grants_used = NULL;
-		kfree(info->shadow[i].indirect_grants);
-		info->shadow[i].indirect_grants = NULL;
-		kfree(info->shadow[i].sg);
-		info->shadow[i].sg = NULL;
+		kfree(rinfo->shadow[i].grants_used);
+		rinfo->shadow[i].grants_used = NULL;
+		kfree(rinfo->shadow[i].indirect_grants);
+		rinfo->shadow[i].indirect_grants = NULL;
+		kfree(rinfo->shadow[i].sg);
+		rinfo->shadow[i].sg = NULL;
 	}
 
-	/* No more gnttab callback work. */
-	gnttab_cancel_free_callback(&info->callback);
-	spin_unlock_irq(&info->io_lock);
+}
 
-	/* Flush gnttab callback work. Must be done with no locks held. */
-	flush_work(&info->work);
+static void blkif_free(struct blkfront_info *info, int suspend)
+{
+	struct grant *persistent_gnt;
+	struct grant *n;
+	int i;
 
-	/* Free resources associated with old device channel. */
-	if (info->ring_ref != GRANT_INVALID_REF) {
-		gnttab_end_foreign_access(info->ring_ref, 0,
-					  (unsigned long)info->ring.sring);
-		info->ring_ref = GRANT_INVALID_REF;
-		info->ring.sring = NULL;
-	}
-	if (info->irq)
-		unbind_from_irqhandler(info->irq, info);
-	info->evtchn = info->irq = 0;
+	info->connected = suspend ?
+		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
+
+	/*
+	 * Prevent new requests being issued until we fix things up:
+	 * no more blkif_request() and no more gnttab callback work.
+	 */
+	if (info->rq) {
+		blk_mq_stop_hw_queues(info->rq);
+
+		for (i = 0 ; i < info->nr_rings ; i++) {
+			struct blkfront_ring_info *rinfo = &info->rinfo[i];
+
+			spin_lock_irq(&info->rinfo[i].io_lock);
+			/* Remove all persistent grants */
+			if (!list_empty(&rinfo->grants)) {
+				list_for_each_entry_safe(persistent_gnt, n,
+				                         &rinfo->grants, node) {
+					list_del(&persistent_gnt->node);
+					if (persistent_gnt->gref != GRANT_INVALID_REF) {
+						gnttab_end_foreign_access(persistent_gnt->gref,
+						                          0, 0UL);
+						rinfo->persistent_gnts_c--;
+					}
+					if (info->feature_persistent)
+						__free_page(pfn_to_page(persistent_gnt->pfn));
+					kfree(persistent_gnt);
+				}
+			}
+			BUG_ON(rinfo->persistent_gnts_c != 0);
+
+			/*
+			 * Remove indirect pages, this only happens when using indirect
+			 * descriptors but not persistent grants
+			 */
+			if (!list_empty(&rinfo->indirect_pages)) {
+				struct page *indirect_page, *n;
+
+				BUG_ON(info->feature_persistent);
+				list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
+					list_del(&indirect_page->lru);
+					__free_page(indirect_page);
+				}
+			}
 
+			blkif_free_ring(rinfo, info->feature_persistent);
+
+			gnttab_cancel_free_callback(&rinfo->callback);
+			spin_unlock_irq(&rinfo->io_lock);
+
+			/* Flush gnttab callback work. Must be done with no locks held. */
+			flush_work(&info->rinfo[i].work);
+
+			/* Free resources associated with old device channel. */
+			if (rinfo->ring_ref != GRANT_INVALID_REF) {
+				gnttab_end_foreign_access(rinfo->ring_ref, 0,
+							  (unsigned long)rinfo->ring.sring);
+				rinfo->ring_ref = GRANT_INVALID_REF;
+				rinfo->ring.sring = NULL;
+			}
+			if (rinfo->irq)
+				unbind_from_irqhandler(rinfo->irq, rinfo);
+			rinfo->evtchn = rinfo->irq = 0;
+		}
+	}
 }
 
-static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+static void blkif_completion(struct blk_shadow *s,
+			     struct blkfront_ring_info *rinfo,
 			     struct blkif_response *bret)
 {
+	struct blkfront_info *info = rinfo->info;
 	int i = 0;
 	struct scatterlist *sg;
 	char *bvec_data;
@@ -1071,8 +1129,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			if (!info->feature_persistent)
 				pr_alert_ratelimited("backed has not unmapped grant: %u\n",
 						     s->grants_used[i]->gref);
-			list_add(&s->grants_used[i]->node, &info->grants);
-			info->persistent_gnts_c++;
+			list_add(&s->grants_used[i]->node, &rinfo->grants);
+			rinfo->persistent_gnts_c++;
 		} else {
 			/*
 			 * If the grant is not mapped by the backend we end the
@@ -1082,7 +1140,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			 */
 			gnttab_end_foreign_access(s->grants_used[i]->gref, 0, 0UL);
 			s->grants_used[i]->gref = GRANT_INVALID_REF;
-			list_add_tail(&s->grants_used[i]->node, &info->grants);
+			list_add_tail(&s->grants_used[i]->node, &rinfo->grants);
 		}
 	}
 	if (s->req.operation == BLKIF_OP_INDIRECT) {
@@ -1091,8 +1149,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 				if (!info->feature_persistent)
 					pr_alert_ratelimited("backed has not unmapped grant: %u\n",
 							     s->indirect_grants[i]->gref);
-				list_add(&s->indirect_grants[i]->node, &info->grants);
-				info->persistent_gnts_c++;
+				list_add(&s->indirect_grants[i]->node, &rinfo->grants);
+				rinfo->persistent_gnts_c++;
 			} else {
 				struct page *indirect_page;
 
@@ -1102,9 +1160,9 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 				 * available pages for indirect grefs.
 				 */
 				indirect_page = pfn_to_page(s->indirect_grants[i]->pfn);
-				list_add(&indirect_page->lru, &info->indirect_pages);
+				list_add(&indirect_page->lru, &rinfo->indirect_pages);
 				s->indirect_grants[i]->gref = GRANT_INVALID_REF;
-				list_add_tail(&s->indirect_grants[i]->node, &info->grants);
+				list_add_tail(&s->indirect_grants[i]->node, &rinfo->grants);
 			}
 		}
 	}
@@ -1116,24 +1174,25 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	struct blkif_response *bret;
 	RING_IDX i, rp;
 	unsigned long flags;
-	struct blkfront_info *info = (struct blkfront_info *)dev_id;
+	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
+	struct blkfront_info *info = rinfo->info;
 	int error;
 
-	spin_lock_irqsave(&info->io_lock, flags);
+	spin_lock_irqsave(&rinfo->io_lock, flags);
 
 	if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
-		spin_unlock_irqrestore(&info->io_lock, flags);
+		spin_unlock_irqrestore(&rinfo->io_lock, flags);
 		return IRQ_HANDLED;
 	}
 
  again:
-	rp = info->ring.sring->rsp_prod;
+	rp = rinfo->ring.sring->rsp_prod;
 	rmb(); /* Ensure we see queued responses up to 'rp'. */
 
-	for (i = info->ring.rsp_cons; i != rp; i++) {
+	for (i = rinfo->ring.rsp_cons; i != rp; i++) {
 		unsigned long id;
 
-		bret = RING_GET_RESPONSE(&info->ring, i);
+		bret = RING_GET_RESPONSE(&rinfo->ring, i);
 		id   = bret->id;
 		/*
 		 * The backend has messed up and given us an id that we would
@@ -1147,12 +1206,12 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 			 * the id is busted. */
 			continue;
 		}
-		req  = info->shadow[id].request;
+		req  = rinfo->shadow[id].request;
 
 		if (bret->operation != BLKIF_OP_DISCARD)
-			blkif_completion(&info->shadow[id], info, bret);
+			blkif_completion(&rinfo->shadow[id], rinfo, bret);
 
-		if (add_id_to_freelist(info, id)) {
+		if (add_id_to_freelist(rinfo, id)) {
 			WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
 			     info->gd->disk_name, op_name(bret->operation), id);
 			continue;
@@ -1181,7 +1240,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 				error = req->errors = -EOPNOTSUPP;
 			}
 			if (unlikely(bret->status == BLKIF_RSP_ERROR &&
-				     info->shadow[id].req.u.rw.nr_segments == 0)) {
+				     rinfo->shadow[id].req.u.rw.nr_segments == 0)) {
 				printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
 				       info->gd->disk_name, op_name(bret->operation));
 				error = req->errors = -EOPNOTSUPP;
@@ -1207,31 +1266,31 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 		}
 	}
 
-	info->ring.rsp_cons = i;
+	rinfo->ring.rsp_cons = i;
 
-	if (i != info->ring.req_prod_pvt) {
+	if (i != rinfo->ring.req_prod_pvt) {
 		int more_to_do;
-		RING_FINAL_CHECK_FOR_RESPONSES(&info->ring, more_to_do);
+		RING_FINAL_CHECK_FOR_RESPONSES(&rinfo->ring, more_to_do);
 		if (more_to_do)
 			goto again;
 	} else
-		info->ring.sring->rsp_event = i + 1;
+		rinfo->ring.sring->rsp_event = i + 1;
 
-	blkif_restart_queue_callback(info);
+	blkif_restart_queue_callback(rinfo);
 
-	spin_unlock_irqrestore(&info->io_lock, flags);
+	spin_unlock_irqrestore(&rinfo->io_lock, flags);
 
 	return IRQ_HANDLED;
 }
 
 
 static int setup_blkring(struct xenbus_device *dev,
-			 struct blkfront_info *info)
+			 struct blkfront_ring_info *rinfo)
 {
 	struct blkif_sring *sring;
 	int err;
 
-	info->ring_ref = GRANT_INVALID_REF;
+	rinfo->ring_ref = GRANT_INVALID_REF;
 
 	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH);
 	if (!sring) {
@@ -1239,32 +1298,32 @@ static int setup_blkring(struct xenbus_device *dev,
 		return -ENOMEM;
 	}
 	SHARED_RING_INIT(sring);
-	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
+	FRONT_RING_INIT(&rinfo->ring, sring, PAGE_SIZE);
 
-	err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
+	err = xenbus_grant_ring(dev, virt_to_mfn(rinfo->ring.sring));
 	if (err < 0) {
 		free_page((unsigned long)sring);
-		info->ring.sring = NULL;
+		rinfo->ring.sring = NULL;
 		goto fail;
 	}
-	info->ring_ref = err;
+	rinfo->ring_ref = err;
 
-	err = xenbus_alloc_evtchn(dev, &info->evtchn);
+	err = xenbus_alloc_evtchn(dev, &rinfo->evtchn);
 	if (err)
 		goto fail;
 
-	err = bind_evtchn_to_irqhandler(info->evtchn, blkif_interrupt, 0,
-					"blkif", info);
+	err = bind_evtchn_to_irqhandler(rinfo->evtchn, blkif_interrupt, 0,
+					"blkif", rinfo);
 	if (err <= 0) {
 		xenbus_dev_fatal(dev, err,
 				 "bind_evtchn_to_irqhandler failed");
 		goto fail;
 	}
-	info->irq = err;
+	rinfo->irq = err;
 
 	return 0;
 fail:
-	blkif_free(info, 0);
+	blkif_free(rinfo->info, 0);
 	return err;
 }
 
@@ -1274,13 +1333,16 @@ static int talk_to_blkback(struct xenbus_device *dev,
 			   struct blkfront_info *info)
 {
 	const char *message = NULL;
+	char ring_ref_s[64] = "", evtchn_s[64] = "";
 	struct xenbus_transaction xbt;
-	int err;
+	int i, err;
 
-	/* Create shared ring, alloc event channel. */
-	err = setup_blkring(dev, info);
-	if (err)
-		goto out;
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		/* Create shared ring, alloc event channel. */
+		err = setup_blkring(dev, &info->rinfo[i]);
+		if (err)
+			goto out;
+	}
 
 again:
 	err = xenbus_transaction_start(&xbt);
@@ -1289,18 +1351,24 @@ again:
 		goto destroy_blkring;
 	}
 
-	err = xenbus_printf(xbt, dev->nodename,
-			    "ring-ref", "%u", info->ring_ref);
-	if (err) {
-		message = "writing ring-ref";
-		goto abort_transaction;
-	}
-	err = xenbus_printf(xbt, dev->nodename,
-			    "event-channel", "%u", info->evtchn);
-	if (err) {
-		message = "writing event-channel";
-		goto abort_transaction;
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		BUG_ON(i > 0);
+		snprintf(ring_ref_s, 64, "ring-ref");
+		snprintf(evtchn_s, 64, "event-channel");
+		err = xenbus_printf(xbt, dev->nodename,
+				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
+		if (err) {
+			message = "writing ring-ref";
+			goto abort_transaction;
+		}
+		err = xenbus_printf(xbt, dev->nodename,
+				    evtchn_s, "%u", info->rinfo[i].evtchn);
+		if (err) {
+			message = "writing event-channel";
+			goto abort_transaction;
+		}
 	}
+
 	err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
 			    XEN_IO_PROTO_ABI_NATIVE);
 	if (err) {
@@ -1344,7 +1412,7 @@ again:
 static int blkfront_probe(struct xenbus_device *dev,
 			  const struct xenbus_device_id *id)
 {
-	int err, vdevice, i;
+	int err, vdevice, i, r;
 	struct blkfront_info *info;
 
 	/* FIXME: Use dynamic device id if this is not set. */
@@ -1396,23 +1464,36 @@ static int blkfront_probe(struct xenbus_device *dev,
 	}
 
 	mutex_init(&info->mutex);
-	spin_lock_init(&info->io_lock);
 	info->xbdev = dev;
 	info->vdevice = vdevice;
-	INIT_LIST_HEAD(&info->grants);
-	INIT_LIST_HEAD(&info->indirect_pages);
-	info->persistent_gnts_c = 0;
 	info->connected = BLKIF_STATE_DISCONNECTED;
-	INIT_WORK(&info->work, blkif_restart_queue);
-
-	for (i = 0; i < BLK_RING_SIZE; i++)
-		info->shadow[i].req.u.rw.id = i+1;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
 
 	/* Front end dir is a number, which is used as the id. */
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
 
+	/* Allocate the correct number of rings. */
+	info->nr_rings = 1;
+	pr_info("blkfront: %s: %d rings\n",
+		info->gd->disk_name, info->nr_rings);
+
+	info->rinfo = kzalloc(info->nr_rings *
+				sizeof(struct blkfront_ring_info),
+			      GFP_KERNEL);
+	for (r = 0 ; r < info->nr_rings ; r++) {
+		struct blkfront_ring_info *rinfo = &info->rinfo[r];
+
+		rinfo->info = info;
+		rinfo->persistent_gnts_c = 0;
+		INIT_LIST_HEAD(&rinfo->grants);
+		INIT_LIST_HEAD(&rinfo->indirect_pages);
+		INIT_WORK(&rinfo->work, blkif_restart_queue);
+		spin_lock_init(&rinfo->io_lock);
+		for (i = 0; i < BLK_RING_SIZE; i++)
+			rinfo->shadow[i].req.u.rw.id = i+1;
+		rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+	}
+
 	err = talk_to_blkback(dev, info);
 	if (err) {
 		kfree(info);
@@ -1438,88 +1519,100 @@ static void split_bio_end(struct bio *bio, int error)
 	bio_put(bio);
 }
 
-static int blkif_recover(struct blkfront_info *info)
+static int blkif_setup_shadow(struct blkfront_ring_info *rinfo,
+			      struct blk_shadow **copy)
 {
 	int i;
+
+	/* Stage 1: Make a safe copy of the shadow state. */
+	*copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
+		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
+	if (!*copy)
+		return -ENOMEM;
+
+	/* Stage 2: Set up free list. */
+	memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
+	for (i = 0; i < BLK_RING_SIZE; i++)
+		rinfo->shadow[i].req.u.rw.id = i+1;
+	rinfo->shadow_free = rinfo->ring.req_prod_pvt;
+	rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+
+	return 0;
+}
+
+static int blkif_recover(struct blkfront_info *info)
+{
+	int i, r;
 	struct request *req, *n;
 	struct blk_shadow *copy;
-	int rc;
+	int rc = 0;
 	struct bio *bio, *cloned_bio;
-	struct bio_list bio_list, merge_bio;
+	struct bio_list uninitialized_var(bio_list), merge_bio;
 	unsigned int segs, offset;
 	unsigned long flags;
 	int pending, size;
 	struct split_bio *split_bio;
 	struct list_head requests;
 
-	/* Stage 1: Make a safe copy of the shadow state. */
-	copy = kmemdup(info->shadow, sizeof(info->shadow),
-		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
-	if (!copy)
-		return -ENOMEM;
-
-	/* Stage 2: Set up free list. */
-	memset(&info->shadow, 0, sizeof(info->shadow));
-	for (i = 0; i < BLK_RING_SIZE; i++)
-		info->shadow[i].req.u.rw.id = i+1;
-	info->shadow_free = info->ring.req_prod_pvt;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+	segs = blkfront_gather_indirect(info);
 
-	rc = blkfront_setup_indirect(info);
-	if (rc) {
-		kfree(copy);
-		return rc;
-	}
+	for (r = 0 ; r < info->nr_rings ; r++) {
+		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
+		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
+		if (rc) {
+			kfree(copy);
+			return rc;
+		}
 
-	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-	blk_queue_max_segments(info->rq, segs);
-	bio_list_init(&bio_list);
-	INIT_LIST_HEAD(&requests);
-	for (i = 0; i < BLK_RING_SIZE; i++) {
-		/* Not in use? */
-		if (!copy[i].request)
-			continue;
+		segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
+		blk_queue_max_segments(info->rq, segs);
+		bio_list_init(&bio_list);
+		INIT_LIST_HEAD(&requests);
+		for (i = 0; i < BLK_RING_SIZE; i++) {
+			/* Not in use? */
+			if (!copy[i].request)
+				continue;
 
-		/*
-		 * Get the bios in the request so we can re-queue them.
-		 */
-		if (copy[i].request->cmd_flags &
-		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
 			/*
-			 * Flush operations don't contain bios, so
-			 * we need to requeue the whole request
+			 * Get the bios in the request so we can re-queue them.
 			 */
-			list_add(&copy[i].request->queuelist, &requests);
-			continue;
+			if (copy[i].request->cmd_flags &
+			    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+				/*
+				 * Flush operations don't contain bios, so
+				 * we need to requeue the whole request
+				 */
+				list_add(&copy[i].request->queuelist, &requests);
+				continue;
+			}
+			merge_bio.head = copy[i].request->bio;
+			merge_bio.tail = copy[i].request->biotail;
+			bio_list_merge(&bio_list, &merge_bio);
+			copy[i].request->bio = NULL;
+			blk_put_request(copy[i].request);
 		}
-		merge_bio.head = copy[i].request->bio;
-		merge_bio.tail = copy[i].request->biotail;
-		bio_list_merge(&bio_list, &merge_bio);
-		copy[i].request->bio = NULL;
-		blk_put_request(copy[i].request);
+		kfree(copy);
 	}
 
-	kfree(copy);
-
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	spin_lock_irqsave(&info->io_lock, flags);
-
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	/* Kick any other new requests queued since we resumed */
-	kick_pending_request_queues(info, &flags);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		/* Kick any other new requests queued since we resumed */
+		kick_pending_request_queues(&info->rinfo[i], &flags);
 
-	list_for_each_entry_safe(req, n, &requests, queuelist) {
-		/* Requeue pending requests (flush or discard) */
-		list_del_init(&req->queuelist);
-		BUG_ON(req->nr_phys_segments > segs);
-		blk_mq_requeue_request(req);
+		list_for_each_entry_safe(req, n, &requests, queuelist) {
+			/* Requeue pending requests (flush or discard) */
+			list_del_init(&req->queuelist);
+			BUG_ON(req->nr_phys_segments > segs);
+			blk_mq_requeue_request(req);
+		}
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
 	}
 
-	spin_unlock_irqrestore(&info->io_lock, flags);
-
 	while ((bio = bio_list_pop(&bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
 		if (bio_segments(bio) > segs) {
@@ -1643,14 +1736,15 @@ static void blkfront_setup_discard(struct blkfront_info *info)
 		info->feature_secdiscard = !!discard_secure;
 }
 
-static int blkfront_setup_indirect(struct blkfront_info *info)
+
+static int blkfront_gather_indirect(struct blkfront_info *info)
 {
 	unsigned int indirect_segments, segs;
-	int err, i;
+	int err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+				"feature-max-indirect-segments", "%u",
+				&indirect_segments,
+				NULL);
 
-	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-			    "feature-max-indirect-segments", "%u", &indirect_segments,
-			    NULL);
 	if (err) {
 		info->max_indirect_segments = 0;
 		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
@@ -1660,7 +1754,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		segs = info->max_indirect_segments;
 	}
 
-	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+	return segs;
+}
+
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
+				   unsigned int segs)
+{
+	struct blkfront_info *info = rinfo->info;
+	int err, i;
+
+	err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
 	if (err)
 		goto out_of_memory;
 
@@ -1672,31 +1775,31 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		 */
 		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
 
-		BUG_ON(!list_empty(&info->indirect_pages));
+		BUG_ON(!list_empty(&rinfo->indirect_pages));
 		for (i = 0; i < num; i++) {
 			struct page *indirect_page = alloc_page(GFP_NOIO);
 			if (!indirect_page)
 				goto out_of_memory;
-			list_add(&indirect_page->lru, &info->indirect_pages);
+			list_add(&indirect_page->lru, &rinfo->indirect_pages);
 		}
 	}
 
 	for (i = 0; i < BLK_RING_SIZE; i++) {
-		info->shadow[i].grants_used = kzalloc(
-			sizeof(info->shadow[i].grants_used[0]) * segs,
+		rinfo->shadow[i].grants_used = kzalloc(
+			sizeof(rinfo->shadow[i].grants_used[0]) * segs,
 			GFP_NOIO);
-		info->shadow[i].sg = kzalloc(sizeof(info->shadow[i].sg[0]) * segs, GFP_NOIO);
+		rinfo->shadow[i].sg = kzalloc(sizeof(rinfo->shadow[i].sg[0]) * segs, GFP_NOIO);
 		if (info->max_indirect_segments)
-			info->shadow[i].indirect_grants = kzalloc(
-				sizeof(info->shadow[i].indirect_grants[0]) *
+			rinfo->shadow[i].indirect_grants = kzalloc(
+				sizeof(rinfo->shadow[i].indirect_grants[0]) *
 				INDIRECT_GREFS(segs),
 				GFP_NOIO);
-		if ((info->shadow[i].grants_used == NULL) ||
-			(info->shadow[i].sg == NULL) ||
+		if ((rinfo->shadow[i].grants_used == NULL) ||
+			(rinfo->shadow[i].sg == NULL) ||
 		     (info->max_indirect_segments &&
-		     (info->shadow[i].indirect_grants == NULL)))
+		     (rinfo->shadow[i].indirect_grants == NULL)))
 			goto out_of_memory;
-		sg_init_table(info->shadow[i].sg, segs);
+		sg_init_table(rinfo->shadow[i].sg, segs);
 	}
 
 
@@ -1704,16 +1807,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 
 out_of_memory:
 	for (i = 0; i < BLK_RING_SIZE; i++) {
-		kfree(info->shadow[i].grants_used);
-		info->shadow[i].grants_used = NULL;
-		kfree(info->shadow[i].sg);
-		info->shadow[i].sg = NULL;
-		kfree(info->shadow[i].indirect_grants);
-		info->shadow[i].indirect_grants = NULL;
-	}
-	if (!list_empty(&info->indirect_pages)) {
+		kfree(rinfo->shadow[i].grants_used);
+		rinfo->shadow[i].grants_used = NULL;
+		kfree(rinfo->shadow[i].sg);
+		rinfo->shadow[i].sg = NULL;
+		kfree(rinfo->shadow[i].indirect_grants);
+		rinfo->shadow[i].indirect_grants = NULL;
+	}
+	if (!list_empty(&rinfo->indirect_pages)) {
 		struct page *indirect_page, *n;
-		list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
+		list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
 			list_del(&indirect_page->lru);
 			__free_page(indirect_page);
 		}
@@ -1732,7 +1835,8 @@ static void blkfront_connect(struct blkfront_info *info)
 	unsigned long flags;
 	unsigned int physical_sector_size;
 	unsigned int binfo;
-	int err;
+	unsigned int segs;
+	int i, err;
 	int barrier, flush, discard, persistent;
 
 	switch (info->connected) {
@@ -1836,11 +1940,14 @@ static void blkfront_connect(struct blkfront_info *info)
 	else
 		info->feature_persistent = persistent;
 
-	err = blkfront_setup_indirect(info);
-	if (err) {
-		xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
-				 info->xbdev->otherend);
-		return;
+	segs = blkfront_gather_indirect(info);
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		err = blkfront_setup_indirect(&info->rinfo[i], segs);
+		if (err) {
+			xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
+					 info->xbdev->otherend);
+			return;
+		}
 	}
 
 	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
@@ -1853,11 +1960,14 @@ static void blkfront_connect(struct blkfront_info *info)
 
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
-	/* Kick pending requests. */
-	spin_lock_irqsave(&info->io_lock, flags);
 	info->connected = BLKIF_STATE_CONNECTED;
-	kick_pending_request_queues(info, &flags);
-	spin_unlock_irqrestore(&info->io_lock, flags);
+
+	/* Kick pending requests. */
+	for (i = 0 ; i < info->nr_rings ; i++) {
+		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
+		kick_pending_request_queues(&info->rinfo[i], &flags);
+		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
+	}
 
 	add_disk(info->gd);
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (4 preceding siblings ...)
  2014-09-11 23:57 ` [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:46   ` David Vrabel
                     ` (3 more replies)
  2014-09-11 23:57 ` [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings Arianna Avanzini
                   ` (5 subsequent siblings)
  11 siblings, 4 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

This commit implements the negotiation of the number of block rings
to be used; as a default, the number of rings is decided by the
frontend driver and is equal to the number of hardware queues that
the backend makes available. In case of guest migration towards a
host whose devices expose a different number of hardware queues, the
number of I/O rings used by the frontend driver remains the same;
XenStore keys may vary if the frontend needs to be compatible with
a host not having multi-queue support.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 84 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9282df1..77e311d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -137,7 +137,7 @@ struct blkfront_info
 	int vdevice;
 	blkif_vdev_t handle;
 	enum blkif_state connected;
-	unsigned int nr_rings;
+	unsigned int nr_rings, old_nr_rings;
 	struct blkfront_ring_info *rinfo;
 	struct request_queue *rq;
 	unsigned int feature_flush;
@@ -147,6 +147,7 @@ struct blkfront_info
 	unsigned int discard_granularity;
 	unsigned int discard_alignment;
 	unsigned int feature_persistent:1;
+	unsigned int hardware_queues;
 	unsigned int max_indirect_segments;
 	int is_ready;
 	/* Block layer tags. */
@@ -669,7 +670,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 
 	memset(&info->tag_set, 0, sizeof(info->tag_set));
 	info->tag_set.ops = &blkfront_mq_ops;
-	info->tag_set.nr_hw_queues = 1;
+	info->tag_set.nr_hw_queues = info->hardware_queues ? : 1;
 	info->tag_set.queue_depth = BLK_RING_SIZE;
 	info->tag_set.numa_node = NUMA_NO_NODE;
 	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -938,6 +939,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	info->gd = NULL;
 }
 
+/* Must be called with io_lock held */
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
 					unsigned long *flags)
 {
@@ -1351,10 +1353,24 @@ again:
 		goto destroy_blkring;
 	}
 
+	/* Advertise the number of rings */
+	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
+			    "%u", info->nr_rings);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "advertising number of rings");
+		goto abort_transaction;
+	}
+
 	for (i = 0 ; i < info->nr_rings ; i++) {
-		BUG_ON(i > 0);
-		snprintf(ring_ref_s, 64, "ring-ref");
-		snprintf(evtchn_s, 64, "event-channel");
+		if (!info->hardware_queues) {
+			BUG_ON(i > 0);
+			/* Support old XenStore keys */
+			snprintf(ring_ref_s, 64, "ring-ref");
+			snprintf(evtchn_s, 64, "event-channel");
+		} else {
+			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
+			snprintf(evtchn_s, 64, "event-channel-%d", i);
+		}
 		err = xenbus_printf(xbt, dev->nodename,
 				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
 		if (err) {
@@ -1403,6 +1419,14 @@ again:
 	return err;
 }
 
+static inline int blkfront_gather_hw_queues(struct blkfront_info *info,
+					    unsigned int *nr_queues)
+{
+	return xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			     "nr_supported_hw_queues", "%u", nr_queues,
+			     NULL);
+}
+
 /**
  * Entry point to this code when a new device is created.  Allocate the basic
  * structures and the ring buffer for communication with the backend, and
@@ -1414,6 +1438,7 @@ static int blkfront_probe(struct xenbus_device *dev,
 {
 	int err, vdevice, i, r;
 	struct blkfront_info *info;
+	unsigned int nr_queues;
 
 	/* FIXME: Use dynamic device id if this is not set. */
 	err = xenbus_scanf(XBT_NIL, dev->nodename,
@@ -1472,10 +1497,19 @@ static int blkfront_probe(struct xenbus_device *dev,
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
 
-	/* Allocate the correct number of rings. */
-	info->nr_rings = 1;
-	pr_info("blkfront: %s: %d rings\n",
-		info->gd->disk_name, info->nr_rings);
+	/* Gather the number of hardware queues as soon as possible */
+	err = blkfront_gather_hw_queues(info, &nr_queues);
+	if (err)
+		info->hardware_queues = 0;
+	else
+		info->hardware_queues = nr_queues;
+	/*
+	 * The backend has told us the number of hw queues he wants.
+	 * Allocate the correct number of rings.
+	 */
+	info->nr_rings = info->hardware_queues ? : 1;
+	pr_info("blkfront: %s: %d hardware queues, %d rings\n",
+		info->gd->disk_name, info->hardware_queues, info->nr_rings);
 
 	info->rinfo = kzalloc(info->nr_rings *
 				sizeof(struct blkfront_ring_info),
@@ -1556,7 +1590,7 @@ static int blkif_recover(struct blkfront_info *info)
 
 	segs = blkfront_gather_indirect(info);
 
-	for (r = 0 ; r < info->nr_rings ; r++) {
+	for (r = 0 ; r < info->old_nr_rings ; r++) {
 		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
 		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
 		if (rc) {
@@ -1599,7 +1633,7 @@ static int blkif_recover(struct blkfront_info *info)
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	for (i = 0 ; i < info->nr_rings ; i++) {
+	for (i = 0 ; i < info->old_nr_rings ; i++) {
 		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
 		/* Kick any other new requests queued since we resumed */
 		kick_pending_request_queues(&info->rinfo[i], &flags);
@@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
 {
 	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
 	int err;
+	unsigned int nr_queues, prev_nr_queues;
+	bool mq_to_rq_transition;
 
 	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
 
+	prev_nr_queues = info->hardware_queues;
+
+	err = blkfront_gather_hw_queues(info, &nr_queues);
+	if (err < 0)
+		nr_queues = 0;
+	mq_to_rq_transition = prev_nr_queues && !nr_queues;
+
+	if (prev_nr_queues != nr_queues) {
+		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
+		       info->gd->disk_name, prev_nr_queues, nr_queues);
+		if (mq_to_rq_transition) {
+			struct blk_mq_hw_ctx *hctx;
+			unsigned int i;
+			/*
+			 * Switch from multi-queue to single-queue:
+			 * update hctx-to-ring mapping before
+			 * resubmitting any requests
+			 */
+			queue_for_each_hw_ctx(info->rq, hctx, i)
+				hctx->driver_data = &info->rinfo[0];
+		}
+		info->hardware_queues = nr_queues;
+	}
+
 	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
 
+	/* Free with old number of rings, but rebuild with new */
+	info->old_nr_rings = info->nr_rings;
+	/*
+	 * Must not update if transition didn't happen, we're keeping
+	 * the old number of rings.
+	 */
+	if (mq_to_rq_transition)
+		info->nr_rings = 1;
+
 	err = talk_to_blkback(dev, info);
 
 	/*
@@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
 		 * supports indirect descriptors, and how many.
 		 */
 		blkif_recover(info);
+		info->rinfo = krealloc(info->rinfo,
+				       info->nr_rings * sizeof(struct blkfront_ring_info),
+				       GFP_KERNEL);
+
 		return;
 
 	default:
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (3 preceding siblings ...)
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-11 23:57 ` Arianna Avanzini
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi, avanzini.arianna

This commit implements the negotiation of the number of block rings
to be used; as a default, the number of rings is decided by the
frontend driver and is equal to the number of hardware queues that
the backend makes available. In case of guest migration towards a
host whose devices expose a different number of hardware queues, the
number of I/O rings used by the frontend driver remains the same;
XenStore keys may vary if the frontend needs to be compatible with
a host not having multi-queue support.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 84 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9282df1..77e311d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -137,7 +137,7 @@ struct blkfront_info
 	int vdevice;
 	blkif_vdev_t handle;
 	enum blkif_state connected;
-	unsigned int nr_rings;
+	unsigned int nr_rings, old_nr_rings;
 	struct blkfront_ring_info *rinfo;
 	struct request_queue *rq;
 	unsigned int feature_flush;
@@ -147,6 +147,7 @@ struct blkfront_info
 	unsigned int discard_granularity;
 	unsigned int discard_alignment;
 	unsigned int feature_persistent:1;
+	unsigned int hardware_queues;
 	unsigned int max_indirect_segments;
 	int is_ready;
 	/* Block layer tags. */
@@ -669,7 +670,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 
 	memset(&info->tag_set, 0, sizeof(info->tag_set));
 	info->tag_set.ops = &blkfront_mq_ops;
-	info->tag_set.nr_hw_queues = 1;
+	info->tag_set.nr_hw_queues = info->hardware_queues ? : 1;
 	info->tag_set.queue_depth = BLK_RING_SIZE;
 	info->tag_set.numa_node = NUMA_NO_NODE;
 	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -938,6 +939,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	info->gd = NULL;
 }
 
+/* Must be called with io_lock held */
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
 					unsigned long *flags)
 {
@@ -1351,10 +1353,24 @@ again:
 		goto destroy_blkring;
 	}
 
+	/* Advertise the number of rings */
+	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
+			    "%u", info->nr_rings);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "advertising number of rings");
+		goto abort_transaction;
+	}
+
 	for (i = 0 ; i < info->nr_rings ; i++) {
-		BUG_ON(i > 0);
-		snprintf(ring_ref_s, 64, "ring-ref");
-		snprintf(evtchn_s, 64, "event-channel");
+		if (!info->hardware_queues) {
+			BUG_ON(i > 0);
+			/* Support old XenStore keys */
+			snprintf(ring_ref_s, 64, "ring-ref");
+			snprintf(evtchn_s, 64, "event-channel");
+		} else {
+			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
+			snprintf(evtchn_s, 64, "event-channel-%d", i);
+		}
 		err = xenbus_printf(xbt, dev->nodename,
 				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
 		if (err) {
@@ -1403,6 +1419,14 @@ again:
 	return err;
 }
 
+static inline int blkfront_gather_hw_queues(struct blkfront_info *info,
+					    unsigned int *nr_queues)
+{
+	return xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			     "nr_supported_hw_queues", "%u", nr_queues,
+			     NULL);
+}
+
 /**
  * Entry point to this code when a new device is created.  Allocate the basic
  * structures and the ring buffer for communication with the backend, and
@@ -1414,6 +1438,7 @@ static int blkfront_probe(struct xenbus_device *dev,
 {
 	int err, vdevice, i, r;
 	struct blkfront_info *info;
+	unsigned int nr_queues;
 
 	/* FIXME: Use dynamic device id if this is not set. */
 	err = xenbus_scanf(XBT_NIL, dev->nodename,
@@ -1472,10 +1497,19 @@ static int blkfront_probe(struct xenbus_device *dev,
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
 
-	/* Allocate the correct number of rings. */
-	info->nr_rings = 1;
-	pr_info("blkfront: %s: %d rings\n",
-		info->gd->disk_name, info->nr_rings);
+	/* Gather the number of hardware queues as soon as possible */
+	err = blkfront_gather_hw_queues(info, &nr_queues);
+	if (err)
+		info->hardware_queues = 0;
+	else
+		info->hardware_queues = nr_queues;
+	/*
+	 * The backend has told us the number of hw queues he wants.
+	 * Allocate the correct number of rings.
+	 */
+	info->nr_rings = info->hardware_queues ? : 1;
+	pr_info("blkfront: %s: %d hardware queues, %d rings\n",
+		info->gd->disk_name, info->hardware_queues, info->nr_rings);
 
 	info->rinfo = kzalloc(info->nr_rings *
 				sizeof(struct blkfront_ring_info),
@@ -1556,7 +1590,7 @@ static int blkif_recover(struct blkfront_info *info)
 
 	segs = blkfront_gather_indirect(info);
 
-	for (r = 0 ; r < info->nr_rings ; r++) {
+	for (r = 0 ; r < info->old_nr_rings ; r++) {
 		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
 		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
 		if (rc) {
@@ -1599,7 +1633,7 @@ static int blkif_recover(struct blkfront_info *info)
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	for (i = 0 ; i < info->nr_rings ; i++) {
+	for (i = 0 ; i < info->old_nr_rings ; i++) {
 		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
 		/* Kick any other new requests queued since we resumed */
 		kick_pending_request_queues(&info->rinfo[i], &flags);
@@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
 {
 	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
 	int err;
+	unsigned int nr_queues, prev_nr_queues;
+	bool mq_to_rq_transition;
 
 	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
 
+	prev_nr_queues = info->hardware_queues;
+
+	err = blkfront_gather_hw_queues(info, &nr_queues);
+	if (err < 0)
+		nr_queues = 0;
+	mq_to_rq_transition = prev_nr_queues && !nr_queues;
+
+	if (prev_nr_queues != nr_queues) {
+		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
+		       info->gd->disk_name, prev_nr_queues, nr_queues);
+		if (mq_to_rq_transition) {
+			struct blk_mq_hw_ctx *hctx;
+			unsigned int i;
+			/*
+			 * Switch from multi-queue to single-queue:
+			 * update hctx-to-ring mapping before
+			 * resubmitting any requests
+			 */
+			queue_for_each_hw_ctx(info->rq, hctx, i)
+				hctx->driver_data = &info->rinfo[0];
+		}
+		info->hardware_queues = nr_queues;
+	}
+
 	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
 
+	/* Free with old number of rings, but rebuild with new */
+	info->old_nr_rings = info->nr_rings;
+	/*
+	 * Must not update if transition didn't happen, we're keeping
+	 * the old number of rings.
+	 */
+	if (mq_to_rq_transition)
+		info->nr_rings = 1;
+
 	err = talk_to_blkback(dev, info);
 
 	/*
@@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
 		 * supports indirect descriptors, and how many.
 		 */
 		blkif_recover(info);
+		info->rinfo = krealloc(info->rinfo,
+				       info->nr_rings * sizeof(struct blkfront_ring_info),
+				       GFP_KERNEL);
+
 		return;
 
 	default:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (5 preceding siblings ...)
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-09-11 23:57 ` Arianna Avanzini
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

This commit adds to xen-blkback the support to map and make use
of a variable number of ringbuffers. The number of rings to be
mapped is forcedly set to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkback/blkback.c | 377 ++++++++++++++++---------------
 drivers/block/xen-blkback/common.h  | 110 +++++----
 drivers/block/xen-blkback/xenbus.c  | 432 +++++++++++++++++++++++-------------
 3 files changed, 548 insertions(+), 371 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 64c60ed..b31acfb 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -80,6 +80,9 @@ module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
 MODULE_PARM_DESC(max_persistent_grants,
                  "Maximum number of grants to map persistently");
 
+#define XEN_RING_MAX_PGRANTS(nr_rings) \
+	(max((int)(xen_blkif_max_pgrants / nr_rings), 16))
+
 /*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
@@ -103,71 +106,71 @@ module_param(log_stats, int, 0644);
 /* Number of free pages to remove on each call to free_xenballooned_pages */
 #define NUM_BATCH_FREE_PAGES 10
 
-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page **page)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
-	if (list_empty(&blkif->free_pages)) {
-		BUG_ON(blkif->free_pages_num != 0);
-		spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
+	if (list_empty(&ring->free_pages)) {
+		BUG_ON(ring->free_pages_num != 0);
+		spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 		return alloc_xenballooned_pages(1, page, false);
 	}
-	BUG_ON(blkif->free_pages_num == 0);
-	page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+	BUG_ON(ring->free_pages_num == 0);
+	page[0] = list_first_entry(&ring->free_pages, struct page, lru);
 	list_del(&page[0]->lru);
-	blkif->free_pages_num--;
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	ring->free_pages_num--;
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 
 	return 0;
 }
 
-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
-                                  int num)
+static inline void put_free_pages(struct xen_blkif_ring *ring,
+				  struct page **page, int num)
 {
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
 	for (i = 0; i < num; i++)
-		list_add(&page[i]->lru, &blkif->free_pages);
-	blkif->free_pages_num += num;
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+		list_add(&page[i]->lru, &ring->free_pages);
+	ring->free_pages_num += num;
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 }
 
-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
 {
 	/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
 	struct page *page[NUM_BATCH_FREE_PAGES];
 	unsigned int num_pages = 0;
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
-	while (blkif->free_pages_num > num) {
-		BUG_ON(list_empty(&blkif->free_pages));
-		page[num_pages] = list_first_entry(&blkif->free_pages,
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
+	while (ring->free_pages_num > num) {
+		BUG_ON(list_empty(&ring->free_pages));
+		page[num_pages] = list_first_entry(&ring->free_pages,
 		                                   struct page, lru);
 		list_del(&page[num_pages]->lru);
-		blkif->free_pages_num--;
+		ring->free_pages_num--;
 		if (++num_pages == NUM_BATCH_FREE_PAGES) {
-			spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+			spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 			free_xenballooned_pages(num_pages, page);
-			spin_lock_irqsave(&blkif->free_pages_lock, flags);
+			spin_lock_irqsave(&ring->free_pages_lock, flags);
 			num_pages = 0;
 		}
 	}
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 	if (num_pages != 0)
 		free_xenballooned_pages(num_pages, page);
 }
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req,
 				struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
 			  unsigned short op, int st);
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
@@ -188,19 +191,21 @@ static void make_response(struct xen_blkif *blkif, u64 id,
  * bit operations to modify the flags of a persistent grant and to count
  * the number of used grants.
  */
-static int add_persistent_gnt(struct xen_blkif *blkif,
+static int add_persistent_gnt(struct xen_blkif_ring *ring,
 			       struct persistent_gnt *persistent_gnt)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct rb_node **new = NULL, *parent = NULL;
 	struct persistent_gnt *this;
 
-	if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
+	if (ring->persistent_gnt_c >=
+		XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings)) {
 		if (!blkif->vbd.overflow_max_grants)
 			blkif->vbd.overflow_max_grants = 1;
 		return -EBUSY;
 	}
 	/* Figure out where to put new node */
-	new = &blkif->persistent_gnts.rb_node;
+	new = &ring->persistent_gnts.rb_node;
 	while (*new) {
 		this = container_of(*new, struct persistent_gnt, node);
 
@@ -219,19 +224,19 @@ static int add_persistent_gnt(struct xen_blkif *blkif,
 	set_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
 	/* Add new node and rebalance tree. */
 	rb_link_node(&(persistent_gnt->node), parent, new);
-	rb_insert_color(&(persistent_gnt->node), &blkif->persistent_gnts);
-	blkif->persistent_gnt_c++;
-	atomic_inc(&blkif->persistent_gnt_in_use);
+	rb_insert_color(&(persistent_gnt->node), &ring->persistent_gnts);
+	ring->persistent_gnt_c++;
+	atomic_inc(&ring->persistent_gnt_in_use);
 	return 0;
 }
 
-static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
+static struct persistent_gnt *get_persistent_gnt(struct xen_blkif_ring *ring,
 						 grant_ref_t gref)
 {
 	struct persistent_gnt *data;
 	struct rb_node *node = NULL;
 
-	node = blkif->persistent_gnts.rb_node;
+	node = ring->persistent_gnts.rb_node;
 	while (node) {
 		data = container_of(node, struct persistent_gnt, node);
 
@@ -245,25 +250,25 @@ static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
 				return NULL;
 			}
 			set_bit(PERSISTENT_GNT_ACTIVE, data->flags);
-			atomic_inc(&blkif->persistent_gnt_in_use);
+			atomic_inc(&ring->persistent_gnt_in_use);
 			return data;
 		}
 	}
 	return NULL;
 }
 
-static void put_persistent_gnt(struct xen_blkif *blkif,
+static void put_persistent_gnt(struct xen_blkif_ring *ring,
                                struct persistent_gnt *persistent_gnt)
 {
 	if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
 	          pr_alert_ratelimited(DRV_PFX " freeing a grant already unused");
 	set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
 	clear_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
-	atomic_dec(&blkif->persistent_gnt_in_use);
+	atomic_dec(&ring->persistent_gnt_in_use);
 }
 
-static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
-                                 unsigned int num)
+static void free_persistent_gnts(struct xen_blkif_ring *ring,
+				 struct rb_root *root, unsigned int num)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -288,7 +293,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			put_free_pages(blkif, pages, segs_to_unmap);
+			put_free_pages(ring, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 
@@ -305,10 +310,10 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct persistent_gnt *persistent_gnt;
 	int ret, segs_to_unmap = 0;
-	struct xen_blkif *blkif = container_of(work, typeof(*blkif), persistent_purge_work);
+	struct xen_blkif_ring *ring = container_of(work, typeof(*ring), persistent_purge_work);
 
-	while(!list_empty(&blkif->persistent_purge_list)) {
-		persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
+	while(!list_empty(&ring->persistent_purge_list)) {
+		persistent_gnt = list_first_entry(&ring->persistent_purge_list,
 		                                  struct persistent_gnt,
 		                                  remove_node);
 		list_del(&persistent_gnt->remove_node);
@@ -324,7 +329,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			put_free_pages(blkif, pages, segs_to_unmap);
+			put_free_pages(ring, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 		kfree(persistent_gnt);
@@ -332,34 +337,36 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 	if (segs_to_unmap > 0) {
 		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
 		BUG_ON(ret);
-		put_free_pages(blkif, pages, segs_to_unmap);
+		put_free_pages(ring, pages, segs_to_unmap);
 	}
 }
 
-static void purge_persistent_gnt(struct xen_blkif *blkif)
+static void purge_persistent_gnt(struct xen_blkif_ring *ring)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct persistent_gnt *persistent_gnt;
 	struct rb_node *n;
 	unsigned int num_clean, total;
 	bool scan_used = false, clean_used = false;
 	struct rb_root *root;
+	unsigned nr_rings = ring->blkif->nr_rings;
 
-	if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
-	    (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
+	if (ring->persistent_gnt_c < XEN_RING_MAX_PGRANTS(nr_rings) ||
+	    (ring->persistent_gnt_c == XEN_RING_MAX_PGRANTS(nr_rings) &&
 	    !blkif->vbd.overflow_max_grants)) {
 		return;
 	}
 
-	if (work_pending(&blkif->persistent_purge_work)) {
+	if (work_pending(&ring->persistent_purge_work)) {
 		pr_alert_ratelimited(DRV_PFX "Scheduled work from previous purge is still pending, cannot purge list\n");
 		return;
 	}
 
-	num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
-	num_clean = blkif->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
-	num_clean = min(blkif->persistent_gnt_c, num_clean);
+	num_clean = (XEN_RING_MAX_PGRANTS(nr_rings) / 100) * LRU_PERCENT_CLEAN;
+	num_clean = ring->persistent_gnt_c - XEN_RING_MAX_PGRANTS(nr_rings) + num_clean;
+	num_clean = min(ring->persistent_gnt_c, num_clean);
 	if ((num_clean == 0) ||
-	    (num_clean > (blkif->persistent_gnt_c - atomic_read(&blkif->persistent_gnt_in_use))))
+	    (num_clean > (ring->persistent_gnt_c - atomic_read(&ring->persistent_gnt_in_use))))
 		return;
 
 	/*
@@ -375,8 +382,8 @@ static void purge_persistent_gnt(struct xen_blkif *blkif)
 
 	pr_debug(DRV_PFX "Going to purge %u persistent grants\n", num_clean);
 
-	BUG_ON(!list_empty(&blkif->persistent_purge_list));
-	root = &blkif->persistent_gnts;
+	BUG_ON(!list_empty(&ring->persistent_purge_list));
+	root = &ring->persistent_gnts;
 purge_list:
 	foreach_grant_safe(persistent_gnt, n, root, node) {
 		BUG_ON(persistent_gnt->handle ==
@@ -395,7 +402,7 @@ purge_list:
 
 		rb_erase(&persistent_gnt->node, root);
 		list_add(&persistent_gnt->remove_node,
-		         &blkif->persistent_purge_list);
+		         &ring->persistent_purge_list);
 		if (--num_clean == 0)
 			goto finished;
 	}
@@ -416,11 +423,11 @@ finished:
 		goto purge_list;
 	}
 
-	blkif->persistent_gnt_c -= (total - num_clean);
+	ring->persistent_gnt_c -= (total - num_clean);
 	blkif->vbd.overflow_max_grants = 0;
 
 	/* We can defer this work */
-	schedule_work(&blkif->persistent_purge_work);
+	schedule_work(&ring->persistent_purge_work);
 	pr_debug(DRV_PFX "Purged %u/%u\n", (total - num_clean), total);
 	return;
 }
@@ -428,18 +435,18 @@ finished:
 /*
  * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
  */
-static struct pending_req *alloc_req(struct xen_blkif *blkif)
+static struct pending_req *alloc_req(struct xen_blkif_ring *ring)
 {
 	struct pending_req *req = NULL;
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->pending_free_lock, flags);
-	if (!list_empty(&blkif->pending_free)) {
-		req = list_entry(blkif->pending_free.next, struct pending_req,
+	spin_lock_irqsave(&ring->pending_free_lock, flags);
+	if (!list_empty(&ring->pending_free)) {
+		req = list_entry(ring->pending_free.next, struct pending_req,
 				 free_list);
 		list_del(&req->free_list);
 	}
-	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+	spin_unlock_irqrestore(&ring->pending_free_lock, flags);
 	return req;
 }
 
@@ -447,17 +454,17 @@ static struct pending_req *alloc_req(struct xen_blkif *blkif)
  * Return the 'pending_req' structure back to the freepool. We also
  * wake up the thread if it was waiting for a free page.
  */
-static void free_req(struct xen_blkif *blkif, struct pending_req *req)
+static void free_req(struct xen_blkif_ring *ring, struct pending_req *req)
 {
 	unsigned long flags;
 	int was_empty;
 
-	spin_lock_irqsave(&blkif->pending_free_lock, flags);
-	was_empty = list_empty(&blkif->pending_free);
-	list_add(&req->free_list, &blkif->pending_free);
-	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+	spin_lock_irqsave(&ring->pending_free_lock, flags);
+	was_empty = list_empty(&ring->pending_free);
+	list_add(&req->free_list, &ring->pending_free);
+	spin_unlock_irqrestore(&ring->pending_free_lock, flags);
 	if (was_empty)
-		wake_up(&blkif->pending_free_wq);
+		wake_up(&ring->pending_free_wq);
 }
 
 /*
@@ -537,10 +544,10 @@ abort:
 /*
  * Notification from the guest OS.
  */
-static void blkif_notify_work(struct xen_blkif *blkif)
+static void blkif_notify_work(struct xen_blkif_ring *ring)
 {
-	blkif->waiting_reqs = 1;
-	wake_up(&blkif->wq);
+	ring->waiting_reqs = 1;
+	wake_up(&ring->wq);
 }
 
 irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
@@ -553,30 +560,33 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
  * SCHEDULER FUNCTIONS
  */
 
-static void print_stats(struct xen_blkif *blkif)
+static void print_stats(struct xen_blkif_ring *ring)
 {
+	spin_lock_irq(&ring->stats_lock);
 	pr_info("xen-blkback (%s): oo %3llu  |  rd %4llu  |  wr %4llu  |  f %4llu"
 		 "  |  ds %4llu | pg: %4u/%4d\n",
-		 current->comm, blkif->st_oo_req,
-		 blkif->st_rd_req, blkif->st_wr_req,
-		 blkif->st_f_req, blkif->st_ds_req,
-		 blkif->persistent_gnt_c,
-		 xen_blkif_max_pgrants);
-	blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
-	blkif->st_rd_req = 0;
-	blkif->st_wr_req = 0;
-	blkif->st_oo_req = 0;
-	blkif->st_ds_req = 0;
+		 current->comm, ring->st_oo_req,
+		 ring->st_rd_req, ring->st_wr_req,
+		 ring->st_f_req, ring->st_ds_req,
+		 ring->persistent_gnt_c,
+		 XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings));
+	ring->st_print = jiffies + msecs_to_jiffies(10 * 1000);
+	ring->st_rd_req = 0;
+	ring->st_wr_req = 0;
+	ring->st_oo_req = 0;
+	ring->st_ds_req = 0;
+	spin_unlock_irq(&ring->stats_lock);
 }
 
 int xen_blkif_schedule(void *arg)
 {
-	struct xen_blkif *blkif = arg;
+	struct xen_blkif_ring *ring = arg;
+	struct xen_blkif *blkif = ring->blkif;
 	struct xen_vbd *vbd = &blkif->vbd;
 	unsigned long timeout;
 	int ret;
 
-	xen_blkif_get(blkif);
+	xen_ring_get(ring);
 
 	while (!kthread_should_stop()) {
 		if (try_to_freeze())
@@ -587,51 +597,51 @@ int xen_blkif_schedule(void *arg)
 		timeout = msecs_to_jiffies(LRU_INTERVAL);
 
 		timeout = wait_event_interruptible_timeout(
-			blkif->wq,
-			blkif->waiting_reqs || kthread_should_stop(),
+			ring->wq,
+			ring->waiting_reqs || kthread_should_stop(),
 			timeout);
 		if (timeout == 0)
 			goto purge_gnt_list;
 		timeout = wait_event_interruptible_timeout(
-			blkif->pending_free_wq,
-			!list_empty(&blkif->pending_free) ||
+			ring->pending_free_wq,
+			!list_empty(&ring->pending_free) ||
 			kthread_should_stop(),
 			timeout);
 		if (timeout == 0)
 			goto purge_gnt_list;
 
-		blkif->waiting_reqs = 0;
+		ring->waiting_reqs = 0;
 		smp_mb(); /* clear flag *before* checking for work */
 
-		ret = do_block_io_op(blkif);
+		ret = do_block_io_op(ring);
 		if (ret > 0)
-			blkif->waiting_reqs = 1;
+			ring->waiting_reqs = 1;
 		if (ret == -EACCES)
-			wait_event_interruptible(blkif->shutdown_wq,
+			wait_event_interruptible(ring->shutdown_wq,
 						 kthread_should_stop());
 
 purge_gnt_list:
 		if (blkif->vbd.feature_gnt_persistent &&
-		    time_after(jiffies, blkif->next_lru)) {
-			purge_persistent_gnt(blkif);
-			blkif->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
+		    time_after(jiffies, ring->next_lru)) {
+			purge_persistent_gnt(ring);
+			ring->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
 		}
 
 		/* Shrink if we have more than xen_blkif_max_buffer_pages */
-		shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
+		shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);
 
-		if (log_stats && time_after(jiffies, blkif->st_print))
-			print_stats(blkif);
+		if (log_stats && time_after(jiffies, ring->st_print))
+			print_stats(ring);
 	}
 
 	/* Drain pending purge work */
-	flush_work(&blkif->persistent_purge_work);
+	flush_work(&ring->persistent_purge_work);
 
 	if (log_stats)
-		print_stats(blkif);
+		print_stats(ring);
 
-	blkif->xenblkd = NULL;
-	xen_blkif_put(blkif);
+	ring->xenblkd = NULL;
+	xen_ring_put(ring);
 
 	return 0;
 }
@@ -639,25 +649,25 @@ purge_gnt_list:
 /*
  * Remove persistent grants and empty the pool of free pages
  */
-void xen_blkbk_free_caches(struct xen_blkif *blkif)
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring)
 {
 	/* Free all persistent grant pages */
-	if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
-		free_persistent_gnts(blkif, &blkif->persistent_gnts,
-			blkif->persistent_gnt_c);
+	if (!RB_EMPTY_ROOT(&ring->persistent_gnts))
+		free_persistent_gnts(ring, &ring->persistent_gnts,
+			ring->persistent_gnt_c);
 
-	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
-	blkif->persistent_gnt_c = 0;
+	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+	ring->persistent_gnt_c = 0;
 
 	/* Since we are shutting down remove all pages from the buffer */
-	shrink_free_pagepool(blkif, 0 /* All */);
+	shrink_free_pagepool(ring, 0 /* All */);
 }
 
 /*
  * Unmap the grant references, and also remove the M2P over-rides
  * used in the 'pending_req'.
  */
-static void xen_blkbk_unmap(struct xen_blkif *blkif,
+static void xen_blkbk_unmap(struct xen_blkif_ring *ring,
                             struct grant_page *pages[],
                             int num)
 {
@@ -668,7 +678,7 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
 
 	for (i = 0; i < num; i++) {
 		if (pages[i]->persistent_gnt != NULL) {
-			put_persistent_gnt(blkif, pages[i]->persistent_gnt);
+			put_persistent_gnt(ring, pages[i]->persistent_gnt);
 			continue;
 		}
 		if (pages[i]->handle == BLKBACK_INVALID_HANDLE)
@@ -681,21 +691,22 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
 			ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
 			                        invcount);
 			BUG_ON(ret);
-			put_free_pages(blkif, unmap_pages, invcount);
+			put_free_pages(ring, unmap_pages, invcount);
 			invcount = 0;
 		}
 	}
 	if (invcount) {
 		ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
 		BUG_ON(ret);
-		put_free_pages(blkif, unmap_pages, invcount);
+		put_free_pages(ring, unmap_pages, invcount);
 	}
 }
 
-static int xen_blkbk_map(struct xen_blkif *blkif,
+static int xen_blkbk_map(struct xen_blkif_ring *ring,
 			 struct grant_page *pages[],
 			 int num, bool ro)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct persistent_gnt *persistent_gnt = NULL;
@@ -719,7 +730,7 @@ again:
 
 		if (use_persistent_gnts)
 			persistent_gnt = get_persistent_gnt(
-				blkif,
+				ring,
 				pages[i]->gref);
 
 		if (persistent_gnt) {
@@ -730,7 +741,7 @@ again:
 			pages[i]->page = persistent_gnt->page;
 			pages[i]->persistent_gnt = persistent_gnt;
 		} else {
-			if (get_free_page(blkif, &pages[i]->page))
+			if (get_free_page(ring, &pages[i]->page))
 				goto out_of_memory;
 			addr = vaddr(pages[i]->page);
 			pages_to_gnt[segs_to_map] = pages[i]->page;
@@ -772,7 +783,8 @@ again:
 			continue;
 		}
 		if (use_persistent_gnts &&
-		    blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
+		    ring->persistent_gnt_c <
+			XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings)) {
 			/*
 			 * We are using persistent grants, the grant is
 			 * not mapped but we might have room for it.
@@ -790,7 +802,7 @@ again:
 			persistent_gnt->gnt = map[new_map_idx].ref;
 			persistent_gnt->handle = map[new_map_idx].handle;
 			persistent_gnt->page = pages[seg_idx]->page;
-			if (add_persistent_gnt(blkif,
+			if (add_persistent_gnt(ring,
 			                       persistent_gnt)) {
 				kfree(persistent_gnt);
 				persistent_gnt = NULL;
@@ -798,8 +810,8 @@ again:
 			}
 			pages[seg_idx]->persistent_gnt = persistent_gnt;
 			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
-				 persistent_gnt->gnt, blkif->persistent_gnt_c,
-				 xen_blkif_max_pgrants);
+				 persistent_gnt->gnt, ring->persistent_gnt_c,
+				 XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings));
 			goto next;
 		}
 		if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
@@ -823,7 +835,7 @@ next:
 
 out_of_memory:
 	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
-	put_free_pages(blkif, pages_to_gnt, segs_to_map);
+	put_free_pages(ring, pages_to_gnt, segs_to_map);
 	return -ENOMEM;
 }
 
@@ -831,7 +843,7 @@ static int xen_blkbk_map_seg(struct pending_req *pending_req)
 {
 	int rc;
 
-	rc = xen_blkbk_map(pending_req->blkif, pending_req->segments,
+	rc = xen_blkbk_map(pending_req->ring, pending_req->segments,
 			   pending_req->nr_pages,
 	                   (pending_req->operation != BLKIF_OP_READ));
 
@@ -844,7 +856,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 				    struct phys_req *preq)
 {
 	struct grant_page **pages = pending_req->indirect_pages;
-	struct xen_blkif *blkif = pending_req->blkif;
+	struct xen_blkif_ring *ring = pending_req->ring;
 	int indirect_grefs, rc, n, nseg, i;
 	struct blkif_request_segment *segments = NULL;
 
@@ -855,7 +867,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 	for (i = 0; i < indirect_grefs; i++)
 		pages[i]->gref = req->u.indirect.indirect_grefs[i];
 
-	rc = xen_blkbk_map(blkif, pages, indirect_grefs, true);
+	rc = xen_blkbk_map(ring, pages, indirect_grefs, true);
 	if (rc)
 		goto unmap;
 
@@ -882,20 +894,21 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 unmap:
 	if (segments)
 		kunmap_atomic(segments);
-	xen_blkbk_unmap(blkif, pages, indirect_grefs);
+	xen_blkbk_unmap(ring, pages, indirect_grefs);
 	return rc;
 }
 
-static int dispatch_discard_io(struct xen_blkif *blkif,
+static int dispatch_discard_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req)
 {
 	int err = 0;
 	int status = BLKIF_RSP_OKAY;
+	struct xen_blkif *blkif = ring->blkif;
 	struct block_device *bdev = blkif->vbd.bdev;
 	unsigned long secure;
 	struct phys_req preq;
 
-	xen_blkif_get(blkif);
+	xen_ring_get(ring);
 
 	preq.sector_number = req->u.discard.sector_number;
 	preq.nr_sects      = req->u.discard.nr_sectors;
@@ -907,7 +920,9 @@ static int dispatch_discard_io(struct xen_blkif *blkif,
 			preq.sector_number + preq.nr_sects, blkif->vbd.pdevice);
 		goto fail_response;
 	}
-	blkif->st_ds_req++;
+	spin_lock_irq(&ring->stats_lock);
+	ring->st_ds_req++;
+	spin_unlock_irq(&ring->stats_lock);
 
 	secure = (blkif->vbd.discard_secure &&
 		 (req->u.discard.flag & BLKIF_DISCARD_SECURE)) ?
@@ -923,26 +938,27 @@ fail_response:
 	} else if (err)
 		status = BLKIF_RSP_ERROR;
 
-	make_response(blkif, req->u.discard.id, req->operation, status);
-	xen_blkif_put(blkif);
+	make_response(ring, req->u.discard.id, req->operation, status);
+	xen_ring_put(ring);
 	return err;
 }
 
-static int dispatch_other_io(struct xen_blkif *blkif,
+static int dispatch_other_io(struct xen_blkif_ring *ring,
 			     struct blkif_request *req,
 			     struct pending_req *pending_req)
 {
-	free_req(blkif, pending_req);
-	make_response(blkif, req->u.other.id, req->operation,
+	free_req(ring, pending_req);
+	make_response(ring, req->u.other.id, req->operation,
 		      BLKIF_RSP_EOPNOTSUPP);
 	return -EIO;
 }
 
-static void xen_blk_drain_io(struct xen_blkif *blkif)
+static void xen_blk_drain_io(struct xen_blkif_ring *ring)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	atomic_set(&blkif->drain, 1);
 	do {
-		if (atomic_read(&blkif->inflight) == 0)
+		if (atomic_read(&ring->inflight) == 0)
 			break;
 		wait_for_completion_interruptible_timeout(
 				&blkif->drain_complete, HZ);
@@ -963,12 +979,12 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 	if ((pending_req->operation == BLKIF_OP_FLUSH_DISKCACHE) &&
 	    (error == -EOPNOTSUPP)) {
 		pr_debug(DRV_PFX "flush diskcache op failed, not supported\n");
-		xen_blkbk_flush_diskcache(XBT_NIL, pending_req->blkif->be, 0);
+		xen_blkbk_flush_diskcache(XBT_NIL, pending_req->ring->blkif->be, 0);
 		pending_req->status = BLKIF_RSP_EOPNOTSUPP;
 	} else if ((pending_req->operation == BLKIF_OP_WRITE_BARRIER) &&
 		    (error == -EOPNOTSUPP)) {
 		pr_debug(DRV_PFX "write barrier op failed, not supported\n");
-		xen_blkbk_barrier(XBT_NIL, pending_req->blkif->be, 0);
+		xen_blkbk_barrier(XBT_NIL, pending_req->ring->blkif->be, 0);
 		pending_req->status = BLKIF_RSP_EOPNOTSUPP;
 	} else if (error) {
 		pr_debug(DRV_PFX "Buffer not up-to-date at end of operation,"
@@ -982,14 +998,15 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 	 * the proper response on the ring.
 	 */
 	if (atomic_dec_and_test(&pending_req->pendcnt)) {
-		struct xen_blkif *blkif = pending_req->blkif;
+		struct xen_blkif_ring *ring = pending_req->ring;
+		struct xen_blkif *blkif = ring->blkif;
 
-		xen_blkbk_unmap(blkif,
+		xen_blkbk_unmap(ring,
 		                pending_req->segments,
 		                pending_req->nr_pages);
-		make_response(blkif, pending_req->id,
+		make_response(ring, pending_req->id,
 			      pending_req->operation, pending_req->status);
-		free_req(blkif, pending_req);
+		free_req(ring, pending_req);
 		/*
 		 * Make sure the request is freed before releasing blkif,
 		 * or there could be a race between free_req and the
@@ -1002,10 +1019,10 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 		 * pending_free_wq if there's a drain going on, but it has
 		 * to be taken into account if the current model is changed.
 		 */
-		if (atomic_dec_and_test(&blkif->inflight) && atomic_read(&blkif->drain)) {
+		if (atomic_dec_and_test(&ring->inflight) && atomic_read(&blkif->drain)) {
 			complete(&blkif->drain_complete);
 		}
-		xen_blkif_put(blkif);
+		xen_ring_put(ring);
 	}
 }
 
@@ -1026,9 +1043,10 @@ static void end_block_io_op(struct bio *bio, int error)
  * and transmute  it to the block API to hand it over to the proper block disk.
  */
 static int
-__do_block_io_op(struct xen_blkif *blkif)
+__do_block_io_op(struct xen_blkif_ring *ring)
 {
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
+	struct xen_blkif *blkif = ring->blkif;
 	struct blkif_request req;
 	struct pending_req *pending_req;
 	RING_IDX rc, rp;
@@ -1054,9 +1072,11 @@ __do_block_io_op(struct xen_blkif *blkif)
 			break;
 		}
 
-		pending_req = alloc_req(blkif);
+		pending_req = alloc_req(ring);
 		if (NULL == pending_req) {
-			blkif->st_oo_req++;
+			spin_lock_irq(&ring->stats_lock);
+			ring->st_oo_req++;
+			spin_unlock_irq(&ring->stats_lock);
 			more_to_do = 1;
 			break;
 		}
@@ -1085,16 +1105,16 @@ __do_block_io_op(struct xen_blkif *blkif)
 		case BLKIF_OP_WRITE_BARRIER:
 		case BLKIF_OP_FLUSH_DISKCACHE:
 		case BLKIF_OP_INDIRECT:
-			if (dispatch_rw_block_io(blkif, &req, pending_req))
+			if (dispatch_rw_block_io(ring, &req, pending_req))
 				goto done;
 			break;
 		case BLKIF_OP_DISCARD:
-			free_req(blkif, pending_req);
-			if (dispatch_discard_io(blkif, &req))
+			free_req(ring, pending_req);
+			if (dispatch_discard_io(ring, &req))
 				goto done;
 			break;
 		default:
-			if (dispatch_other_io(blkif, &req, pending_req))
+			if (dispatch_other_io(ring, &req, pending_req))
 				goto done;
 			break;
 		}
@@ -1107,13 +1127,13 @@ done:
 }
 
 static int
-do_block_io_op(struct xen_blkif *blkif)
+do_block_io_op(struct xen_blkif_ring *ring)
 {
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
 	int more_to_do;
 
 	do {
-		more_to_do = __do_block_io_op(blkif);
+		more_to_do = __do_block_io_op(ring);
 		if (more_to_do)
 			break;
 
@@ -1126,7 +1146,7 @@ do_block_io_op(struct xen_blkif *blkif)
  * Transmutation of the 'struct blkif_request' to a proper 'struct bio'
  * and call the 'submit_bio' to pass it to the underlying storage.
  */
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req,
 				struct pending_req *pending_req)
 {
@@ -1140,6 +1160,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	struct blk_plug plug;
 	bool drain = false;
 	struct grant_page **pages = pending_req->segments;
+	struct xen_blkif *blkif = ring->blkif;
 	unsigned short req_operation;
 
 	req_operation = req->operation == BLKIF_OP_INDIRECT ?
@@ -1152,26 +1173,29 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		goto fail_response;
 	}
 
+	spin_lock_irq(&ring->stats_lock);
 	switch (req_operation) {
 	case BLKIF_OP_READ:
-		blkif->st_rd_req++;
+		ring->st_rd_req++;
 		operation = READ;
 		break;
 	case BLKIF_OP_WRITE:
-		blkif->st_wr_req++;
+		ring->st_wr_req++;
 		operation = WRITE_ODIRECT;
 		break;
 	case BLKIF_OP_WRITE_BARRIER:
 		drain = true;
 	case BLKIF_OP_FLUSH_DISKCACHE:
-		blkif->st_f_req++;
+		ring->st_f_req++;
 		operation = WRITE_FLUSH;
 		break;
 	default:
 		operation = 0; /* make gcc happy */
+		spin_unlock_irq(&ring->stats_lock);
 		goto fail_response;
 		break;
 	}
+	spin_unlock_irq(&ring->stats_lock);
 
 	/* Check that the number of segments is sane. */
 	nseg = req->operation == BLKIF_OP_INDIRECT ?
@@ -1190,7 +1214,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 
 	preq.nr_sects      = 0;
 
-	pending_req->blkif     = blkif;
+	pending_req->ring      = ring;
 	pending_req->id        = req->u.rw.id;
 	pending_req->operation = req_operation;
 	pending_req->status    = BLKIF_RSP_OKAY;
@@ -1243,7 +1267,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	 * issue the WRITE_FLUSH.
 	 */
 	if (drain)
-		xen_blk_drain_io(pending_req->blkif);
+		xen_blk_drain_io(pending_req->ring);
 
 	/*
 	 * If we have failed at this point, we need to undo the M2P override,
@@ -1255,11 +1279,11 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		goto fail_flush;
 
 	/*
-	 * This corresponding xen_blkif_put is done in __end_block_io_op, or
+	 * This corresponding xen_ring_put is done in __end_block_io_op, or
 	 * below (in "!bio") if we are handling a BLKIF_OP_DISCARD.
 	 */
-	xen_blkif_get(blkif);
-	atomic_inc(&blkif->inflight);
+	xen_ring_get(ring);
+	atomic_inc(&ring->inflight);
 
 	for (i = 0; i < nseg; i++) {
 		while ((bio == NULL) ||
@@ -1306,20 +1330,22 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	/* Let the I/Os go.. */
 	blk_finish_plug(&plug);
 
+	spin_lock_irq(&ring->stats_lock);
 	if (operation == READ)
-		blkif->st_rd_sect += preq.nr_sects;
+		ring->st_rd_sect += preq.nr_sects;
 	else if (operation & WRITE)
-		blkif->st_wr_sect += preq.nr_sects;
+		ring->st_wr_sect += preq.nr_sects;
+	spin_unlock_irq(&ring->stats_lock);
 
 	return 0;
 
  fail_flush:
-	xen_blkbk_unmap(blkif, pending_req->segments,
+	xen_blkbk_unmap(ring, pending_req->segments,
 	                pending_req->nr_pages);
  fail_response:
 	/* Haven't submitted any bio's yet. */
-	make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
-	free_req(blkif, pending_req);
+	make_response(ring, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
+	free_req(ring, pending_req);
 	msleep(1); /* back off a bit */
 	return -EIO;
 
@@ -1337,19 +1363,20 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 /*
  * Put a response on the ring on how the operation fared.
  */
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
 			  unsigned short op, int st)
 {
 	struct blkif_response  resp;
 	unsigned long     flags;
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
+	struct xen_blkif *blkif = ring->blkif;
 	int notify;
 
 	resp.id        = id;
 	resp.operation = op;
 	resp.status    = st;
 
-	spin_lock_irqsave(&blkif->blk_ring_lock, flags);
+	spin_lock_irqsave(&ring->blk_ring_lock, flags);
 	/* Place on the response ring for the relevant domain. */
 	switch (blkif->blk_protocol) {
 	case BLKIF_PROTOCOL_NATIVE:
@@ -1369,9 +1396,9 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 	}
 	blk_rings->common.rsp_prod_pvt++;
 	RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify);
-	spin_unlock_irqrestore(&blkif->blk_ring_lock, flags);
+	spin_unlock_irqrestore(&ring->blk_ring_lock, flags);
 	if (notify)
-		notify_remote_via_irq(blkif->irq);
+		notify_remote_via_irq(ring->irq);
 }
 
 static int __init xen_blkif_init(void)
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index f65b807..6f074ce 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -226,6 +226,7 @@ struct xen_vbd {
 	struct block_device	*bdev;
 	/* Cached size parameter. */
 	sector_t		size;
+	unsigned int		nr_supported_hw_queues;
 	unsigned int		flush_support:1;
 	unsigned int		discard_secure:1;
 	unsigned int		feature_gnt_persistent:1;
@@ -246,6 +247,7 @@ struct backend_info;
 
 /* Number of requests that we can fit in a ring */
 #define XEN_BLKIF_REQS			32
+#define XEN_RING_REQS(nr_rings)	(max((int)(XEN_BLKIF_REQS / nr_rings), 8))
 
 struct persistent_gnt {
 	struct page *page;
@@ -256,32 +258,29 @@ struct persistent_gnt {
 	struct list_head remove_node;
 };
 
-struct xen_blkif {
-	/* Unique identifier for this interface. */
-	domid_t			domid;
-	unsigned int		handle;
+struct xen_blkif_ring {
+	union blkif_back_rings	blk_rings;
 	/* Physical parameters of the comms window. */
 	unsigned int		irq;
-	/* Comms information. */
-	enum blkif_protocol	blk_protocol;
-	union blkif_back_rings	blk_rings;
-	void			*blk_ring;
-	/* The VBD attached to this interface. */
-	struct xen_vbd		vbd;
-	/* Back pointer to the backend_info. */
-	struct backend_info	*be;
-	/* Private fields. */
-	spinlock_t		blk_ring_lock;
-	atomic_t		refcnt;
 
 	wait_queue_head_t	wq;
-	/* for barrier (drain) requests */
-	struct completion	drain_complete;
-	atomic_t		drain;
-	atomic_t		inflight;
 	/* One thread per one blkif. */
 	struct task_struct	*xenblkd;
 	unsigned int		waiting_reqs;
+	void			*blk_ring;
+	spinlock_t		blk_ring_lock;
+
+	struct work_struct	free_work;
+	/* Thread shutdown wait queue. */
+	wait_queue_head_t	shutdown_wq;
+
+	/* buffer of free pages to map grant refs */
+	spinlock_t		free_pages_lock;
+	int			free_pages_num;
+
+	/* used by the kworker that offload work from the persistent purge */
+	struct list_head	persistent_purge_list;
+	struct work_struct	persistent_purge_work;
 
 	/* tree to store persistent grants */
 	struct rb_root		persistent_gnts;
@@ -289,13 +288,6 @@ struct xen_blkif {
 	atomic_t		persistent_gnt_in_use;
 	unsigned long           next_lru;
 
-	/* used by the kworker that offload work from the persistent purge */
-	struct list_head	persistent_purge_list;
-	struct work_struct	persistent_purge_work;
-
-	/* buffer of free pages to map grant refs */
-	spinlock_t		free_pages_lock;
-	int			free_pages_num;
 	struct list_head	free_pages;
 
 	/* List of all 'pending_req' available */
@@ -303,20 +295,54 @@ struct xen_blkif {
 	/* And its spinlock. */
 	spinlock_t		pending_free_lock;
 	wait_queue_head_t	pending_free_wq;
+	atomic_t		inflight;
+
+	/* Private fields. */
+	atomic_t		refcnt;
+
+	struct xen_blkif	*blkif;
+	unsigned		ring_index;
 
+	spinlock_t		stats_lock;
 	/* statistics */
 	unsigned long		st_print;
-	unsigned long long			st_rd_req;
-	unsigned long long			st_wr_req;
-	unsigned long long			st_oo_req;
-	unsigned long long			st_f_req;
-	unsigned long long			st_ds_req;
-	unsigned long long			st_rd_sect;
-	unsigned long long			st_wr_sect;
+	unsigned long long	st_rd_req;
+	unsigned long long	st_wr_req;
+	unsigned long long	st_oo_req;
+	unsigned long long	st_f_req;
+	unsigned long long	st_ds_req;
+	unsigned long long	st_rd_sect;
+	unsigned long long	st_wr_sect;
+};
 
-	struct work_struct	free_work;
-	/* Thread shutdown wait queue. */
-	wait_queue_head_t	shutdown_wq;
+struct xen_blkif {
+	/* Unique identifier for this interface. */
+	domid_t			domid;
+	unsigned int		handle;
+	/* Comms information. */
+	enum blkif_protocol	blk_protocol;
+	/* The VBD attached to this interface. */
+	struct xen_vbd		vbd;
+	/* Rings for this device */
+	struct xen_blkif_ring	*rings;
+	unsigned int		nr_rings;
+	/* Back pointer to the backend_info. */
+	struct backend_info	*be;
+
+	/* for barrier (drain) requests */
+	struct completion	drain_complete;
+	atomic_t		drain;
+
+	atomic_t		refcnt;
+
+	/* statistics */
+	unsigned long long	st_rd_req;
+	unsigned long long	st_wr_req;
+	unsigned long long	st_oo_req;
+	unsigned long long	st_f_req;
+	unsigned long long	st_ds_req;
+	unsigned long long	st_rd_sect;
+	unsigned long long	st_wr_sect;
 };
 
 struct seg_buf {
@@ -338,7 +364,7 @@ struct grant_page {
  * response queued for it, with the saved 'id' passed back.
  */
 struct pending_req {
-	struct xen_blkif	*blkif;
+	struct xen_blkif_ring	*ring;
 	u64			id;
 	int			nr_pages;
 	atomic_t		pendcnt;
@@ -357,11 +383,11 @@ struct pending_req {
 			 (_v)->bdev->bd_part->nr_sects : \
 			  get_capacity((_v)->bdev->bd_disk))
 
-#define xen_blkif_get(_b) (atomic_inc(&(_b)->refcnt))
-#define xen_blkif_put(_b)				\
+#define xen_ring_get(_r) (atomic_inc(&(_r)->refcnt))
+#define xen_ring_put(_r)				\
 	do {						\
-		if (atomic_dec_and_test(&(_b)->refcnt))	\
-			schedule_work(&(_b)->free_work);\
+		if (atomic_dec_and_test(&(_r)->refcnt))	\
+			schedule_work(&(_r)->free_work);\
 	} while (0)
 
 struct phys_req {
@@ -377,7 +403,7 @@ int xen_blkif_xenbus_init(void);
 irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
 int xen_blkif_schedule(void *arg);
 int xen_blkif_purge_persistent(void *arg);
-void xen_blkbk_free_caches(struct xen_blkif *blkif);
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring);
 
 int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
 			      struct backend_info *be, int state);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 3a8b810..a4f13cc 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -35,7 +35,7 @@ static void connect(struct backend_info *);
 static int connect_ring(struct backend_info *);
 static void backend_changed(struct xenbus_watch *, const char **,
 			    unsigned int);
-static void xen_blkif_free(struct xen_blkif *blkif);
+static void xen_ring_free(struct xen_blkif_ring *ring);
 static void xen_vbd_free(struct xen_vbd *vbd);
 
 struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
@@ -45,17 +45,17 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
 
 /*
  * The last request could free the device from softirq context and
- * xen_blkif_free() can sleep.
+ * xen_ring_free() can sleep.
  */
-static void xen_blkif_deferred_free(struct work_struct *work)
+static void xen_ring_deferred_free(struct work_struct *work)
 {
-	struct xen_blkif *blkif;
+	struct xen_blkif_ring *ring;
 
-	blkif = container_of(work, struct xen_blkif, free_work);
-	xen_blkif_free(blkif);
+	ring = container_of(work, struct xen_blkif_ring, free_work);
+	xen_ring_free(ring);
 }
 
-static int blkback_name(struct xen_blkif *blkif, char *buf)
+static int blkback_name(struct xen_blkif *blkif, char *buf, bool save_space)
 {
 	char *devpath, *devname;
 	struct xenbus_device *dev = blkif->be->dev;
@@ -70,7 +70,10 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
 	else
 		devname  = devpath;
 
-	snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
+	if (save_space)
+		snprintf(buf, TASK_COMM_LEN, "blkbk.%d.%s", blkif->domid, devname);
+	else
+		snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
 	kfree(devpath);
 
 	return 0;
@@ -78,11 +81,15 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
 
 static void xen_update_blkif_status(struct xen_blkif *blkif)
 {
-	int err;
-	char name[TASK_COMM_LEN];
+	int i, err;
+	char name[TASK_COMM_LEN], per_ring_name[TASK_COMM_LEN];
+	struct xen_blkif_ring *ring;
 
-	/* Not ready to connect? */
-	if (!blkif->irq || !blkif->vbd.bdev)
+	/*
+	 * Not ready to connect? Check irq of first ring as the others
+	 * should all be the same.
+	 */
+	if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
 		return;
 
 	/* Already connected? */
@@ -94,7 +101,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 	if (blkif->be->dev->state != XenbusStateConnected)
 		return;
 
-	err = blkback_name(blkif, name);
+	err = blkback_name(blkif, name, blkif->vbd.nr_supported_hw_queues);
 	if (err) {
 		xenbus_dev_error(blkif->be->dev, err, "get blkback dev name");
 		return;
@@ -107,20 +114,96 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 	}
 	invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
 
-	blkif->xenblkd = kthread_run(xen_blkif_schedule, blkif, "%s", name);
-	if (IS_ERR(blkif->xenblkd)) {
-		err = PTR_ERR(blkif->xenblkd);
-		blkif->xenblkd = NULL;
-		xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
-		return;
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		ring = &blkif->rings[i];
+		if (blkif->vbd.nr_supported_hw_queues)
+			snprintf(per_ring_name, TASK_COMM_LEN, "%s-%d", name, i);
+		else {
+			BUG_ON(i != 0);
+			snprintf(per_ring_name, TASK_COMM_LEN, "%s", name);
+		}
+		ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);
+		if (IS_ERR(ring->xenblkd)) {
+			err = PTR_ERR(ring->xenblkd);
+			ring->xenblkd = NULL;
+			xenbus_dev_error(blkif->be->dev, err, "start %s", per_ring_name);
+			return;
+		}
 	}
 }
 
+static struct xen_blkif_ring *xen_blkif_ring_alloc(struct xen_blkif *blkif,
+						   int nr_rings)
+{
+	int r, i, j;
+	struct xen_blkif_ring *rings;
+	struct pending_req *req;
+
+	rings = kzalloc(nr_rings * sizeof(struct xen_blkif_ring),
+			GFP_KERNEL);
+	if (!rings)
+		return NULL;
+
+	for (r = 0 ; r < nr_rings ; r++) {
+		struct xen_blkif_ring *ring = &rings[r];
+
+		spin_lock_init(&ring->blk_ring_lock);
+
+		init_waitqueue_head(&ring->wq);
+		init_waitqueue_head(&ring->shutdown_wq);
+
+		ring->persistent_gnts.rb_node = NULL;
+		spin_lock_init(&ring->free_pages_lock);
+		INIT_LIST_HEAD(&ring->free_pages);
+		INIT_LIST_HEAD(&ring->persistent_purge_list);
+		ring->free_pages_num = 0;
+		atomic_set(&ring->persistent_gnt_in_use, 0);
+		atomic_set(&ring->refcnt, 1);
+		atomic_set(&ring->inflight, 0);
+		INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
+		spin_lock_init(&ring->pending_free_lock);
+		init_waitqueue_head(&ring->pending_free_wq);
+		INIT_LIST_HEAD(&ring->pending_free);
+		for (i = 0; i < XEN_RING_REQS(nr_rings); i++) {
+			req = kzalloc(sizeof(*req), GFP_KERNEL);
+			if (!req)
+				goto fail;
+			list_add_tail(&req->free_list,
+				      &ring->pending_free);
+			for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+				req->segments[j] = kzalloc(sizeof(*req->segments[0]),
+				                           GFP_KERNEL);
+				if (!req->segments[j])
+					goto fail;
+			}
+			for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+				req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
+				                                 GFP_KERNEL);
+				if (!req->indirect_pages[j])
+					goto fail;
+			}
+		}
+
+		INIT_WORK(&ring->free_work, xen_ring_deferred_free);
+		ring->blkif = blkif;
+		ring->ring_index = r;
+
+		spin_lock_init(&ring->stats_lock);
+		ring->st_print = jiffies;
+
+		atomic_inc(&blkif->refcnt);
+	}
+
+	return rings;
+
+fail:
+	kfree(rings);
+	return NULL;
+}
+
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
 	struct xen_blkif *blkif;
-	struct pending_req *req, *n;
-	int i, j;
 
 	BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -129,80 +212,26 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 		return ERR_PTR(-ENOMEM);
 
 	blkif->domid = domid;
-	spin_lock_init(&blkif->blk_ring_lock);
-	atomic_set(&blkif->refcnt, 1);
-	init_waitqueue_head(&blkif->wq);
 	init_completion(&blkif->drain_complete);
 	atomic_set(&blkif->drain, 0);
-	blkif->st_print = jiffies;
-	blkif->persistent_gnts.rb_node = NULL;
-	spin_lock_init(&blkif->free_pages_lock);
-	INIT_LIST_HEAD(&blkif->free_pages);
-	INIT_LIST_HEAD(&blkif->persistent_purge_list);
-	blkif->free_pages_num = 0;
-	atomic_set(&blkif->persistent_gnt_in_use, 0);
-	atomic_set(&blkif->inflight, 0);
-	INIT_WORK(&blkif->persistent_purge_work, xen_blkbk_unmap_purged_grants);
-
-	INIT_LIST_HEAD(&blkif->pending_free);
-	INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
-
-	for (i = 0; i < XEN_BLKIF_REQS; i++) {
-		req = kzalloc(sizeof(*req), GFP_KERNEL);
-		if (!req)
-			goto fail;
-		list_add_tail(&req->free_list,
-		              &blkif->pending_free);
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-			req->segments[j] = kzalloc(sizeof(*req->segments[0]),
-			                           GFP_KERNEL);
-			if (!req->segments[j])
-				goto fail;
-		}
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-			req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
-			                                 GFP_KERNEL);
-			if (!req->indirect_pages[j])
-				goto fail;
-		}
-	}
-	spin_lock_init(&blkif->pending_free_lock);
-	init_waitqueue_head(&blkif->pending_free_wq);
-	init_waitqueue_head(&blkif->shutdown_wq);
 
 	return blkif;
-
-fail:
-	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
-		list_del(&req->free_list);
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-			if (!req->segments[j])
-				break;
-			kfree(req->segments[j]);
-		}
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-			if (!req->indirect_pages[j])
-				break;
-			kfree(req->indirect_pages[j]);
-		}
-		kfree(req);
-	}
-
-	kmem_cache_free(xen_blkif_cachep, blkif);
-
-	return ERR_PTR(-ENOMEM);
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
-			 unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
+			 unsigned int evtchn, unsigned int ring_idx)
 {
 	int err;
+	struct xen_blkif *blkif;
+	char dev_name[64];
 
 	/* Already connected through? */
-	if (blkif->irq)
+	if (ring->irq)
 		return 0;
 
-	err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &blkif->blk_ring);
+	blkif = ring->blkif;
+
+	err = xenbus_map_ring_valloc(ring->blkif->be->dev, shared_page, &ring->blk_ring);
 	if (err < 0)
 		return err;
 
@@ -210,64 +239,73 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
 	case BLKIF_PROTOCOL_NATIVE:
 	{
 		struct blkif_sring *sring;
-		sring = (struct blkif_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
+		sring = (struct blkif_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.native, sring, PAGE_SIZE);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_32:
 	{
 		struct blkif_x86_32_sring *sring_x86_32;
-		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
+		sring_x86_32 = (struct blkif_x86_32_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_64:
 	{
 		struct blkif_x86_64_sring *sring_x86_64;
-		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
+		sring_x86_64 = (struct blkif_x86_64_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
 		break;
 	}
 	default:
 		BUG();
 	}
 
+	if (blkif->vbd.nr_supported_hw_queues)
+		snprintf(dev_name, 64, "blkif-backend-%d", ring_idx);
+	else
+		snprintf(dev_name, 64, "blkif-backend");
 	err = bind_interdomain_evtchn_to_irqhandler(blkif->domid, evtchn,
 						    xen_blkif_be_int, 0,
-						    "blkif-backend", blkif);
+						    dev_name, ring);
 	if (err < 0) {
-		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
-		blkif->blk_rings.common.sring = NULL;
+		xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+		ring->blk_rings.common.sring = NULL;
 		return err;
 	}
-	blkif->irq = err;
+	ring->irq = err;
 
 	return 0;
 }
 
 static int xen_blkif_disconnect(struct xen_blkif *blkif)
 {
-	if (blkif->xenblkd) {
-		kthread_stop(blkif->xenblkd);
-		wake_up(&blkif->shutdown_wq);
-		blkif->xenblkd = NULL;
-	}
+	int i;
+
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		struct xen_blkif_ring *ring = &blkif->rings[i];
+		if (ring->xenblkd) {
+			kthread_stop(ring->xenblkd);
+			wake_up(&ring->shutdown_wq);
+			ring->xenblkd = NULL;
+		}
 
-	/* The above kthread_stop() guarantees that at this point we
-	 * don't have any discard_io or other_io requests. So, checking
-	 * for inflight IO is enough.
-	 */
-	if (atomic_read(&blkif->inflight) > 0)
-		return -EBUSY;
+		/* The above kthread_stop() guarantees that at this point we
+		 * don't have any discard_io or other_io requests. So, checking
+		 * for inflight IO is enough.
+		 */
+		if (atomic_read(&ring->inflight) > 0)
+			return -EBUSY;
 
-	if (blkif->irq) {
-		unbind_from_irqhandler(blkif->irq, blkif);
-		blkif->irq = 0;
-	}
+		if (ring->irq) {
+			unbind_from_irqhandler(ring->irq, ring);
+			ring->irq = 0;
+		}
 
-	if (blkif->blk_rings.common.sring) {
-		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
-		blkif->blk_rings.common.sring = NULL;
+		if (ring->blk_rings.common.sring) {
+			xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+			ring->blk_rings.common.sring = NULL;
+		}
 	}
 
 	return 0;
@@ -275,40 +313,52 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
 
 static void xen_blkif_free(struct xen_blkif *blkif)
 {
-	struct pending_req *req, *n;
-	int i = 0, j;
 
 	xen_blkif_disconnect(blkif);
 	xen_vbd_free(&blkif->vbd);
 
+	kfree(blkif->rings);
+
+	kmem_cache_free(xen_blkif_cachep, blkif);
+}
+
+static void xen_ring_free(struct xen_blkif_ring *ring)
+{
+	struct pending_req *req, *n;
+	int i, j;
+
 	/* Remove all persistent grants and the cache of ballooned pages. */
-	xen_blkbk_free_caches(blkif);
+	xen_blkbk_free_caches(ring);
 
 	/* Make sure everything is drained before shutting down */
-	BUG_ON(blkif->persistent_gnt_c != 0);
-	BUG_ON(atomic_read(&blkif->persistent_gnt_in_use) != 0);
-	BUG_ON(blkif->free_pages_num != 0);
-	BUG_ON(!list_empty(&blkif->persistent_purge_list));
-	BUG_ON(!list_empty(&blkif->free_pages));
-	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
-
+	BUG_ON(ring->persistent_gnt_c != 0);
+	BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
+	BUG_ON(ring->free_pages_num != 0);
+	BUG_ON(!list_empty(&ring->persistent_purge_list));
+	BUG_ON(!list_empty(&ring->free_pages));
+	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+
+	i = 0;
 	/* Check that there is no request in use */
-	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
+	list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
 		list_del(&req->free_list);
-
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
+		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+			if (!req->segments[j])
+				break;
 			kfree(req->segments[j]);
-
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++)
+		}
+		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+			if (!req->segments[j])
+				break;
 			kfree(req->indirect_pages[j]);
-
+		}
 		kfree(req);
 		i++;
 	}
+	WARN_ON(i != XEN_RING_REQS(ring->blkif->nr_rings));
 
-	WARN_ON(i != XEN_BLKIF_REQS);
-
-	kmem_cache_free(xen_blkif_cachep, blkif);
+	if (atomic_dec_and_test(&ring->blkif->refcnt))
+		xen_blkif_free(ring->blkif);
 }
 
 int __init xen_blkif_interface_init(void)
@@ -333,6 +383,29 @@ int __init xen_blkif_interface_init(void)
 	{								\
 		struct xenbus_device *dev = to_xenbus_device(_dev);	\
 		struct backend_info *be = dev_get_drvdata(&dev->dev);	\
+		struct xen_blkif *blkif = be->blkif;			\
+		struct xen_blkif_ring *ring;				\
+		int i;							\
+									\
+		blkif->st_oo_req = 0;					\
+		blkif->st_rd_req = 0;					\
+		blkif->st_wr_req = 0;					\
+		blkif->st_f_req = 0;					\
+		blkif->st_ds_req = 0;					\
+		blkif->st_rd_sect = 0;					\
+		blkif->st_wr_sect = 0;					\
+		for (i = 0 ; i < blkif->nr_rings ; i++) {		\
+			ring = &blkif->rings[i];			\
+			spin_lock_irq(&ring->stats_lock);		\
+			blkif->st_oo_req += ring->st_oo_req;		\
+			blkif->st_rd_req += ring->st_rd_req;		\
+			blkif->st_wr_req += ring->st_wr_req;		\
+			blkif->st_f_req += ring->st_f_req;		\
+			blkif->st_ds_req += ring->st_ds_req;		\
+			blkif->st_rd_sect += ring->st_rd_sect;		\
+			blkif->st_wr_sect += ring->st_wr_sect;		\
+			spin_unlock_irq(&ring->stats_lock);		\
+		}							\
 									\
 		return sprintf(buf, format, ##args);			\
 	}								\
@@ -453,6 +526,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 		handle, blkif->domid);
 	return 0;
 }
+
 static int xen_blkbk_remove(struct xenbus_device *dev)
 {
 	struct backend_info *be = dev_get_drvdata(&dev->dev);
@@ -468,13 +542,14 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
 		be->backend_watch.node = NULL;
 	}
 
-	dev_set_drvdata(&dev->dev, NULL);
-
 	if (be->blkif) {
+		int i = 0;
 		xen_blkif_disconnect(be->blkif);
-		xen_blkif_put(be->blkif);
+		for (; i < be->blkif->nr_rings ; i++)
+			xen_ring_put(&be->blkif->rings[i]);
 	}
 
+	dev_set_drvdata(&dev->dev, NULL);
 	kfree(be->mode);
 	kfree(be);
 	return 0;
@@ -851,21 +926,46 @@ again:
 static int connect_ring(struct backend_info *be)
 {
 	struct xenbus_device *dev = be->dev;
-	unsigned long ring_ref;
-	unsigned int evtchn;
+	struct xen_blkif *blkif = be->blkif;
+	unsigned long *ring_ref;
+	unsigned int *evtchn;
 	unsigned int pers_grants;
-	char protocol[64] = "";
-	int err;
+	char protocol[64] = "", ring_ref_s[64] = "", evtchn_s[64] = "";
+	int i, err;
 
 	DPRINTK("%s", dev->otherend);
 
-	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
-			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
-	if (err) {
-		xenbus_dev_fatal(dev, err,
-				 "reading %s/ring-ref and event-channel",
-				 dev->otherend);
-		return err;
+	blkif->nr_rings = 1;
+
+	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
+	if (!ring_ref)
+		return -ENOMEM;
+	evtchn = kzalloc(sizeof(unsigned int) * blkif->nr_rings,
+			 GFP_KERNEL);
+	if (!evtchn) {
+		kfree(ring_ref);
+		return -ENOMEM;
+	}
+
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		if (blkif->vbd.nr_supported_hw_queues == 0) {
+			BUG_ON(i != 0);
+			/* Support old XenStore keys for compatibility */
+			snprintf(ring_ref_s, 64, "ring-ref");
+			snprintf(evtchn_s, 64, "event-channel");
+		} else {
+			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
+			snprintf(evtchn_s, 64, "event-channel-%d", i);
+		}
+		err = xenbus_gather(XBT_NIL, dev->otherend,
+				    ring_ref_s, "%lu", &ring_ref[i],
+				    evtchn_s, "%u", &evtchn[i], NULL);
+		if (err) {
+			xenbus_dev_fatal(dev, err,
+					 "reading %s/%s and event-channel",
+					 dev->otherend, ring_ref_s);
+			goto fail;
+		}
 	}
 
 	be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
@@ -881,7 +981,8 @@ static int connect_ring(struct backend_info *be)
 		be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
 	else {
 		xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
-		return -1;
+		err = -1;
+		goto fail;
 	}
 	err = xenbus_gather(XBT_NIL, dev->otherend,
 			    "feature-persistent", "%u",
@@ -892,19 +993,42 @@ static int connect_ring(struct backend_info *be)
 	be->blkif->vbd.feature_gnt_persistent = pers_grants;
 	be->blkif->vbd.overflow_max_grants = 0;
 
-	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
-		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
-		pers_grants ? "persistent grants" : "");
+	blkif->rings = xen_blkif_ring_alloc(blkif, blkif->nr_rings);
+	if (!blkif->rings) {
+		err = -ENOMEM;
+		goto fail;
+	}
+	/*
+	 * Enforce postcondition on number of allocated rings; note that if we are
+	 * resuming it might happen that nr_supported_hw_queues != nr_rings,
+	 * so we cannot rely on such a postcondition.
+	 */
+	BUG_ON(!blkif->vbd.nr_supported_hw_queues &&
+		blkif->nr_rings != 1);
 
-	/* Map the shared frame, irq etc. */
-	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
-	if (err) {
-		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
-				 ring_ref, evtchn);
-		return err;
+	for (i = 0; i < blkif->nr_rings ; i++) {
+		pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
+			ring_ref[i], evtchn[i], blkif->blk_protocol, protocol,
+			pers_grants ? "persistent grants" : "");
+
+		/* Map the shared frame, irq etc. */
+		err = xen_blkif_map(&blkif->rings[i], ring_ref[i], evtchn[i], i);
+		if (err) {
+			xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u of ring %d",
+					 ring_ref[i], evtchn[i], i);
+			goto fail;
+		}
 	}
 
+	kfree(ring_ref);
+	kfree(evtchn);
+
 	return 0;
+
+fail:
+	kfree(ring_ref);
+	kfree(evtchn);
+	return err;
 }
 
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (6 preceding siblings ...)
  2014-09-11 23:57 ` [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-11 23:57 ` [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend Arianna Avanzini
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi, avanzini.arianna

This commit adds to xen-blkback the support to map and make use
of a variable number of ringbuffers. The number of rings to be
mapped is forcedly set to one.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkback/blkback.c | 377 ++++++++++++++++---------------
 drivers/block/xen-blkback/common.h  | 110 +++++----
 drivers/block/xen-blkback/xenbus.c  | 432 +++++++++++++++++++++++-------------
 3 files changed, 548 insertions(+), 371 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 64c60ed..b31acfb 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -80,6 +80,9 @@ module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
 MODULE_PARM_DESC(max_persistent_grants,
                  "Maximum number of grants to map persistently");
 
+#define XEN_RING_MAX_PGRANTS(nr_rings) \
+	(max((int)(xen_blkif_max_pgrants / nr_rings), 16))
+
 /*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
@@ -103,71 +106,71 @@ module_param(log_stats, int, 0644);
 /* Number of free pages to remove on each call to free_xenballooned_pages */
 #define NUM_BATCH_FREE_PAGES 10
 
-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page **page)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
-	if (list_empty(&blkif->free_pages)) {
-		BUG_ON(blkif->free_pages_num != 0);
-		spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
+	if (list_empty(&ring->free_pages)) {
+		BUG_ON(ring->free_pages_num != 0);
+		spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 		return alloc_xenballooned_pages(1, page, false);
 	}
-	BUG_ON(blkif->free_pages_num == 0);
-	page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+	BUG_ON(ring->free_pages_num == 0);
+	page[0] = list_first_entry(&ring->free_pages, struct page, lru);
 	list_del(&page[0]->lru);
-	blkif->free_pages_num--;
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	ring->free_pages_num--;
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 
 	return 0;
 }
 
-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
-                                  int num)
+static inline void put_free_pages(struct xen_blkif_ring *ring,
+				  struct page **page, int num)
 {
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
 	for (i = 0; i < num; i++)
-		list_add(&page[i]->lru, &blkif->free_pages);
-	blkif->free_pages_num += num;
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+		list_add(&page[i]->lru, &ring->free_pages);
+	ring->free_pages_num += num;
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 }
 
-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
 {
 	/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
 	struct page *page[NUM_BATCH_FREE_PAGES];
 	unsigned int num_pages = 0;
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->free_pages_lock, flags);
-	while (blkif->free_pages_num > num) {
-		BUG_ON(list_empty(&blkif->free_pages));
-		page[num_pages] = list_first_entry(&blkif->free_pages,
+	spin_lock_irqsave(&ring->free_pages_lock, flags);
+	while (ring->free_pages_num > num) {
+		BUG_ON(list_empty(&ring->free_pages));
+		page[num_pages] = list_first_entry(&ring->free_pages,
 		                                   struct page, lru);
 		list_del(&page[num_pages]->lru);
-		blkif->free_pages_num--;
+		ring->free_pages_num--;
 		if (++num_pages == NUM_BATCH_FREE_PAGES) {
-			spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+			spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 			free_xenballooned_pages(num_pages, page);
-			spin_lock_irqsave(&blkif->free_pages_lock, flags);
+			spin_lock_irqsave(&ring->free_pages_lock, flags);
 			num_pages = 0;
 		}
 	}
-	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	spin_unlock_irqrestore(&ring->free_pages_lock, flags);
 	if (num_pages != 0)
 		free_xenballooned_pages(num_pages, page);
 }
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req,
 				struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
 			  unsigned short op, int st);
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
@@ -188,19 +191,21 @@ static void make_response(struct xen_blkif *blkif, u64 id,
  * bit operations to modify the flags of a persistent grant and to count
  * the number of used grants.
  */
-static int add_persistent_gnt(struct xen_blkif *blkif,
+static int add_persistent_gnt(struct xen_blkif_ring *ring,
 			       struct persistent_gnt *persistent_gnt)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct rb_node **new = NULL, *parent = NULL;
 	struct persistent_gnt *this;
 
-	if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
+	if (ring->persistent_gnt_c >=
+		XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings)) {
 		if (!blkif->vbd.overflow_max_grants)
 			blkif->vbd.overflow_max_grants = 1;
 		return -EBUSY;
 	}
 	/* Figure out where to put new node */
-	new = &blkif->persistent_gnts.rb_node;
+	new = &ring->persistent_gnts.rb_node;
 	while (*new) {
 		this = container_of(*new, struct persistent_gnt, node);
 
@@ -219,19 +224,19 @@ static int add_persistent_gnt(struct xen_blkif *blkif,
 	set_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
 	/* Add new node and rebalance tree. */
 	rb_link_node(&(persistent_gnt->node), parent, new);
-	rb_insert_color(&(persistent_gnt->node), &blkif->persistent_gnts);
-	blkif->persistent_gnt_c++;
-	atomic_inc(&blkif->persistent_gnt_in_use);
+	rb_insert_color(&(persistent_gnt->node), &ring->persistent_gnts);
+	ring->persistent_gnt_c++;
+	atomic_inc(&ring->persistent_gnt_in_use);
 	return 0;
 }
 
-static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
+static struct persistent_gnt *get_persistent_gnt(struct xen_blkif_ring *ring,
 						 grant_ref_t gref)
 {
 	struct persistent_gnt *data;
 	struct rb_node *node = NULL;
 
-	node = blkif->persistent_gnts.rb_node;
+	node = ring->persistent_gnts.rb_node;
 	while (node) {
 		data = container_of(node, struct persistent_gnt, node);
 
@@ -245,25 +250,25 @@ static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
 				return NULL;
 			}
 			set_bit(PERSISTENT_GNT_ACTIVE, data->flags);
-			atomic_inc(&blkif->persistent_gnt_in_use);
+			atomic_inc(&ring->persistent_gnt_in_use);
 			return data;
 		}
 	}
 	return NULL;
 }
 
-static void put_persistent_gnt(struct xen_blkif *blkif,
+static void put_persistent_gnt(struct xen_blkif_ring *ring,
                                struct persistent_gnt *persistent_gnt)
 {
 	if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
 	          pr_alert_ratelimited(DRV_PFX " freeing a grant already unused");
 	set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
 	clear_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
-	atomic_dec(&blkif->persistent_gnt_in_use);
+	atomic_dec(&ring->persistent_gnt_in_use);
 }
 
-static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
-                                 unsigned int num)
+static void free_persistent_gnts(struct xen_blkif_ring *ring,
+				 struct rb_root *root, unsigned int num)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -288,7 +293,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			put_free_pages(blkif, pages, segs_to_unmap);
+			put_free_pages(ring, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 
@@ -305,10 +310,10 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct persistent_gnt *persistent_gnt;
 	int ret, segs_to_unmap = 0;
-	struct xen_blkif *blkif = container_of(work, typeof(*blkif), persistent_purge_work);
+	struct xen_blkif_ring *ring = container_of(work, typeof(*ring), persistent_purge_work);
 
-	while(!list_empty(&blkif->persistent_purge_list)) {
-		persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
+	while(!list_empty(&ring->persistent_purge_list)) {
+		persistent_gnt = list_first_entry(&ring->persistent_purge_list,
 		                                  struct persistent_gnt,
 		                                  remove_node);
 		list_del(&persistent_gnt->remove_node);
@@ -324,7 +329,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			put_free_pages(blkif, pages, segs_to_unmap);
+			put_free_pages(ring, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 		kfree(persistent_gnt);
@@ -332,34 +337,36 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
 	if (segs_to_unmap > 0) {
 		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
 		BUG_ON(ret);
-		put_free_pages(blkif, pages, segs_to_unmap);
+		put_free_pages(ring, pages, segs_to_unmap);
 	}
 }
 
-static void purge_persistent_gnt(struct xen_blkif *blkif)
+static void purge_persistent_gnt(struct xen_blkif_ring *ring)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct persistent_gnt *persistent_gnt;
 	struct rb_node *n;
 	unsigned int num_clean, total;
 	bool scan_used = false, clean_used = false;
 	struct rb_root *root;
+	unsigned nr_rings = ring->blkif->nr_rings;
 
-	if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
-	    (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
+	if (ring->persistent_gnt_c < XEN_RING_MAX_PGRANTS(nr_rings) ||
+	    (ring->persistent_gnt_c == XEN_RING_MAX_PGRANTS(nr_rings) &&
 	    !blkif->vbd.overflow_max_grants)) {
 		return;
 	}
 
-	if (work_pending(&blkif->persistent_purge_work)) {
+	if (work_pending(&ring->persistent_purge_work)) {
 		pr_alert_ratelimited(DRV_PFX "Scheduled work from previous purge is still pending, cannot purge list\n");
 		return;
 	}
 
-	num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
-	num_clean = blkif->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
-	num_clean = min(blkif->persistent_gnt_c, num_clean);
+	num_clean = (XEN_RING_MAX_PGRANTS(nr_rings) / 100) * LRU_PERCENT_CLEAN;
+	num_clean = ring->persistent_gnt_c - XEN_RING_MAX_PGRANTS(nr_rings) + num_clean;
+	num_clean = min(ring->persistent_gnt_c, num_clean);
 	if ((num_clean == 0) ||
-	    (num_clean > (blkif->persistent_gnt_c - atomic_read(&blkif->persistent_gnt_in_use))))
+	    (num_clean > (ring->persistent_gnt_c - atomic_read(&ring->persistent_gnt_in_use))))
 		return;
 
 	/*
@@ -375,8 +382,8 @@ static void purge_persistent_gnt(struct xen_blkif *blkif)
 
 	pr_debug(DRV_PFX "Going to purge %u persistent grants\n", num_clean);
 
-	BUG_ON(!list_empty(&blkif->persistent_purge_list));
-	root = &blkif->persistent_gnts;
+	BUG_ON(!list_empty(&ring->persistent_purge_list));
+	root = &ring->persistent_gnts;
 purge_list:
 	foreach_grant_safe(persistent_gnt, n, root, node) {
 		BUG_ON(persistent_gnt->handle ==
@@ -395,7 +402,7 @@ purge_list:
 
 		rb_erase(&persistent_gnt->node, root);
 		list_add(&persistent_gnt->remove_node,
-		         &blkif->persistent_purge_list);
+		         &ring->persistent_purge_list);
 		if (--num_clean == 0)
 			goto finished;
 	}
@@ -416,11 +423,11 @@ finished:
 		goto purge_list;
 	}
 
-	blkif->persistent_gnt_c -= (total - num_clean);
+	ring->persistent_gnt_c -= (total - num_clean);
 	blkif->vbd.overflow_max_grants = 0;
 
 	/* We can defer this work */
-	schedule_work(&blkif->persistent_purge_work);
+	schedule_work(&ring->persistent_purge_work);
 	pr_debug(DRV_PFX "Purged %u/%u\n", (total - num_clean), total);
 	return;
 }
@@ -428,18 +435,18 @@ finished:
 /*
  * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
  */
-static struct pending_req *alloc_req(struct xen_blkif *blkif)
+static struct pending_req *alloc_req(struct xen_blkif_ring *ring)
 {
 	struct pending_req *req = NULL;
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkif->pending_free_lock, flags);
-	if (!list_empty(&blkif->pending_free)) {
-		req = list_entry(blkif->pending_free.next, struct pending_req,
+	spin_lock_irqsave(&ring->pending_free_lock, flags);
+	if (!list_empty(&ring->pending_free)) {
+		req = list_entry(ring->pending_free.next, struct pending_req,
 				 free_list);
 		list_del(&req->free_list);
 	}
-	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+	spin_unlock_irqrestore(&ring->pending_free_lock, flags);
 	return req;
 }
 
@@ -447,17 +454,17 @@ static struct pending_req *alloc_req(struct xen_blkif *blkif)
  * Return the 'pending_req' structure back to the freepool. We also
  * wake up the thread if it was waiting for a free page.
  */
-static void free_req(struct xen_blkif *blkif, struct pending_req *req)
+static void free_req(struct xen_blkif_ring *ring, struct pending_req *req)
 {
 	unsigned long flags;
 	int was_empty;
 
-	spin_lock_irqsave(&blkif->pending_free_lock, flags);
-	was_empty = list_empty(&blkif->pending_free);
-	list_add(&req->free_list, &blkif->pending_free);
-	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+	spin_lock_irqsave(&ring->pending_free_lock, flags);
+	was_empty = list_empty(&ring->pending_free);
+	list_add(&req->free_list, &ring->pending_free);
+	spin_unlock_irqrestore(&ring->pending_free_lock, flags);
 	if (was_empty)
-		wake_up(&blkif->pending_free_wq);
+		wake_up(&ring->pending_free_wq);
 }
 
 /*
@@ -537,10 +544,10 @@ abort:
 /*
  * Notification from the guest OS.
  */
-static void blkif_notify_work(struct xen_blkif *blkif)
+static void blkif_notify_work(struct xen_blkif_ring *ring)
 {
-	blkif->waiting_reqs = 1;
-	wake_up(&blkif->wq);
+	ring->waiting_reqs = 1;
+	wake_up(&ring->wq);
 }
 
 irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
@@ -553,30 +560,33 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
  * SCHEDULER FUNCTIONS
  */
 
-static void print_stats(struct xen_blkif *blkif)
+static void print_stats(struct xen_blkif_ring *ring)
 {
+	spin_lock_irq(&ring->stats_lock);
 	pr_info("xen-blkback (%s): oo %3llu  |  rd %4llu  |  wr %4llu  |  f %4llu"
 		 "  |  ds %4llu | pg: %4u/%4d\n",
-		 current->comm, blkif->st_oo_req,
-		 blkif->st_rd_req, blkif->st_wr_req,
-		 blkif->st_f_req, blkif->st_ds_req,
-		 blkif->persistent_gnt_c,
-		 xen_blkif_max_pgrants);
-	blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
-	blkif->st_rd_req = 0;
-	blkif->st_wr_req = 0;
-	blkif->st_oo_req = 0;
-	blkif->st_ds_req = 0;
+		 current->comm, ring->st_oo_req,
+		 ring->st_rd_req, ring->st_wr_req,
+		 ring->st_f_req, ring->st_ds_req,
+		 ring->persistent_gnt_c,
+		 XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings));
+	ring->st_print = jiffies + msecs_to_jiffies(10 * 1000);
+	ring->st_rd_req = 0;
+	ring->st_wr_req = 0;
+	ring->st_oo_req = 0;
+	ring->st_ds_req = 0;
+	spin_unlock_irq(&ring->stats_lock);
 }
 
 int xen_blkif_schedule(void *arg)
 {
-	struct xen_blkif *blkif = arg;
+	struct xen_blkif_ring *ring = arg;
+	struct xen_blkif *blkif = ring->blkif;
 	struct xen_vbd *vbd = &blkif->vbd;
 	unsigned long timeout;
 	int ret;
 
-	xen_blkif_get(blkif);
+	xen_ring_get(ring);
 
 	while (!kthread_should_stop()) {
 		if (try_to_freeze())
@@ -587,51 +597,51 @@ int xen_blkif_schedule(void *arg)
 		timeout = msecs_to_jiffies(LRU_INTERVAL);
 
 		timeout = wait_event_interruptible_timeout(
-			blkif->wq,
-			blkif->waiting_reqs || kthread_should_stop(),
+			ring->wq,
+			ring->waiting_reqs || kthread_should_stop(),
 			timeout);
 		if (timeout == 0)
 			goto purge_gnt_list;
 		timeout = wait_event_interruptible_timeout(
-			blkif->pending_free_wq,
-			!list_empty(&blkif->pending_free) ||
+			ring->pending_free_wq,
+			!list_empty(&ring->pending_free) ||
 			kthread_should_stop(),
 			timeout);
 		if (timeout == 0)
 			goto purge_gnt_list;
 
-		blkif->waiting_reqs = 0;
+		ring->waiting_reqs = 0;
 		smp_mb(); /* clear flag *before* checking for work */
 
-		ret = do_block_io_op(blkif);
+		ret = do_block_io_op(ring);
 		if (ret > 0)
-			blkif->waiting_reqs = 1;
+			ring->waiting_reqs = 1;
 		if (ret == -EACCES)
-			wait_event_interruptible(blkif->shutdown_wq,
+			wait_event_interruptible(ring->shutdown_wq,
 						 kthread_should_stop());
 
 purge_gnt_list:
 		if (blkif->vbd.feature_gnt_persistent &&
-		    time_after(jiffies, blkif->next_lru)) {
-			purge_persistent_gnt(blkif);
-			blkif->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
+		    time_after(jiffies, ring->next_lru)) {
+			purge_persistent_gnt(ring);
+			ring->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
 		}
 
 		/* Shrink if we have more than xen_blkif_max_buffer_pages */
-		shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
+		shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);
 
-		if (log_stats && time_after(jiffies, blkif->st_print))
-			print_stats(blkif);
+		if (log_stats && time_after(jiffies, ring->st_print))
+			print_stats(ring);
 	}
 
 	/* Drain pending purge work */
-	flush_work(&blkif->persistent_purge_work);
+	flush_work(&ring->persistent_purge_work);
 
 	if (log_stats)
-		print_stats(blkif);
+		print_stats(ring);
 
-	blkif->xenblkd = NULL;
-	xen_blkif_put(blkif);
+	ring->xenblkd = NULL;
+	xen_ring_put(ring);
 
 	return 0;
 }
@@ -639,25 +649,25 @@ purge_gnt_list:
 /*
  * Remove persistent grants and empty the pool of free pages
  */
-void xen_blkbk_free_caches(struct xen_blkif *blkif)
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring)
 {
 	/* Free all persistent grant pages */
-	if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
-		free_persistent_gnts(blkif, &blkif->persistent_gnts,
-			blkif->persistent_gnt_c);
+	if (!RB_EMPTY_ROOT(&ring->persistent_gnts))
+		free_persistent_gnts(ring, &ring->persistent_gnts,
+			ring->persistent_gnt_c);
 
-	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
-	blkif->persistent_gnt_c = 0;
+	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+	ring->persistent_gnt_c = 0;
 
 	/* Since we are shutting down remove all pages from the buffer */
-	shrink_free_pagepool(blkif, 0 /* All */);
+	shrink_free_pagepool(ring, 0 /* All */);
 }
 
 /*
  * Unmap the grant references, and also remove the M2P over-rides
  * used in the 'pending_req'.
  */
-static void xen_blkbk_unmap(struct xen_blkif *blkif,
+static void xen_blkbk_unmap(struct xen_blkif_ring *ring,
                             struct grant_page *pages[],
                             int num)
 {
@@ -668,7 +678,7 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
 
 	for (i = 0; i < num; i++) {
 		if (pages[i]->persistent_gnt != NULL) {
-			put_persistent_gnt(blkif, pages[i]->persistent_gnt);
+			put_persistent_gnt(ring, pages[i]->persistent_gnt);
 			continue;
 		}
 		if (pages[i]->handle == BLKBACK_INVALID_HANDLE)
@@ -681,21 +691,22 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
 			ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
 			                        invcount);
 			BUG_ON(ret);
-			put_free_pages(blkif, unmap_pages, invcount);
+			put_free_pages(ring, unmap_pages, invcount);
 			invcount = 0;
 		}
 	}
 	if (invcount) {
 		ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
 		BUG_ON(ret);
-		put_free_pages(blkif, unmap_pages, invcount);
+		put_free_pages(ring, unmap_pages, invcount);
 	}
 }
 
-static int xen_blkbk_map(struct xen_blkif *blkif,
+static int xen_blkbk_map(struct xen_blkif_ring *ring,
 			 struct grant_page *pages[],
 			 int num, bool ro)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct persistent_gnt *persistent_gnt = NULL;
@@ -719,7 +730,7 @@ again:
 
 		if (use_persistent_gnts)
 			persistent_gnt = get_persistent_gnt(
-				blkif,
+				ring,
 				pages[i]->gref);
 
 		if (persistent_gnt) {
@@ -730,7 +741,7 @@ again:
 			pages[i]->page = persistent_gnt->page;
 			pages[i]->persistent_gnt = persistent_gnt;
 		} else {
-			if (get_free_page(blkif, &pages[i]->page))
+			if (get_free_page(ring, &pages[i]->page))
 				goto out_of_memory;
 			addr = vaddr(pages[i]->page);
 			pages_to_gnt[segs_to_map] = pages[i]->page;
@@ -772,7 +783,8 @@ again:
 			continue;
 		}
 		if (use_persistent_gnts &&
-		    blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
+		    ring->persistent_gnt_c <
+			XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings)) {
 			/*
 			 * We are using persistent grants, the grant is
 			 * not mapped but we might have room for it.
@@ -790,7 +802,7 @@ again:
 			persistent_gnt->gnt = map[new_map_idx].ref;
 			persistent_gnt->handle = map[new_map_idx].handle;
 			persistent_gnt->page = pages[seg_idx]->page;
-			if (add_persistent_gnt(blkif,
+			if (add_persistent_gnt(ring,
 			                       persistent_gnt)) {
 				kfree(persistent_gnt);
 				persistent_gnt = NULL;
@@ -798,8 +810,8 @@ again:
 			}
 			pages[seg_idx]->persistent_gnt = persistent_gnt;
 			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
-				 persistent_gnt->gnt, blkif->persistent_gnt_c,
-				 xen_blkif_max_pgrants);
+				 persistent_gnt->gnt, ring->persistent_gnt_c,
+				 XEN_RING_MAX_PGRANTS(ring->blkif->nr_rings));
 			goto next;
 		}
 		if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
@@ -823,7 +835,7 @@ next:
 
 out_of_memory:
 	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
-	put_free_pages(blkif, pages_to_gnt, segs_to_map);
+	put_free_pages(ring, pages_to_gnt, segs_to_map);
 	return -ENOMEM;
 }
 
@@ -831,7 +843,7 @@ static int xen_blkbk_map_seg(struct pending_req *pending_req)
 {
 	int rc;
 
-	rc = xen_blkbk_map(pending_req->blkif, pending_req->segments,
+	rc = xen_blkbk_map(pending_req->ring, pending_req->segments,
 			   pending_req->nr_pages,
 	                   (pending_req->operation != BLKIF_OP_READ));
 
@@ -844,7 +856,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 				    struct phys_req *preq)
 {
 	struct grant_page **pages = pending_req->indirect_pages;
-	struct xen_blkif *blkif = pending_req->blkif;
+	struct xen_blkif_ring *ring = pending_req->ring;
 	int indirect_grefs, rc, n, nseg, i;
 	struct blkif_request_segment *segments = NULL;
 
@@ -855,7 +867,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 	for (i = 0; i < indirect_grefs; i++)
 		pages[i]->gref = req->u.indirect.indirect_grefs[i];
 
-	rc = xen_blkbk_map(blkif, pages, indirect_grefs, true);
+	rc = xen_blkbk_map(ring, pages, indirect_grefs, true);
 	if (rc)
 		goto unmap;
 
@@ -882,20 +894,21 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
 unmap:
 	if (segments)
 		kunmap_atomic(segments);
-	xen_blkbk_unmap(blkif, pages, indirect_grefs);
+	xen_blkbk_unmap(ring, pages, indirect_grefs);
 	return rc;
 }
 
-static int dispatch_discard_io(struct xen_blkif *blkif,
+static int dispatch_discard_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req)
 {
 	int err = 0;
 	int status = BLKIF_RSP_OKAY;
+	struct xen_blkif *blkif = ring->blkif;
 	struct block_device *bdev = blkif->vbd.bdev;
 	unsigned long secure;
 	struct phys_req preq;
 
-	xen_blkif_get(blkif);
+	xen_ring_get(ring);
 
 	preq.sector_number = req->u.discard.sector_number;
 	preq.nr_sects      = req->u.discard.nr_sectors;
@@ -907,7 +920,9 @@ static int dispatch_discard_io(struct xen_blkif *blkif,
 			preq.sector_number + preq.nr_sects, blkif->vbd.pdevice);
 		goto fail_response;
 	}
-	blkif->st_ds_req++;
+	spin_lock_irq(&ring->stats_lock);
+	ring->st_ds_req++;
+	spin_unlock_irq(&ring->stats_lock);
 
 	secure = (blkif->vbd.discard_secure &&
 		 (req->u.discard.flag & BLKIF_DISCARD_SECURE)) ?
@@ -923,26 +938,27 @@ fail_response:
 	} else if (err)
 		status = BLKIF_RSP_ERROR;
 
-	make_response(blkif, req->u.discard.id, req->operation, status);
-	xen_blkif_put(blkif);
+	make_response(ring, req->u.discard.id, req->operation, status);
+	xen_ring_put(ring);
 	return err;
 }
 
-static int dispatch_other_io(struct xen_blkif *blkif,
+static int dispatch_other_io(struct xen_blkif_ring *ring,
 			     struct blkif_request *req,
 			     struct pending_req *pending_req)
 {
-	free_req(blkif, pending_req);
-	make_response(blkif, req->u.other.id, req->operation,
+	free_req(ring, pending_req);
+	make_response(ring, req->u.other.id, req->operation,
 		      BLKIF_RSP_EOPNOTSUPP);
 	return -EIO;
 }
 
-static void xen_blk_drain_io(struct xen_blkif *blkif)
+static void xen_blk_drain_io(struct xen_blkif_ring *ring)
 {
+	struct xen_blkif *blkif = ring->blkif;
 	atomic_set(&blkif->drain, 1);
 	do {
-		if (atomic_read(&blkif->inflight) == 0)
+		if (atomic_read(&ring->inflight) == 0)
 			break;
 		wait_for_completion_interruptible_timeout(
 				&blkif->drain_complete, HZ);
@@ -963,12 +979,12 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 	if ((pending_req->operation == BLKIF_OP_FLUSH_DISKCACHE) &&
 	    (error == -EOPNOTSUPP)) {
 		pr_debug(DRV_PFX "flush diskcache op failed, not supported\n");
-		xen_blkbk_flush_diskcache(XBT_NIL, pending_req->blkif->be, 0);
+		xen_blkbk_flush_diskcache(XBT_NIL, pending_req->ring->blkif->be, 0);
 		pending_req->status = BLKIF_RSP_EOPNOTSUPP;
 	} else if ((pending_req->operation == BLKIF_OP_WRITE_BARRIER) &&
 		    (error == -EOPNOTSUPP)) {
 		pr_debug(DRV_PFX "write barrier op failed, not supported\n");
-		xen_blkbk_barrier(XBT_NIL, pending_req->blkif->be, 0);
+		xen_blkbk_barrier(XBT_NIL, pending_req->ring->blkif->be, 0);
 		pending_req->status = BLKIF_RSP_EOPNOTSUPP;
 	} else if (error) {
 		pr_debug(DRV_PFX "Buffer not up-to-date at end of operation,"
@@ -982,14 +998,15 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 	 * the proper response on the ring.
 	 */
 	if (atomic_dec_and_test(&pending_req->pendcnt)) {
-		struct xen_blkif *blkif = pending_req->blkif;
+		struct xen_blkif_ring *ring = pending_req->ring;
+		struct xen_blkif *blkif = ring->blkif;
 
-		xen_blkbk_unmap(blkif,
+		xen_blkbk_unmap(ring,
 		                pending_req->segments,
 		                pending_req->nr_pages);
-		make_response(blkif, pending_req->id,
+		make_response(ring, pending_req->id,
 			      pending_req->operation, pending_req->status);
-		free_req(blkif, pending_req);
+		free_req(ring, pending_req);
 		/*
 		 * Make sure the request is freed before releasing blkif,
 		 * or there could be a race between free_req and the
@@ -1002,10 +1019,10 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 		 * pending_free_wq if there's a drain going on, but it has
 		 * to be taken into account if the current model is changed.
 		 */
-		if (atomic_dec_and_test(&blkif->inflight) && atomic_read(&blkif->drain)) {
+		if (atomic_dec_and_test(&ring->inflight) && atomic_read(&blkif->drain)) {
 			complete(&blkif->drain_complete);
 		}
-		xen_blkif_put(blkif);
+		xen_ring_put(ring);
 	}
 }
 
@@ -1026,9 +1043,10 @@ static void end_block_io_op(struct bio *bio, int error)
  * and transmute  it to the block API to hand it over to the proper block disk.
  */
 static int
-__do_block_io_op(struct xen_blkif *blkif)
+__do_block_io_op(struct xen_blkif_ring *ring)
 {
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
+	struct xen_blkif *blkif = ring->blkif;
 	struct blkif_request req;
 	struct pending_req *pending_req;
 	RING_IDX rc, rp;
@@ -1054,9 +1072,11 @@ __do_block_io_op(struct xen_blkif *blkif)
 			break;
 		}
 
-		pending_req = alloc_req(blkif);
+		pending_req = alloc_req(ring);
 		if (NULL == pending_req) {
-			blkif->st_oo_req++;
+			spin_lock_irq(&ring->stats_lock);
+			ring->st_oo_req++;
+			spin_unlock_irq(&ring->stats_lock);
 			more_to_do = 1;
 			break;
 		}
@@ -1085,16 +1105,16 @@ __do_block_io_op(struct xen_blkif *blkif)
 		case BLKIF_OP_WRITE_BARRIER:
 		case BLKIF_OP_FLUSH_DISKCACHE:
 		case BLKIF_OP_INDIRECT:
-			if (dispatch_rw_block_io(blkif, &req, pending_req))
+			if (dispatch_rw_block_io(ring, &req, pending_req))
 				goto done;
 			break;
 		case BLKIF_OP_DISCARD:
-			free_req(blkif, pending_req);
-			if (dispatch_discard_io(blkif, &req))
+			free_req(ring, pending_req);
+			if (dispatch_discard_io(ring, &req))
 				goto done;
 			break;
 		default:
-			if (dispatch_other_io(blkif, &req, pending_req))
+			if (dispatch_other_io(ring, &req, pending_req))
 				goto done;
 			break;
 		}
@@ -1107,13 +1127,13 @@ done:
 }
 
 static int
-do_block_io_op(struct xen_blkif *blkif)
+do_block_io_op(struct xen_blkif_ring *ring)
 {
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
 	int more_to_do;
 
 	do {
-		more_to_do = __do_block_io_op(blkif);
+		more_to_do = __do_block_io_op(ring);
 		if (more_to_do)
 			break;
 
@@ -1126,7 +1146,7 @@ do_block_io_op(struct xen_blkif *blkif)
  * Transmutation of the 'struct blkif_request' to a proper 'struct bio'
  * and call the 'submit_bio' to pass it to the underlying storage.
  */
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 				struct blkif_request *req,
 				struct pending_req *pending_req)
 {
@@ -1140,6 +1160,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	struct blk_plug plug;
 	bool drain = false;
 	struct grant_page **pages = pending_req->segments;
+	struct xen_blkif *blkif = ring->blkif;
 	unsigned short req_operation;
 
 	req_operation = req->operation == BLKIF_OP_INDIRECT ?
@@ -1152,26 +1173,29 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		goto fail_response;
 	}
 
+	spin_lock_irq(&ring->stats_lock);
 	switch (req_operation) {
 	case BLKIF_OP_READ:
-		blkif->st_rd_req++;
+		ring->st_rd_req++;
 		operation = READ;
 		break;
 	case BLKIF_OP_WRITE:
-		blkif->st_wr_req++;
+		ring->st_wr_req++;
 		operation = WRITE_ODIRECT;
 		break;
 	case BLKIF_OP_WRITE_BARRIER:
 		drain = true;
 	case BLKIF_OP_FLUSH_DISKCACHE:
-		blkif->st_f_req++;
+		ring->st_f_req++;
 		operation = WRITE_FLUSH;
 		break;
 	default:
 		operation = 0; /* make gcc happy */
+		spin_unlock_irq(&ring->stats_lock);
 		goto fail_response;
 		break;
 	}
+	spin_unlock_irq(&ring->stats_lock);
 
 	/* Check that the number of segments is sane. */
 	nseg = req->operation == BLKIF_OP_INDIRECT ?
@@ -1190,7 +1214,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 
 	preq.nr_sects      = 0;
 
-	pending_req->blkif     = blkif;
+	pending_req->ring      = ring;
 	pending_req->id        = req->u.rw.id;
 	pending_req->operation = req_operation;
 	pending_req->status    = BLKIF_RSP_OKAY;
@@ -1243,7 +1267,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	 * issue the WRITE_FLUSH.
 	 */
 	if (drain)
-		xen_blk_drain_io(pending_req->blkif);
+		xen_blk_drain_io(pending_req->ring);
 
 	/*
 	 * If we have failed at this point, we need to undo the M2P override,
@@ -1255,11 +1279,11 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		goto fail_flush;
 
 	/*
-	 * This corresponding xen_blkif_put is done in __end_block_io_op, or
+	 * This corresponding xen_ring_put is done in __end_block_io_op, or
 	 * below (in "!bio") if we are handling a BLKIF_OP_DISCARD.
 	 */
-	xen_blkif_get(blkif);
-	atomic_inc(&blkif->inflight);
+	xen_ring_get(ring);
+	atomic_inc(&ring->inflight);
 
 	for (i = 0; i < nseg; i++) {
 		while ((bio == NULL) ||
@@ -1306,20 +1330,22 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	/* Let the I/Os go.. */
 	blk_finish_plug(&plug);
 
+	spin_lock_irq(&ring->stats_lock);
 	if (operation == READ)
-		blkif->st_rd_sect += preq.nr_sects;
+		ring->st_rd_sect += preq.nr_sects;
 	else if (operation & WRITE)
-		blkif->st_wr_sect += preq.nr_sects;
+		ring->st_wr_sect += preq.nr_sects;
+	spin_unlock_irq(&ring->stats_lock);
 
 	return 0;
 
  fail_flush:
-	xen_blkbk_unmap(blkif, pending_req->segments,
+	xen_blkbk_unmap(ring, pending_req->segments,
 	                pending_req->nr_pages);
  fail_response:
 	/* Haven't submitted any bio's yet. */
-	make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
-	free_req(blkif, pending_req);
+	make_response(ring, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
+	free_req(ring, pending_req);
 	msleep(1); /* back off a bit */
 	return -EIO;
 
@@ -1337,19 +1363,20 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 /*
  * Put a response on the ring on how the operation fared.
  */
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
 			  unsigned short op, int st)
 {
 	struct blkif_response  resp;
 	unsigned long     flags;
-	union blkif_back_rings *blk_rings = &blkif->blk_rings;
+	union blkif_back_rings *blk_rings = &ring->blk_rings;
+	struct xen_blkif *blkif = ring->blkif;
 	int notify;
 
 	resp.id        = id;
 	resp.operation = op;
 	resp.status    = st;
 
-	spin_lock_irqsave(&blkif->blk_ring_lock, flags);
+	spin_lock_irqsave(&ring->blk_ring_lock, flags);
 	/* Place on the response ring for the relevant domain. */
 	switch (blkif->blk_protocol) {
 	case BLKIF_PROTOCOL_NATIVE:
@@ -1369,9 +1396,9 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 	}
 	blk_rings->common.rsp_prod_pvt++;
 	RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify);
-	spin_unlock_irqrestore(&blkif->blk_ring_lock, flags);
+	spin_unlock_irqrestore(&ring->blk_ring_lock, flags);
 	if (notify)
-		notify_remote_via_irq(blkif->irq);
+		notify_remote_via_irq(ring->irq);
 }
 
 static int __init xen_blkif_init(void)
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index f65b807..6f074ce 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -226,6 +226,7 @@ struct xen_vbd {
 	struct block_device	*bdev;
 	/* Cached size parameter. */
 	sector_t		size;
+	unsigned int		nr_supported_hw_queues;
 	unsigned int		flush_support:1;
 	unsigned int		discard_secure:1;
 	unsigned int		feature_gnt_persistent:1;
@@ -246,6 +247,7 @@ struct backend_info;
 
 /* Number of requests that we can fit in a ring */
 #define XEN_BLKIF_REQS			32
+#define XEN_RING_REQS(nr_rings)	(max((int)(XEN_BLKIF_REQS / nr_rings), 8))
 
 struct persistent_gnt {
 	struct page *page;
@@ -256,32 +258,29 @@ struct persistent_gnt {
 	struct list_head remove_node;
 };
 
-struct xen_blkif {
-	/* Unique identifier for this interface. */
-	domid_t			domid;
-	unsigned int		handle;
+struct xen_blkif_ring {
+	union blkif_back_rings	blk_rings;
 	/* Physical parameters of the comms window. */
 	unsigned int		irq;
-	/* Comms information. */
-	enum blkif_protocol	blk_protocol;
-	union blkif_back_rings	blk_rings;
-	void			*blk_ring;
-	/* The VBD attached to this interface. */
-	struct xen_vbd		vbd;
-	/* Back pointer to the backend_info. */
-	struct backend_info	*be;
-	/* Private fields. */
-	spinlock_t		blk_ring_lock;
-	atomic_t		refcnt;
 
 	wait_queue_head_t	wq;
-	/* for barrier (drain) requests */
-	struct completion	drain_complete;
-	atomic_t		drain;
-	atomic_t		inflight;
 	/* One thread per one blkif. */
 	struct task_struct	*xenblkd;
 	unsigned int		waiting_reqs;
+	void			*blk_ring;
+	spinlock_t		blk_ring_lock;
+
+	struct work_struct	free_work;
+	/* Thread shutdown wait queue. */
+	wait_queue_head_t	shutdown_wq;
+
+	/* buffer of free pages to map grant refs */
+	spinlock_t		free_pages_lock;
+	int			free_pages_num;
+
+	/* used by the kworker that offload work from the persistent purge */
+	struct list_head	persistent_purge_list;
+	struct work_struct	persistent_purge_work;
 
 	/* tree to store persistent grants */
 	struct rb_root		persistent_gnts;
@@ -289,13 +288,6 @@ struct xen_blkif {
 	atomic_t		persistent_gnt_in_use;
 	unsigned long           next_lru;
 
-	/* used by the kworker that offload work from the persistent purge */
-	struct list_head	persistent_purge_list;
-	struct work_struct	persistent_purge_work;
-
-	/* buffer of free pages to map grant refs */
-	spinlock_t		free_pages_lock;
-	int			free_pages_num;
 	struct list_head	free_pages;
 
 	/* List of all 'pending_req' available */
@@ -303,20 +295,54 @@ struct xen_blkif {
 	/* And its spinlock. */
 	spinlock_t		pending_free_lock;
 	wait_queue_head_t	pending_free_wq;
+	atomic_t		inflight;
+
+	/* Private fields. */
+	atomic_t		refcnt;
+
+	struct xen_blkif	*blkif;
+	unsigned		ring_index;
 
+	spinlock_t		stats_lock;
 	/* statistics */
 	unsigned long		st_print;
-	unsigned long long			st_rd_req;
-	unsigned long long			st_wr_req;
-	unsigned long long			st_oo_req;
-	unsigned long long			st_f_req;
-	unsigned long long			st_ds_req;
-	unsigned long long			st_rd_sect;
-	unsigned long long			st_wr_sect;
+	unsigned long long	st_rd_req;
+	unsigned long long	st_wr_req;
+	unsigned long long	st_oo_req;
+	unsigned long long	st_f_req;
+	unsigned long long	st_ds_req;
+	unsigned long long	st_rd_sect;
+	unsigned long long	st_wr_sect;
+};
 
-	struct work_struct	free_work;
-	/* Thread shutdown wait queue. */
-	wait_queue_head_t	shutdown_wq;
+struct xen_blkif {
+	/* Unique identifier for this interface. */
+	domid_t			domid;
+	unsigned int		handle;
+	/* Comms information. */
+	enum blkif_protocol	blk_protocol;
+	/* The VBD attached to this interface. */
+	struct xen_vbd		vbd;
+	/* Rings for this device */
+	struct xen_blkif_ring	*rings;
+	unsigned int		nr_rings;
+	/* Back pointer to the backend_info. */
+	struct backend_info	*be;
+
+	/* for barrier (drain) requests */
+	struct completion	drain_complete;
+	atomic_t		drain;
+
+	atomic_t		refcnt;
+
+	/* statistics */
+	unsigned long long	st_rd_req;
+	unsigned long long	st_wr_req;
+	unsigned long long	st_oo_req;
+	unsigned long long	st_f_req;
+	unsigned long long	st_ds_req;
+	unsigned long long	st_rd_sect;
+	unsigned long long	st_wr_sect;
 };
 
 struct seg_buf {
@@ -338,7 +364,7 @@ struct grant_page {
  * response queued for it, with the saved 'id' passed back.
  */
 struct pending_req {
-	struct xen_blkif	*blkif;
+	struct xen_blkif_ring	*ring;
 	u64			id;
 	int			nr_pages;
 	atomic_t		pendcnt;
@@ -357,11 +383,11 @@ struct pending_req {
 			 (_v)->bdev->bd_part->nr_sects : \
 			  get_capacity((_v)->bdev->bd_disk))
 
-#define xen_blkif_get(_b) (atomic_inc(&(_b)->refcnt))
-#define xen_blkif_put(_b)				\
+#define xen_ring_get(_r) (atomic_inc(&(_r)->refcnt))
+#define xen_ring_put(_r)				\
 	do {						\
-		if (atomic_dec_and_test(&(_b)->refcnt))	\
-			schedule_work(&(_b)->free_work);\
+		if (atomic_dec_and_test(&(_r)->refcnt))	\
+			schedule_work(&(_r)->free_work);\
 	} while (0)
 
 struct phys_req {
@@ -377,7 +403,7 @@ int xen_blkif_xenbus_init(void);
 irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
 int xen_blkif_schedule(void *arg);
 int xen_blkif_purge_persistent(void *arg);
-void xen_blkbk_free_caches(struct xen_blkif *blkif);
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring);
 
 int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
 			      struct backend_info *be, int state);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 3a8b810..a4f13cc 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -35,7 +35,7 @@ static void connect(struct backend_info *);
 static int connect_ring(struct backend_info *);
 static void backend_changed(struct xenbus_watch *, const char **,
 			    unsigned int);
-static void xen_blkif_free(struct xen_blkif *blkif);
+static void xen_ring_free(struct xen_blkif_ring *ring);
 static void xen_vbd_free(struct xen_vbd *vbd);
 
 struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
@@ -45,17 +45,17 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
 
 /*
  * The last request could free the device from softirq context and
- * xen_blkif_free() can sleep.
+ * xen_ring_free() can sleep.
  */
-static void xen_blkif_deferred_free(struct work_struct *work)
+static void xen_ring_deferred_free(struct work_struct *work)
 {
-	struct xen_blkif *blkif;
+	struct xen_blkif_ring *ring;
 
-	blkif = container_of(work, struct xen_blkif, free_work);
-	xen_blkif_free(blkif);
+	ring = container_of(work, struct xen_blkif_ring, free_work);
+	xen_ring_free(ring);
 }
 
-static int blkback_name(struct xen_blkif *blkif, char *buf)
+static int blkback_name(struct xen_blkif *blkif, char *buf, bool save_space)
 {
 	char *devpath, *devname;
 	struct xenbus_device *dev = blkif->be->dev;
@@ -70,7 +70,10 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
 	else
 		devname  = devpath;
 
-	snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
+	if (save_space)
+		snprintf(buf, TASK_COMM_LEN, "blkbk.%d.%s", blkif->domid, devname);
+	else
+		snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
 	kfree(devpath);
 
 	return 0;
@@ -78,11 +81,15 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
 
 static void xen_update_blkif_status(struct xen_blkif *blkif)
 {
-	int err;
-	char name[TASK_COMM_LEN];
+	int i, err;
+	char name[TASK_COMM_LEN], per_ring_name[TASK_COMM_LEN];
+	struct xen_blkif_ring *ring;
 
-	/* Not ready to connect? */
-	if (!blkif->irq || !blkif->vbd.bdev)
+	/*
+	 * Not ready to connect? Check irq of first ring as the others
+	 * should all be the same.
+	 */
+	if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
 		return;
 
 	/* Already connected? */
@@ -94,7 +101,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 	if (blkif->be->dev->state != XenbusStateConnected)
 		return;
 
-	err = blkback_name(blkif, name);
+	err = blkback_name(blkif, name, blkif->vbd.nr_supported_hw_queues);
 	if (err) {
 		xenbus_dev_error(blkif->be->dev, err, "get blkback dev name");
 		return;
@@ -107,20 +114,96 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 	}
 	invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
 
-	blkif->xenblkd = kthread_run(xen_blkif_schedule, blkif, "%s", name);
-	if (IS_ERR(blkif->xenblkd)) {
-		err = PTR_ERR(blkif->xenblkd);
-		blkif->xenblkd = NULL;
-		xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
-		return;
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		ring = &blkif->rings[i];
+		if (blkif->vbd.nr_supported_hw_queues)
+			snprintf(per_ring_name, TASK_COMM_LEN, "%s-%d", name, i);
+		else {
+			BUG_ON(i != 0);
+			snprintf(per_ring_name, TASK_COMM_LEN, "%s", name);
+		}
+		ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);
+		if (IS_ERR(ring->xenblkd)) {
+			err = PTR_ERR(ring->xenblkd);
+			ring->xenblkd = NULL;
+			xenbus_dev_error(blkif->be->dev, err, "start %s", per_ring_name);
+			return;
+		}
 	}
 }
 
+static struct xen_blkif_ring *xen_blkif_ring_alloc(struct xen_blkif *blkif,
+						   int nr_rings)
+{
+	int r, i, j;
+	struct xen_blkif_ring *rings;
+	struct pending_req *req;
+
+	rings = kzalloc(nr_rings * sizeof(struct xen_blkif_ring),
+			GFP_KERNEL);
+	if (!rings)
+		return NULL;
+
+	for (r = 0 ; r < nr_rings ; r++) {
+		struct xen_blkif_ring *ring = &rings[r];
+
+		spin_lock_init(&ring->blk_ring_lock);
+
+		init_waitqueue_head(&ring->wq);
+		init_waitqueue_head(&ring->shutdown_wq);
+
+		ring->persistent_gnts.rb_node = NULL;
+		spin_lock_init(&ring->free_pages_lock);
+		INIT_LIST_HEAD(&ring->free_pages);
+		INIT_LIST_HEAD(&ring->persistent_purge_list);
+		ring->free_pages_num = 0;
+		atomic_set(&ring->persistent_gnt_in_use, 0);
+		atomic_set(&ring->refcnt, 1);
+		atomic_set(&ring->inflight, 0);
+		INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
+		spin_lock_init(&ring->pending_free_lock);
+		init_waitqueue_head(&ring->pending_free_wq);
+		INIT_LIST_HEAD(&ring->pending_free);
+		for (i = 0; i < XEN_RING_REQS(nr_rings); i++) {
+			req = kzalloc(sizeof(*req), GFP_KERNEL);
+			if (!req)
+				goto fail;
+			list_add_tail(&req->free_list,
+				      &ring->pending_free);
+			for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+				req->segments[j] = kzalloc(sizeof(*req->segments[0]),
+				                           GFP_KERNEL);
+				if (!req->segments[j])
+					goto fail;
+			}
+			for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+				req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
+				                                 GFP_KERNEL);
+				if (!req->indirect_pages[j])
+					goto fail;
+			}
+		}
+
+		INIT_WORK(&ring->free_work, xen_ring_deferred_free);
+		ring->blkif = blkif;
+		ring->ring_index = r;
+
+		spin_lock_init(&ring->stats_lock);
+		ring->st_print = jiffies;
+
+		atomic_inc(&blkif->refcnt);
+	}
+
+	return rings;
+
+fail:
+	kfree(rings);
+	return NULL;
+}
+
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
 	struct xen_blkif *blkif;
-	struct pending_req *req, *n;
-	int i, j;
 
 	BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
 
@@ -129,80 +212,26 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 		return ERR_PTR(-ENOMEM);
 
 	blkif->domid = domid;
-	spin_lock_init(&blkif->blk_ring_lock);
-	atomic_set(&blkif->refcnt, 1);
-	init_waitqueue_head(&blkif->wq);
 	init_completion(&blkif->drain_complete);
 	atomic_set(&blkif->drain, 0);
-	blkif->st_print = jiffies;
-	blkif->persistent_gnts.rb_node = NULL;
-	spin_lock_init(&blkif->free_pages_lock);
-	INIT_LIST_HEAD(&blkif->free_pages);
-	INIT_LIST_HEAD(&blkif->persistent_purge_list);
-	blkif->free_pages_num = 0;
-	atomic_set(&blkif->persistent_gnt_in_use, 0);
-	atomic_set(&blkif->inflight, 0);
-	INIT_WORK(&blkif->persistent_purge_work, xen_blkbk_unmap_purged_grants);
-
-	INIT_LIST_HEAD(&blkif->pending_free);
-	INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
-
-	for (i = 0; i < XEN_BLKIF_REQS; i++) {
-		req = kzalloc(sizeof(*req), GFP_KERNEL);
-		if (!req)
-			goto fail;
-		list_add_tail(&req->free_list,
-		              &blkif->pending_free);
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-			req->segments[j] = kzalloc(sizeof(*req->segments[0]),
-			                           GFP_KERNEL);
-			if (!req->segments[j])
-				goto fail;
-		}
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-			req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
-			                                 GFP_KERNEL);
-			if (!req->indirect_pages[j])
-				goto fail;
-		}
-	}
-	spin_lock_init(&blkif->pending_free_lock);
-	init_waitqueue_head(&blkif->pending_free_wq);
-	init_waitqueue_head(&blkif->shutdown_wq);
 
 	return blkif;
-
-fail:
-	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
-		list_del(&req->free_list);
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
-			if (!req->segments[j])
-				break;
-			kfree(req->segments[j]);
-		}
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
-			if (!req->indirect_pages[j])
-				break;
-			kfree(req->indirect_pages[j]);
-		}
-		kfree(req);
-	}
-
-	kmem_cache_free(xen_blkif_cachep, blkif);
-
-	return ERR_PTR(-ENOMEM);
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
-			 unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
+			 unsigned int evtchn, unsigned int ring_idx)
 {
 	int err;
+	struct xen_blkif *blkif;
+	char dev_name[64];
 
 	/* Already connected through? */
-	if (blkif->irq)
+	if (ring->irq)
 		return 0;
 
-	err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &blkif->blk_ring);
+	blkif = ring->blkif;
+
+	err = xenbus_map_ring_valloc(ring->blkif->be->dev, shared_page, &ring->blk_ring);
 	if (err < 0)
 		return err;
 
@@ -210,64 +239,73 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
 	case BLKIF_PROTOCOL_NATIVE:
 	{
 		struct blkif_sring *sring;
-		sring = (struct blkif_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
+		sring = (struct blkif_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.native, sring, PAGE_SIZE);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_32:
 	{
 		struct blkif_x86_32_sring *sring_x86_32;
-		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
+		sring_x86_32 = (struct blkif_x86_32_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_64:
 	{
 		struct blkif_x86_64_sring *sring_x86_64;
-		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
+		sring_x86_64 = (struct blkif_x86_64_sring *)ring->blk_ring;
+		BACK_RING_INIT(&ring->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
 		break;
 	}
 	default:
 		BUG();
 	}
 
+	if (blkif->vbd.nr_supported_hw_queues)
+		snprintf(dev_name, 64, "blkif-backend-%d", ring_idx);
+	else
+		snprintf(dev_name, 64, "blkif-backend");
 	err = bind_interdomain_evtchn_to_irqhandler(blkif->domid, evtchn,
 						    xen_blkif_be_int, 0,
-						    "blkif-backend", blkif);
+						    dev_name, ring);
 	if (err < 0) {
-		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
-		blkif->blk_rings.common.sring = NULL;
+		xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+		ring->blk_rings.common.sring = NULL;
 		return err;
 	}
-	blkif->irq = err;
+	ring->irq = err;
 
 	return 0;
 }
 
 static int xen_blkif_disconnect(struct xen_blkif *blkif)
 {
-	if (blkif->xenblkd) {
-		kthread_stop(blkif->xenblkd);
-		wake_up(&blkif->shutdown_wq);
-		blkif->xenblkd = NULL;
-	}
+	int i;
+
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		struct xen_blkif_ring *ring = &blkif->rings[i];
+		if (ring->xenblkd) {
+			kthread_stop(ring->xenblkd);
+			wake_up(&ring->shutdown_wq);
+			ring->xenblkd = NULL;
+		}
 
-	/* The above kthread_stop() guarantees that at this point we
-	 * don't have any discard_io or other_io requests. So, checking
-	 * for inflight IO is enough.
-	 */
-	if (atomic_read(&blkif->inflight) > 0)
-		return -EBUSY;
+		/* The above kthread_stop() guarantees that at this point we
+		 * don't have any discard_io or other_io requests. So, checking
+		 * for inflight IO is enough.
+		 */
+		if (atomic_read(&ring->inflight) > 0)
+			return -EBUSY;
 
-	if (blkif->irq) {
-		unbind_from_irqhandler(blkif->irq, blkif);
-		blkif->irq = 0;
-	}
+		if (ring->irq) {
+			unbind_from_irqhandler(ring->irq, ring);
+			ring->irq = 0;
+		}
 
-	if (blkif->blk_rings.common.sring) {
-		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
-		blkif->blk_rings.common.sring = NULL;
+		if (ring->blk_rings.common.sring) {
+			xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+			ring->blk_rings.common.sring = NULL;
+		}
 	}
 
 	return 0;
@@ -275,40 +313,52 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
 
 static void xen_blkif_free(struct xen_blkif *blkif)
 {
-	struct pending_req *req, *n;
-	int i = 0, j;
 
 	xen_blkif_disconnect(blkif);
 	xen_vbd_free(&blkif->vbd);
 
+	kfree(blkif->rings);
+
+	kmem_cache_free(xen_blkif_cachep, blkif);
+}
+
+static void xen_ring_free(struct xen_blkif_ring *ring)
+{
+	struct pending_req *req, *n;
+	int i, j;
+
 	/* Remove all persistent grants and the cache of ballooned pages. */
-	xen_blkbk_free_caches(blkif);
+	xen_blkbk_free_caches(ring);
 
 	/* Make sure everything is drained before shutting down */
-	BUG_ON(blkif->persistent_gnt_c != 0);
-	BUG_ON(atomic_read(&blkif->persistent_gnt_in_use) != 0);
-	BUG_ON(blkif->free_pages_num != 0);
-	BUG_ON(!list_empty(&blkif->persistent_purge_list));
-	BUG_ON(!list_empty(&blkif->free_pages));
-	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
-
+	BUG_ON(ring->persistent_gnt_c != 0);
+	BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
+	BUG_ON(ring->free_pages_num != 0);
+	BUG_ON(!list_empty(&ring->persistent_purge_list));
+	BUG_ON(!list_empty(&ring->free_pages));
+	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+
+	i = 0;
 	/* Check that there is no request in use */
-	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
+	list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
 		list_del(&req->free_list);
-
-		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
+		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+			if (!req->segments[j])
+				break;
 			kfree(req->segments[j]);
-
-		for (j = 0; j < MAX_INDIRECT_PAGES; j++)
+		}
+		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+			if (!req->segments[j])
+				break;
 			kfree(req->indirect_pages[j]);
-
+		}
 		kfree(req);
 		i++;
 	}
+	WARN_ON(i != XEN_RING_REQS(ring->blkif->nr_rings));
 
-	WARN_ON(i != XEN_BLKIF_REQS);
-
-	kmem_cache_free(xen_blkif_cachep, blkif);
+	if (atomic_dec_and_test(&ring->blkif->refcnt))
+		xen_blkif_free(ring->blkif);
 }
 
 int __init xen_blkif_interface_init(void)
@@ -333,6 +383,29 @@ int __init xen_blkif_interface_init(void)
 	{								\
 		struct xenbus_device *dev = to_xenbus_device(_dev);	\
 		struct backend_info *be = dev_get_drvdata(&dev->dev);	\
+		struct xen_blkif *blkif = be->blkif;			\
+		struct xen_blkif_ring *ring;				\
+		int i;							\
+									\
+		blkif->st_oo_req = 0;					\
+		blkif->st_rd_req = 0;					\
+		blkif->st_wr_req = 0;					\
+		blkif->st_f_req = 0;					\
+		blkif->st_ds_req = 0;					\
+		blkif->st_rd_sect = 0;					\
+		blkif->st_wr_sect = 0;					\
+		for (i = 0 ; i < blkif->nr_rings ; i++) {		\
+			ring = &blkif->rings[i];			\
+			spin_lock_irq(&ring->stats_lock);		\
+			blkif->st_oo_req += ring->st_oo_req;		\
+			blkif->st_rd_req += ring->st_rd_req;		\
+			blkif->st_wr_req += ring->st_wr_req;		\
+			blkif->st_f_req += ring->st_f_req;		\
+			blkif->st_ds_req += ring->st_ds_req;		\
+			blkif->st_rd_sect += ring->st_rd_sect;		\
+			blkif->st_wr_sect += ring->st_wr_sect;		\
+			spin_unlock_irq(&ring->stats_lock);		\
+		}							\
 									\
 		return sprintf(buf, format, ##args);			\
 	}								\
@@ -453,6 +526,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 		handle, blkif->domid);
 	return 0;
 }
+
 static int xen_blkbk_remove(struct xenbus_device *dev)
 {
 	struct backend_info *be = dev_get_drvdata(&dev->dev);
@@ -468,13 +542,14 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
 		be->backend_watch.node = NULL;
 	}
 
-	dev_set_drvdata(&dev->dev, NULL);
-
 	if (be->blkif) {
+		int i = 0;
 		xen_blkif_disconnect(be->blkif);
-		xen_blkif_put(be->blkif);
+		for (; i < be->blkif->nr_rings ; i++)
+			xen_ring_put(&be->blkif->rings[i]);
 	}
 
+	dev_set_drvdata(&dev->dev, NULL);
 	kfree(be->mode);
 	kfree(be);
 	return 0;
@@ -851,21 +926,46 @@ again:
 static int connect_ring(struct backend_info *be)
 {
 	struct xenbus_device *dev = be->dev;
-	unsigned long ring_ref;
-	unsigned int evtchn;
+	struct xen_blkif *blkif = be->blkif;
+	unsigned long *ring_ref;
+	unsigned int *evtchn;
 	unsigned int pers_grants;
-	char protocol[64] = "";
-	int err;
+	char protocol[64] = "", ring_ref_s[64] = "", evtchn_s[64] = "";
+	int i, err;
 
 	DPRINTK("%s", dev->otherend);
 
-	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
-			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
-	if (err) {
-		xenbus_dev_fatal(dev, err,
-				 "reading %s/ring-ref and event-channel",
-				 dev->otherend);
-		return err;
+	blkif->nr_rings = 1;
+
+	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
+	if (!ring_ref)
+		return -ENOMEM;
+	evtchn = kzalloc(sizeof(unsigned int) * blkif->nr_rings,
+			 GFP_KERNEL);
+	if (!evtchn) {
+		kfree(ring_ref);
+		return -ENOMEM;
+	}
+
+	for (i = 0 ; i < blkif->nr_rings ; i++) {
+		if (blkif->vbd.nr_supported_hw_queues == 0) {
+			BUG_ON(i != 0);
+			/* Support old XenStore keys for compatibility */
+			snprintf(ring_ref_s, 64, "ring-ref");
+			snprintf(evtchn_s, 64, "event-channel");
+		} else {
+			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
+			snprintf(evtchn_s, 64, "event-channel-%d", i);
+		}
+		err = xenbus_gather(XBT_NIL, dev->otherend,
+				    ring_ref_s, "%lu", &ring_ref[i],
+				    evtchn_s, "%u", &evtchn[i], NULL);
+		if (err) {
+			xenbus_dev_fatal(dev, err,
+					 "reading %s/%s and event-channel",
+					 dev->otherend, ring_ref_s);
+			goto fail;
+		}
 	}
 
 	be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
@@ -881,7 +981,8 @@ static int connect_ring(struct backend_info *be)
 		be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
 	else {
 		xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
-		return -1;
+		err = -1;
+		goto fail;
 	}
 	err = xenbus_gather(XBT_NIL, dev->otherend,
 			    "feature-persistent", "%u",
@@ -892,19 +993,42 @@ static int connect_ring(struct backend_info *be)
 	be->blkif->vbd.feature_gnt_persistent = pers_grants;
 	be->blkif->vbd.overflow_max_grants = 0;
 
-	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
-		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
-		pers_grants ? "persistent grants" : "");
+	blkif->rings = xen_blkif_ring_alloc(blkif, blkif->nr_rings);
+	if (!blkif->rings) {
+		err = -ENOMEM;
+		goto fail;
+	}
+	/*
+	 * Enforce postcondition on number of allocated rings; note that if we are
+	 * resuming it might happen that nr_supported_hw_queues != nr_rings,
+	 * so we cannot rely on such a postcondition.
+	 */
+	BUG_ON(!blkif->vbd.nr_supported_hw_queues &&
+		blkif->nr_rings != 1);
 
-	/* Map the shared frame, irq etc. */
-	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
-	if (err) {
-		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
-				 ring_ref, evtchn);
-		return err;
+	for (i = 0; i < blkif->nr_rings ; i++) {
+		pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
+			ring_ref[i], evtchn[i], blkif->blk_protocol, protocol,
+			pers_grants ? "persistent grants" : "");
+
+		/* Map the shared frame, irq etc. */
+		err = xen_blkif_map(&blkif->rings[i], ring_ref[i], evtchn[i], i);
+		if (err) {
+			xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u of ring %d",
+					 ring_ref[i], evtchn[i], i);
+			goto fail;
+		}
 	}
 
+	kfree(ring_ref);
+	kfree(evtchn);
+
 	return 0;
+
+fail:
+	kfree(ring_ref);
+	kfree(evtchn);
+	return err;
 }
 
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (8 preceding siblings ...)
  2014-09-11 23:57 ` [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:58   ` David Vrabel
                     ` (3 more replies)
  2014-10-01 20:27 ` [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Konrad Rzeszutek Wilk
  2014-10-01 20:27 ` Konrad Rzeszutek Wilk
  11 siblings, 4 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe, avanzini.arianna

This commit lets the backend driver advertise the number of available
hardware queues; it also implements gathering from the frontend driver
the number of rings actually available for mapping.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkback/xenbus.c | 44 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index a4f13cc..9ff6ced 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
 	vbd->bdev = NULL;
 }
 
+static int xen_advertise_hw_queues(struct xen_blkif *blkif,
+				   struct request_queue *q)
+{
+	struct xen_vbd *vbd = &blkif->vbd;
+	struct xenbus_transaction xbt;
+	int err;
+
+	if (q && q->mq_ops)
+		vbd->nr_supported_hw_queues = q->nr_hw_queues;
+
+	err = xenbus_transaction_start(&xbt);
+	if (err) {
+		BUG_ON(!blkif->be);
+		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
+		return err;
+	}
+
+	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
+			    blkif->vbd.nr_supported_hw_queues);
+	if (err)
+		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
+				 blkif->be->dev->nodename);
+
+	xenbus_transaction_end(xbt, 0);
+
+	return err;
+}
+
 static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 			  unsigned major, unsigned minor, int readonly,
 			  int cdrom)
@@ -484,6 +512,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 	struct xen_vbd *vbd;
 	struct block_device *bdev;
 	struct request_queue *q;
+	int err;
 
 	vbd = &blkif->vbd;
 	vbd->handle   = handle;
@@ -522,6 +551,10 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 	if (q && blk_queue_secdiscard(q))
 		vbd->discard_secure = true;
 
+	err = xen_advertise_hw_queues(blkif, q);
+	if (err)
+		return -ENOENT;
+
 	DPRINTK("Successful creation of handle=%04x (dom=%u)\n",
 		handle, blkif->domid);
 	return 0;
@@ -935,7 +968,16 @@ static int connect_ring(struct backend_info *be)
 
 	DPRINTK("%s", dev->otherend);
 
-	blkif->nr_rings = 1;
+	err = xenbus_gather(XBT_NIL, dev->otherend, "nr_blk_rings",
+			    "%u", &blkif->nr_rings, NULL);
+	if (err) {
+		/*
+		 * Frontend does not support multiqueue; force compatibility
+		 * mode of the driver.
+		 */
+		blkif->vbd.nr_supported_hw_queues = 0;
+		blkif->nr_rings = 1;
+	}
 
 	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
 	if (!ring_ref)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (7 preceding siblings ...)
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-11 23:57 ` Arianna Avanzini
  2014-09-11 23:57 ` Arianna Avanzini
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2014-09-11 23:57 UTC (permalink / raw)
  To: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi, avanzini.arianna

This commit lets the backend driver advertise the number of available
hardware queues; it also implements gathering from the frontend driver
the number of rings actually available for mapping.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 drivers/block/xen-blkback/xenbus.c | 44 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index a4f13cc..9ff6ced 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
 	vbd->bdev = NULL;
 }
 
+static int xen_advertise_hw_queues(struct xen_blkif *blkif,
+				   struct request_queue *q)
+{
+	struct xen_vbd *vbd = &blkif->vbd;
+	struct xenbus_transaction xbt;
+	int err;
+
+	if (q && q->mq_ops)
+		vbd->nr_supported_hw_queues = q->nr_hw_queues;
+
+	err = xenbus_transaction_start(&xbt);
+	if (err) {
+		BUG_ON(!blkif->be);
+		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
+		return err;
+	}
+
+	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
+			    blkif->vbd.nr_supported_hw_queues);
+	if (err)
+		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
+				 blkif->be->dev->nodename);
+
+	xenbus_transaction_end(xbt, 0);
+
+	return err;
+}
+
 static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 			  unsigned major, unsigned minor, int readonly,
 			  int cdrom)
@@ -484,6 +512,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 	struct xen_vbd *vbd;
 	struct block_device *bdev;
 	struct request_queue *q;
+	int err;
 
 	vbd = &blkif->vbd;
 	vbd->handle   = handle;
@@ -522,6 +551,10 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 	if (q && blk_queue_secdiscard(q))
 		vbd->discard_secure = true;
 
+	err = xen_advertise_hw_queues(blkif, q);
+	if (err)
+		return -ENOENT;
+
 	DPRINTK("Successful creation of handle=%04x (dom=%u)\n",
 		handle, blkif->domid);
 	return 0;
@@ -935,7 +968,16 @@ static int connect_ring(struct backend_info *be)
 
 	DPRINTK("%s", dev->otherend);
 
-	blkif->nr_rings = 1;
+	err = xenbus_gather(XBT_NIL, dev->otherend, "nr_blk_rings",
+			    "%u", &blkif->nr_rings, NULL);
+	if (err) {
+		/*
+		 * Frontend does not support multiqueue; force compatibility
+		 * mode of the driver.
+		 */
+		blkif->vbd.nr_supported_hw_queues = 0;
+		blkif->nr_rings = 1;
+	}
 
 	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
 	if (!ring_ref)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:46   ` David Vrabel
@ 2014-09-12 10:46   ` David Vrabel
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: David Vrabel @ 2014-09-12 10:46 UTC (permalink / raw)
  To: Arianna Avanzini, konrad.wilk, boris.ostrovsky, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe

On 12/09/14 00:57, Arianna Avanzini wrote:
> This commit implements the negotiation of the number of block rings
> to be used; as a default, the number of rings is decided by the
> frontend driver and is equal to the number of hardware queues that
> the backend makes available. In case of guest migration towards a
> host whose devices expose a different number of hardware queues, the
> number of I/O rings used by the frontend driver remains the same;
> XenStore keys may vary if the frontend needs to be compatible with
> a host not having multi-queue support.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 84 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9282df1..77e311d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -137,7 +137,7 @@ struct blkfront_info
>  	int vdevice;
>  	blkif_vdev_t handle;
>  	enum blkif_state connected;
> -	unsigned int nr_rings;
> +	unsigned int nr_rings, old_nr_rings;

I don't think you need old_nr_rings.  nr_rings is the current number of
rings and you should only update nr_rings to a new number after tearing
down all the old rings.

>  	struct blkfront_ring_info *rinfo;
>  	struct request_queue *rq;
>  	unsigned int feature_flush;
> @@ -147,6 +147,7 @@ struct blkfront_info
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	unsigned int feature_persistent:1;
> +	unsigned int hardware_queues;

hardware_queues seems to have the same purpose as nr_rings and isn't
needed.  nr_rings == 1 can mean write old keys for non-multi-queue
capable backends (or a mq capable one that only wants 1 queue).

> @@ -1351,10 +1353,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	/* Advertise the number of rings */
> +	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
> +			    "%u", info->nr_rings);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "advertising number of rings");
> +		goto abort_transaction;
> +	}
> +
>  	for (i = 0 ; i < info->nr_rings ; i++) {
> -		BUG_ON(i > 0);
> -		snprintf(ring_ref_s, 64, "ring-ref");
> -		snprintf(evtchn_s, 64, "event-channel");
> +		if (!info->hardware_queues) {

   if (info->nr_rings == 1)

> +			BUG_ON(i > 0);
> +			/* Support old XenStore keys */
> +			snprintf(ring_ref_s, 64, "ring-ref");
> +			snprintf(evtchn_s, 64, "event-channel");
> +		} else {
> +			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
> +			snprintf(evtchn_s, 64, "event-channel-%d", i);
> +		}
>  		err = xenbus_printf(xbt, dev->nodename,
>  				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
>  		if (err) {
[...]
> @@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
>  {
>  	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
>  	int err;
> +	unsigned int nr_queues, prev_nr_queues;
> +	bool mq_to_rq_transition;
>  
>  	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
>  
> +	prev_nr_queues = info->hardware_queues;
> +
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err < 0)
> +		nr_queues = 0;
> +	mq_to_rq_transition = prev_nr_queues && !nr_queues;
> +
> +	if (prev_nr_queues != nr_queues) {
> +		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
> +		       info->gd->disk_name, prev_nr_queues, nr_queues);
> +		if (mq_to_rq_transition) {
> +			struct blk_mq_hw_ctx *hctx;
> +			unsigned int i;
> +			/*
> +			 * Switch from multi-queue to single-queue:
> +			 * update hctx-to-ring mapping before
> +			 * resubmitting any requests
> +			 */
> +			queue_for_each_hw_ctx(info->rq, hctx, i)
> +				hctx->driver_data = &info->rinfo[0];

I think this does give a mechanism to change (reduce) the number of
rings used if the backend supports fewer.  You don't need to map all
hctxs to one ring.  You can distribute them amongst the available rings.

> @@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
>  		 * supports indirect descriptors, and how many.
>  		 */
>  		blkif_recover(info);
> +		info->rinfo = krealloc(info->rinfo,
> +				       info->nr_rings * sizeof(struct blkfront_ring_info),
> +				       GFP_KERNEL);
> +

You don't check for allocation failure here.

David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-12 10:46   ` David Vrabel
  2014-09-12 10:46   ` David Vrabel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: David Vrabel @ 2014-09-12 10:46 UTC (permalink / raw)
  To: Arianna Avanzini, konrad.wilk, boris.ostrovsky, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi

On 12/09/14 00:57, Arianna Avanzini wrote:
> This commit implements the negotiation of the number of block rings
> to be used; as a default, the number of rings is decided by the
> frontend driver and is equal to the number of hardware queues that
> the backend makes available. In case of guest migration towards a
> host whose devices expose a different number of hardware queues, the
> number of I/O rings used by the frontend driver remains the same;
> XenStore keys may vary if the frontend needs to be compatible with
> a host not having multi-queue support.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 84 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9282df1..77e311d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -137,7 +137,7 @@ struct blkfront_info
>  	int vdevice;
>  	blkif_vdev_t handle;
>  	enum blkif_state connected;
> -	unsigned int nr_rings;
> +	unsigned int nr_rings, old_nr_rings;

I don't think you need old_nr_rings.  nr_rings is the current number of
rings and you should only update nr_rings to a new number after tearing
down all the old rings.

>  	struct blkfront_ring_info *rinfo;
>  	struct request_queue *rq;
>  	unsigned int feature_flush;
> @@ -147,6 +147,7 @@ struct blkfront_info
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	unsigned int feature_persistent:1;
> +	unsigned int hardware_queues;

hardware_queues seems to have the same purpose as nr_rings and isn't
needed.  nr_rings == 1 can mean write old keys for non-multi-queue
capable backends (or a mq capable one that only wants 1 queue).

> @@ -1351,10 +1353,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	/* Advertise the number of rings */
> +	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
> +			    "%u", info->nr_rings);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "advertising number of rings");
> +		goto abort_transaction;
> +	}
> +
>  	for (i = 0 ; i < info->nr_rings ; i++) {
> -		BUG_ON(i > 0);
> -		snprintf(ring_ref_s, 64, "ring-ref");
> -		snprintf(evtchn_s, 64, "event-channel");
> +		if (!info->hardware_queues) {

   if (info->nr_rings == 1)

> +			BUG_ON(i > 0);
> +			/* Support old XenStore keys */
> +			snprintf(ring_ref_s, 64, "ring-ref");
> +			snprintf(evtchn_s, 64, "event-channel");
> +		} else {
> +			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
> +			snprintf(evtchn_s, 64, "event-channel-%d", i);
> +		}
>  		err = xenbus_printf(xbt, dev->nodename,
>  				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
>  		if (err) {
[...]
> @@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
>  {
>  	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
>  	int err;
> +	unsigned int nr_queues, prev_nr_queues;
> +	bool mq_to_rq_transition;
>  
>  	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
>  
> +	prev_nr_queues = info->hardware_queues;
> +
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err < 0)
> +		nr_queues = 0;
> +	mq_to_rq_transition = prev_nr_queues && !nr_queues;
> +
> +	if (prev_nr_queues != nr_queues) {
> +		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
> +		       info->gd->disk_name, prev_nr_queues, nr_queues);
> +		if (mq_to_rq_transition) {
> +			struct blk_mq_hw_ctx *hctx;
> +			unsigned int i;
> +			/*
> +			 * Switch from multi-queue to single-queue:
> +			 * update hctx-to-ring mapping before
> +			 * resubmitting any requests
> +			 */
> +			queue_for_each_hw_ctx(info->rq, hctx, i)
> +				hctx->driver_data = &info->rinfo[0];

I think this does give a mechanism to change (reduce) the number of
rings used if the backend supports fewer.  You don't need to map all
hctxs to one ring.  You can distribute them amongst the available rings.

> @@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
>  		 * supports indirect descriptors, and how many.
>  		 */
>  		blkif_recover(info);
> +		info->rinfo = krealloc(info->rinfo,
> +				       info->nr_rings * sizeof(struct blkfront_ring_info),
> +				       GFP_KERNEL);
> +

You don't check for allocation failure here.

David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:58   ` David Vrabel
@ 2014-09-12 10:58   ` David Vrabel
  2014-10-01 20:23   ` Konrad Rzeszutek Wilk
  2014-10-01 20:23   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: David Vrabel @ 2014-09-12 10:58 UTC (permalink / raw)
  To: Arianna Avanzini, konrad.wilk, boris.ostrovsky, xen-devel, linux-kernel
  Cc: hch, bob.liu, felipe.franciosi, axboe

On 12/09/14 00:57, Arianna Avanzini wrote:
> This commit lets the backend driver advertise the number of available
> hardware queues; it also implements gathering from the frontend driver
> the number of rings actually available for mapping.
[...]
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  	vbd->bdev = NULL;
>  }
>  
> +static int xen_advertise_hw_queues(struct xen_blkif *blkif,
> +				   struct request_queue *q)
> +{
> +	struct xen_vbd *vbd = &blkif->vbd;
> +	struct xenbus_transaction xbt;
> +	int err;
> +
> +	if (q && q->mq_ops)
> +		vbd->nr_supported_hw_queues = q->nr_hw_queues;
> +
> +	err = xenbus_transaction_start(&xbt);
> +	if (err) {
> +		BUG_ON(!blkif->be);

This BUG_ON() isn't useful.

> +		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
> +		return err;
> +	}
> +
> +	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
> +			    blkif->vbd.nr_supported_hw_queues);
> +	if (err)
> +		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
> +				 blkif->be->dev->nodename);
> +
> +	xenbus_transaction_end(xbt, 0);

Transactions are expensive and not needed to write a single key.

Can you use the same key names as netback (multi-queue-max-queues I
think)? I don't see why we can't use a common set of key names for this.

See interface/io/netif.h for full set of keys VIFs use.

David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-12 10:58   ` David Vrabel
  2014-09-12 10:58   ` David Vrabel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: David Vrabel @ 2014-09-12 10:58 UTC (permalink / raw)
  To: Arianna Avanzini, konrad.wilk, boris.ostrovsky, xen-devel, linux-kernel
  Cc: hch, axboe, felipe.franciosi

On 12/09/14 00:57, Arianna Avanzini wrote:
> This commit lets the backend driver advertise the number of available
> hardware queues; it also implements gathering from the frontend driver
> the number of rings actually available for mapping.
[...]
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  	vbd->bdev = NULL;
>  }
>  
> +static int xen_advertise_hw_queues(struct xen_blkif *blkif,
> +				   struct request_queue *q)
> +{
> +	struct xen_vbd *vbd = &blkif->vbd;
> +	struct xenbus_transaction xbt;
> +	int err;
> +
> +	if (q && q->mq_ops)
> +		vbd->nr_supported_hw_queues = q->nr_hw_queues;
> +
> +	err = xenbus_transaction_start(&xbt);
> +	if (err) {
> +		BUG_ON(!blkif->be);

This BUG_ON() isn't useful.

> +		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
> +		return err;
> +	}
> +
> +	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
> +			    blkif->vbd.nr_supported_hw_queues);
> +	if (err)
> +		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
> +				 blkif->be->dev->nodename);
> +
> +	xenbus_transaction_end(xbt, 0);

Transactions are expensive and not needed to write a single key.

Can you use the same key names as netback (multi-queue-max-queues I
think)? I don't see why we can't use a common set of key names for this.

See interface/io/netif.h for full set of keys VIFs use.

David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-09-13 19:29   ` Christoph Hellwig
  2014-09-13 19:29   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2014-09-13 19:29 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, xen-devel,
	linux-kernel, hch, bob.liu, felipe.franciosi, axboe

> +static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>  {
> +	struct blkfront_info *info = req->rq_disk->private_data;
>  
> +	spin_lock_irq(&info->io_lock);
> +	if (RING_FULL(&info->ring))
> +		goto wait;
>  
> -		blk_start_request(req);
> +	if ((req->cmd_type != REQ_TYPE_FS) ||
> +			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> +			 !info->flush_op)) {
> +		req->errors = -EIO;
> +		blk_mq_complete_request(req);
> +		spin_unlock_irq(&info->io_lock);
> +		return BLK_MQ_RQ_QUEUE_ERROR;

As mentioned during the last round this should only return
BLK_MQ_RQ_QUEUE_ERROR, and not also set req->errors and complete the
request.

> +	}
>  
> +	if (blkif_queue_request(req)) {
> +		blk_mq_requeue_request(req);
> +		goto wait;

Same here, this should only return BLK_MQ_RQ_QUEUE_BUSY after the wait
label, but not also requeue the request.  While the error case above
is harmless due to the double completion protection in blk-mq, this one
actually is actively harmful.

>  wait:
> +	/* Avoid pointless unplugs. */
> +	blk_mq_stop_hw_queue(hctx);
> +	spin_unlock_irq(&info->io_lock);

In general you should try to do calls into the blk_mq code without holding
your locks to simplify the locking hierachy and reduce lock hold times.

> -static void kick_pending_request_queues(struct blkfront_info *info)
> +static void kick_pending_request_queues(struct blkfront_info *info,
> +					unsigned long *flags)
>  {
>  	if (!RING_FULL(&info->ring)) {
> -		/* Re-enable calldowns. */
> -		blk_start_queue(info->rq);
> -		/* Kick things off immediately. */
> -		do_blkif_request(info->rq);
> +		spin_unlock_irqrestore(&info->io_lock, *flags);
> +		blk_mq_start_stopped_hw_queues(info->rq, 0);
> +		spin_lock_irqsave(&info->io_lock, *flags);
>  	}

The second paramter to blk_mq_start_stopped_hw_queues is a bool,
so you should pass false instead of 0 here.

Also the locking in this area seems wrong as most callers immediately
acquire and/or release the io_lock, so it seems more useful in general
to expect this function to be called without it.

>  static void blkif_restart_queue(struct work_struct *work)
>  {
>  	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
> +	unsigned long flags;
>  
> -	spin_lock_irq(&info->io_lock);
> +	spin_lock_irqsave(&info->io_lock, flags);

There shouldn't be any need to ever take a lock as _irqsave from a work
queue handler.

Note that you might be able to get rid of your own workqueue here by
simply using blk_mq_start_stopped_hw_queues with the async paramter set
to true.

>  
> -		error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
> +		error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;

I don't think you need the error variable any more as blk-mq always uses
req->errors to pass the errno value.

> -	kick_pending_request_queues(info);
> +	kick_pending_request_queues(info, &flags);
>  
>  	list_for_each_entry_safe(req, n, &requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
>  		BUG_ON(req->nr_phys_segments > segs);
> -		blk_requeue_request(info->rq, req);
> +		blk_mq_requeue_request(req);

Note that blk_mq_requeue_request calls will need a
blk_mq_kick_requeue_list call to be actually requeued.  It should be
fine to have one past this loop here.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-13 19:29   ` Christoph Hellwig
@ 2014-09-13 19:29   ` Christoph Hellwig
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2014-09-13 19:29 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

> +static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>  {
> +	struct blkfront_info *info = req->rq_disk->private_data;
>  
> +	spin_lock_irq(&info->io_lock);
> +	if (RING_FULL(&info->ring))
> +		goto wait;
>  
> -		blk_start_request(req);
> +	if ((req->cmd_type != REQ_TYPE_FS) ||
> +			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> +			 !info->flush_op)) {
> +		req->errors = -EIO;
> +		blk_mq_complete_request(req);
> +		spin_unlock_irq(&info->io_lock);
> +		return BLK_MQ_RQ_QUEUE_ERROR;

As mentioned during the last round this should only return
BLK_MQ_RQ_QUEUE_ERROR, and not also set req->errors and complete the
request.

> +	}
>  
> +	if (blkif_queue_request(req)) {
> +		blk_mq_requeue_request(req);
> +		goto wait;

Same here, this should only return BLK_MQ_RQ_QUEUE_BUSY after the wait
label, but not also requeue the request.  While the error case above
is harmless due to the double completion protection in blk-mq, this one
actually is actively harmful.

>  wait:
> +	/* Avoid pointless unplugs. */
> +	blk_mq_stop_hw_queue(hctx);
> +	spin_unlock_irq(&info->io_lock);

In general you should try to do calls into the blk_mq code without holding
your locks to simplify the locking hierachy and reduce lock hold times.

> -static void kick_pending_request_queues(struct blkfront_info *info)
> +static void kick_pending_request_queues(struct blkfront_info *info,
> +					unsigned long *flags)
>  {
>  	if (!RING_FULL(&info->ring)) {
> -		/* Re-enable calldowns. */
> -		blk_start_queue(info->rq);
> -		/* Kick things off immediately. */
> -		do_blkif_request(info->rq);
> +		spin_unlock_irqrestore(&info->io_lock, *flags);
> +		blk_mq_start_stopped_hw_queues(info->rq, 0);
> +		spin_lock_irqsave(&info->io_lock, *flags);
>  	}

The second paramter to blk_mq_start_stopped_hw_queues is a bool,
so you should pass false instead of 0 here.

Also the locking in this area seems wrong as most callers immediately
acquire and/or release the io_lock, so it seems more useful in general
to expect this function to be called without it.

>  static void blkif_restart_queue(struct work_struct *work)
>  {
>  	struct blkfront_info *info = container_of(work, struct blkfront_info, work);
> +	unsigned long flags;
>  
> -	spin_lock_irq(&info->io_lock);
> +	spin_lock_irqsave(&info->io_lock, flags);

There shouldn't be any need to ever take a lock as _irqsave from a work
queue handler.

Note that you might be able to get rid of your own workqueue here by
simply using blk_mq_start_stopped_hw_queues with the async paramter set
to true.

>  
> -		error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
> +		error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;

I don't think you need the error variable any more as blk-mq always uses
req->errors to pass the errno value.

> -	kick_pending_request_queues(info);
> +	kick_pending_request_queues(info, &flags);
>  
>  	list_for_each_entry_safe(req, n, &requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
>  		BUG_ON(req->nr_phys_segments > segs);
> -		blk_requeue_request(info->rq, req);
> +		blk_mq_requeue_request(req);

Note that blk_mq_requeue_request calls will need a
blk_mq_kick_requeue_list call to be actually requeued.  It should be
fine to have one past this loop here.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings
  2014-09-11 23:57 ` [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings Arianna Avanzini
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

On Fri, Sep 12, 2014 at 01:57:23AM +0200, Arianna Avanzini wrote:
> This commit adds to xen-blkback the support to map and make use
> of a variable number of ringbuffers. The number of rings to be
> mapped is forcedly set to one.

Please add:
 - An explanation the 'xen_blkif_ring' and 'xen_blkif' split as well.
 - The addition of the 'stats_lock' (which really should be a seperate
   patch). Please remember: One logical change per patch./

> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkback/blkback.c | 377 ++++++++++++++++---------------
>  drivers/block/xen-blkback/common.h  | 110 +++++----
>  drivers/block/xen-blkback/xenbus.c  | 432 +++++++++++++++++++++++-------------
>  3 files changed, 548 insertions(+), 371 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 64c60ed..b31acfb 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -80,6 +80,9 @@ module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
>  MODULE_PARM_DESC(max_persistent_grants,
>                   "Maximum number of grants to map persistently");
>

A bit fat comment here please :-)
  
> +#define XEN_RING_MAX_PGRANTS(nr_rings) \
> +	(max((int)(xen_blkif_max_pgrants / nr_rings), 16))
> +

.. Giant snip ..
>  static int __init xen_blkif_init(void)
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index f65b807..6f074ce 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -226,6 +226,7 @@ struct xen_vbd {
>  	struct block_device	*bdev;
>  	/* Cached size parameter. */
>  	sector_t		size;
> +	unsigned int		nr_supported_hw_queues;

nr_rings

would do
>  	unsigned int		flush_support:1;
>  	unsigned int		discard_secure:1;
>  	unsigned int		feature_gnt_persistent:1;
> @@ -246,6 +247,7 @@ struct backend_info;
>  
>  /* Number of requests that we can fit in a ring */
>  #define XEN_BLKIF_REQS			32
> +#define XEN_RING_REQS(nr_rings)	(max((int)(XEN_BLKIF_REQS / nr_rings), 8))

Bit giant comment here please.
>  
>  struct persistent_gnt {
>  	struct page *page;
> @@ -256,32 +258,29 @@ struct persistent_gnt {
>  	struct list_head remove_node;
>  };
>  
> -struct xen_blkif {
> -	/* Unique identifier for this interface. */
> -	domid_t			domid;
> -	unsigned int		handle;
> +struct xen_blkif_ring {
> +	union blkif_back_rings	blk_rings;
>  	/* Physical parameters of the comms window. */
>  	unsigned int		irq;
> -	/* Comms information. */
> -	enum blkif_protocol	blk_protocol;
> -	union blkif_back_rings	blk_rings;
> -	void			*blk_ring;
> -	/* The VBD attached to this interface. */
> -	struct xen_vbd		vbd;
> -	/* Back pointer to the backend_info. */
> -	struct backend_info	*be;
> -	/* Private fields. */
> -	spinlock_t		blk_ring_lock;
> -	atomic_t		refcnt;
>  
>  	wait_queue_head_t	wq;
> -	/* for barrier (drain) requests */
> -	struct completion	drain_complete;
> -	atomic_t		drain;
> -	atomic_t		inflight;
>  	/* One thread per one blkif. */
>  	struct task_struct	*xenblkd;
>  	unsigned int		waiting_reqs;
> +	void			*blk_ring;
> +	spinlock_t		blk_ring_lock;
> +
> +	struct work_struct	free_work;
> +	/* Thread shutdown wait queue. */
> +	wait_queue_head_t	shutdown_wq;
> +
> +	/* buffer of free pages to map grant refs */
> +	spinlock_t		free_pages_lock;
> +	int			free_pages_num;
> +
> +	/* used by the kworker that offload work from the persistent purge */
> +	struct list_head	persistent_purge_list;
> +	struct work_struct	persistent_purge_work;
>  
>  	/* tree to store persistent grants */
>  	struct rb_root		persistent_gnts;
> @@ -289,13 +288,6 @@ struct xen_blkif {
>  	atomic_t		persistent_gnt_in_use;
>  	unsigned long           next_lru;
>  
> -	/* used by the kworker that offload work from the persistent purge */
> -	struct list_head	persistent_purge_list;
> -	struct work_struct	persistent_purge_work;
> -
> -	/* buffer of free pages to map grant refs */
> -	spinlock_t		free_pages_lock;
> -	int			free_pages_num;
>  	struct list_head	free_pages;
>  
>  	/* List of all 'pending_req' available */
> @@ -303,20 +295,54 @@ struct xen_blkif {
>  	/* And its spinlock. */
>  	spinlock_t		pending_free_lock;
>  	wait_queue_head_t	pending_free_wq;
> +	atomic_t		inflight;
> +
> +	/* Private fields. */
> +	atomic_t		refcnt;
> +
> +	struct xen_blkif	*blkif;
> +	unsigned		ring_index;


>  
> +	spinlock_t		stats_lock;
>  	/* statistics */
>  	unsigned long		st_print;
> -	unsigned long long			st_rd_req;
> -	unsigned long long			st_wr_req;
> -	unsigned long long			st_oo_req;
> -	unsigned long long			st_f_req;
> -	unsigned long long			st_ds_req;
> -	unsigned long long			st_rd_sect;
> -	unsigned long long			st_wr_sect;
> +	unsigned long long	st_rd_req;
> +	unsigned long long	st_wr_req;
> +	unsigned long long	st_oo_req;
> +	unsigned long long	st_f_req;
> +	unsigned long long	st_ds_req;
> +	unsigned long long	st_rd_sect;
> +	unsigned long long	st_wr_sect;
> +};
>  
> -	struct work_struct	free_work;
> -	/* Thread shutdown wait queue. */
> -	wait_queue_head_t	shutdown_wq;
> +struct xen_blkif {
> +	/* Unique identifier for this interface. */
> +	domid_t			domid;
> +	unsigned int		handle;
> +	/* Comms information. */
> +	enum blkif_protocol	blk_protocol;
> +	/* The VBD attached to this interface. */
> +	struct xen_vbd		vbd;
> +	/* Rings for this device */
> +	struct xen_blkif_ring	*rings;
> +	unsigned int		nr_rings;
> +	/* Back pointer to the backend_info. */
> +	struct backend_info	*be;
> +
> +	/* for barrier (drain) requests */
> +	struct completion	drain_complete;
> +	atomic_t		drain;
> +
> +	atomic_t		refcnt;
> +
> +	/* statistics */
> +	unsigned long long	st_rd_req;
> +	unsigned long long	st_wr_req;
> +	unsigned long long	st_oo_req;
> +	unsigned long long	st_f_req;
> +	unsigned long long	st_ds_req;
> +	unsigned long long	st_rd_sect;
> +	unsigned long long	st_wr_sect;


Can those go now that they are in xen_blkif_ring?
[edit: I see that in VBD_SHOW why you use them]

Perhaps change 'statistics' to 'full device statistics - including all of the rings values'

Or you could have each ring just increment the 'blkif' counters
instead of doing it per ring?

But there is something usefull about having those values per ring too.
Lets leave it as is then.

>  };
>  
>  struct seg_buf {
> @@ -338,7 +364,7 @@ struct grant_page {
>   * response queued for it, with the saved 'id' passed back.
>   */
>  struct pending_req {
> -	struct xen_blkif	*blkif;
> +	struct xen_blkif_ring	*ring;
>  	u64			id;
>  	int			nr_pages;
>  	atomic_t		pendcnt;
> @@ -357,11 +383,11 @@ struct pending_req {
>  			 (_v)->bdev->bd_part->nr_sects : \
>  			  get_capacity((_v)->bdev->bd_disk))
>  
> -#define xen_blkif_get(_b) (atomic_inc(&(_b)->refcnt))
> -#define xen_blkif_put(_b)				\
> +#define xen_ring_get(_r) (atomic_inc(&(_r)->refcnt))
> +#define xen_ring_put(_r)				\
>  	do {						\
> -		if (atomic_dec_and_test(&(_b)->refcnt))	\
> -			schedule_work(&(_b)->free_work);\
> +		if (atomic_dec_and_test(&(_r)->refcnt))	\
> +			schedule_work(&(_r)->free_work);\
>  	} while (0)
>  
>  struct phys_req {
> @@ -377,7 +403,7 @@ int xen_blkif_xenbus_init(void);
>  irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
>  int xen_blkif_schedule(void *arg);
>  int xen_blkif_purge_persistent(void *arg);
> -void xen_blkbk_free_caches(struct xen_blkif *blkif);
> +void xen_blkbk_free_caches(struct xen_blkif_ring *ring);
>  
>  int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
>  			      struct backend_info *be, int state);
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 3a8b810..a4f13cc 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -35,7 +35,7 @@ static void connect(struct backend_info *);
>  static int connect_ring(struct backend_info *);
>  static void backend_changed(struct xenbus_watch *, const char **,
>  			    unsigned int);
> -static void xen_blkif_free(struct xen_blkif *blkif);
> +static void xen_ring_free(struct xen_blkif_ring *ring);
>  static void xen_vbd_free(struct xen_vbd *vbd);
>  
>  struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
> @@ -45,17 +45,17 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
>  
>  /*
>   * The last request could free the device from softirq context and
> - * xen_blkif_free() can sleep.
> + * xen_ring_free() can sleep.
>   */
> -static void xen_blkif_deferred_free(struct work_struct *work)
> +static void xen_ring_deferred_free(struct work_struct *work)
>  {
> -	struct xen_blkif *blkif;
> +	struct xen_blkif_ring *ring;
>  
> -	blkif = container_of(work, struct xen_blkif, free_work);
> -	xen_blkif_free(blkif);
> +	ring = container_of(work, struct xen_blkif_ring, free_work);
> +	xen_ring_free(ring);
>  }
>  
> -static int blkback_name(struct xen_blkif *blkif, char *buf)
> +static int blkback_name(struct xen_blkif *blkif, char *buf, bool save_space)
>  {
>  	char *devpath, *devname;
>  	struct xenbus_device *dev = blkif->be->dev;
> @@ -70,7 +70,10 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
>  	else
>  		devname  = devpath;
>  
> -	snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
> +	if (save_space)
> +		snprintf(buf, TASK_COMM_LEN, "blkbk.%d.%s", blkif->domid, devname);
> +	else
> +		snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
>  	kfree(devpath);
>  
>  	return 0;
> @@ -78,11 +81,15 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
>  
>  static void xen_update_blkif_status(struct xen_blkif *blkif)
>  {
> -	int err;
> -	char name[TASK_COMM_LEN];
> +	int i, err;
> +	char name[TASK_COMM_LEN], per_ring_name[TASK_COMM_LEN];
> +	struct xen_blkif_ring *ring;
>  
> -	/* Not ready to connect? */
> -	if (!blkif->irq || !blkif->vbd.bdev)
> +	/*
> +	 * Not ready to connect? Check irq of first ring as the others
> +	 * should all be the same.
> +	 */
> +	if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
>  		return;
>  
>  	/* Already connected? */
> @@ -94,7 +101,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
>  	if (blkif->be->dev->state != XenbusStateConnected)
>  		return;
>  
> -	err = blkback_name(blkif, name);
> +	err = blkback_name(blkif, name, blkif->vbd.nr_supported_hw_queues);
>  	if (err) {
>  		xenbus_dev_error(blkif->be->dev, err, "get blkback dev name");
>  		return;
> @@ -107,20 +114,96 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
>  	}
>  	invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
>  
> -	blkif->xenblkd = kthread_run(xen_blkif_schedule, blkif, "%s", name);
> -	if (IS_ERR(blkif->xenblkd)) {
> -		err = PTR_ERR(blkif->xenblkd);
> -		blkif->xenblkd = NULL;
> -		xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
> -		return;
> +	for (i = 0 ; i < blkif->nr_rings ; i++) {
> +		ring = &blkif->rings[i];
> +		if (blkif->vbd.nr_supported_hw_queues)
> +			snprintf(per_ring_name, TASK_COMM_LEN, "%s-%d", name, i);
> +		else {
> +			BUG_ON(i != 0);
> +			snprintf(per_ring_name, TASK_COMM_LEN, "%s", name);
> +		}
> +		ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);
> +		if (IS_ERR(ring->xenblkd)) {
> +			err = PTR_ERR(ring->xenblkd);
> +			ring->xenblkd = NULL;
> +			xenbus_dev_error(blkif->be->dev, err, "start %s", per_ring_name);
> +			return;

That looks to be dangerous. We could not start one of the threads and just return.
The caller doesn't care about the error so we continue on our way. The frontend
thinks it is OK but when it tries to put I/Os on one of the rings - it is silent.

Perhaps what we should do is:
 1). Return an error.
 2). The callers of it ('xen_update_blkif_status' and 'frontend_changed')
     can take steps to either call 'xenbus_dev_fatal' (that will move the state
     of the driver to Closed) or also call 'xen_blkif_disconnect'. Before
     doing all of that reset the nr_rings we can support to 'i' value.

> +		}
>  	}
>  }
>  
> +static struct xen_blkif_ring *xen_blkif_ring_alloc(struct xen_blkif *blkif,
> +						   int nr_rings)
> +{
> +	int r, i, j;
> +	struct xen_blkif_ring *rings;
> +	struct pending_req *req;
> +
> +	rings = kzalloc(nr_rings * sizeof(struct xen_blkif_ring),
> +			GFP_KERNEL);
> +	if (!rings)
> +		return NULL;
> +
> +	for (r = 0 ; r < nr_rings ; r++) {
> +		struct xen_blkif_ring *ring = &rings[r];
> +
> +		spin_lock_init(&ring->blk_ring_lock);
> +
> +		init_waitqueue_head(&ring->wq);
> +		init_waitqueue_head(&ring->shutdown_wq);
> +
> +		ring->persistent_gnts.rb_node = NULL;
> +		spin_lock_init(&ring->free_pages_lock);
> +		INIT_LIST_HEAD(&ring->free_pages);
> +		INIT_LIST_HEAD(&ring->persistent_purge_list);
> +		ring->free_pages_num = 0;
> +		atomic_set(&ring->persistent_gnt_in_use, 0);
> +		atomic_set(&ring->refcnt, 1);
> +		atomic_set(&ring->inflight, 0);
> +		INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
> +		spin_lock_init(&ring->pending_free_lock);
> +		init_waitqueue_head(&ring->pending_free_wq);
> +		INIT_LIST_HEAD(&ring->pending_free);
> +		for (i = 0; i < XEN_RING_REQS(nr_rings); i++) {
> +			req = kzalloc(sizeof(*req), GFP_KERNEL);
> +			if (!req)
> +				goto fail;
> +			list_add_tail(&req->free_list,
> +				      &ring->pending_free);
> +			for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> +				req->segments[j] = kzalloc(sizeof(*req->segments[0]),
> +				                           GFP_KERNEL);
> +				if (!req->segments[j])
> +					goto fail;
> +			}
> +			for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> +				req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
> +				                                 GFP_KERNEL);
> +				if (!req->indirect_pages[j])
> +					goto fail;
> +			}
> +		}
> +
> +		INIT_WORK(&ring->free_work, xen_ring_deferred_free);
> +		ring->blkif = blkif;
> +		ring->ring_index = r;
> +
> +		spin_lock_init(&ring->stats_lock);
> +		ring->st_print = jiffies;
> +
> +		atomic_inc(&blkif->refcnt);
> +	}
> +
> +	return rings;
> +
> +fail:
> +	kfree(rings);

Uh, what about the req->segments[i], req->indirect_pages[j] freeing?


> +	return NULL;
> +}
> +

Like it was before:
> -
> -fail:
> -	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
> -		list_del(&req->free_list);
> -		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> -			if (!req->segments[j])
> -				break;
> -			kfree(req->segments[j]);
> -		}
> -		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> -			if (!req->indirect_pages[j])
> -				break;
> -			kfree(req->indirect_pages[j]);
> -		}
> -		kfree(req);
> -	}
> -
> -	kmem_cache_free(xen_blkif_cachep, blkif);
> -
> -	return ERR_PTR(-ENOMEM);

And that return above should stay in.

>  }
>  
> -static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
> -			 unsigned int evtchn)
> +static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
> +			 unsigned int evtchn, unsigned int ring_idx)
>  {
>  	int err;
> +	struct xen_blkif *blkif;
> +	char dev_name[64];
>  
>  	/* Already connected through? */
> -	if (blkif->irq)
> +	if (ring->irq)
>  		return 0;
>  
> -	err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &blkif->blk_ring);
> +	blkif = ring->blkif;
> +
> +	err = xenbus_map_ring_valloc(ring->blkif->be->dev, shared_page, &ring->blk_ring);
>  	if (err < 0)
>  		return err;
>  
> @@ -210,64 +239,73 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
>  	case BLKIF_PROTOCOL_NATIVE:
>  	{
>  		struct blkif_sring *sring;
> -		sring = (struct blkif_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
> +		sring = (struct blkif_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.native, sring, PAGE_SIZE);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_32:
>  	{
>  		struct blkif_x86_32_sring *sring_x86_32;
> -		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
> +		sring_x86_32 = (struct blkif_x86_32_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_64:
>  	{
>  		struct blkif_x86_64_sring *sring_x86_64;
> -		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
> +		sring_x86_64 = (struct blkif_x86_64_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
>  		break;
>  	}
>  	default:
>  		BUG();
>  	}
>  
> +	if (blkif->vbd.nr_supported_hw_queues)
> +		snprintf(dev_name, 64, "blkif-backend-%d", ring_idx);
> +	else
> +		snprintf(dev_name, 64, "blkif-backend");
>  	err = bind_interdomain_evtchn_to_irqhandler(blkif->domid, evtchn,
>  						    xen_blkif_be_int, 0,
> -						    "blkif-backend", blkif);
> +						    dev_name, ring);
>  	if (err < 0) {
> -		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
> -		blkif->blk_rings.common.sring = NULL;
> +		xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
> +		ring->blk_rings.common.sring = NULL;
>  		return err;
>  	}
> -	blkif->irq = err;
> +	ring->irq = err;
>  
>  	return 0;
>  }
>  
>  static int xen_blkif_disconnect(struct xen_blkif *blkif)
>  {
> -	if (blkif->xenblkd) {
> -		kthread_stop(blkif->xenblkd);
> -		wake_up(&blkif->shutdown_wq);
> -		blkif->xenblkd = NULL;
> -	}
> +	int i;
> +
> +	for (i = 0 ; i < blkif->nr_rings ; i++) {
> +		struct xen_blkif_ring *ring = &blkif->rings[i];
> +		if (ring->xenblkd) {
> +			kthread_stop(ring->xenblkd);
> +			wake_up(&ring->shutdown_wq);
> +			ring->xenblkd = NULL;
> +		}
>  
> -	/* The above kthread_stop() guarantees that at this point we
> -	 * don't have any discard_io or other_io requests. So, checking
> -	 * for inflight IO is enough.
> -	 */
> -	if (atomic_read(&blkif->inflight) > 0)
> -		return -EBUSY;
> +		/* The above kthread_stop() guarantees that at this point we
> +		 * don't have any discard_io or other_io requests. So, checking
> +		 * for inflight IO is enough.
> +		 */
> +		if (atomic_read(&ring->inflight) > 0)
> +			return -EBUSY;
>  
> -	if (blkif->irq) {
> -		unbind_from_irqhandler(blkif->irq, blkif);
> -		blkif->irq = 0;
> -	}
> +		if (ring->irq) {
> +			unbind_from_irqhandler(ring->irq, ring);
> +			ring->irq = 0;
> +		}
>  
> -	if (blkif->blk_rings.common.sring) {
> -		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
> -		blkif->blk_rings.common.sring = NULL;
> +		if (ring->blk_rings.common.sring) {
> +			xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
> +			ring->blk_rings.common.sring = NULL;
> +		}
>  	}
>  
>  	return 0;
> @@ -275,40 +313,52 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
>  
>  static void xen_blkif_free(struct xen_blkif *blkif)
>  {
> -	struct pending_req *req, *n;
> -	int i = 0, j;
>  
>  	xen_blkif_disconnect(blkif);
>  	xen_vbd_free(&blkif->vbd);
>  
> +	kfree(blkif->rings);
> +
> +	kmem_cache_free(xen_blkif_cachep, blkif);
> +}
> +
> +static void xen_ring_free(struct xen_blkif_ring *ring)
> +{
> +	struct pending_req *req, *n;
> +	int i, j;
> +
>  	/* Remove all persistent grants and the cache of ballooned pages. */
> -	xen_blkbk_free_caches(blkif);
> +	xen_blkbk_free_caches(ring);
>  
>  	/* Make sure everything is drained before shutting down */
> -	BUG_ON(blkif->persistent_gnt_c != 0);
> -	BUG_ON(atomic_read(&blkif->persistent_gnt_in_use) != 0);
> -	BUG_ON(blkif->free_pages_num != 0);
> -	BUG_ON(!list_empty(&blkif->persistent_purge_list));
> -	BUG_ON(!list_empty(&blkif->free_pages));
> -	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> -
> +	BUG_ON(ring->persistent_gnt_c != 0);
> +	BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
> +	BUG_ON(ring->free_pages_num != 0);
> +	BUG_ON(!list_empty(&ring->persistent_purge_list));
> +	BUG_ON(!list_empty(&ring->free_pages));
> +	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
> +
> +	i = 0;
>  	/* Check that there is no request in use */
> -	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
> +	list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
>  		list_del(&req->free_list);
> -
> -		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
> +		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> +			if (!req->segments[j])
> +				break;
>  			kfree(req->segments[j]);
> -
> -		for (j = 0; j < MAX_INDIRECT_PAGES; j++)
> +		}
> +		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> +			if (!req->segments[j])
> +				break;
>  			kfree(req->indirect_pages[j]);
> -
> +		}
>  		kfree(req);
>  		i++;
>  	}
> +	WARN_ON(i != XEN_RING_REQS(ring->blkif->nr_rings));
>  
> -	WARN_ON(i != XEN_BLKIF_REQS);
> -
> -	kmem_cache_free(xen_blkif_cachep, blkif);
> +	if (atomic_dec_and_test(&ring->blkif->refcnt))
> +		xen_blkif_free(ring->blkif);
>  }
>  
>  int __init xen_blkif_interface_init(void)
> @@ -333,6 +383,29 @@ int __init xen_blkif_interface_init(void)
>  	{								\
>  		struct xenbus_device *dev = to_xenbus_device(_dev);	\
>  		struct backend_info *be = dev_get_drvdata(&dev->dev);	\
> +		struct xen_blkif *blkif = be->blkif;			\
> +		struct xen_blkif_ring *ring;				\
> +		int i;							\
> +									\
> +		blkif->st_oo_req = 0;					\
> +		blkif->st_rd_req = 0;					\
> +		blkif->st_wr_req = 0;					\
> +		blkif->st_f_req = 0;					\
> +		blkif->st_ds_req = 0;					\
> +		blkif->st_rd_sect = 0;					\
> +		blkif->st_wr_sect = 0;					\
> +		for (i = 0 ; i < blkif->nr_rings ; i++) {		\
> +			ring = &blkif->rings[i];			\
> +			spin_lock_irq(&ring->stats_lock);		\
> +			blkif->st_oo_req += ring->st_oo_req;		\
> +			blkif->st_rd_req += ring->st_rd_req;		\
> +			blkif->st_wr_req += ring->st_wr_req;		\
> +			blkif->st_f_req += ring->st_f_req;		\
> +			blkif->st_ds_req += ring->st_ds_req;		\
> +			blkif->st_rd_sect += ring->st_rd_sect;		\
> +			blkif->st_wr_sect += ring->st_wr_sect;		\
> +			spin_unlock_irq(&ring->stats_lock);		\
> +		}							\


Ah, that is why you had extra statistics! Please mention that
in the commit description

Could we make this macro a bit smarter and just add for the appropiate
value that is asked? 

Right now if I just want to see ds_req I end up computing for 'st_ds_req'
but also for the rest of them?

But making this code be nicely in this macro for this would be ugly.

Perhaps you can use offsetof?

Like so:

struct vbd_stats_offset {
	unsigned int global;
	unsigned int per_ring;
}

static const struct vbd_status_offset vbd_offsets[] = {
	{offsetof(struct xen_blkif, st_oo_req), offsetof(struct xen_blkif_ring, st_oo_req)}
	...
	};

And in the VBD macro:

	unsigned long long val = 0;
	unsigned int offset = offsetof(struct xen_blkif, ##args);

	unsigned int i;	
	for (i = 0; i < ARRAY_SIZE(vbd_offset); i++) {
		struct vbd_stats_offset *offsets = vbd_offsets[i];
		if (offsets->global == offset)
		{
			for (i = 0; i < blkif->nr_rings; i++) {
				unsigned long *ring = (unsigned long *)&blkif->rings[i];
				val += *(ring + offsets->per_ring);
			}
			break;
		}
		snprintf(bug, format, val);	

Minus any bugs in the above code..

Which should make it:
 - Faster (as we would be taking the lock only when looking in the ring)
 - No need to have the extra global statistics as we compute them on demand.

>  									\
>  		return sprintf(buf, format, ##args);			\
>  	}								\
> @@ -453,6 +526,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  		handle, blkif->domid);
>  	return 0;
>  }
> +

Spurious change.
>  static int xen_blkbk_remove(struct xenbus_device *dev)
>  {
>  	struct backend_info *be = dev_get_drvdata(&dev->dev);
> @@ -468,13 +542,14 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
>  		be->backend_watch.node = NULL;
>  	}
>  
> -	dev_set_drvdata(&dev->dev, NULL);
> -
>  	if (be->blkif) {
> +		int i = 0;
>  		xen_blkif_disconnect(be->blkif);
> -		xen_blkif_put(be->blkif);
> +		for (; i < be->blkif->nr_rings ; i++)

Lets do the 'i = 0' in the loop in case somebody in the future
modifies it and does some operation on 'i' before the loop.

> +			xen_ring_put(&be->blkif->rings[i]);
>  	}
>  
> +	dev_set_drvdata(&dev->dev, NULL);

How come you move it _after_ ?

>  	kfree(be->mode);
>  	kfree(be);
>  	return 0;
> @@ -851,21 +926,46 @@ again:
>  static int connect_ring(struct backend_info *be)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings
  2014-09-11 23:57 ` [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings Arianna Avanzini
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

On Fri, Sep 12, 2014 at 01:57:23AM +0200, Arianna Avanzini wrote:
> This commit adds to xen-blkback the support to map and make use
> of a variable number of ringbuffers. The number of rings to be
> mapped is forcedly set to one.

Please add:
 - An explanation the 'xen_blkif_ring' and 'xen_blkif' split as well.
 - The addition of the 'stats_lock' (which really should be a seperate
   patch). Please remember: One logical change per patch./

> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkback/blkback.c | 377 ++++++++++++++++---------------
>  drivers/block/xen-blkback/common.h  | 110 +++++----
>  drivers/block/xen-blkback/xenbus.c  | 432 +++++++++++++++++++++++-------------
>  3 files changed, 548 insertions(+), 371 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 64c60ed..b31acfb 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -80,6 +80,9 @@ module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
>  MODULE_PARM_DESC(max_persistent_grants,
>                   "Maximum number of grants to map persistently");
>

A bit fat comment here please :-)
  
> +#define XEN_RING_MAX_PGRANTS(nr_rings) \
> +	(max((int)(xen_blkif_max_pgrants / nr_rings), 16))
> +

.. Giant snip ..
>  static int __init xen_blkif_init(void)
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index f65b807..6f074ce 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -226,6 +226,7 @@ struct xen_vbd {
>  	struct block_device	*bdev;
>  	/* Cached size parameter. */
>  	sector_t		size;
> +	unsigned int		nr_supported_hw_queues;

nr_rings

would do
>  	unsigned int		flush_support:1;
>  	unsigned int		discard_secure:1;
>  	unsigned int		feature_gnt_persistent:1;
> @@ -246,6 +247,7 @@ struct backend_info;
>  
>  /* Number of requests that we can fit in a ring */
>  #define XEN_BLKIF_REQS			32
> +#define XEN_RING_REQS(nr_rings)	(max((int)(XEN_BLKIF_REQS / nr_rings), 8))

Bit giant comment here please.
>  
>  struct persistent_gnt {
>  	struct page *page;
> @@ -256,32 +258,29 @@ struct persistent_gnt {
>  	struct list_head remove_node;
>  };
>  
> -struct xen_blkif {
> -	/* Unique identifier for this interface. */
> -	domid_t			domid;
> -	unsigned int		handle;
> +struct xen_blkif_ring {
> +	union blkif_back_rings	blk_rings;
>  	/* Physical parameters of the comms window. */
>  	unsigned int		irq;
> -	/* Comms information. */
> -	enum blkif_protocol	blk_protocol;
> -	union blkif_back_rings	blk_rings;
> -	void			*blk_ring;
> -	/* The VBD attached to this interface. */
> -	struct xen_vbd		vbd;
> -	/* Back pointer to the backend_info. */
> -	struct backend_info	*be;
> -	/* Private fields. */
> -	spinlock_t		blk_ring_lock;
> -	atomic_t		refcnt;
>  
>  	wait_queue_head_t	wq;
> -	/* for barrier (drain) requests */
> -	struct completion	drain_complete;
> -	atomic_t		drain;
> -	atomic_t		inflight;
>  	/* One thread per one blkif. */
>  	struct task_struct	*xenblkd;
>  	unsigned int		waiting_reqs;
> +	void			*blk_ring;
> +	spinlock_t		blk_ring_lock;
> +
> +	struct work_struct	free_work;
> +	/* Thread shutdown wait queue. */
> +	wait_queue_head_t	shutdown_wq;
> +
> +	/* buffer of free pages to map grant refs */
> +	spinlock_t		free_pages_lock;
> +	int			free_pages_num;
> +
> +	/* used by the kworker that offload work from the persistent purge */
> +	struct list_head	persistent_purge_list;
> +	struct work_struct	persistent_purge_work;
>  
>  	/* tree to store persistent grants */
>  	struct rb_root		persistent_gnts;
> @@ -289,13 +288,6 @@ struct xen_blkif {
>  	atomic_t		persistent_gnt_in_use;
>  	unsigned long           next_lru;
>  
> -	/* used by the kworker that offload work from the persistent purge */
> -	struct list_head	persistent_purge_list;
> -	struct work_struct	persistent_purge_work;
> -
> -	/* buffer of free pages to map grant refs */
> -	spinlock_t		free_pages_lock;
> -	int			free_pages_num;
>  	struct list_head	free_pages;
>  
>  	/* List of all 'pending_req' available */
> @@ -303,20 +295,54 @@ struct xen_blkif {
>  	/* And its spinlock. */
>  	spinlock_t		pending_free_lock;
>  	wait_queue_head_t	pending_free_wq;
> +	atomic_t		inflight;
> +
> +	/* Private fields. */
> +	atomic_t		refcnt;
> +
> +	struct xen_blkif	*blkif;
> +	unsigned		ring_index;


>  
> +	spinlock_t		stats_lock;
>  	/* statistics */
>  	unsigned long		st_print;
> -	unsigned long long			st_rd_req;
> -	unsigned long long			st_wr_req;
> -	unsigned long long			st_oo_req;
> -	unsigned long long			st_f_req;
> -	unsigned long long			st_ds_req;
> -	unsigned long long			st_rd_sect;
> -	unsigned long long			st_wr_sect;
> +	unsigned long long	st_rd_req;
> +	unsigned long long	st_wr_req;
> +	unsigned long long	st_oo_req;
> +	unsigned long long	st_f_req;
> +	unsigned long long	st_ds_req;
> +	unsigned long long	st_rd_sect;
> +	unsigned long long	st_wr_sect;
> +};
>  
> -	struct work_struct	free_work;
> -	/* Thread shutdown wait queue. */
> -	wait_queue_head_t	shutdown_wq;
> +struct xen_blkif {
> +	/* Unique identifier for this interface. */
> +	domid_t			domid;
> +	unsigned int		handle;
> +	/* Comms information. */
> +	enum blkif_protocol	blk_protocol;
> +	/* The VBD attached to this interface. */
> +	struct xen_vbd		vbd;
> +	/* Rings for this device */
> +	struct xen_blkif_ring	*rings;
> +	unsigned int		nr_rings;
> +	/* Back pointer to the backend_info. */
> +	struct backend_info	*be;
> +
> +	/* for barrier (drain) requests */
> +	struct completion	drain_complete;
> +	atomic_t		drain;
> +
> +	atomic_t		refcnt;
> +
> +	/* statistics */
> +	unsigned long long	st_rd_req;
> +	unsigned long long	st_wr_req;
> +	unsigned long long	st_oo_req;
> +	unsigned long long	st_f_req;
> +	unsigned long long	st_ds_req;
> +	unsigned long long	st_rd_sect;
> +	unsigned long long	st_wr_sect;


Can those go now that they are in xen_blkif_ring?
[edit: I see that in VBD_SHOW why you use them]

Perhaps change 'statistics' to 'full device statistics - including all of the rings values'

Or you could have each ring just increment the 'blkif' counters
instead of doing it per ring?

But there is something usefull about having those values per ring too.
Lets leave it as is then.

>  };
>  
>  struct seg_buf {
> @@ -338,7 +364,7 @@ struct grant_page {
>   * response queued for it, with the saved 'id' passed back.
>   */
>  struct pending_req {
> -	struct xen_blkif	*blkif;
> +	struct xen_blkif_ring	*ring;
>  	u64			id;
>  	int			nr_pages;
>  	atomic_t		pendcnt;
> @@ -357,11 +383,11 @@ struct pending_req {
>  			 (_v)->bdev->bd_part->nr_sects : \
>  			  get_capacity((_v)->bdev->bd_disk))
>  
> -#define xen_blkif_get(_b) (atomic_inc(&(_b)->refcnt))
> -#define xen_blkif_put(_b)				\
> +#define xen_ring_get(_r) (atomic_inc(&(_r)->refcnt))
> +#define xen_ring_put(_r)				\
>  	do {						\
> -		if (atomic_dec_and_test(&(_b)->refcnt))	\
> -			schedule_work(&(_b)->free_work);\
> +		if (atomic_dec_and_test(&(_r)->refcnt))	\
> +			schedule_work(&(_r)->free_work);\
>  	} while (0)
>  
>  struct phys_req {
> @@ -377,7 +403,7 @@ int xen_blkif_xenbus_init(void);
>  irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
>  int xen_blkif_schedule(void *arg);
>  int xen_blkif_purge_persistent(void *arg);
> -void xen_blkbk_free_caches(struct xen_blkif *blkif);
> +void xen_blkbk_free_caches(struct xen_blkif_ring *ring);
>  
>  int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
>  			      struct backend_info *be, int state);
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 3a8b810..a4f13cc 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -35,7 +35,7 @@ static void connect(struct backend_info *);
>  static int connect_ring(struct backend_info *);
>  static void backend_changed(struct xenbus_watch *, const char **,
>  			    unsigned int);
> -static void xen_blkif_free(struct xen_blkif *blkif);
> +static void xen_ring_free(struct xen_blkif_ring *ring);
>  static void xen_vbd_free(struct xen_vbd *vbd);
>  
>  struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
> @@ -45,17 +45,17 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be)
>  
>  /*
>   * The last request could free the device from softirq context and
> - * xen_blkif_free() can sleep.
> + * xen_ring_free() can sleep.
>   */
> -static void xen_blkif_deferred_free(struct work_struct *work)
> +static void xen_ring_deferred_free(struct work_struct *work)
>  {
> -	struct xen_blkif *blkif;
> +	struct xen_blkif_ring *ring;
>  
> -	blkif = container_of(work, struct xen_blkif, free_work);
> -	xen_blkif_free(blkif);
> +	ring = container_of(work, struct xen_blkif_ring, free_work);
> +	xen_ring_free(ring);
>  }
>  
> -static int blkback_name(struct xen_blkif *blkif, char *buf)
> +static int blkback_name(struct xen_blkif *blkif, char *buf, bool save_space)
>  {
>  	char *devpath, *devname;
>  	struct xenbus_device *dev = blkif->be->dev;
> @@ -70,7 +70,10 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
>  	else
>  		devname  = devpath;
>  
> -	snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
> +	if (save_space)
> +		snprintf(buf, TASK_COMM_LEN, "blkbk.%d.%s", blkif->domid, devname);
> +	else
> +		snprintf(buf, TASK_COMM_LEN, "blkback.%d.%s", blkif->domid, devname);
>  	kfree(devpath);
>  
>  	return 0;
> @@ -78,11 +81,15 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)
>  
>  static void xen_update_blkif_status(struct xen_blkif *blkif)
>  {
> -	int err;
> -	char name[TASK_COMM_LEN];
> +	int i, err;
> +	char name[TASK_COMM_LEN], per_ring_name[TASK_COMM_LEN];
> +	struct xen_blkif_ring *ring;
>  
> -	/* Not ready to connect? */
> -	if (!blkif->irq || !blkif->vbd.bdev)
> +	/*
> +	 * Not ready to connect? Check irq of first ring as the others
> +	 * should all be the same.
> +	 */
> +	if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
>  		return;
>  
>  	/* Already connected? */
> @@ -94,7 +101,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
>  	if (blkif->be->dev->state != XenbusStateConnected)
>  		return;
>  
> -	err = blkback_name(blkif, name);
> +	err = blkback_name(blkif, name, blkif->vbd.nr_supported_hw_queues);
>  	if (err) {
>  		xenbus_dev_error(blkif->be->dev, err, "get blkback dev name");
>  		return;
> @@ -107,20 +114,96 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
>  	}
>  	invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
>  
> -	blkif->xenblkd = kthread_run(xen_blkif_schedule, blkif, "%s", name);
> -	if (IS_ERR(blkif->xenblkd)) {
> -		err = PTR_ERR(blkif->xenblkd);
> -		blkif->xenblkd = NULL;
> -		xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
> -		return;
> +	for (i = 0 ; i < blkif->nr_rings ; i++) {
> +		ring = &blkif->rings[i];
> +		if (blkif->vbd.nr_supported_hw_queues)
> +			snprintf(per_ring_name, TASK_COMM_LEN, "%s-%d", name, i);
> +		else {
> +			BUG_ON(i != 0);
> +			snprintf(per_ring_name, TASK_COMM_LEN, "%s", name);
> +		}
> +		ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);
> +		if (IS_ERR(ring->xenblkd)) {
> +			err = PTR_ERR(ring->xenblkd);
> +			ring->xenblkd = NULL;
> +			xenbus_dev_error(blkif->be->dev, err, "start %s", per_ring_name);
> +			return;

That looks to be dangerous. We could not start one of the threads and just return.
The caller doesn't care about the error so we continue on our way. The frontend
thinks it is OK but when it tries to put I/Os on one of the rings - it is silent.

Perhaps what we should do is:
 1). Return an error.
 2). The callers of it ('xen_update_blkif_status' and 'frontend_changed')
     can take steps to either call 'xenbus_dev_fatal' (that will move the state
     of the driver to Closed) or also call 'xen_blkif_disconnect'. Before
     doing all of that reset the nr_rings we can support to 'i' value.

> +		}
>  	}
>  }
>  
> +static struct xen_blkif_ring *xen_blkif_ring_alloc(struct xen_blkif *blkif,
> +						   int nr_rings)
> +{
> +	int r, i, j;
> +	struct xen_blkif_ring *rings;
> +	struct pending_req *req;
> +
> +	rings = kzalloc(nr_rings * sizeof(struct xen_blkif_ring),
> +			GFP_KERNEL);
> +	if (!rings)
> +		return NULL;
> +
> +	for (r = 0 ; r < nr_rings ; r++) {
> +		struct xen_blkif_ring *ring = &rings[r];
> +
> +		spin_lock_init(&ring->blk_ring_lock);
> +
> +		init_waitqueue_head(&ring->wq);
> +		init_waitqueue_head(&ring->shutdown_wq);
> +
> +		ring->persistent_gnts.rb_node = NULL;
> +		spin_lock_init(&ring->free_pages_lock);
> +		INIT_LIST_HEAD(&ring->free_pages);
> +		INIT_LIST_HEAD(&ring->persistent_purge_list);
> +		ring->free_pages_num = 0;
> +		atomic_set(&ring->persistent_gnt_in_use, 0);
> +		atomic_set(&ring->refcnt, 1);
> +		atomic_set(&ring->inflight, 0);
> +		INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
> +		spin_lock_init(&ring->pending_free_lock);
> +		init_waitqueue_head(&ring->pending_free_wq);
> +		INIT_LIST_HEAD(&ring->pending_free);
> +		for (i = 0; i < XEN_RING_REQS(nr_rings); i++) {
> +			req = kzalloc(sizeof(*req), GFP_KERNEL);
> +			if (!req)
> +				goto fail;
> +			list_add_tail(&req->free_list,
> +				      &ring->pending_free);
> +			for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> +				req->segments[j] = kzalloc(sizeof(*req->segments[0]),
> +				                           GFP_KERNEL);
> +				if (!req->segments[j])
> +					goto fail;
> +			}
> +			for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> +				req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
> +				                                 GFP_KERNEL);
> +				if (!req->indirect_pages[j])
> +					goto fail;
> +			}
> +		}
> +
> +		INIT_WORK(&ring->free_work, xen_ring_deferred_free);
> +		ring->blkif = blkif;
> +		ring->ring_index = r;
> +
> +		spin_lock_init(&ring->stats_lock);
> +		ring->st_print = jiffies;
> +
> +		atomic_inc(&blkif->refcnt);
> +	}
> +
> +	return rings;
> +
> +fail:
> +	kfree(rings);

Uh, what about the req->segments[i], req->indirect_pages[j] freeing?


> +	return NULL;
> +}
> +

Like it was before:
> -
> -fail:
> -	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
> -		list_del(&req->free_list);
> -		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> -			if (!req->segments[j])
> -				break;
> -			kfree(req->segments[j]);
> -		}
> -		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> -			if (!req->indirect_pages[j])
> -				break;
> -			kfree(req->indirect_pages[j]);
> -		}
> -		kfree(req);
> -	}
> -
> -	kmem_cache_free(xen_blkif_cachep, blkif);
> -
> -	return ERR_PTR(-ENOMEM);

And that return above should stay in.

>  }
>  
> -static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
> -			 unsigned int evtchn)
> +static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
> +			 unsigned int evtchn, unsigned int ring_idx)
>  {
>  	int err;
> +	struct xen_blkif *blkif;
> +	char dev_name[64];
>  
>  	/* Already connected through? */
> -	if (blkif->irq)
> +	if (ring->irq)
>  		return 0;
>  
> -	err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &blkif->blk_ring);
> +	blkif = ring->blkif;
> +
> +	err = xenbus_map_ring_valloc(ring->blkif->be->dev, shared_page, &ring->blk_ring);
>  	if (err < 0)
>  		return err;
>  
> @@ -210,64 +239,73 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
>  	case BLKIF_PROTOCOL_NATIVE:
>  	{
>  		struct blkif_sring *sring;
> -		sring = (struct blkif_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
> +		sring = (struct blkif_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.native, sring, PAGE_SIZE);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_32:
>  	{
>  		struct blkif_x86_32_sring *sring_x86_32;
> -		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
> +		sring_x86_32 = (struct blkif_x86_32_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_64:
>  	{
>  		struct blkif_x86_64_sring *sring_x86_64;
> -		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
> +		sring_x86_64 = (struct blkif_x86_64_sring *)ring->blk_ring;
> +		BACK_RING_INIT(&ring->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
>  		break;
>  	}
>  	default:
>  		BUG();
>  	}
>  
> +	if (blkif->vbd.nr_supported_hw_queues)
> +		snprintf(dev_name, 64, "blkif-backend-%d", ring_idx);
> +	else
> +		snprintf(dev_name, 64, "blkif-backend");
>  	err = bind_interdomain_evtchn_to_irqhandler(blkif->domid, evtchn,
>  						    xen_blkif_be_int, 0,
> -						    "blkif-backend", blkif);
> +						    dev_name, ring);
>  	if (err < 0) {
> -		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
> -		blkif->blk_rings.common.sring = NULL;
> +		xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
> +		ring->blk_rings.common.sring = NULL;
>  		return err;
>  	}
> -	blkif->irq = err;
> +	ring->irq = err;
>  
>  	return 0;
>  }
>  
>  static int xen_blkif_disconnect(struct xen_blkif *blkif)
>  {
> -	if (blkif->xenblkd) {
> -		kthread_stop(blkif->xenblkd);
> -		wake_up(&blkif->shutdown_wq);
> -		blkif->xenblkd = NULL;
> -	}
> +	int i;
> +
> +	for (i = 0 ; i < blkif->nr_rings ; i++) {
> +		struct xen_blkif_ring *ring = &blkif->rings[i];
> +		if (ring->xenblkd) {
> +			kthread_stop(ring->xenblkd);
> +			wake_up(&ring->shutdown_wq);
> +			ring->xenblkd = NULL;
> +		}
>  
> -	/* The above kthread_stop() guarantees that at this point we
> -	 * don't have any discard_io or other_io requests. So, checking
> -	 * for inflight IO is enough.
> -	 */
> -	if (atomic_read(&blkif->inflight) > 0)
> -		return -EBUSY;
> +		/* The above kthread_stop() guarantees that at this point we
> +		 * don't have any discard_io or other_io requests. So, checking
> +		 * for inflight IO is enough.
> +		 */
> +		if (atomic_read(&ring->inflight) > 0)
> +			return -EBUSY;
>  
> -	if (blkif->irq) {
> -		unbind_from_irqhandler(blkif->irq, blkif);
> -		blkif->irq = 0;
> -	}
> +		if (ring->irq) {
> +			unbind_from_irqhandler(ring->irq, ring);
> +			ring->irq = 0;
> +		}
>  
> -	if (blkif->blk_rings.common.sring) {
> -		xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
> -		blkif->blk_rings.common.sring = NULL;
> +		if (ring->blk_rings.common.sring) {
> +			xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
> +			ring->blk_rings.common.sring = NULL;
> +		}
>  	}
>  
>  	return 0;
> @@ -275,40 +313,52 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
>  
>  static void xen_blkif_free(struct xen_blkif *blkif)
>  {
> -	struct pending_req *req, *n;
> -	int i = 0, j;
>  
>  	xen_blkif_disconnect(blkif);
>  	xen_vbd_free(&blkif->vbd);
>  
> +	kfree(blkif->rings);
> +
> +	kmem_cache_free(xen_blkif_cachep, blkif);
> +}
> +
> +static void xen_ring_free(struct xen_blkif_ring *ring)
> +{
> +	struct pending_req *req, *n;
> +	int i, j;
> +
>  	/* Remove all persistent grants and the cache of ballooned pages. */
> -	xen_blkbk_free_caches(blkif);
> +	xen_blkbk_free_caches(ring);
>  
>  	/* Make sure everything is drained before shutting down */
> -	BUG_ON(blkif->persistent_gnt_c != 0);
> -	BUG_ON(atomic_read(&blkif->persistent_gnt_in_use) != 0);
> -	BUG_ON(blkif->free_pages_num != 0);
> -	BUG_ON(!list_empty(&blkif->persistent_purge_list));
> -	BUG_ON(!list_empty(&blkif->free_pages));
> -	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> -
> +	BUG_ON(ring->persistent_gnt_c != 0);
> +	BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
> +	BUG_ON(ring->free_pages_num != 0);
> +	BUG_ON(!list_empty(&ring->persistent_purge_list));
> +	BUG_ON(!list_empty(&ring->free_pages));
> +	BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
> +
> +	i = 0;
>  	/* Check that there is no request in use */
> -	list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
> +	list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
>  		list_del(&req->free_list);
> -
> -		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
> +		for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
> +			if (!req->segments[j])
> +				break;
>  			kfree(req->segments[j]);
> -
> -		for (j = 0; j < MAX_INDIRECT_PAGES; j++)
> +		}
> +		for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
> +			if (!req->segments[j])
> +				break;
>  			kfree(req->indirect_pages[j]);
> -
> +		}
>  		kfree(req);
>  		i++;
>  	}
> +	WARN_ON(i != XEN_RING_REQS(ring->blkif->nr_rings));
>  
> -	WARN_ON(i != XEN_BLKIF_REQS);
> -
> -	kmem_cache_free(xen_blkif_cachep, blkif);
> +	if (atomic_dec_and_test(&ring->blkif->refcnt))
> +		xen_blkif_free(ring->blkif);
>  }
>  
>  int __init xen_blkif_interface_init(void)
> @@ -333,6 +383,29 @@ int __init xen_blkif_interface_init(void)
>  	{								\
>  		struct xenbus_device *dev = to_xenbus_device(_dev);	\
>  		struct backend_info *be = dev_get_drvdata(&dev->dev);	\
> +		struct xen_blkif *blkif = be->blkif;			\
> +		struct xen_blkif_ring *ring;				\
> +		int i;							\
> +									\
> +		blkif->st_oo_req = 0;					\
> +		blkif->st_rd_req = 0;					\
> +		blkif->st_wr_req = 0;					\
> +		blkif->st_f_req = 0;					\
> +		blkif->st_ds_req = 0;					\
> +		blkif->st_rd_sect = 0;					\
> +		blkif->st_wr_sect = 0;					\
> +		for (i = 0 ; i < blkif->nr_rings ; i++) {		\
> +			ring = &blkif->rings[i];			\
> +			spin_lock_irq(&ring->stats_lock);		\
> +			blkif->st_oo_req += ring->st_oo_req;		\
> +			blkif->st_rd_req += ring->st_rd_req;		\
> +			blkif->st_wr_req += ring->st_wr_req;		\
> +			blkif->st_f_req += ring->st_f_req;		\
> +			blkif->st_ds_req += ring->st_ds_req;		\
> +			blkif->st_rd_sect += ring->st_rd_sect;		\
> +			blkif->st_wr_sect += ring->st_wr_sect;		\
> +			spin_unlock_irq(&ring->stats_lock);		\
> +		}							\


Ah, that is why you had extra statistics! Please mention that
in the commit description

Could we make this macro a bit smarter and just add for the appropiate
value that is asked? 

Right now if I just want to see ds_req I end up computing for 'st_ds_req'
but also for the rest of them?

But making this code be nicely in this macro for this would be ugly.

Perhaps you can use offsetof?

Like so:

struct vbd_stats_offset {
	unsigned int global;
	unsigned int per_ring;
}

static const struct vbd_status_offset vbd_offsets[] = {
	{offsetof(struct xen_blkif, st_oo_req), offsetof(struct xen_blkif_ring, st_oo_req)}
	...
	};

And in the VBD macro:

	unsigned long long val = 0;
	unsigned int offset = offsetof(struct xen_blkif, ##args);

	unsigned int i;	
	for (i = 0; i < ARRAY_SIZE(vbd_offset); i++) {
		struct vbd_stats_offset *offsets = vbd_offsets[i];
		if (offsets->global == offset)
		{
			for (i = 0; i < blkif->nr_rings; i++) {
				unsigned long *ring = (unsigned long *)&blkif->rings[i];
				val += *(ring + offsets->per_ring);
			}
			break;
		}
		snprintf(bug, format, val);	

Minus any bugs in the above code..

Which should make it:
 - Faster (as we would be taking the lock only when looking in the ring)
 - No need to have the extra global statistics as we compute them on demand.

>  									\
>  		return sprintf(buf, format, ##args);			\
>  	}								\
> @@ -453,6 +526,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  		handle, blkif->domid);
>  	return 0;
>  }
> +

Spurious change.
>  static int xen_blkbk_remove(struct xenbus_device *dev)
>  {
>  	struct backend_info *be = dev_get_drvdata(&dev->dev);
> @@ -468,13 +542,14 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
>  		be->backend_watch.node = NULL;
>  	}
>  
> -	dev_set_drvdata(&dev->dev, NULL);
> -
>  	if (be->blkif) {
> +		int i = 0;
>  		xen_blkif_disconnect(be->blkif);
> -		xen_blkif_put(be->blkif);
> +		for (; i < be->blkif->nr_rings ; i++)

Lets do the 'i = 0' in the loop in case somebody in the future
modifies it and does some operation on 'i' before the loop.

> +			xen_ring_put(&be->blkif->rings[i]);
>  	}
>  
> +	dev_set_drvdata(&dev->dev, NULL);

How come you move it _after_ ?

>  	kfree(be->mode);
>  	kfree(be);
>  	return 0;
> @@ -851,21 +926,46 @@ again:
>  static int connect_ring(struct backend_info *be)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-13 19:29   ` Christoph Hellwig
  2014-09-13 19:29   ` Christoph Hellwig
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

On Fri, Sep 12, 2014 at 01:57:20AM +0200, Arianna Avanzini wrote:
> This commit introduces support for the multi-queue block layer API,
> and at the same time removes the existing request_queue API support.
> The changes are only structural, and the number of supported hardware
> contexts is forcedly set to one.

Hey Arianna,

Thank you for posting them and sorry for the long time in reviewing it.

> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 171 ++++++++++++++++++++-----------------------
>  1 file changed, 80 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5deb235..109add6 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -37,6 +37,7 @@
>  
>  #include <linux/interrupt.h>
>  #include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
>  #include <linux/hdreg.h>
>  #include <linux/cdrom.h>
>  #include <linux/module.h>
> @@ -134,6 +135,8 @@ struct blkfront_info
>  	unsigned int feature_persistent:1;
>  	unsigned int max_indirect_segments;
>  	int is_ready;
> +	/* Block layer tags. */
> +	struct blk_mq_tag_set tag_set;
>  };
>  
>  static unsigned int nr_minors;
> @@ -582,66 +585,69 @@ static inline void flush_requests(struct blkfront_info *info)
>  		notify_remote_via_irq(info->irq);
>  }
>  
> -/*
> - * do_blkif_request
> - *  read a block; request is in a request queue
> - */
> -static void do_blkif_request(struct request_queue *rq)

> +static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>  {
> -	struct blkfront_info *info = NULL;
> -	struct request *req;
> -	int queued;
> -
> -	pr_debug("Entered do_blkif_request\n");
> -
> -	queued = 0;
> -

This loop allowed us to queue up on the ring up to 32 requests
(or more depending depending on the options), and then when done
(or when no more could be fitted), call the 'flush_request'
which would update the producer index and kick the backend (if needed).

With the removal of the loop we could be called for each
I/O and do a bunch of : write in ring, update producer index, kick backend,
write in ring, update producer ring, kick backend, etc.


> -	while ((req = blk_peek_request(rq)) != NULL) {
> -		info = req->rq_disk->private_data;
> +	struct blkfront_info *info = req->rq_disk->private_data;
>  
> -		if (RING_FULL(&info->ring))
> -			goto wait;
> +	spin_lock_irq(&info->io_lock);
> +	if (RING_FULL(&info->ring))
> +		goto wait;
>  
> -		blk_start_request(req);
> +	if ((req->cmd_type != REQ_TYPE_FS) ||
> +			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> +			 !info->flush_op)) {

You had an earlier patch (xen, blkfront: factor out flush-related checks from do_blkif_request())

Which made that 'if' statement much nicer. I put said patch on the 3.18 train
for Jens - so once v3.18-rc1 is out you can rebase on that and pick this
up.

> +		req->errors = -EIO;
> +		blk_mq_complete_request(req);
> +		spin_unlock_irq(&info->io_lock);
> +		return BLK_MQ_RQ_QUEUE_ERROR;
> +	}
>  
> -		if ((req->cmd_type != REQ_TYPE_FS) ||
> -		    ((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> -		    !info->flush_op)) {
> -			__blk_end_request_all(req, -EIO);
> -			continue;
> -		}
> +	if (blkif_queue_request(req)) {
> +		blk_mq_requeue_request(req);
> +		goto wait;
> +	}
>  
> -		pr_debug("do_blk_req %p: cmd %p, sec %lx, "
> -			 "(%u/%u) [%s]\n",
> -			 req, req->cmd, (unsigned long)blk_rq_pos(req),
> -			 blk_rq_cur_sectors(req), blk_rq_sectors(req),
> -			 rq_data_dir(req) ? "write" : "read");
> +	flush_requests(info);
> +	spin_unlock_irq(&info->io_lock);
> +	return BLK_MQ_RQ_QUEUE_OK;
>  
> -		if (blkif_queue_request(req)) {
> -			blk_requeue_request(rq, req);
>  wait:
> -			/* Avoid pointless unplugs. */
> -			blk_stop_queue(rq);
> -			break;
> -		}
> -
> -		queued++;
> -	}
> -
> -	if (queued != 0)
> -		flush_requests(info);
> +	/* Avoid pointless unplugs. */
> +	blk_mq_stop_hw_queue(hctx);
> +	spin_unlock_irq(&info->io_lock);
> +	return BLK_MQ_RQ_QUEUE_BUSY;
>  }
>  
> +static struct blk_mq_ops blkfront_mq_ops = {
> +	.queue_rq = blkfront_queue_rq,
> +	.map_queue = blk_mq_map_queue,
> +};
> +
>  static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  				unsigned int physical_sector_size,
>  				unsigned int segments)
>  {
>  	struct request_queue *rq;
>  	struct blkfront_info *info = gd->private_data;
> +	int ret;
> +
> +	memset(&info->tag_set, 0, sizeof(info->tag_set));

You can ditch that. In 'blkfront_probe' we use 'kzalloc' to setup
the 'info' structure so it is already set to zero values.

> +	info->tag_set.ops = &blkfront_mq_ops;
> +	info->tag_set.nr_hw_queues = 1;
> +	info->tag_set.queue_depth = BLK_RING_SIZE;

Could you add an simple comment (or I can do it when I pick v3):
/* TODO: Need to figure out how to square this with 'xen_blkif_max_segments' */

> +	info->tag_set.numa_node = NUMA_NO_NODE;
> +	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> +	info->tag_set.cmd_size = 0;

Could you add an comment about why we have zero here? My recollection is
that we do not need it as we keep track of the request internally (for
migration purposes and if we did persistent mapping).

/*
 * And the reason we want it set to zero is because we want to be responsible
 * for page recycling (in case it is persistent and we need to manually
 * deal with unmapping it, etc).
 */

> +	info->tag_set.driver_data = info;

Or perhaps mention that since we have 'driver_data' we can access 'info'
and do all internal mapping/unmapping/etc on that.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API
  2014-09-11 23:57 ` Arianna Avanzini
                     ` (2 preceding siblings ...)
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

On Fri, Sep 12, 2014 at 01:57:20AM +0200, Arianna Avanzini wrote:
> This commit introduces support for the multi-queue block layer API,
> and at the same time removes the existing request_queue API support.
> The changes are only structural, and the number of supported hardware
> contexts is forcedly set to one.

Hey Arianna,

Thank you for posting them and sorry for the long time in reviewing it.

> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 171 ++++++++++++++++++++-----------------------
>  1 file changed, 80 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5deb235..109add6 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -37,6 +37,7 @@
>  
>  #include <linux/interrupt.h>
>  #include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
>  #include <linux/hdreg.h>
>  #include <linux/cdrom.h>
>  #include <linux/module.h>
> @@ -134,6 +135,8 @@ struct blkfront_info
>  	unsigned int feature_persistent:1;
>  	unsigned int max_indirect_segments;
>  	int is_ready;
> +	/* Block layer tags. */
> +	struct blk_mq_tag_set tag_set;
>  };
>  
>  static unsigned int nr_minors;
> @@ -582,66 +585,69 @@ static inline void flush_requests(struct blkfront_info *info)
>  		notify_remote_via_irq(info->irq);
>  }
>  
> -/*
> - * do_blkif_request
> - *  read a block; request is in a request queue
> - */
> -static void do_blkif_request(struct request_queue *rq)

> +static int blkfront_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>  {
> -	struct blkfront_info *info = NULL;
> -	struct request *req;
> -	int queued;
> -
> -	pr_debug("Entered do_blkif_request\n");
> -
> -	queued = 0;
> -

This loop allowed us to queue up on the ring up to 32 requests
(or more depending depending on the options), and then when done
(or when no more could be fitted), call the 'flush_request'
which would update the producer index and kick the backend (if needed).

With the removal of the loop we could be called for each
I/O and do a bunch of : write in ring, update producer index, kick backend,
write in ring, update producer ring, kick backend, etc.


> -	while ((req = blk_peek_request(rq)) != NULL) {
> -		info = req->rq_disk->private_data;
> +	struct blkfront_info *info = req->rq_disk->private_data;
>  
> -		if (RING_FULL(&info->ring))
> -			goto wait;
> +	spin_lock_irq(&info->io_lock);
> +	if (RING_FULL(&info->ring))
> +		goto wait;
>  
> -		blk_start_request(req);
> +	if ((req->cmd_type != REQ_TYPE_FS) ||
> +			((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> +			 !info->flush_op)) {

You had an earlier patch (xen, blkfront: factor out flush-related checks from do_blkif_request())

Which made that 'if' statement much nicer. I put said patch on the 3.18 train
for Jens - so once v3.18-rc1 is out you can rebase on that and pick this
up.

> +		req->errors = -EIO;
> +		blk_mq_complete_request(req);
> +		spin_unlock_irq(&info->io_lock);
> +		return BLK_MQ_RQ_QUEUE_ERROR;
> +	}
>  
> -		if ((req->cmd_type != REQ_TYPE_FS) ||
> -		    ((req->cmd_flags & (REQ_FLUSH | REQ_FUA)) &&
> -		    !info->flush_op)) {
> -			__blk_end_request_all(req, -EIO);
> -			continue;
> -		}
> +	if (blkif_queue_request(req)) {
> +		blk_mq_requeue_request(req);
> +		goto wait;
> +	}
>  
> -		pr_debug("do_blk_req %p: cmd %p, sec %lx, "
> -			 "(%u/%u) [%s]\n",
> -			 req, req->cmd, (unsigned long)blk_rq_pos(req),
> -			 blk_rq_cur_sectors(req), blk_rq_sectors(req),
> -			 rq_data_dir(req) ? "write" : "read");
> +	flush_requests(info);
> +	spin_unlock_irq(&info->io_lock);
> +	return BLK_MQ_RQ_QUEUE_OK;
>  
> -		if (blkif_queue_request(req)) {
> -			blk_requeue_request(rq, req);
>  wait:
> -			/* Avoid pointless unplugs. */
> -			blk_stop_queue(rq);
> -			break;
> -		}
> -
> -		queued++;
> -	}
> -
> -	if (queued != 0)
> -		flush_requests(info);
> +	/* Avoid pointless unplugs. */
> +	blk_mq_stop_hw_queue(hctx);
> +	spin_unlock_irq(&info->io_lock);
> +	return BLK_MQ_RQ_QUEUE_BUSY;
>  }
>  
> +static struct blk_mq_ops blkfront_mq_ops = {
> +	.queue_rq = blkfront_queue_rq,
> +	.map_queue = blk_mq_map_queue,
> +};
> +
>  static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  				unsigned int physical_sector_size,
>  				unsigned int segments)
>  {
>  	struct request_queue *rq;
>  	struct blkfront_info *info = gd->private_data;
> +	int ret;
> +
> +	memset(&info->tag_set, 0, sizeof(info->tag_set));

You can ditch that. In 'blkfront_probe' we use 'kzalloc' to setup
the 'info' structure so it is already set to zero values.

> +	info->tag_set.ops = &blkfront_mq_ops;
> +	info->tag_set.nr_hw_queues = 1;
> +	info->tag_set.queue_depth = BLK_RING_SIZE;

Could you add an simple comment (or I can do it when I pick v3):
/* TODO: Need to figure out how to square this with 'xen_blkif_max_segments' */

> +	info->tag_set.numa_node = NUMA_NO_NODE;
> +	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> +	info->tag_set.cmd_size = 0;

Could you add an comment about why we have zero here? My recollection is
that we do not need it as we keep track of the request internally (for
migration purposes and if we did persistent mapping).

/*
 * And the reason we want it set to zero is because we want to be responsible
 * for page recycling (in case it is persistent and we need to manually
 * deal with unmapping it, etc).
 */

> +	info->tag_set.driver_data = info;

Or perhaps mention that since we have 'driver_data' we can access 'info'
and do all internal mapping/unmapping/etc on that.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

On Fri, Sep 12, 2014 at 01:57:21AM +0200, Arianna Avanzini wrote:
> This commit introduces in xen-blkfront actual support for multiple
> block rings. The number of block rings to be used is still forced
> to one.

I would add:

Since this is just a patch hoisting members from 'struct blkinfo_info'
in an 'struct blkfront_ring_info' to allow multiple rings, we do not
introduce any new behavior to take advantage of it. The patch
consists of:
 - Accessing members through 'info->rinfo' instead of previously
   done via just accessing 'info' members.
 - Adding the 'init_ctx' for the MQ API to initialize properly
   the multiple ring support
 - The 'fill_grant_buffer' moved to a different function to deal
   with the filling of that per ring instead of done in the function
   that gathers the count of segments - which is only called once
   - while we need to call 'fill_grant_buffer' per each ring.

That said, some comments:

> +static void blkif_free(struct blkfront_info *info, int suspend)
> +{
> +	struct grant *persistent_gnt;
> +	struct grant *n;
> +	int i;
>  
> -	/* Free resources associated with old device channel. */
> -	if (info->ring_ref != GRANT_INVALID_REF) {
> -		gnttab_end_foreign_access(info->ring_ref, 0,
> -					  (unsigned long)info->ring.sring);
> -		info->ring_ref = GRANT_INVALID_REF;
> -		info->ring.sring = NULL;
> -	}
> -	if (info->irq)
> -		unbind_from_irqhandler(info->irq, info);
> -	info->evtchn = info->irq = 0;
> +	info->connected = suspend ?
> +		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> +
> +	/*
> +	 * Prevent new requests being issued until we fix things up:
> +	 * no more blkif_request() and no more gnttab callback work.
> +	 */
> +	if (info->rq) {

I would have done it the other way to make it easier to read the code.

That is do:

	if (!info->rq)
		return;

And then you can continue with the code without having to worry about
hitting the StyleGuide issues (the 80 or 120 lines limit or such)

> +		blk_mq_stop_hw_queues(info->rq);
> +
> +		for (i = 0 ; i < info->nr_rings ; i++) {
> +			struct blkfront_ring_info *rinfo = &info->rinfo[i];
> +
> +			spin_lock_irq(&info->rinfo[i].io_lock);
> +			/* Remove all persistent grants */
> +			if (!list_empty(&rinfo->grants)) {
> +				list_for_each_entry_safe(persistent_gnt, n,
> +				                         &rinfo->grants, node) {
> +					list_del(&persistent_gnt->node);
> +					if (persistent_gnt->gref != GRANT_INVALID_REF) {
> +						gnttab_end_foreign_access(persistent_gnt->gref,
> +						                          0, 0UL);
> +						rinfo->persistent_gnts_c--;
> +					}
> +					if (info->feature_persistent)
> +						__free_page(pfn_to_page(persistent_gnt->pfn));
> +					kfree(persistent_gnt);
> +				}
> +			}
> +			BUG_ON(rinfo->persistent_gnts_c != 0);
> +
> +			/*
> +			 * Remove indirect pages, this only happens when using indirect
> +			 * descriptors but not persistent grants
> +			 */
> +			if (!list_empty(&rinfo->indirect_pages)) {
> +				struct page *indirect_page, *n;
> +
> +				BUG_ON(info->feature_persistent);
> +				list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
> +					list_del(&indirect_page->lru);
> +					__free_page(indirect_page);
> +				}
> +			}
>  
> +			blkif_free_ring(rinfo, info->feature_persistent);
> +
> +			gnttab_cancel_free_callback(&rinfo->callback);
> +			spin_unlock_irq(&rinfo->io_lock);
> +
> +			/* Flush gnttab callback work. Must be done with no locks held. */
> +			flush_work(&info->rinfo[i].work);
> +
> +			/* Free resources associated with old device channel. */
> +			if (rinfo->ring_ref != GRANT_INVALID_REF) {
> +				gnttab_end_foreign_access(rinfo->ring_ref, 0,
> +							  (unsigned long)rinfo->ring.sring);
> +				rinfo->ring_ref = GRANT_INVALID_REF;
> +				rinfo->ring.sring = NULL;
> +			}
> +			if (rinfo->irq)
> +				unbind_from_irqhandler(rinfo->irq, rinfo);
> +			rinfo->evtchn = rinfo->irq = 0;
> +		}
> +	}
>  }
>  

.. snip..
> @@ -1274,13 +1333,16 @@ static int talk_to_blkback(struct xenbus_device *dev,
>  			   struct blkfront_info *info)
>  {
>  	const char *message = NULL;
> +	char ring_ref_s[64] = "", evtchn_s[64] = "";

Why the usage of these, since:
>  	struct xenbus_transaction xbt;
> -	int err;
> +	int i, err;
>  
> -	/* Create shared ring, alloc event channel. */
> -	err = setup_blkring(dev, info);
> -	if (err)
> -		goto out;
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		/* Create shared ring, alloc event channel. */
> +		err = setup_blkring(dev, &info->rinfo[i]);
> +		if (err)
> +			goto out;
> +	}
>  
>  again:
>  	err = xenbus_transaction_start(&xbt);
> @@ -1289,18 +1351,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> -	err = xenbus_printf(xbt, dev->nodename,
> -			    "ring-ref", "%u", info->ring_ref);
> -	if (err) {
> -		message = "writing ring-ref";
> -		goto abort_transaction;
> -	}
> -	err = xenbus_printf(xbt, dev->nodename,
> -			    "event-channel", "%u", info->evtchn);
> -	if (err) {
> -		message = "writing event-channel";
> -		goto abort_transaction;
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		BUG_ON(i > 0);
> +		snprintf(ring_ref_s, 64, "ring-ref");
> +		snprintf(evtchn_s, 64, "event-channel");

You seem to just writting it with the same value on each iteration? Is that
to prepare when we would use it? I would rather ditch this and keep
it in the patch that would use it.

Also the 64 needs a #define.
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
> +		if (err) {
> +			message = "writing ring-ref";
> +			goto abort_transaction;
> +		}
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    evtchn_s, "%u", info->rinfo[i].evtchn);
> +		if (err) {
> +			message = "writing event-channel";
> +			goto abort_transaction;
> +		}
>  	}
> +
>  	err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
>  			    XEN_IO_PROTO_ABI_NATIVE);
>  	if (err) {
> @@ -1344,7 +1412,7 @@ again:
>  static int blkfront_probe(struct xenbus_device *dev,
>  			  const struct xenbus_device_id *id)
>  {
> -	int err, vdevice, i;
> +	int err, vdevice, i, r;
>  	struct blkfront_info *info;
>  
>  	/* FIXME: Use dynamic device id if this is not set. */
> @@ -1396,23 +1464,36 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	}
>  
>  	mutex_init(&info->mutex);
> -	spin_lock_init(&info->io_lock);
>  	info->xbdev = dev;
>  	info->vdevice = vdevice;
> -	INIT_LIST_HEAD(&info->grants);
> -	INIT_LIST_HEAD(&info->indirect_pages);
> -	info->persistent_gnts_c = 0;
>  	info->connected = BLKIF_STATE_DISCONNECTED;
> -	INIT_WORK(&info->work, blkif_restart_queue);
> -
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> -		info->shadow[i].req.u.rw.id = i+1;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
>  
>  	/* Front end dir is a number, which is used as the id. */
>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
>  	dev_set_drvdata(&dev->dev, info);
>  
> +	/* Allocate the correct number of rings. */
> +	info->nr_rings = 1;
> +	pr_info("blkfront: %s: %d rings\n",
> +		info->gd->disk_name, info->nr_rings);

pr_debug.


> +
> +	info->rinfo = kzalloc(info->nr_rings *
> +				sizeof(struct blkfront_ring_info),
> +			      GFP_KERNEL);

Something is off with the tabs/spaces.

And please do check if we allocated it in the first place. As in

if (!info->rinfo) {
	err = -ENOMEM;
	goto out;
}

And add the label 'out' ..

> +	for (r = 0 ; r < info->nr_rings ; r++) {
> +		struct blkfront_ring_info *rinfo = &info->rinfo[r];
> +
> +		rinfo->info = info;
> +		rinfo->persistent_gnts_c = 0;
> +		INIT_LIST_HEAD(&rinfo->grants);
> +		INIT_LIST_HEAD(&rinfo->indirect_pages);
> +		INIT_WORK(&rinfo->work, blkif_restart_queue);
> +		spin_lock_init(&rinfo->io_lock);
> +		for (i = 0; i < BLK_RING_SIZE; i++)
> +			rinfo->shadow[i].req.u.rw.id = i+1;
> +		rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +	}
> +

==> Right here, 'out:' label :-)

>  	err = talk_to_blkback(dev, info);
>  	if (err) {
>  		kfree(info);
> @@ -1438,88 +1519,100 @@ static void split_bio_end(struct bio *bio, int error)
>  	bio_put(bio);
>  }
>  
> -static int blkif_recover(struct blkfront_info *info)
> +static int blkif_setup_shadow(struct blkfront_ring_info *rinfo,
> +			      struct blk_shadow **copy)
>  {
>  	int i;
> +
> +	/* Stage 1: Make a safe copy of the shadow state. */
> +	*copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
> +		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> +	if (!*copy)
> +		return -ENOMEM;
> +
> +	/* Stage 2: Set up free list. */
> +	memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
> +	for (i = 0; i < BLK_RING_SIZE; i++)
> +		rinfo->shadow[i].req.u.rw.id = i+1;
> +	rinfo->shadow_free = rinfo->ring.req_prod_pvt;
> +	rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +
> +	return 0;
> +}
> +
> +static int blkif_recover(struct blkfront_info *info)
> +{
> +	int i, r;
>  	struct request *req, *n;
>  	struct blk_shadow *copy;
> -	int rc;
> +	int rc = 0;
>  	struct bio *bio, *cloned_bio;
> -	struct bio_list bio_list, merge_bio;
> +	struct bio_list uninitialized_var(bio_list), merge_bio;
>  	unsigned int segs, offset;
>  	unsigned long flags;
>  	int pending, size;
>  	struct split_bio *split_bio;
>  	struct list_head requests;
>  
> -	/* Stage 1: Make a safe copy of the shadow state. */
> -	copy = kmemdup(info->shadow, sizeof(info->shadow),
> -		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> -	if (!copy)
> -		return -ENOMEM;
> -
> -	/* Stage 2: Set up free list. */
> -	memset(&info->shadow, 0, sizeof(info->shadow));
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> -		info->shadow[i].req.u.rw.id = i+1;
> -	info->shadow_free = info->ring.req_prod_pvt;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +	segs = blkfront_gather_indirect(info);
>  
> -	rc = blkfront_setup_indirect(info);
> -	if (rc) {
> -		kfree(copy);
> -		return rc;
> -	}
> +	for (r = 0 ; r < info->nr_rings ; r++) {
> +		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
> +		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);

Could you add an comment why it is OK to continue working if the
'blkif_setup_shadow' fails (say with -ENOMEM)?
> +		if (rc) {
> +			kfree(copy);
> +			return rc;
> +		}
>  
> -	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
> -	blk_queue_max_segments(info->rq, segs);
> -	bio_list_init(&bio_list);
> -	INIT_LIST_HEAD(&requests);
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> -		/* Not in use? */
> -		if (!copy[i].request)
> -			continue;
> +		segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +		blk_queue_max_segments(info->rq, segs);
> +		bio_list_init(&bio_list);
> +		INIT_LIST_HEAD(&requests);
> +		for (i = 0; i < BLK_RING_SIZE; i++) {
> +			/* Not in use? */
> +			if (!copy[i].request)
> +				continue;
>  
> -		/*
> -		 * Get the bios in the request so we can re-queue them.
> -		 */
> -		if (copy[i].request->cmd_flags &
> -		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
>  			/*
> -			 * Flush operations don't contain bios, so
> -			 * we need to requeue the whole request
> +			 * Get the bios in the request so we can re-queue them.
>  			 */
> -			list_add(&copy[i].request->queuelist, &requests);
> -			continue;
> +			if (copy[i].request->cmd_flags &
> +			    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> +				/*
> +				 * Flush operations don't contain bios, so
> +				 * we need to requeue the whole request
> +				 */
> +				list_add(&copy[i].request->queuelist, &requests);
> +				continue;
> +			}
> +			merge_bio.head = copy[i].request->bio;
> +			merge_bio.tail = copy[i].request->biotail;
> +			bio_list_merge(&bio_list, &merge_bio);
> +			copy[i].request->bio = NULL;
> +			blk_put_request(copy[i].request);
>  		}
> -		merge_bio.head = copy[i].request->bio;
> -		merge_bio.tail = copy[i].request->biotail;
> -		bio_list_merge(&bio_list, &merge_bio);
> -		copy[i].request->bio = NULL;
> -		blk_put_request(copy[i].request);
> +		kfree(copy);
>  	}
>  
> -	kfree(copy);
> -
>  	xenbus_switch_state(info->xbdev, XenbusStateConnected);
>  
> -	spin_lock_irqsave(&info->io_lock, flags);
> -
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	/* Kick any other new requests queued since we resumed */
> -	kick_pending_request_queues(info, &flags);
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
> +		/* Kick any other new requests queued since we resumed */
> +		kick_pending_request_queues(&info->rinfo[i], &flags);
>  
> -	list_for_each_entry_safe(req, n, &requests, queuelist) {
> -		/* Requeue pending requests (flush or discard) */
> -		list_del_init(&req->queuelist);
> -		BUG_ON(req->nr_phys_segments > segs);
> -		blk_mq_requeue_request(req);
> +		list_for_each_entry_safe(req, n, &requests, queuelist) {
> +			/* Requeue pending requests (flush or discard) */
> +			list_del_init(&req->queuelist);
> +			BUG_ON(req->nr_phys_segments > segs);
> +			blk_mq_requeue_request(req);
> +		}
> +		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
>  	}
>  
> -	spin_unlock_irqrestore(&info->io_lock, flags);
> -
>  	while ((bio = bio_list_pop(&bio_list)) != NULL) {
>  		/* Traverse the list of pending bios and re-queue them */
>  		if (bio_segments(bio) > segs) {
> @@ -1643,14 +1736,15 @@ static void blkfront_setup_discard(struct blkfront_info *info)
>  		info->feature_secdiscard = !!discard_secure;
>  }
>  
> -static int blkfront_setup_indirect(struct blkfront_info *info)
> +
> +static int blkfront_gather_indirect(struct blkfront_info *info)
>  {
>  	unsigned int indirect_segments, segs;
> -	int err, i;
> +	int err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +				"feature-max-indirect-segments", "%u",
> +				&indirect_segments,
> +				NULL);
>  
> -	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> -			    "feature-max-indirect-segments", "%u", &indirect_segments,
> -			    NULL);


That looks like an unrellated patch. Could you leave it as it please?

>  	if (err) {
>  		info->max_indirect_segments = 0;
>  		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> @@ -1660,7 +1754,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  		segs = info->max_indirect_segments;
>  	}
>  
> -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
> +	return segs;
> +}
> +
> +static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
> +				   unsigned int segs)
> +{
> +	struct blkfront_info *info = rinfo->info;
> +	int err, i;
> +
> +	err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);


Can you say a bit about why you are moving it from 'blkfront_gather_indirect' to
'blkfront_setup_indirect'  - preferrably in the commit description.

I presume it would be because in the past we did it in blkfront_gather_indirect
and since it does not iterate over the number of rings (but just gives you the
segment value), we need to move it to the function that will be called for
each ring?

It actually seems that this sort of patch could be done earlier (as an standalone one)
as I think the fill_grant_buffer in blkfront_gather_indirect seems rather ill placed
(it is about gathering data, not setting it up!).

Anyhow, it is OK to have it in this patch - but just please mention that in the commit
description.
>  	if (err)
>  		goto out_of_memory;
>  

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings
  2014-09-11 23:57 ` Arianna Avanzini
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

On Fri, Sep 12, 2014 at 01:57:21AM +0200, Arianna Avanzini wrote:
> This commit introduces in xen-blkfront actual support for multiple
> block rings. The number of block rings to be used is still forced
> to one.

I would add:

Since this is just a patch hoisting members from 'struct blkinfo_info'
in an 'struct blkfront_ring_info' to allow multiple rings, we do not
introduce any new behavior to take advantage of it. The patch
consists of:
 - Accessing members through 'info->rinfo' instead of previously
   done via just accessing 'info' members.
 - Adding the 'init_ctx' for the MQ API to initialize properly
   the multiple ring support
 - The 'fill_grant_buffer' moved to a different function to deal
   with the filling of that per ring instead of done in the function
   that gathers the count of segments - which is only called once
   - while we need to call 'fill_grant_buffer' per each ring.

That said, some comments:

> +static void blkif_free(struct blkfront_info *info, int suspend)
> +{
> +	struct grant *persistent_gnt;
> +	struct grant *n;
> +	int i;
>  
> -	/* Free resources associated with old device channel. */
> -	if (info->ring_ref != GRANT_INVALID_REF) {
> -		gnttab_end_foreign_access(info->ring_ref, 0,
> -					  (unsigned long)info->ring.sring);
> -		info->ring_ref = GRANT_INVALID_REF;
> -		info->ring.sring = NULL;
> -	}
> -	if (info->irq)
> -		unbind_from_irqhandler(info->irq, info);
> -	info->evtchn = info->irq = 0;
> +	info->connected = suspend ?
> +		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> +
> +	/*
> +	 * Prevent new requests being issued until we fix things up:
> +	 * no more blkif_request() and no more gnttab callback work.
> +	 */
> +	if (info->rq) {

I would have done it the other way to make it easier to read the code.

That is do:

	if (!info->rq)
		return;

And then you can continue with the code without having to worry about
hitting the StyleGuide issues (the 80 or 120 lines limit or such)

> +		blk_mq_stop_hw_queues(info->rq);
> +
> +		for (i = 0 ; i < info->nr_rings ; i++) {
> +			struct blkfront_ring_info *rinfo = &info->rinfo[i];
> +
> +			spin_lock_irq(&info->rinfo[i].io_lock);
> +			/* Remove all persistent grants */
> +			if (!list_empty(&rinfo->grants)) {
> +				list_for_each_entry_safe(persistent_gnt, n,
> +				                         &rinfo->grants, node) {
> +					list_del(&persistent_gnt->node);
> +					if (persistent_gnt->gref != GRANT_INVALID_REF) {
> +						gnttab_end_foreign_access(persistent_gnt->gref,
> +						                          0, 0UL);
> +						rinfo->persistent_gnts_c--;
> +					}
> +					if (info->feature_persistent)
> +						__free_page(pfn_to_page(persistent_gnt->pfn));
> +					kfree(persistent_gnt);
> +				}
> +			}
> +			BUG_ON(rinfo->persistent_gnts_c != 0);
> +
> +			/*
> +			 * Remove indirect pages, this only happens when using indirect
> +			 * descriptors but not persistent grants
> +			 */
> +			if (!list_empty(&rinfo->indirect_pages)) {
> +				struct page *indirect_page, *n;
> +
> +				BUG_ON(info->feature_persistent);
> +				list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
> +					list_del(&indirect_page->lru);
> +					__free_page(indirect_page);
> +				}
> +			}
>  
> +			blkif_free_ring(rinfo, info->feature_persistent);
> +
> +			gnttab_cancel_free_callback(&rinfo->callback);
> +			spin_unlock_irq(&rinfo->io_lock);
> +
> +			/* Flush gnttab callback work. Must be done with no locks held. */
> +			flush_work(&info->rinfo[i].work);
> +
> +			/* Free resources associated with old device channel. */
> +			if (rinfo->ring_ref != GRANT_INVALID_REF) {
> +				gnttab_end_foreign_access(rinfo->ring_ref, 0,
> +							  (unsigned long)rinfo->ring.sring);
> +				rinfo->ring_ref = GRANT_INVALID_REF;
> +				rinfo->ring.sring = NULL;
> +			}
> +			if (rinfo->irq)
> +				unbind_from_irqhandler(rinfo->irq, rinfo);
> +			rinfo->evtchn = rinfo->irq = 0;
> +		}
> +	}
>  }
>  

.. snip..
> @@ -1274,13 +1333,16 @@ static int talk_to_blkback(struct xenbus_device *dev,
>  			   struct blkfront_info *info)
>  {
>  	const char *message = NULL;
> +	char ring_ref_s[64] = "", evtchn_s[64] = "";

Why the usage of these, since:
>  	struct xenbus_transaction xbt;
> -	int err;
> +	int i, err;
>  
> -	/* Create shared ring, alloc event channel. */
> -	err = setup_blkring(dev, info);
> -	if (err)
> -		goto out;
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		/* Create shared ring, alloc event channel. */
> +		err = setup_blkring(dev, &info->rinfo[i]);
> +		if (err)
> +			goto out;
> +	}
>  
>  again:
>  	err = xenbus_transaction_start(&xbt);
> @@ -1289,18 +1351,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> -	err = xenbus_printf(xbt, dev->nodename,
> -			    "ring-ref", "%u", info->ring_ref);
> -	if (err) {
> -		message = "writing ring-ref";
> -		goto abort_transaction;
> -	}
> -	err = xenbus_printf(xbt, dev->nodename,
> -			    "event-channel", "%u", info->evtchn);
> -	if (err) {
> -		message = "writing event-channel";
> -		goto abort_transaction;
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		BUG_ON(i > 0);
> +		snprintf(ring_ref_s, 64, "ring-ref");
> +		snprintf(evtchn_s, 64, "event-channel");

You seem to just writting it with the same value on each iteration? Is that
to prepare when we would use it? I would rather ditch this and keep
it in the patch that would use it.

Also the 64 needs a #define.
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    ring_ref_s, "%u", info->rinfo[i].ring_ref);
> +		if (err) {
> +			message = "writing ring-ref";
> +			goto abort_transaction;
> +		}
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    evtchn_s, "%u", info->rinfo[i].evtchn);
> +		if (err) {
> +			message = "writing event-channel";
> +			goto abort_transaction;
> +		}
>  	}
> +
>  	err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
>  			    XEN_IO_PROTO_ABI_NATIVE);
>  	if (err) {
> @@ -1344,7 +1412,7 @@ again:
>  static int blkfront_probe(struct xenbus_device *dev,
>  			  const struct xenbus_device_id *id)
>  {
> -	int err, vdevice, i;
> +	int err, vdevice, i, r;
>  	struct blkfront_info *info;
>  
>  	/* FIXME: Use dynamic device id if this is not set. */
> @@ -1396,23 +1464,36 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	}
>  
>  	mutex_init(&info->mutex);
> -	spin_lock_init(&info->io_lock);
>  	info->xbdev = dev;
>  	info->vdevice = vdevice;
> -	INIT_LIST_HEAD(&info->grants);
> -	INIT_LIST_HEAD(&info->indirect_pages);
> -	info->persistent_gnts_c = 0;
>  	info->connected = BLKIF_STATE_DISCONNECTED;
> -	INIT_WORK(&info->work, blkif_restart_queue);
> -
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> -		info->shadow[i].req.u.rw.id = i+1;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
>  
>  	/* Front end dir is a number, which is used as the id. */
>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
>  	dev_set_drvdata(&dev->dev, info);
>  
> +	/* Allocate the correct number of rings. */
> +	info->nr_rings = 1;
> +	pr_info("blkfront: %s: %d rings\n",
> +		info->gd->disk_name, info->nr_rings);

pr_debug.


> +
> +	info->rinfo = kzalloc(info->nr_rings *
> +				sizeof(struct blkfront_ring_info),
> +			      GFP_KERNEL);

Something is off with the tabs/spaces.

And please do check if we allocated it in the first place. As in

if (!info->rinfo) {
	err = -ENOMEM;
	goto out;
}

And add the label 'out' ..

> +	for (r = 0 ; r < info->nr_rings ; r++) {
> +		struct blkfront_ring_info *rinfo = &info->rinfo[r];
> +
> +		rinfo->info = info;
> +		rinfo->persistent_gnts_c = 0;
> +		INIT_LIST_HEAD(&rinfo->grants);
> +		INIT_LIST_HEAD(&rinfo->indirect_pages);
> +		INIT_WORK(&rinfo->work, blkif_restart_queue);
> +		spin_lock_init(&rinfo->io_lock);
> +		for (i = 0; i < BLK_RING_SIZE; i++)
> +			rinfo->shadow[i].req.u.rw.id = i+1;
> +		rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +	}
> +

==> Right here, 'out:' label :-)

>  	err = talk_to_blkback(dev, info);
>  	if (err) {
>  		kfree(info);
> @@ -1438,88 +1519,100 @@ static void split_bio_end(struct bio *bio, int error)
>  	bio_put(bio);
>  }
>  
> -static int blkif_recover(struct blkfront_info *info)
> +static int blkif_setup_shadow(struct blkfront_ring_info *rinfo,
> +			      struct blk_shadow **copy)
>  {
>  	int i;
> +
> +	/* Stage 1: Make a safe copy of the shadow state. */
> +	*copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
> +		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> +	if (!*copy)
> +		return -ENOMEM;
> +
> +	/* Stage 2: Set up free list. */
> +	memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
> +	for (i = 0; i < BLK_RING_SIZE; i++)
> +		rinfo->shadow[i].req.u.rw.id = i+1;
> +	rinfo->shadow_free = rinfo->ring.req_prod_pvt;
> +	rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +
> +	return 0;
> +}
> +
> +static int blkif_recover(struct blkfront_info *info)
> +{
> +	int i, r;
>  	struct request *req, *n;
>  	struct blk_shadow *copy;
> -	int rc;
> +	int rc = 0;
>  	struct bio *bio, *cloned_bio;
> -	struct bio_list bio_list, merge_bio;
> +	struct bio_list uninitialized_var(bio_list), merge_bio;
>  	unsigned int segs, offset;
>  	unsigned long flags;
>  	int pending, size;
>  	struct split_bio *split_bio;
>  	struct list_head requests;
>  
> -	/* Stage 1: Make a safe copy of the shadow state. */
> -	copy = kmemdup(info->shadow, sizeof(info->shadow),
> -		       GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
> -	if (!copy)
> -		return -ENOMEM;
> -
> -	/* Stage 2: Set up free list. */
> -	memset(&info->shadow, 0, sizeof(info->shadow));
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> -		info->shadow[i].req.u.rw.id = i+1;
> -	info->shadow_free = info->ring.req_prod_pvt;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +	segs = blkfront_gather_indirect(info);
>  
> -	rc = blkfront_setup_indirect(info);
> -	if (rc) {
> -		kfree(copy);
> -		return rc;
> -	}
> +	for (r = 0 ; r < info->nr_rings ; r++) {
> +		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
> +		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);

Could you add an comment why it is OK to continue working if the
'blkif_setup_shadow' fails (say with -ENOMEM)?
> +		if (rc) {
> +			kfree(copy);
> +			return rc;
> +		}
>  
> -	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
> -	blk_queue_max_segments(info->rq, segs);
> -	bio_list_init(&bio_list);
> -	INIT_LIST_HEAD(&requests);
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> -		/* Not in use? */
> -		if (!copy[i].request)
> -			continue;
> +		segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +		blk_queue_max_segments(info->rq, segs);
> +		bio_list_init(&bio_list);
> +		INIT_LIST_HEAD(&requests);
> +		for (i = 0; i < BLK_RING_SIZE; i++) {
> +			/* Not in use? */
> +			if (!copy[i].request)
> +				continue;
>  
> -		/*
> -		 * Get the bios in the request so we can re-queue them.
> -		 */
> -		if (copy[i].request->cmd_flags &
> -		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
>  			/*
> -			 * Flush operations don't contain bios, so
> -			 * we need to requeue the whole request
> +			 * Get the bios in the request so we can re-queue them.
>  			 */
> -			list_add(&copy[i].request->queuelist, &requests);
> -			continue;
> +			if (copy[i].request->cmd_flags &
> +			    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> +				/*
> +				 * Flush operations don't contain bios, so
> +				 * we need to requeue the whole request
> +				 */
> +				list_add(&copy[i].request->queuelist, &requests);
> +				continue;
> +			}
> +			merge_bio.head = copy[i].request->bio;
> +			merge_bio.tail = copy[i].request->biotail;
> +			bio_list_merge(&bio_list, &merge_bio);
> +			copy[i].request->bio = NULL;
> +			blk_put_request(copy[i].request);
>  		}
> -		merge_bio.head = copy[i].request->bio;
> -		merge_bio.tail = copy[i].request->biotail;
> -		bio_list_merge(&bio_list, &merge_bio);
> -		copy[i].request->bio = NULL;
> -		blk_put_request(copy[i].request);
> +		kfree(copy);
>  	}
>  
> -	kfree(copy);
> -
>  	xenbus_switch_state(info->xbdev, XenbusStateConnected);
>  
> -	spin_lock_irqsave(&info->io_lock, flags);
> -
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	/* Kick any other new requests queued since we resumed */
> -	kick_pending_request_queues(info, &flags);
> +	for (i = 0 ; i < info->nr_rings ; i++) {
> +		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
> +		/* Kick any other new requests queued since we resumed */
> +		kick_pending_request_queues(&info->rinfo[i], &flags);
>  
> -	list_for_each_entry_safe(req, n, &requests, queuelist) {
> -		/* Requeue pending requests (flush or discard) */
> -		list_del_init(&req->queuelist);
> -		BUG_ON(req->nr_phys_segments > segs);
> -		blk_mq_requeue_request(req);
> +		list_for_each_entry_safe(req, n, &requests, queuelist) {
> +			/* Requeue pending requests (flush or discard) */
> +			list_del_init(&req->queuelist);
> +			BUG_ON(req->nr_phys_segments > segs);
> +			blk_mq_requeue_request(req);
> +		}
> +		spin_unlock_irqrestore(&info->rinfo[i].io_lock, flags);
>  	}
>  
> -	spin_unlock_irqrestore(&info->io_lock, flags);
> -
>  	while ((bio = bio_list_pop(&bio_list)) != NULL) {
>  		/* Traverse the list of pending bios and re-queue them */
>  		if (bio_segments(bio) > segs) {
> @@ -1643,14 +1736,15 @@ static void blkfront_setup_discard(struct blkfront_info *info)
>  		info->feature_secdiscard = !!discard_secure;
>  }
>  
> -static int blkfront_setup_indirect(struct blkfront_info *info)
> +
> +static int blkfront_gather_indirect(struct blkfront_info *info)
>  {
>  	unsigned int indirect_segments, segs;
> -	int err, i;
> +	int err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +				"feature-max-indirect-segments", "%u",
> +				&indirect_segments,
> +				NULL);
>  
> -	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> -			    "feature-max-indirect-segments", "%u", &indirect_segments,
> -			    NULL);


That looks like an unrellated patch. Could you leave it as it please?

>  	if (err) {
>  		info->max_indirect_segments = 0;
>  		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> @@ -1660,7 +1754,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  		segs = info->max_indirect_segments;
>  	}
>  
> -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
> +	return segs;
> +}
> +
> +static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo,
> +				   unsigned int segs)
> +{
> +	struct blkfront_info *info = rinfo->info;
> +	int err, i;
> +
> +	err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);


Can you say a bit about why you are moving it from 'blkfront_gather_indirect' to
'blkfront_setup_indirect'  - preferrably in the commit description.

I presume it would be because in the past we did it in blkfront_gather_indirect
and since it does not iterate over the number of rings (but just gives you the
segment value), we need to move it to the function that will be called for
each ring?

It actually seems that this sort of patch could be done earlier (as an standalone one)
as I think the fill_grant_buffer in blkfront_gather_indirect seems rather ill placed
(it is about gathering data, not setting it up!).

Anyhow, it is OK to have it in this patch - but just please mention that in the commit
description.
>  	if (err)
>  		goto out_of_memory;
>  

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 ` Arianna Avanzini
                     ` (2 preceding siblings ...)
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

On Fri, Sep 12, 2014 at 01:57:22AM +0200, Arianna Avanzini wrote:
> This commit implements the negotiation of the number of block rings
> to be used; as a default, the number of rings is decided by the
> frontend driver and is equal to the number of hardware queues that
> the backend makes available. In case of guest migration towards a
> host whose devices expose a different number of hardware queues, the
> number of I/O rings used by the frontend driver remains the same;
> XenStore keys may vary if the frontend needs to be compatible with
> a host not having multi-queue support.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 84 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9282df1..77e311d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -137,7 +137,7 @@ struct blkfront_info
>  	int vdevice;
>  	blkif_vdev_t handle;
>  	enum blkif_state connected;
> -	unsigned int nr_rings;
> +	unsigned int nr_rings, old_nr_rings;
>  	struct blkfront_ring_info *rinfo;
>  	struct request_queue *rq;
>  	unsigned int feature_flush;
> @@ -147,6 +147,7 @@ struct blkfront_info
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	unsigned int feature_persistent:1;
> +	unsigned int hardware_queues;
>  	unsigned int max_indirect_segments;
>  	int is_ready;
>  	/* Block layer tags. */
> @@ -669,7 +670,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  
>  	memset(&info->tag_set, 0, sizeof(info->tag_set));
>  	info->tag_set.ops = &blkfront_mq_ops;
> -	info->tag_set.nr_hw_queues = 1;
> +	info->tag_set.nr_hw_queues = info->hardware_queues ? : 1;
>  	info->tag_set.queue_depth = BLK_RING_SIZE;
>  	info->tag_set.numa_node = NUMA_NO_NODE;
>  	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> @@ -938,6 +939,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  	info->gd = NULL;
>  }
>  
> +/* Must be called with io_lock held */
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
>  					unsigned long *flags)
>  {
> @@ -1351,10 +1353,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	/* Advertise the number of rings */
> +	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
> +			    "%u", info->nr_rings);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "advertising number of rings");
> +		goto abort_transaction;


Not sure I understand the purpose of this?

The 'hardware_queues' is gathered via the 'nr_supported_hw_queues' which is what the
backend advertise. Then you tell the backend how many you are going to use using
a different XenStore value?


I would simplify this by having one key: 'multi-queue-max-queues' or such. The backend
would write in its directory (/local/domain/0/backend/drivers/<front domaind>/vbd/<id>/multi-queue-max-queues)
it's max.

The frontend would then write in the its in directory the 'multi-queue-num-queues'. The value
it writes MUST be less or equal to what the backend says. 

And of course if we don't get the value we assume '1'.

This is following xen-netfront does.

That also would simplifiy this code as we could just assume that 'nr_rings == hardware_queues'
througout the code. And in 'blkfront_gather_hw_queues' you can add:

 unsigned int val;

 int err = xenbus_gather(..)
 if (err)
	/* Backend too old. One ring */
	return 1;
  return max(nr_online_cpus(), val);

And then in 'talk_to_blkback' you can write that 'ring->nr_rings' value in 'feature-nr-rings'.

If the backend is deaf (too old) it will ignore it. If it is newer, it will do the
right thing.

> +	}
> +
>  	for (i = 0 ; i < info->nr_rings ; i++) {
> -		BUG_ON(i > 0);
> -		snprintf(ring_ref_s, 64, "ring-ref");
> -		snprintf(evtchn_s, 64, "event-channel");

Instead of 64, I would do an macro for the value. And I think you can
do 19 + 13 = 32 (19 to represent 2^64 value, and 13 for 'event-channel'.

> +		if (!info->hardware_queues) {
> +			BUG_ON(i > 0);
> +			/* Support old XenStore keys */
> +			snprintf(ring_ref_s, 64, "ring-ref");
> +			snprintf(evtchn_s, 64, "event-channel");
> +		} else {
> +			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
> +			snprintf(evtchn_s, 64, "event-channel-%d", i);
> +		}
>  		err = xenbus_printf(xbt, dev->nodename,
>  				    ring_ref_s, "%u", info->rinfo[i].ring_ref);

That looks actually odd. %u is for unsigned int, but 'ring_ref' is int. Hm, another bug.

>  		if (err) {
> @@ -1403,6 +1419,14 @@ again:
>  	return err;
>  }
>  
> +static inline int blkfront_gather_hw_queues(struct blkfront_info *info,
> +					    unsigned int *nr_queues)
> +{
> +	return xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			     "nr_supported_hw_queues", "%u", nr_queues,
> +			     NULL);
> +}
> +

See above about my comment.

Oh, and also this patch should have a blurb in the blkif.h about this.
>  /**
>   * Entry point to this code when a new device is created.  Allocate the basic
>   * structures and the ring buffer for communication with the backend, and
> @@ -1414,6 +1438,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>  {
>  	int err, vdevice, i, r;
>  	struct blkfront_info *info;
> +	unsigned int nr_queues;
>  
>  	/* FIXME: Use dynamic device id if this is not set. */
>  	err = xenbus_scanf(XBT_NIL, dev->nodename,
> @@ -1472,10 +1497,19 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
>  	dev_set_drvdata(&dev->dev, info);
>  
> -	/* Allocate the correct number of rings. */
> -	info->nr_rings = 1;
> -	pr_info("blkfront: %s: %d rings\n",
> -		info->gd->disk_name, info->nr_rings);
> +	/* Gather the number of hardware queues as soon as possible */
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err)
> +		info->hardware_queues = 0;
> +	else
> +		info->hardware_queues = nr_queues;
> +	/*
> +	 * The backend has told us the number of hw queues he wants.
> +	 * Allocate the correct number of rings.
> +	 */
> +	info->nr_rings = info->hardware_queues ? : 1;
> +	pr_info("blkfront: %s: %d hardware queues, %d rings\n",
> +		info->gd->disk_name, info->hardware_queues, info->nr_rings);

pr_debug

>  
>  	info->rinfo = kzalloc(info->nr_rings *
>  				sizeof(struct blkfront_ring_info),
> @@ -1556,7 +1590,7 @@ static int blkif_recover(struct blkfront_info *info)
>  
>  	segs = blkfront_gather_indirect(info);
>  
> -	for (r = 0 ; r < info->nr_rings ; r++) {
> +	for (r = 0 ; r < info->old_nr_rings ; r++) {

I would stash 'old_nr_rings' as an parameter for 'blkif_recover'.

But that means the caller of 'blkif_recover' has to get this (talk_to_blkback).


>  		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
>  		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
>  		if (rc) {
> @@ -1599,7 +1633,7 @@ static int blkif_recover(struct blkfront_info *info)
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	for (i = 0 ; i < info->nr_rings ; i++) {
> +	for (i = 0 ; i < info->old_nr_rings ; i++) {
>  		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
>  		/* Kick any other new requests queued since we resumed */
>  		kick_pending_request_queues(&info->rinfo[i], &flags);
> @@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
>  {
>  	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
>  	int err;
> +	unsigned int nr_queues, prev_nr_queues;
> +	bool mq_to_rq_transition;
>  
>  	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
>  
> +	prev_nr_queues = info->hardware_queues;
> +
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err < 0)
> +		nr_queues = 0;

Or you could have set 'nr_queues' when defining the value.

Anyhow with my idea of making blkfront_gather_hw_queues set
your ring->nr_ring to the value from the backend or set to
one for old backends, this means your code here:
> +	mq_to_rq_transition = prev_nr_queues && !nr_queues;

Will require that you do:
	mq_to_rq_transition = info->nr_shadow_rings != info->nr_rings;

> +
> +	if (prev_nr_queues != nr_queues) {
> +		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
> +		       info->gd->disk_name, prev_nr_queues, nr_queues);

pr_info. Or pr_debug.

> +		if (mq_to_rq_transition) {

OK, so if we transition to _less_ than what we had we assume one ring.

Why not just negotiate:
	max(info->nr_shadows_rings, info->nr_rings);

And we assume that nr_shadows_rings will be set on the first allocation
to info->nr_rings. That should make it possible to use up to maximum
amount of contexts we had when we initially started the driver.

Granted once we migrate once more the nr_shadows_rings gets updated
to a new value and the info->nr_rings too .. so eventually that might
go down to one and stay there. Even if we migrate back to the initial
guest that has 32 queues or such.

I think we might need three members in 'info':

 'shadow_nr_ring' - used by the blkif_recover only. Is reset to zero once
     that code is done. Is set to some value in 'blkif_resume'

 'init_nr_ring' - the initial value of contexts we want. Perhaps ignore the
    XenBus state and just base it on the number of CPUs we booted with? That
    way we have ample room. And if migrate to old backends we just use one
    ring, and if we migrate between different hosts that have different
    amount of hardware contexts we just use up to 'init_nr_ring'.

    Oh, and this should probably be gated by some module parameter for
    adventerous users to tweak.

 'nr_ring' - the currently negotiated amount of rings. Can fluctuare between
   migration, but will always be equal or below 'init_nr_ring'. When migrating
   its value will be written to 'shadow_nr_ring' to figure out how many
   rings to iterate through.
 
> +			struct blk_mq_hw_ctx *hctx;
> +			unsigned int i;
> +			/*
> +			 * Switch from multi-queue to single-queue:
> +			 * update hctx-to-ring mapping before
> +			 * resubmitting any requests
> +			 */
> +			queue_for_each_hw_ctx(info->rq, hctx, i)
> +				hctx->driver_data = &info->rinfo[0];
> +		}
> +		info->hardware_queues = nr_queues;
> +	}
> +
>  	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
>  
> +	/* Free with old number of rings, but rebuild with new */
> +	info->old_nr_rings = info->nr_rings;
> +	/*
> +	 * Must not update if transition didn't happen, we're keeping
> +	 * the old number of rings.
> +	 */
> +	if (mq_to_rq_transition)
> +		info->nr_rings = 1;
> +
>  	err = talk_to_blkback(dev, info);
>  
>  	/*
> @@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
>  		 * supports indirect descriptors, and how many.
>  		 */
>  		blkif_recover(info);
> +		info->rinfo = krealloc(info->rinfo,
> +				       info->nr_rings * sizeof(struct blkfront_ring_info),
> +				       GFP_KERNEL);
> +

David already mentioned it, but we need to also deal with the case of this
giving us -ENOMEM and scalling back (perhaps to just one ring)?

>  		return;
>  
>  	default:
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:46   ` David Vrabel
  2014-09-12 10:46   ` David Vrabel
@ 2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  2014-10-01 20:18   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:18 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

On Fri, Sep 12, 2014 at 01:57:22AM +0200, Arianna Avanzini wrote:
> This commit implements the negotiation of the number of block rings
> to be used; as a default, the number of rings is decided by the
> frontend driver and is equal to the number of hardware queues that
> the backend makes available. In case of guest migration towards a
> host whose devices expose a different number of hardware queues, the
> number of I/O rings used by the frontend driver remains the same;
> XenStore keys may vary if the frontend needs to be compatible with
> a host not having multi-queue support.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkfront.c | 95 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 84 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9282df1..77e311d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -137,7 +137,7 @@ struct blkfront_info
>  	int vdevice;
>  	blkif_vdev_t handle;
>  	enum blkif_state connected;
> -	unsigned int nr_rings;
> +	unsigned int nr_rings, old_nr_rings;
>  	struct blkfront_ring_info *rinfo;
>  	struct request_queue *rq;
>  	unsigned int feature_flush;
> @@ -147,6 +147,7 @@ struct blkfront_info
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	unsigned int feature_persistent:1;
> +	unsigned int hardware_queues;
>  	unsigned int max_indirect_segments;
>  	int is_ready;
>  	/* Block layer tags. */
> @@ -669,7 +670,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  
>  	memset(&info->tag_set, 0, sizeof(info->tag_set));
>  	info->tag_set.ops = &blkfront_mq_ops;
> -	info->tag_set.nr_hw_queues = 1;
> +	info->tag_set.nr_hw_queues = info->hardware_queues ? : 1;
>  	info->tag_set.queue_depth = BLK_RING_SIZE;
>  	info->tag_set.numa_node = NUMA_NO_NODE;
>  	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> @@ -938,6 +939,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  	info->gd = NULL;
>  }
>  
> +/* Must be called with io_lock held */
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo,
>  					unsigned long *flags)
>  {
> @@ -1351,10 +1353,24 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	/* Advertise the number of rings */
> +	err = xenbus_printf(xbt, dev->nodename, "nr_blk_rings",
> +			    "%u", info->nr_rings);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "advertising number of rings");
> +		goto abort_transaction;


Not sure I understand the purpose of this?

The 'hardware_queues' is gathered via the 'nr_supported_hw_queues' which is what the
backend advertise. Then you tell the backend how many you are going to use using
a different XenStore value?


I would simplify this by having one key: 'multi-queue-max-queues' or such. The backend
would write in its directory (/local/domain/0/backend/drivers/<front domaind>/vbd/<id>/multi-queue-max-queues)
it's max.

The frontend would then write in the its in directory the 'multi-queue-num-queues'. The value
it writes MUST be less or equal to what the backend says. 

And of course if we don't get the value we assume '1'.

This is following xen-netfront does.

That also would simplifiy this code as we could just assume that 'nr_rings == hardware_queues'
througout the code. And in 'blkfront_gather_hw_queues' you can add:

 unsigned int val;

 int err = xenbus_gather(..)
 if (err)
	/* Backend too old. One ring */
	return 1;
  return max(nr_online_cpus(), val);

And then in 'talk_to_blkback' you can write that 'ring->nr_rings' value in 'feature-nr-rings'.

If the backend is deaf (too old) it will ignore it. If it is newer, it will do the
right thing.

> +	}
> +
>  	for (i = 0 ; i < info->nr_rings ; i++) {
> -		BUG_ON(i > 0);
> -		snprintf(ring_ref_s, 64, "ring-ref");
> -		snprintf(evtchn_s, 64, "event-channel");

Instead of 64, I would do an macro for the value. And I think you can
do 19 + 13 = 32 (19 to represent 2^64 value, and 13 for 'event-channel'.

> +		if (!info->hardware_queues) {
> +			BUG_ON(i > 0);
> +			/* Support old XenStore keys */
> +			snprintf(ring_ref_s, 64, "ring-ref");
> +			snprintf(evtchn_s, 64, "event-channel");
> +		} else {
> +			snprintf(ring_ref_s, 64, "ring-ref-%d", i);
> +			snprintf(evtchn_s, 64, "event-channel-%d", i);
> +		}
>  		err = xenbus_printf(xbt, dev->nodename,
>  				    ring_ref_s, "%u", info->rinfo[i].ring_ref);

That looks actually odd. %u is for unsigned int, but 'ring_ref' is int. Hm, another bug.

>  		if (err) {
> @@ -1403,6 +1419,14 @@ again:
>  	return err;
>  }
>  
> +static inline int blkfront_gather_hw_queues(struct blkfront_info *info,
> +					    unsigned int *nr_queues)
> +{
> +	return xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			     "nr_supported_hw_queues", "%u", nr_queues,
> +			     NULL);
> +}
> +

See above about my comment.

Oh, and also this patch should have a blurb in the blkif.h about this.
>  /**
>   * Entry point to this code when a new device is created.  Allocate the basic
>   * structures and the ring buffer for communication with the backend, and
> @@ -1414,6 +1438,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>  {
>  	int err, vdevice, i, r;
>  	struct blkfront_info *info;
> +	unsigned int nr_queues;
>  
>  	/* FIXME: Use dynamic device id if this is not set. */
>  	err = xenbus_scanf(XBT_NIL, dev->nodename,
> @@ -1472,10 +1497,19 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
>  	dev_set_drvdata(&dev->dev, info);
>  
> -	/* Allocate the correct number of rings. */
> -	info->nr_rings = 1;
> -	pr_info("blkfront: %s: %d rings\n",
> -		info->gd->disk_name, info->nr_rings);
> +	/* Gather the number of hardware queues as soon as possible */
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err)
> +		info->hardware_queues = 0;
> +	else
> +		info->hardware_queues = nr_queues;
> +	/*
> +	 * The backend has told us the number of hw queues he wants.
> +	 * Allocate the correct number of rings.
> +	 */
> +	info->nr_rings = info->hardware_queues ? : 1;
> +	pr_info("blkfront: %s: %d hardware queues, %d rings\n",
> +		info->gd->disk_name, info->hardware_queues, info->nr_rings);

pr_debug

>  
>  	info->rinfo = kzalloc(info->nr_rings *
>  				sizeof(struct blkfront_ring_info),
> @@ -1556,7 +1590,7 @@ static int blkif_recover(struct blkfront_info *info)
>  
>  	segs = blkfront_gather_indirect(info);
>  
> -	for (r = 0 ; r < info->nr_rings ; r++) {
> +	for (r = 0 ; r < info->old_nr_rings ; r++) {

I would stash 'old_nr_rings' as an parameter for 'blkif_recover'.

But that means the caller of 'blkif_recover' has to get this (talk_to_blkback).


>  		rc |= blkif_setup_shadow(&info->rinfo[r], &copy);
>  		rc |= blkfront_setup_indirect(&info->rinfo[r], segs);
>  		if (rc) {
> @@ -1599,7 +1633,7 @@ static int blkif_recover(struct blkfront_info *info)
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	for (i = 0 ; i < info->nr_rings ; i++) {
> +	for (i = 0 ; i < info->old_nr_rings ; i++) {
>  		spin_lock_irqsave(&info->rinfo[i].io_lock, flags);
>  		/* Kick any other new requests queued since we resumed */
>  		kick_pending_request_queues(&info->rinfo[i], &flags);
> @@ -1659,11 +1693,46 @@ static int blkfront_resume(struct xenbus_device *dev)
>  {
>  	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
>  	int err;
> +	unsigned int nr_queues, prev_nr_queues;
> +	bool mq_to_rq_transition;
>  
>  	dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);
>  
> +	prev_nr_queues = info->hardware_queues;
> +
> +	err = blkfront_gather_hw_queues(info, &nr_queues);
> +	if (err < 0)
> +		nr_queues = 0;

Or you could have set 'nr_queues' when defining the value.

Anyhow with my idea of making blkfront_gather_hw_queues set
your ring->nr_ring to the value from the backend or set to
one for old backends, this means your code here:
> +	mq_to_rq_transition = prev_nr_queues && !nr_queues;

Will require that you do:
	mq_to_rq_transition = info->nr_shadow_rings != info->nr_rings;

> +
> +	if (prev_nr_queues != nr_queues) {
> +		printk(KERN_INFO "blkfront: %s: hw queues %u -> %u\n",
> +		       info->gd->disk_name, prev_nr_queues, nr_queues);

pr_info. Or pr_debug.

> +		if (mq_to_rq_transition) {

OK, so if we transition to _less_ than what we had we assume one ring.

Why not just negotiate:
	max(info->nr_shadows_rings, info->nr_rings);

And we assume that nr_shadows_rings will be set on the first allocation
to info->nr_rings. That should make it possible to use up to maximum
amount of contexts we had when we initially started the driver.

Granted once we migrate once more the nr_shadows_rings gets updated
to a new value and the info->nr_rings too .. so eventually that might
go down to one and stay there. Even if we migrate back to the initial
guest that has 32 queues or such.

I think we might need three members in 'info':

 'shadow_nr_ring' - used by the blkif_recover only. Is reset to zero once
     that code is done. Is set to some value in 'blkif_resume'

 'init_nr_ring' - the initial value of contexts we want. Perhaps ignore the
    XenBus state and just base it on the number of CPUs we booted with? That
    way we have ample room. And if migrate to old backends we just use one
    ring, and if we migrate between different hosts that have different
    amount of hardware contexts we just use up to 'init_nr_ring'.

    Oh, and this should probably be gated by some module parameter for
    adventerous users to tweak.

 'nr_ring' - the currently negotiated amount of rings. Can fluctuare between
   migration, but will always be equal or below 'init_nr_ring'. When migrating
   its value will be written to 'shadow_nr_ring' to figure out how many
   rings to iterate through.
 
> +			struct blk_mq_hw_ctx *hctx;
> +			unsigned int i;
> +			/*
> +			 * Switch from multi-queue to single-queue:
> +			 * update hctx-to-ring mapping before
> +			 * resubmitting any requests
> +			 */
> +			queue_for_each_hw_ctx(info->rq, hctx, i)
> +				hctx->driver_data = &info->rinfo[0];
> +		}
> +		info->hardware_queues = nr_queues;
> +	}
> +
>  	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
>  
> +	/* Free with old number of rings, but rebuild with new */
> +	info->old_nr_rings = info->nr_rings;
> +	/*
> +	 * Must not update if transition didn't happen, we're keeping
> +	 * the old number of rings.
> +	 */
> +	if (mq_to_rq_transition)
> +		info->nr_rings = 1;
> +
>  	err = talk_to_blkback(dev, info);
>  
>  	/*
> @@ -1863,6 +1932,10 @@ static void blkfront_connect(struct blkfront_info *info)
>  		 * supports indirect descriptors, and how many.
>  		 */
>  		blkif_recover(info);
> +		info->rinfo = krealloc(info->rinfo,
> +				       info->nr_rings * sizeof(struct blkfront_ring_info),
> +				       GFP_KERNEL);
> +

David already mentioned it, but we need to also deal with the case of this
giving us -ENOMEM and scalling back (perhaps to just one ring)?

>  		return;
>  
>  	default:
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 ` Arianna Avanzini
  2014-09-12 10:58   ` David Vrabel
  2014-09-12 10:58   ` David Vrabel
@ 2014-10-01 20:23   ` Konrad Rzeszutek Wilk
  2014-10-01 20:23   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:23 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

On Fri, Sep 12, 2014 at 01:57:24AM +0200, Arianna Avanzini wrote:
> This commit lets the backend driver advertise the number of available
> hardware queues; it also implements gathering from the frontend driver
> the number of rings actually available for mapping.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkback/xenbus.c | 44 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index a4f13cc..9ff6ced 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  	vbd->bdev = NULL;
>  }
>  
> +static int xen_advertise_hw_queues(struct xen_blkif *blkif,
> +				   struct request_queue *q)
> +{
> +	struct xen_vbd *vbd = &blkif->vbd;
> +	struct xenbus_transaction xbt;
> +	int err;
> +
> +	if (q && q->mq_ops)
> +		vbd->nr_supported_hw_queues = q->nr_hw_queues;
> +
> +	err = xenbus_transaction_start(&xbt);
> +	if (err) {
> +		BUG_ON(!blkif->be);
> +		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
> +		return err;
> +	}
> +
> +	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
> +			    blkif->vbd.nr_supported_hw_queues);

I would (as David had mentioned) use the same keys that netfront is using for negotiating
this.

Plus that means you can copy-n-paste the text from netif.h to blkif.h instead of having
to write it :-)

> +	if (err)
> +		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
> +				 blkif->be->dev->nodename);
> +
> +	xenbus_transaction_end(xbt, 0);
> +
> +	return err;
> +}
> +
>  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  			  unsigned major, unsigned minor, int readonly,
>  			  int cdrom)
> @@ -484,6 +512,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  	struct xen_vbd *vbd;
>  	struct block_device *bdev;
>  	struct request_queue *q;
> +	int err;
>  
>  	vbd = &blkif->vbd;
>  	vbd->handle   = handle;
> @@ -522,6 +551,10 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  	if (q && blk_queue_secdiscard(q))
>  		vbd->discard_secure = true;
>  
> +	err = xen_advertise_hw_queues(blkif, q);
> +	if (err)
> +		return -ENOENT;
> +
>  	DPRINTK("Successful creation of handle=%04x (dom=%u)\n",
>  		handle, blkif->domid);
>  	return 0;
> @@ -935,7 +968,16 @@ static int connect_ring(struct backend_info *be)
>  
>  	DPRINTK("%s", dev->otherend);
>  
> -	blkif->nr_rings = 1;
> +	err = xenbus_gather(XBT_NIL, dev->otherend, "nr_blk_rings",
> +			    "%u", &blkif->nr_rings, NULL);
> +	if (err) {
> +		/*
> +		 * Frontend does not support multiqueue; force compatibility
> +		 * mode of the driver.
> +		 */
> +		blkif->vbd.nr_supported_hw_queues = 0;
> +		blkif->nr_rings = 1;
I would add also:

	pr_debug("Advertised %u queues, fronted deaf - using one ring.\n");

To make it easier during debugging/testing to figure out why the frontend
does not seem to use it (and having an easy way to figure out why).

> +	}
>  
>  	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
>  	if (!ring_ref)
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend
  2014-09-11 23:57 ` Arianna Avanzini
                     ` (2 preceding siblings ...)
  2014-10-01 20:23   ` Konrad Rzeszutek Wilk
@ 2014-10-01 20:23   ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:23 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

On Fri, Sep 12, 2014 at 01:57:24AM +0200, Arianna Avanzini wrote:
> This commit lets the backend driver advertise the number of available
> hardware queues; it also implements gathering from the frontend driver
> the number of rings actually available for mapping.
> 
> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
> ---
>  drivers/block/xen-blkback/xenbus.c | 44 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index a4f13cc..9ff6ced 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -477,6 +477,34 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  	vbd->bdev = NULL;
>  }
>  
> +static int xen_advertise_hw_queues(struct xen_blkif *blkif,
> +				   struct request_queue *q)
> +{
> +	struct xen_vbd *vbd = &blkif->vbd;
> +	struct xenbus_transaction xbt;
> +	int err;
> +
> +	if (q && q->mq_ops)
> +		vbd->nr_supported_hw_queues = q->nr_hw_queues;
> +
> +	err = xenbus_transaction_start(&xbt);
> +	if (err) {
> +		BUG_ON(!blkif->be);
> +		xenbus_dev_fatal(blkif->be->dev, err, "starting transaction (hw queues)");
> +		return err;
> +	}
> +
> +	err = xenbus_printf(xbt, blkif->be->dev->nodename, "nr_supported_hw_queues", "%u",
> +			    blkif->vbd.nr_supported_hw_queues);

I would (as David had mentioned) use the same keys that netfront is using for negotiating
this.

Plus that means you can copy-n-paste the text from netif.h to blkif.h instead of having
to write it :-)

> +	if (err)
> +		xenbus_dev_error(blkif->be->dev, err, "writing %s/nr_supported_hw_queues",
> +				 blkif->be->dev->nodename);
> +
> +	xenbus_transaction_end(xbt, 0);
> +
> +	return err;
> +}
> +
>  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  			  unsigned major, unsigned minor, int readonly,
>  			  int cdrom)
> @@ -484,6 +512,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  	struct xen_vbd *vbd;
>  	struct block_device *bdev;
>  	struct request_queue *q;
> +	int err;
>  
>  	vbd = &blkif->vbd;
>  	vbd->handle   = handle;
> @@ -522,6 +551,10 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  	if (q && blk_queue_secdiscard(q))
>  		vbd->discard_secure = true;
>  
> +	err = xen_advertise_hw_queues(blkif, q);
> +	if (err)
> +		return -ENOENT;
> +
>  	DPRINTK("Successful creation of handle=%04x (dom=%u)\n",
>  		handle, blkif->domid);
>  	return 0;
> @@ -935,7 +968,16 @@ static int connect_ring(struct backend_info *be)
>  
>  	DPRINTK("%s", dev->otherend);
>  
> -	blkif->nr_rings = 1;
> +	err = xenbus_gather(XBT_NIL, dev->otherend, "nr_blk_rings",
> +			    "%u", &blkif->nr_rings, NULL);
> +	if (err) {
> +		/*
> +		 * Frontend does not support multiqueue; force compatibility
> +		 * mode of the driver.
> +		 */
> +		blkif->vbd.nr_supported_hw_queues = 0;
> +		blkif->nr_rings = 1;
I would add also:

	pr_debug("Advertised %u queues, fronted deaf - using one ring.\n");

To make it easier during debugging/testing to figure out why the frontend
does not seem to use it (and having an easy way to figure out why).

> +	}
>  
>  	ring_ref = kzalloc(sizeof(unsigned long) * blkif->nr_rings, GFP_KERNEL);
>  	if (!ring_ref)
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (9 preceding siblings ...)
  2014-09-11 23:57 ` Arianna Avanzini
@ 2014-10-01 20:27 ` Konrad Rzeszutek Wilk
  2015-04-28  7:36   ` Christoph Hellwig
  2015-04-28  7:36   ` Christoph Hellwig
  2014-10-01 20:27 ` Konrad Rzeszutek Wilk
  11 siblings, 2 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:27 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, hch,
	bob.liu, felipe.franciosi, axboe

> Any comments or suggestions are more than welcome.

Hey Arianna,

Thank you for posting this patchset. I've gone over each patchset and the design
looks quite sound. There are just a some required changes mentioned by Christopher,
David, and me in regards to the different uses of APIs and such.

And I should on the next revision be able to review much quicker now that the
giant nexus of _everything_ (Xen 4.5 feature freeze, LinuxCon, tons of reviews,
internal bugs, etc) has slowed down.

Again, thank you for doing this work and posting the patchset!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
                   ` (10 preceding siblings ...)
  2014-10-01 20:27 ` [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Konrad Rzeszutek Wilk
@ 2014-10-01 20:27 ` Konrad Rzeszutek Wilk
  11 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-01 20:27 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	xen-devel, boris.ostrovsky

> Any comments or suggestions are more than welcome.

Hey Arianna,

Thank you for posting this patchset. I've gone over each patchset and the design
looks quite sound. There are just a some required changes mentioned by Christopher,
David, and me in regards to the different uses of APIs and such.

And I should on the next revision be able to review much quicker now that the
giant nexus of _everything_ (Xen 4.5 feature freeze, LinuxCon, tons of reviews,
internal bugs, etc) has slowed down.

Again, thank you for doing this work and posting the patchset!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2014-10-01 20:27 ` [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Konrad Rzeszutek Wilk
@ 2015-04-28  7:36   ` Christoph Hellwig
  2015-04-28  7:46     ` Arianna Avanzini
  2015-04-28  7:46     ` Arianna Avanzini
  2015-04-28  7:36   ` Christoph Hellwig
  1 sibling, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-04-28  7:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Arianna Avanzini, boris.ostrovsky, david.vrabel, xen-devel,
	linux-kernel, hch, bob.liu, felipe.franciosi, axboe

What happened to this patchset?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2014-10-01 20:27 ` [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Konrad Rzeszutek Wilk
  2015-04-28  7:36   ` Christoph Hellwig
@ 2015-04-28  7:36   ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-04-28  7:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: axboe, felipe.franciosi, linux-kernel, hch, david.vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky

What happened to this patchset?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-04-28  7:36   ` Christoph Hellwig
  2015-04-28  7:46     ` Arianna Avanzini
@ 2015-04-28  7:46     ` Arianna Avanzini
  2015-05-13 10:29       ` Bob Liu
  2015-05-13 10:29       ` Bob Liu
  1 sibling, 2 replies; 63+ messages in thread
From: Arianna Avanzini @ 2015-04-28  7:46 UTC (permalink / raw)
  To: Christoph Hellwig, Konrad Rzeszutek Wilk
  Cc: boris.ostrovsky, david.vrabel, xen-devel, linux-kernel, bob.liu,
	felipe.franciosi, axboe

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
> What happened to this patchset?
>

It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46

Thanks,
Arianna


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-04-28  7:36   ` Christoph Hellwig
@ 2015-04-28  7:46     ` Arianna Avanzini
  2015-04-28  7:46     ` Arianna Avanzini
  1 sibling, 0 replies; 63+ messages in thread
From: Arianna Avanzini @ 2015-04-28  7:46 UTC (permalink / raw)
  To: Christoph Hellwig, Konrad Rzeszutek Wilk
  Cc: felipe.franciosi, linux-kernel, axboe, david.vrabel, xen-devel,
	boris.ostrovsky

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
> What happened to this patchset?
>

It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46

Thanks,
Arianna

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-04-28  7:46     ` Arianna Avanzini
  2015-05-13 10:29       ` Bob Liu
@ 2015-05-13 10:29       ` Bob Liu
  2015-06-30 14:21         ` Marcus Granado
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
  1 sibling, 2 replies; 63+ messages in thread
From: Bob Liu @ 2015-05-13 10:29 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: Christoph Hellwig, Konrad Rzeszutek Wilk, boris.ostrovsky,
	david.vrabel, xen-devel, linux-kernel, felipe.franciosi, axboe


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
> Hello Christoph,
> 
> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>> What happened to this patchset?
>>
> 
> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
> 

Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
Will back on this patchset soon. Thank you!

-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-04-28  7:46     ` Arianna Avanzini
@ 2015-05-13 10:29       ` Bob Liu
  2015-05-13 10:29       ` Bob Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-05-13 10:29 UTC (permalink / raw)
  To: Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, Christoph Hellwig,
	david.vrabel, xen-devel, boris.ostrovsky


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
> Hello Christoph,
> 
> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>> What happened to this patchset?
>>
> 
> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
> 

Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
Will back on this patchset soon. Thank you!

-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-05-13 10:29       ` Bob Liu
  2015-06-30 14:21         ` Marcus Granado
@ 2015-06-30 14:21         ` Marcus Granado
  2015-07-01  0:04           ` Bob Liu
                             ` (3 more replies)
  1 sibling, 4 replies; 63+ messages in thread
From: Marcus Granado @ 2015-06-30 14:21 UTC (permalink / raw)
  To: Bob Liu, Arianna Avanzini
  Cc: axboe, felipe.franciosi, linux-kernel, Christoph Hellwig,
	david.vrabel, xen-devel, boris.ostrovsky, Jonathan Davies,
	Rafal Mielniczuk

On 13/05/15 11:29, Bob Liu wrote:
>
> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>> Hello Christoph,
>>
>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>> What happened to this patchset?
>>>
>>
>> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
>>
>
> Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
> Will back on this patchset soon. Thank you!
>
> -Bob
>

Hi,

Our measurements for the multiqueue patch indicate a clear improvement 
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to 
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend 
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s 
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue 
depth of 32 for each queue was used to saturate individual vcpus in the 
guest.

We were interested in observing storage iops for different values of 
block sizes. Our expectation was that iops would improve when increasing 
the number of queues, because both the guest and dom0 would be able to 
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that 
we got for the conditions above with sequential reads:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
     8           32       512           158K         264K
     8           32        1K           157K         260K
     8           32        2K           157K         258K
     8           32        4K           148K         257K
     8           32        8K           124K         207K
     8           32       16K            84K         105K
     8           32       32K            50K          54K
     8           32       64K            24K          27K
     8           32      128K            11K          13K

8-queue iops was better than single queue iops for all the block sizes. 
There were very good improvements as well for sequential writes with 
block size 4K (from 80K iops with single queue to 230K iops with 8 
queues), and no regressions were visible in any measurement performed.

Marcus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-05-13 10:29       ` Bob Liu
@ 2015-06-30 14:21         ` Marcus Granado
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
  1 sibling, 0 replies; 63+ messages in thread
From: Marcus Granado @ 2015-06-30 14:21 UTC (permalink / raw)
  To: Bob Liu, Arianna Avanzini
  Cc: Christoph Hellwig, felipe.franciosi, Rafal Mielniczuk,
	Jonathan Davies, linux-kernel, axboe, david.vrabel, xen-devel,
	boris.ostrovsky

On 13/05/15 11:29, Bob Liu wrote:
>
> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>> Hello Christoph,
>>
>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>> What happened to this patchset?
>>>
>>
>> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
>>
>
> Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
> Will back on this patchset soon. Thank you!
>
> -Bob
>

Hi,

Our measurements for the multiqueue patch indicate a clear improvement 
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to 
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend 
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s 
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue 
depth of 32 for each queue was used to saturate individual vcpus in the 
guest.

We were interested in observing storage iops for different values of 
block sizes. Our expectation was that iops would improve when increasing 
the number of queues, because both the guest and dom0 would be able to 
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that 
we got for the conditions above with sequential reads:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
     8           32       512           158K         264K
     8           32        1K           157K         260K
     8           32        2K           157K         258K
     8           32        4K           148K         257K
     8           32        8K           124K         207K
     8           32       16K            84K         105K
     8           32       32K            50K          54K
     8           32       64K            24K          27K
     8           32      128K            11K          13K

8-queue iops was better than single queue iops for all the block sizes. 
There were very good improvements as well for sequential writes with 
block size 4K (from 80K iops with single queue to 230K iops with 8 
queues), and no regressions were visible in any measurement performed.

Marcus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
@ 2015-07-01  0:04           ` Bob Liu
  2015-07-01  0:04           ` Bob Liu
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-07-01  0:04 UTC (permalink / raw)
  To: Marcus Granado
  Cc: Arianna Avanzini, axboe, felipe.franciosi, linux-kernel,
	Christoph Hellwig, david.vrabel, xen-devel, boris.ostrovsky,
	Jonathan Davies, Rafal Mielniczuk


On 06/30/2015 10:21 PM, Marcus Granado wrote:
> On 13/05/15 11:29, Bob Liu wrote:
>>
>> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>>> Hello Christoph,
>>>
>>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>>> What happened to this patchset?
>>>>
>>>
>>> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
>>>
>>
>> Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
>> Will back on this patchset soon. Thank you!
>>
>> -Bob
>>
> 
> Hi,
> 
> Our measurements for the multiqueue patch indicate a clear improvement in iops when more queues are used.
> 
> The measurements were obtained under the following conditions:
> 
> - using blkback as the dom0 backend with the multiqueue patch applied to a dom0 kernel 4.0 on 8 vcpus.
> 
> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend applied to be used as a guest on 4 vcpus
> 
> - using a micron RealSSD P320h as the underlying local storage on a Dell PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
> 
> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. We used direct_io to skip caching in the guest and ran fio for 60s reading a number of block sizes ranging from 512 bytes to 4MiB. Queue depth of 32 for each queue was used to saturate individual vcpus in the guest.
> 
> We were interested in observing storage iops for different values of block sizes. Our expectation was that iops would improve when increasing the number of queues, because both the guest and dom0 would be able to make use of more vcpus to handle these requests.
> 
> These are the results (as aggregate iops for all the fio threads) that we got for the conditions above with sequential reads:
> 
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>     8           32       512           158K         264K
>     8           32        1K           157K         260K
>     8           32        2K           157K         258K
>     8           32        4K           148K         257K
>     8           32        8K           124K         207K
>     8           32       16K            84K         105K
>     8           32       32K            50K          54K
>     8           32       64K            24K          27K
>     8           32      128K            11K          13K
> 
> 8-queue iops was better than single queue iops for all the block sizes. There were very good improvements as well for sequential writes with block size 4K (from 80K iops with single queue to 230K iops with 8 queues), and no regressions were visible in any measurement performed.
> 

Great! Thank you very much for the test.

I'm trying to rebase these patches to the latest kernel version(v4.1) and will send out in following days.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
  2015-07-01  0:04           ` Bob Liu
@ 2015-07-01  0:04           ` Bob Liu
  2015-07-01  3:02           ` Jens Axboe
  2015-07-01  3:02           ` [Xen-devel] " Jens Axboe
  3 siblings, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-07-01  0:04 UTC (permalink / raw)
  To: Marcus Granado
  Cc: Christoph Hellwig, felipe.franciosi, Rafal Mielniczuk,
	Jonathan Davies, linux-kernel, axboe, david.vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky


On 06/30/2015 10:21 PM, Marcus Granado wrote:
> On 13/05/15 11:29, Bob Liu wrote:
>>
>> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>>> Hello Christoph,
>>>
>>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>>> What happened to this patchset?
>>>>
>>>
>>> It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46
>>>
>>
>> Right, and then I was interrupted by another xen-block feature: 'multi-page' ring.
>> Will back on this patchset soon. Thank you!
>>
>> -Bob
>>
> 
> Hi,
> 
> Our measurements for the multiqueue patch indicate a clear improvement in iops when more queues are used.
> 
> The measurements were obtained under the following conditions:
> 
> - using blkback as the dom0 backend with the multiqueue patch applied to a dom0 kernel 4.0 on 8 vcpus.
> 
> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend applied to be used as a guest on 4 vcpus
> 
> - using a micron RealSSD P320h as the underlying local storage on a Dell PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
> 
> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. We used direct_io to skip caching in the guest and ran fio for 60s reading a number of block sizes ranging from 512 bytes to 4MiB. Queue depth of 32 for each queue was used to saturate individual vcpus in the guest.
> 
> We were interested in observing storage iops for different values of block sizes. Our expectation was that iops would improve when increasing the number of queues, because both the guest and dom0 would be able to make use of more vcpus to handle these requests.
> 
> These are the results (as aggregate iops for all the fio threads) that we got for the conditions above with sequential reads:
> 
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>     8           32       512           158K         264K
>     8           32        1K           157K         260K
>     8           32        2K           157K         258K
>     8           32        4K           148K         257K
>     8           32        8K           124K         207K
>     8           32       16K            84K         105K
>     8           32       32K            50K          54K
>     8           32       64K            24K          27K
>     8           32      128K            11K          13K
> 
> 8-queue iops was better than single queue iops for all the block sizes. There were very good improvements as well for sequential writes with block size 4K (from 80K iops with single queue to 230K iops with 8 queues), and no regressions were visible in any measurement performed.
> 

Great! Thank you very much for the test.

I'm trying to rebase these patches to the latest kernel version(v4.1) and will send out in following days.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
                             ` (2 preceding siblings ...)
  2015-07-01  3:02           ` Jens Axboe
@ 2015-07-01  3:02           ` Jens Axboe
  2015-08-10 11:03             ` Rafal Mielniczuk
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
  3 siblings, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2015-07-01  3:02 UTC (permalink / raw)
  To: Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: felipe.franciosi, linux-kernel, Christoph Hellwig, david.vrabel,
	xen-devel, boris.ostrovsky, Jonathan Davies, Rafal Mielniczuk

On 06/30/2015 08:21 AM, Marcus Granado wrote:
> On 13/05/15 11:29, Bob Liu wrote:
>>
>> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>>> Hello Christoph,
>>>
>>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>>> What happened to this patchset?
>>>>
>>>
>>> It was passed on to Bob Liu, who published a follow-up patchset here:
>>> https://lkml.org/lkml/2015/2/15/46
>>>
>>
>> Right, and then I was interrupted by another xen-block feature:
>> 'multi-page' ring.
>> Will back on this patchset soon. Thank you!
>>
>> -Bob
>>
>
> Hi,
>
> Our measurements for the multiqueue patch indicate a clear improvement
> in iops when more queues are used.
>
> The measurements were obtained under the following conditions:
>
> - using blkback as the dom0 backend with the multiqueue patch applied to
> a dom0 kernel 4.0 on 8 vcpus.
>
> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
> applied to be used as a guest on 4 vcpus
>
> - using a micron RealSSD P320h as the underlying local storage on a Dell
> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>
> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
> We used direct_io to skip caching in the guest and ran fio for 60s
> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
> depth of 32 for each queue was used to saturate individual vcpus in the
> guest.
>
> We were interested in observing storage iops for different values of
> block sizes. Our expectation was that iops would improve when increasing
> the number of queues, because both the guest and dom0 would be able to
> make use of more vcpus to handle these requests.
>
> These are the results (as aggregate iops for all the fio threads) that
> we got for the conditions above with sequential reads:
>
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>      8           32       512           158K         264K
>      8           32        1K           157K         260K
>      8           32        2K           157K         258K
>      8           32        4K           148K         257K
>      8           32        8K           124K         207K
>      8           32       16K            84K         105K
>      8           32       32K            50K          54K
>      8           32       64K            24K          27K
>      8           32      128K            11K          13K
>
> 8-queue iops was better than single queue iops for all the block sizes.
> There were very good improvements as well for sequential writes with
> block size 4K (from 80K iops with single queue to 230K iops with 8
> queues), and no regressions were visible in any measurement performed.

Great results! And I don't know why this code has lingered for so long, 
so thanks for helping get some attention to this again.

Personally I'd be really interested in the results for the same set of 
tests, but without the blk-mq patches. Do you have them, or could you 
potentially run them?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
  2015-07-01  0:04           ` Bob Liu
  2015-07-01  0:04           ` Bob Liu
@ 2015-07-01  3:02           ` Jens Axboe
  2015-07-01  3:02           ` [Xen-devel] " Jens Axboe
  3 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2015-07-01  3:02 UTC (permalink / raw)
  To: Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: felipe.franciosi, Rafal Mielniczuk, Jonathan Davies,
	linux-kernel, Christoph Hellwig, david.vrabel, xen-devel,
	boris.ostrovsky

On 06/30/2015 08:21 AM, Marcus Granado wrote:
> On 13/05/15 11:29, Bob Liu wrote:
>>
>> On 04/28/2015 03:46 PM, Arianna Avanzini wrote:
>>> Hello Christoph,
>>>
>>> Il 28/04/2015 09:36, Christoph Hellwig ha scritto:
>>>> What happened to this patchset?
>>>>
>>>
>>> It was passed on to Bob Liu, who published a follow-up patchset here:
>>> https://lkml.org/lkml/2015/2/15/46
>>>
>>
>> Right, and then I was interrupted by another xen-block feature:
>> 'multi-page' ring.
>> Will back on this patchset soon. Thank you!
>>
>> -Bob
>>
>
> Hi,
>
> Our measurements for the multiqueue patch indicate a clear improvement
> in iops when more queues are used.
>
> The measurements were obtained under the following conditions:
>
> - using blkback as the dom0 backend with the multiqueue patch applied to
> a dom0 kernel 4.0 on 8 vcpus.
>
> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
> applied to be used as a guest on 4 vcpus
>
> - using a micron RealSSD P320h as the underlying local storage on a Dell
> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>
> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
> We used direct_io to skip caching in the guest and ran fio for 60s
> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
> depth of 32 for each queue was used to saturate individual vcpus in the
> guest.
>
> We were interested in observing storage iops for different values of
> block sizes. Our expectation was that iops would improve when increasing
> the number of queues, because both the guest and dom0 would be able to
> make use of more vcpus to handle these requests.
>
> These are the results (as aggregate iops for all the fio threads) that
> we got for the conditions above with sequential reads:
>
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>      8           32       512           158K         264K
>      8           32        1K           157K         260K
>      8           32        2K           157K         258K
>      8           32        4K           148K         257K
>      8           32        8K           124K         207K
>      8           32       16K            84K         105K
>      8           32       32K            50K          54K
>      8           32       64K            24K          27K
>      8           32      128K            11K          13K
>
> 8-queue iops was better than single queue iops for all the block sizes.
> There were very good improvements as well for sequential writes with
> block size 4K (from 80K iops with single queue to 230K iops with 8
> queues), and no regressions were visible in any measurement performed.

Great results! And I don't know why this code has lingered for so long, 
so thanks for helping get some attention to this again.

Personally I'd be really interested in the results for the same set of 
tests, but without the blk-mq patches. Do you have them, or could you 
potentially run them?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-07-01  3:02           ` [Xen-devel] " Jens Axboe
  2015-08-10 11:03             ` Rafal Mielniczuk
@ 2015-08-10 11:03             ` Rafal Mielniczuk
  2015-08-10 11:14               ` Bob Liu
                                 ` (3 more replies)
  1 sibling, 4 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-10 11:03 UTC (permalink / raw)
  To: Jens Axboe, Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: Felipe Franciosi, linux-kernel, Christoph Hellwig, David Vrabel,
	xen-devel, boris.ostrovsky, Jonathan Davies

On 01/07/15 04:03, Jens Axboe wrote:
> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>> Hi,
>>
>> Our measurements for the multiqueue patch indicate a clear improvement
>> in iops when more queues are used.
>>
>> The measurements were obtained under the following conditions:
>>
>> - using blkback as the dom0 backend with the multiqueue patch applied to
>> a dom0 kernel 4.0 on 8 vcpus.
>>
>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>> applied to be used as a guest on 4 vcpus
>>
>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>
>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>> We used direct_io to skip caching in the guest and ran fio for 60s
>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>> depth of 32 for each queue was used to saturate individual vcpus in the
>> guest.
>>
>> We were interested in observing storage iops for different values of
>> block sizes. Our expectation was that iops would improve when increasing
>> the number of queues, because both the guest and dom0 would be able to
>> make use of more vcpus to handle these requests.
>>
>> These are the results (as aggregate iops for all the fio threads) that
>> we got for the conditions above with sequential reads:
>>
>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>      8           32       512           158K         264K
>>      8           32        1K           157K         260K
>>      8           32        2K           157K         258K
>>      8           32        4K           148K         257K
>>      8           32        8K           124K         207K
>>      8           32       16K            84K         105K
>>      8           32       32K            50K          54K
>>      8           32       64K            24K          27K
>>      8           32      128K            11K          13K
>>
>> 8-queue iops was better than single queue iops for all the block sizes.
>> There were very good improvements as well for sequential writes with
>> block size 4K (from 80K iops with single queue to 230K iops with 8
>> queues), and no regressions were visible in any measurement performed.
> Great results! And I don't know why this code has lingered for so long, 
> so thanks for helping get some attention to this again.
>
> Personally I'd be really interested in the results for the same set of 
> tests, but without the blk-mq patches. Do you have them, or could you 
> potentially run them?
>
Hello,

We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue patches applied:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
     8           32       512           158K         264K         321K
     8           32        1K           157K         260K         328K
     8           32        2K           157K         258K         336K
     8           32        4K           148K         257K         308K
     8           32        8K           124K         207K         188K
     8           32       16K            84K         105K         82K
     8           32       32K            50K          54K         36K
     8           32       64K            24K          27K         16K
     8           32      128K            11K          13K         11K

We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the performance,
at least when it comes to sequential read/writes operations.

Rafal



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-07-01  3:02           ` [Xen-devel] " Jens Axboe
@ 2015-08-10 11:03             ` Rafal Mielniczuk
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-10 11:03 UTC (permalink / raw)
  To: Jens Axboe, Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: Jonathan Davies, Felipe Franciosi, linux-kernel,
	Christoph Hellwig, David Vrabel, xen-devel, boris.ostrovsky

On 01/07/15 04:03, Jens Axboe wrote:
> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>> Hi,
>>
>> Our measurements for the multiqueue patch indicate a clear improvement
>> in iops when more queues are used.
>>
>> The measurements were obtained under the following conditions:
>>
>> - using blkback as the dom0 backend with the multiqueue patch applied to
>> a dom0 kernel 4.0 on 8 vcpus.
>>
>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>> applied to be used as a guest on 4 vcpus
>>
>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>
>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>> We used direct_io to skip caching in the guest and ran fio for 60s
>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>> depth of 32 for each queue was used to saturate individual vcpus in the
>> guest.
>>
>> We were interested in observing storage iops for different values of
>> block sizes. Our expectation was that iops would improve when increasing
>> the number of queues, because both the guest and dom0 would be able to
>> make use of more vcpus to handle these requests.
>>
>> These are the results (as aggregate iops for all the fio threads) that
>> we got for the conditions above with sequential reads:
>>
>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>      8           32       512           158K         264K
>>      8           32        1K           157K         260K
>>      8           32        2K           157K         258K
>>      8           32        4K           148K         257K
>>      8           32        8K           124K         207K
>>      8           32       16K            84K         105K
>>      8           32       32K            50K          54K
>>      8           32       64K            24K          27K
>>      8           32      128K            11K          13K
>>
>> 8-queue iops was better than single queue iops for all the block sizes.
>> There were very good improvements as well for sequential writes with
>> block size 4K (from 80K iops with single queue to 230K iops with 8
>> queues), and no regressions were visible in any measurement performed.
> Great results! And I don't know why this code has lingered for so long, 
> so thanks for helping get some attention to this again.
>
> Personally I'd be really interested in the results for the same set of 
> tests, but without the blk-mq patches. Do you have them, or could you 
> potentially run them?
>
Hello,

We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue patches applied:

fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
     8           32       512           158K         264K         321K
     8           32        1K           157K         260K         328K
     8           32        2K           157K         258K         336K
     8           32        4K           148K         257K         308K
     8           32        8K           124K         207K         188K
     8           32       16K            84K         105K         82K
     8           32       32K            50K          54K         36K
     8           32       64K            24K          27K         16K
     8           32      128K            11K          13K         11K

We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the performance,
at least when it comes to sequential read/writes operations.

Rafal

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
  2015-08-10 11:14               ` Bob Liu
@ 2015-08-10 11:14               ` Bob Liu
  2015-08-10 15:52               ` Jens Axboe
  2015-08-10 15:52               ` [Xen-devel] " Jens Axboe
  3 siblings, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-10 11:14 UTC (permalink / raw)
  To: Rafal Mielniczuk
  Cc: Jens Axboe, Marcus Granado, Arianna Avanzini, Felipe Franciosi,
	linux-kernel, Christoph Hellwig, David Vrabel, xen-devel,
	boris.ostrovsky, Jonathan Davies


On 08/10/2015 07:03 PM, Rafal Mielniczuk wrote:
> On 01/07/15 04:03, Jens Axboe wrote:
>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>> Hi,
>>>
>>> Our measurements for the multiqueue patch indicate a clear improvement
>>> in iops when more queues are used.
>>>
>>> The measurements were obtained under the following conditions:
>>>
>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>> a dom0 kernel 4.0 on 8 vcpus.
>>>
>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>> applied to be used as a guest on 4 vcpus
>>>
>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>
>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>> guest.
>>>
>>> We were interested in observing storage iops for different values of
>>> block sizes. Our expectation was that iops would improve when increasing
>>> the number of queues, because both the guest and dom0 would be able to
>>> make use of more vcpus to handle these requests.
>>>
>>> These are the results (as aggregate iops for all the fio threads) that
>>> we got for the conditions above with sequential reads:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>      8           32       512           158K         264K
>>>      8           32        1K           157K         260K
>>>      8           32        2K           157K         258K
>>>      8           32        4K           148K         257K
>>>      8           32        8K           124K         207K
>>>      8           32       16K            84K         105K
>>>      8           32       32K            50K          54K
>>>      8           32       64K            24K          27K
>>>      8           32      128K            11K          13K
>>>
>>> 8-queue iops was better than single queue iops for all the block sizes.
>>> There were very good improvements as well for sequential writes with
>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>> queues), and no regressions were visible in any measurement performed.
>> Great results! And I don't know why this code has lingered for so long, 
>> so thanks for helping get some attention to this again.
>>
>> Personally I'd be really interested in the results for the same set of 
>> tests, but without the blk-mq patches. Do you have them, or could you 
>> potentially run them?
>>
> Hello,
> 
> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
> The results we obtained were *better* than the results we got with multiqueue patches applied:
> 
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>      8           32       512           158K         264K         321K
>      8           32        1K           157K         260K         328K
>      8           32        2K           157K         258K         336K
>      8           32        4K           148K         257K         308K
>      8           32        8K           124K         207K         188K
>      8           32       16K            84K         105K         82K
>      8           32       32K            50K          54K         36K
>      8           32       64K            24K          27K         16K
>      8           32      128K            11K          13K         11K
> 
> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
> 
> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
> 

Which block scheduler was used in domU?  Please try to "cat /sys/block/sdxxx/queue/scheduler".
How about the result if using "noop" scheduler?

Thanks,
Bob Liu

> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
> Could you explain why it is difficult to enable merging in the blk-mq layer?
> That could help closing the performance gap we observed.
> 
> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
> at least when it comes to sequential read/writes operations.
> 
> Rafal
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
@ 2015-08-10 11:14               ` Bob Liu
  2015-08-10 11:14               ` [Xen-devel] " Bob Liu
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-10 11:14 UTC (permalink / raw)
  To: Rafal Mielniczuk
  Cc: Christoph Hellwig, Felipe Franciosi, Jonathan Davies,
	linux-kernel, Marcus Granado, Jens Axboe, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky


On 08/10/2015 07:03 PM, Rafal Mielniczuk wrote:
> On 01/07/15 04:03, Jens Axboe wrote:
>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>> Hi,
>>>
>>> Our measurements for the multiqueue patch indicate a clear improvement
>>> in iops when more queues are used.
>>>
>>> The measurements were obtained under the following conditions:
>>>
>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>> a dom0 kernel 4.0 on 8 vcpus.
>>>
>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>> applied to be used as a guest on 4 vcpus
>>>
>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>
>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>> guest.
>>>
>>> We were interested in observing storage iops for different values of
>>> block sizes. Our expectation was that iops would improve when increasing
>>> the number of queues, because both the guest and dom0 would be able to
>>> make use of more vcpus to handle these requests.
>>>
>>> These are the results (as aggregate iops for all the fio threads) that
>>> we got for the conditions above with sequential reads:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>      8           32       512           158K         264K
>>>      8           32        1K           157K         260K
>>>      8           32        2K           157K         258K
>>>      8           32        4K           148K         257K
>>>      8           32        8K           124K         207K
>>>      8           32       16K            84K         105K
>>>      8           32       32K            50K          54K
>>>      8           32       64K            24K          27K
>>>      8           32      128K            11K          13K
>>>
>>> 8-queue iops was better than single queue iops for all the block sizes.
>>> There were very good improvements as well for sequential writes with
>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>> queues), and no regressions were visible in any measurement performed.
>> Great results! And I don't know why this code has lingered for so long, 
>> so thanks for helping get some attention to this again.
>>
>> Personally I'd be really interested in the results for the same set of 
>> tests, but without the blk-mq patches. Do you have them, or could you 
>> potentially run them?
>>
> Hello,
> 
> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
> The results we obtained were *better* than the results we got with multiqueue patches applied:
> 
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>      8           32       512           158K         264K         321K
>      8           32        1K           157K         260K         328K
>      8           32        2K           157K         258K         336K
>      8           32        4K           148K         257K         308K
>      8           32        8K           124K         207K         188K
>      8           32       16K            84K         105K         82K
>      8           32       32K            50K          54K         36K
>      8           32       64K            24K          27K         16K
>      8           32      128K            11K          13K         11K
> 
> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
> 
> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
> 

Which block scheduler was used in domU?  Please try to "cat /sys/block/sdxxx/queue/scheduler".
How about the result if using "noop" scheduler?

Thanks,
Bob Liu

> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
> Could you explain why it is difficult to enable merging in the blk-mq layer?
> That could help closing the performance gap we observed.
> 
> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
> at least when it comes to sequential read/writes operations.
> 
> Rafal
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
                                 ` (2 preceding siblings ...)
  2015-08-10 15:52               ` Jens Axboe
@ 2015-08-10 15:52               ` Jens Axboe
  2015-08-11  6:07                 ` Bob Liu
  2015-08-11  6:07                 ` [Xen-devel] " Bob Liu
  3 siblings, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2015-08-10 15:52 UTC (permalink / raw)
  To: Rafal Mielniczuk, Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: Felipe Franciosi, linux-kernel, Christoph Hellwig, David Vrabel,
	xen-devel, boris.ostrovsky, Jonathan Davies

On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
> On 01/07/15 04:03, Jens Axboe wrote:
>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>> Hi,
>>>
>>> Our measurements for the multiqueue patch indicate a clear improvement
>>> in iops when more queues are used.
>>>
>>> The measurements were obtained under the following conditions:
>>>
>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>> a dom0 kernel 4.0 on 8 vcpus.
>>>
>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>> applied to be used as a guest on 4 vcpus
>>>
>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>
>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>> guest.
>>>
>>> We were interested in observing storage iops for different values of
>>> block sizes. Our expectation was that iops would improve when increasing
>>> the number of queues, because both the guest and dom0 would be able to
>>> make use of more vcpus to handle these requests.
>>>
>>> These are the results (as aggregate iops for all the fio threads) that
>>> we got for the conditions above with sequential reads:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>       8           32       512           158K         264K
>>>       8           32        1K           157K         260K
>>>       8           32        2K           157K         258K
>>>       8           32        4K           148K         257K
>>>       8           32        8K           124K         207K
>>>       8           32       16K            84K         105K
>>>       8           32       32K            50K          54K
>>>       8           32       64K            24K          27K
>>>       8           32      128K            11K          13K
>>>
>>> 8-queue iops was better than single queue iops for all the block sizes.
>>> There were very good improvements as well for sequential writes with
>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>> queues), and no regressions were visible in any measurement performed.
>> Great results! And I don't know why this code has lingered for so long,
>> so thanks for helping get some attention to this again.
>>
>> Personally I'd be really interested in the results for the same set of
>> tests, but without the blk-mq patches. Do you have them, or could you
>> potentially run them?
>>
> Hello,
>
> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
> The results we obtained were *better* than the results we got with multiqueue patches applied:
>
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>       8           32       512           158K         264K         321K
>       8           32        1K           157K         260K         328K
>       8           32        2K           157K         258K         336K
>       8           32        4K           148K         257K         308K
>       8           32        8K           124K         207K         188K
>       8           32       16K            84K         105K         82K
>       8           32       32K            50K          54K         36K
>       8           32       64K            24K          27K         16K
>       8           32      128K            11K          13K         11K
>
> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>
> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>
> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
> Could you explain why it is difficult to enable merging in the blk-mq layer?
> That could help closing the performance gap we observed.
>
> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
> at least when it comes to sequential read/writes operations.

blk-mq still provides merging, there should be no difference there. Does 
the xen patches set BLK_MQ_F_SHOULD_MERGE?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
  2015-08-10 11:14               ` Bob Liu
  2015-08-10 11:14               ` [Xen-devel] " Bob Liu
@ 2015-08-10 15:52               ` Jens Axboe
  2015-08-10 15:52               ` [Xen-devel] " Jens Axboe
  3 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2015-08-10 15:52 UTC (permalink / raw)
  To: Rafal Mielniczuk, Marcus Granado, Bob Liu, Arianna Avanzini
  Cc: Jonathan Davies, Felipe Franciosi, linux-kernel,
	Christoph Hellwig, David Vrabel, xen-devel, boris.ostrovsky

On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
> On 01/07/15 04:03, Jens Axboe wrote:
>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>> Hi,
>>>
>>> Our measurements for the multiqueue patch indicate a clear improvement
>>> in iops when more queues are used.
>>>
>>> The measurements were obtained under the following conditions:
>>>
>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>> a dom0 kernel 4.0 on 8 vcpus.
>>>
>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>> applied to be used as a guest on 4 vcpus
>>>
>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>
>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>> guest.
>>>
>>> We were interested in observing storage iops for different values of
>>> block sizes. Our expectation was that iops would improve when increasing
>>> the number of queues, because both the guest and dom0 would be able to
>>> make use of more vcpus to handle these requests.
>>>
>>> These are the results (as aggregate iops for all the fio threads) that
>>> we got for the conditions above with sequential reads:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>       8           32       512           158K         264K
>>>       8           32        1K           157K         260K
>>>       8           32        2K           157K         258K
>>>       8           32        4K           148K         257K
>>>       8           32        8K           124K         207K
>>>       8           32       16K            84K         105K
>>>       8           32       32K            50K          54K
>>>       8           32       64K            24K          27K
>>>       8           32      128K            11K          13K
>>>
>>> 8-queue iops was better than single queue iops for all the block sizes.
>>> There were very good improvements as well for sequential writes with
>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>> queues), and no regressions were visible in any measurement performed.
>> Great results! And I don't know why this code has lingered for so long,
>> so thanks for helping get some attention to this again.
>>
>> Personally I'd be really interested in the results for the same set of
>> tests, but without the blk-mq patches. Do you have them, or could you
>> potentially run them?
>>
> Hello,
>
> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
> The results we obtained were *better* than the results we got with multiqueue patches applied:
>
> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>       8           32       512           158K         264K         321K
>       8           32        1K           157K         260K         328K
>       8           32        2K           157K         258K         336K
>       8           32        4K           148K         257K         308K
>       8           32        8K           124K         207K         188K
>       8           32       16K            84K         105K         82K
>       8           32       32K            50K          54K         36K
>       8           32       64K            24K          27K         16K
>       8           32      128K            11K          13K         11K
>
> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>
> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>
> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
> Could you explain why it is difficult to enable merging in the blk-mq layer?
> That could help closing the performance gap we observed.
>
> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
> at least when it comes to sequential read/writes operations.

blk-mq still provides merging, there should be no difference there. Does 
the xen patches set BLK_MQ_F_SHOULD_MERGE?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 15:52               ` [Xen-devel] " Jens Axboe
  2015-08-11  6:07                 ` Bob Liu
@ 2015-08-11  6:07                 ` Bob Liu
  2015-08-11  9:45                   ` Rafal Mielniczuk
  2015-08-11  9:45                   ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 2 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-11  6:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Rafal Mielniczuk, Marcus Granado, Arianna Avanzini,
	Felipe Franciosi, linux-kernel, Christoph Hellwig, David Vrabel,
	xen-devel, boris.ostrovsky, Jonathan Davies


On 08/10/2015 11:52 PM, Jens Axboe wrote:
> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>> On 01/07/15 04:03, Jens Axboe wrote:
>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>> Hi,
>>>>
>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>> in iops when more queues are used.
>>>>
>>>> The measurements were obtained under the following conditions:
>>>>
>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>
>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>> applied to be used as a guest on 4 vcpus
>>>>
>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>
>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>> guest.
>>>>
>>>> We were interested in observing storage iops for different values of
>>>> block sizes. Our expectation was that iops would improve when increasing
>>>> the number of queues, because both the guest and dom0 would be able to
>>>> make use of more vcpus to handle these requests.
>>>>
>>>> These are the results (as aggregate iops for all the fio threads) that
>>>> we got for the conditions above with sequential reads:
>>>>
>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>       8           32       512           158K         264K
>>>>       8           32        1K           157K         260K
>>>>       8           32        2K           157K         258K
>>>>       8           32        4K           148K         257K
>>>>       8           32        8K           124K         207K
>>>>       8           32       16K            84K         105K
>>>>       8           32       32K            50K          54K
>>>>       8           32       64K            24K          27K
>>>>       8           32      128K            11K          13K
>>>>
>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>> There were very good improvements as well for sequential writes with
>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>> queues), and no regressions were visible in any measurement performed.
>>> Great results! And I don't know why this code has lingered for so long,
>>> so thanks for helping get some attention to this again.
>>>
>>> Personally I'd be really interested in the results for the same set of
>>> tests, but without the blk-mq patches. Do you have them, or could you
>>> potentially run them?
>>>
>> Hello,
>>
>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>
>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>       8           32       512           158K         264K         321K
>>       8           32        1K           157K         260K         328K
>>       8           32        2K           157K         258K         336K
>>       8           32        4K           148K         257K         308K
>>       8           32        8K           124K         207K         188K
>>       8           32       16K            84K         105K         82K
>>       8           32       32K            50K          54K         36K
>>       8           32       64K            24K          27K         16K
>>       8           32      128K            11K          13K         11K
>>
>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>
>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>
>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>> That could help closing the performance gap we observed.
>>
>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>> at least when it comes to sequential read/writes operations.
> 
> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
> 

Yes.
Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
Because new requests don't have the chance merging with old requests which were already dequeued and issued.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-10 15:52               ` [Xen-devel] " Jens Axboe
@ 2015-08-11  6:07                 ` Bob Liu
  2015-08-11  6:07                 ` [Xen-devel] " Bob Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-11  6:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Davies, Felipe Franciosi, Rafal Mielniczuk,
	linux-kernel, Marcus Granado, Christoph Hellwig, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky


On 08/10/2015 11:52 PM, Jens Axboe wrote:
> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>> On 01/07/15 04:03, Jens Axboe wrote:
>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>> Hi,
>>>>
>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>> in iops when more queues are used.
>>>>
>>>> The measurements were obtained under the following conditions:
>>>>
>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>
>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>> applied to be used as a guest on 4 vcpus
>>>>
>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>
>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>> guest.
>>>>
>>>> We were interested in observing storage iops for different values of
>>>> block sizes. Our expectation was that iops would improve when increasing
>>>> the number of queues, because both the guest and dom0 would be able to
>>>> make use of more vcpus to handle these requests.
>>>>
>>>> These are the results (as aggregate iops for all the fio threads) that
>>>> we got for the conditions above with sequential reads:
>>>>
>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>       8           32       512           158K         264K
>>>>       8           32        1K           157K         260K
>>>>       8           32        2K           157K         258K
>>>>       8           32        4K           148K         257K
>>>>       8           32        8K           124K         207K
>>>>       8           32       16K            84K         105K
>>>>       8           32       32K            50K          54K
>>>>       8           32       64K            24K          27K
>>>>       8           32      128K            11K          13K
>>>>
>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>> There were very good improvements as well for sequential writes with
>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>> queues), and no regressions were visible in any measurement performed.
>>> Great results! And I don't know why this code has lingered for so long,
>>> so thanks for helping get some attention to this again.
>>>
>>> Personally I'd be really interested in the results for the same set of
>>> tests, but without the blk-mq patches. Do you have them, or could you
>>> potentially run them?
>>>
>> Hello,
>>
>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>
>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>       8           32       512           158K         264K         321K
>>       8           32        1K           157K         260K         328K
>>       8           32        2K           157K         258K         336K
>>       8           32        4K           148K         257K         308K
>>       8           32        8K           124K         207K         188K
>>       8           32       16K            84K         105K         82K
>>       8           32       32K            50K          54K         36K
>>       8           32       64K            24K          27K         16K
>>       8           32      128K            11K          13K         11K
>>
>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>
>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>
>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>> That could help closing the performance gap we observed.
>>
>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>> at least when it comes to sequential read/writes operations.
> 
> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
> 

Yes.
Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
Because new requests don't have the chance merging with old requests which were already dequeued and issued.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11  6:07                 ` [Xen-devel] " Bob Liu
  2015-08-11  9:45                   ` Rafal Mielniczuk
@ 2015-08-11  9:45                   ` Rafal Mielniczuk
  2015-08-11 17:32                     ` Jens Axboe
  2015-08-11 17:32                     ` Jens Axboe
  1 sibling, 2 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-11  9:45 UTC (permalink / raw)
  To: Bob Liu, Jens Axboe
  Cc: Marcus Granado, Arianna Avanzini, Felipe Franciosi, linux-kernel,
	Christoph Hellwig, David Vrabel, xen-devel, boris.ostrovsky,
	Jonathan Davies

On 11/08/15 07:08, Bob Liu wrote:
> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>> On 01/07/15 04:03, Jens Axboe wrote:
>>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>>> Hi,
>>>>>
>>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>>> in iops when more queues are used.
>>>>>
>>>>> The measurements were obtained under the following conditions:
>>>>>
>>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>>
>>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>>> applied to be used as a guest on 4 vcpus
>>>>>
>>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>>
>>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>>> guest.
>>>>>
>>>>> We were interested in observing storage iops for different values of
>>>>> block sizes. Our expectation was that iops would improve when increasing
>>>>> the number of queues, because both the guest and dom0 would be able to
>>>>> make use of more vcpus to handle these requests.
>>>>>
>>>>> These are the results (as aggregate iops for all the fio threads) that
>>>>> we got for the conditions above with sequential reads:
>>>>>
>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>>       8           32       512           158K         264K
>>>>>       8           32        1K           157K         260K
>>>>>       8           32        2K           157K         258K
>>>>>       8           32        4K           148K         257K
>>>>>       8           32        8K           124K         207K
>>>>>       8           32       16K            84K         105K
>>>>>       8           32       32K            50K          54K
>>>>>       8           32       64K            24K          27K
>>>>>       8           32      128K            11K          13K
>>>>>
>>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>>> There were very good improvements as well for sequential writes with
>>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>>> queues), and no regressions were visible in any measurement performed.
>>>> Great results! And I don't know why this code has lingered for so long,
>>>> so thanks for helping get some attention to this again.
>>>>
>>>> Personally I'd be really interested in the results for the same set of
>>>> tests, but without the blk-mq patches. Do you have them, or could you
>>>> potentially run them?
>>>>
>>> Hello,
>>>
>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>       8           32       512           158K         264K         321K
>>>       8           32        1K           157K         260K         328K
>>>       8           32        2K           157K         258K         336K
>>>       8           32        4K           148K         257K         308K
>>>       8           32        8K           124K         207K         188K
>>>       8           32       16K            84K         105K         82K
>>>       8           32       32K            50K          54K         36K
>>>       8           32       64K            24K          27K         16K
>>>       8           32      128K            11K          13K         11K
>>>
>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>
>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>
>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>> That could help closing the performance gap we observed.
>>>
>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>> at least when it comes to sequential read/writes operations.
>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>
> Yes.
> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>

For some reason we don't see merges even when we set multiqueue to 1.
Below are some stats from the guest system when doing sequential 4KB reads:

$ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
      --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
--filename=/dev/xvdb

$ iostat -xt 5 /dev/xvdb
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    2.73   85.14    2.00    9.63

Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00    
8.00    30.06    0.19    0.19    0.00   0.01 100.48

$ cat /sys/block/xvdb/queue/scheduler
none

$ cat /sys/block/xvdb/queue/nomerges
0

Relevant bits from the xenstore configuration on the dom0:

/local/domain/0/backend/vbd/2/51728/dev = "xvdb"
/local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
/local/domain/0/backend/vbd/2/51728/type = "phy"
/local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"

/local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
/local/domain/2/device/vbd/51728/ring-ref = "9"
/local/domain/2/device/vbd/51728/event-channel = "60"


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11  6:07                 ` [Xen-devel] " Bob Liu
@ 2015-08-11  9:45                   ` Rafal Mielniczuk
  2015-08-11  9:45                   ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-11  9:45 UTC (permalink / raw)
  To: Bob Liu, Jens Axboe
  Cc: Jonathan Davies, Felipe Franciosi, linux-kernel, Marcus Granado,
	Christoph Hellwig, David Vrabel, Arianna Avanzini, xen-devel,
	boris.ostrovsky

On 11/08/15 07:08, Bob Liu wrote:
> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>> On 01/07/15 04:03, Jens Axboe wrote:
>>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>>> Hi,
>>>>>
>>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>>> in iops when more queues are used.
>>>>>
>>>>> The measurements were obtained under the following conditions:
>>>>>
>>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>>
>>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>>> applied to be used as a guest on 4 vcpus
>>>>>
>>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>>
>>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>>> guest.
>>>>>
>>>>> We were interested in observing storage iops for different values of
>>>>> block sizes. Our expectation was that iops would improve when increasing
>>>>> the number of queues, because both the guest and dom0 would be able to
>>>>> make use of more vcpus to handle these requests.
>>>>>
>>>>> These are the results (as aggregate iops for all the fio threads) that
>>>>> we got for the conditions above with sequential reads:
>>>>>
>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>>       8           32       512           158K         264K
>>>>>       8           32        1K           157K         260K
>>>>>       8           32        2K           157K         258K
>>>>>       8           32        4K           148K         257K
>>>>>       8           32        8K           124K         207K
>>>>>       8           32       16K            84K         105K
>>>>>       8           32       32K            50K          54K
>>>>>       8           32       64K            24K          27K
>>>>>       8           32      128K            11K          13K
>>>>>
>>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>>> There were very good improvements as well for sequential writes with
>>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>>> queues), and no regressions were visible in any measurement performed.
>>>> Great results! And I don't know why this code has lingered for so long,
>>>> so thanks for helping get some attention to this again.
>>>>
>>>> Personally I'd be really interested in the results for the same set of
>>>> tests, but without the blk-mq patches. Do you have them, or could you
>>>> potentially run them?
>>>>
>>> Hello,
>>>
>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>
>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>       8           32       512           158K         264K         321K
>>>       8           32        1K           157K         260K         328K
>>>       8           32        2K           157K         258K         336K
>>>       8           32        4K           148K         257K         308K
>>>       8           32        8K           124K         207K         188K
>>>       8           32       16K            84K         105K         82K
>>>       8           32       32K            50K          54K         36K
>>>       8           32       64K            24K          27K         16K
>>>       8           32      128K            11K          13K         11K
>>>
>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>
>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>
>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>> That could help closing the performance gap we observed.
>>>
>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>> at least when it comes to sequential read/writes operations.
>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>
> Yes.
> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>

For some reason we don't see merges even when we set multiqueue to 1.
Below are some stats from the guest system when doing sequential 4KB reads:

$ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
      --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
--filename=/dev/xvdb

$ iostat -xt 5 /dev/xvdb
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    2.73   85.14    2.00    9.63

Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00    
8.00    30.06    0.19    0.19    0.00   0.01 100.48

$ cat /sys/block/xvdb/queue/scheduler
none

$ cat /sys/block/xvdb/queue/nomerges
0

Relevant bits from the xenstore configuration on the dom0:

/local/domain/0/backend/vbd/2/51728/dev = "xvdb"
/local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
/local/domain/0/backend/vbd/2/51728/type = "phy"
/local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"

/local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
/local/domain/2/device/vbd/51728/ring-ref = "9"
/local/domain/2/device/vbd/51728/event-channel = "60"

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11  9:45                   ` [Xen-devel] " Rafal Mielniczuk
@ 2015-08-11 17:32                     ` Jens Axboe
  2015-08-12 10:16                       ` Bob Liu
  2015-08-12 10:16                       ` Bob Liu
  2015-08-11 17:32                     ` Jens Axboe
  1 sibling, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2015-08-11 17:32 UTC (permalink / raw)
  To: Rafal Mielniczuk, Bob Liu
  Cc: Marcus Granado, Arianna Avanzini, Felipe Franciosi, linux-kernel,
	Christoph Hellwig, David Vrabel, xen-devel, boris.ostrovsky,
	Jonathan Davies

On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
> On 11/08/15 07:08, Bob Liu wrote:
>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>>> On 01/07/15 04:03, Jens Axboe wrote:
>>>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>>>> in iops when more queues are used.
>>>>>>
>>>>>> The measurements were obtained under the following conditions:
>>>>>>
>>>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>>>
>>>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>>>> applied to be used as a guest on 4 vcpus
>>>>>>
>>>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>>>
>>>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>>>> guest.
>>>>>>
>>>>>> We were interested in observing storage iops for different values of
>>>>>> block sizes. Our expectation was that iops would improve when increasing
>>>>>> the number of queues, because both the guest and dom0 would be able to
>>>>>> make use of more vcpus to handle these requests.
>>>>>>
>>>>>> These are the results (as aggregate iops for all the fio threads) that
>>>>>> we got for the conditions above with sequential reads:
>>>>>>
>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>>>        8           32       512           158K         264K
>>>>>>        8           32        1K           157K         260K
>>>>>>        8           32        2K           157K         258K
>>>>>>        8           32        4K           148K         257K
>>>>>>        8           32        8K           124K         207K
>>>>>>        8           32       16K            84K         105K
>>>>>>        8           32       32K            50K          54K
>>>>>>        8           32       64K            24K          27K
>>>>>>        8           32      128K            11K          13K
>>>>>>
>>>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>>>> There were very good improvements as well for sequential writes with
>>>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>>>> queues), and no regressions were visible in any measurement performed.
>>>>> Great results! And I don't know why this code has lingered for so long,
>>>>> so thanks for helping get some attention to this again.
>>>>>
>>>>> Personally I'd be really interested in the results for the same set of
>>>>> tests, but without the blk-mq patches. Do you have them, or could you
>>>>> potentially run them?
>>>>>
>>>> Hello,
>>>>
>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>
>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>        8           32       512           158K         264K         321K
>>>>        8           32        1K           157K         260K         328K
>>>>        8           32        2K           157K         258K         336K
>>>>        8           32        4K           148K         257K         308K
>>>>        8           32        8K           124K         207K         188K
>>>>        8           32       16K            84K         105K         82K
>>>>        8           32       32K            50K          54K         36K
>>>>        8           32       64K            24K          27K         16K
>>>>        8           32      128K            11K          13K         11K
>>>>
>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>
>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>
>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>> That could help closing the performance gap we observed.
>>>>
>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>> at least when it comes to sequential read/writes operations.
>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>
>> Yes.
>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>
>
> For some reason we don't see merges even when we set multiqueue to 1.
> Below are some stats from the guest system when doing sequential 4KB reads:
>
> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
> --filename=/dev/xvdb
>
> $ iostat -xt 5 /dev/xvdb
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.50    0.00    2.73   85.14    2.00    9.63
>
> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>
> $ cat /sys/block/xvdb/queue/scheduler
> none
>
> $ cat /sys/block/xvdb/queue/nomerges
> 0
>
> Relevant bits from the xenstore configuration on the dom0:
>
> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
> /local/domain/0/backend/vbd/2/51728/type = "phy"
> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>
> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
> /local/domain/2/device/vbd/51728/ring-ref = "9"
> /local/domain/2/device/vbd/51728/event-channel = "60"

If you add --iodepth-batch=16 to that fio command line? Both mq and 
non-mq relies on plugging to get batching in the use case above, 
otherwise IO is dispatched immediately. O_DIRECT is immediate. I'd be 
more interested in seeing a test case with buffered IO of a file system 
on top of the xvdb device, if we're missing merging for that case, then 
that's a much bigger issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11  9:45                   ` [Xen-devel] " Rafal Mielniczuk
  2015-08-11 17:32                     ` Jens Axboe
@ 2015-08-11 17:32                     ` Jens Axboe
  1 sibling, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2015-08-11 17:32 UTC (permalink / raw)
  To: Rafal Mielniczuk, Bob Liu
  Cc: Jonathan Davies, Felipe Franciosi, linux-kernel, Marcus Granado,
	Christoph Hellwig, David Vrabel, Arianna Avanzini, xen-devel,
	boris.ostrovsky

On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
> On 11/08/15 07:08, Bob Liu wrote:
>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>>> On 01/07/15 04:03, Jens Axboe wrote:
>>>>> On 06/30/2015 08:21 AM, Marcus Granado wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Our measurements for the multiqueue patch indicate a clear improvement
>>>>>> in iops when more queues are used.
>>>>>>
>>>>>> The measurements were obtained under the following conditions:
>>>>>>
>>>>>> - using blkback as the dom0 backend with the multiqueue patch applied to
>>>>>> a dom0 kernel 4.0 on 8 vcpus.
>>>>>>
>>>>>> - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
>>>>>> applied to be used as a guest on 4 vcpus
>>>>>>
>>>>>> - using a micron RealSSD P320h as the underlying local storage on a Dell
>>>>>> PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
>>>>>>
>>>>>> - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
>>>>>> We used direct_io to skip caching in the guest and ran fio for 60s
>>>>>> reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
>>>>>> depth of 32 for each queue was used to saturate individual vcpus in the
>>>>>> guest.
>>>>>>
>>>>>> We were interested in observing storage iops for different values of
>>>>>> block sizes. Our expectation was that iops would improve when increasing
>>>>>> the number of queues, because both the guest and dom0 would be able to
>>>>>> make use of more vcpus to handle these requests.
>>>>>>
>>>>>> These are the results (as aggregate iops for all the fio threads) that
>>>>>> we got for the conditions above with sequential reads:
>>>>>>
>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
>>>>>>        8           32       512           158K         264K
>>>>>>        8           32        1K           157K         260K
>>>>>>        8           32        2K           157K         258K
>>>>>>        8           32        4K           148K         257K
>>>>>>        8           32        8K           124K         207K
>>>>>>        8           32       16K            84K         105K
>>>>>>        8           32       32K            50K          54K
>>>>>>        8           32       64K            24K          27K
>>>>>>        8           32      128K            11K          13K
>>>>>>
>>>>>> 8-queue iops was better than single queue iops for all the block sizes.
>>>>>> There were very good improvements as well for sequential writes with
>>>>>> block size 4K (from 80K iops with single queue to 230K iops with 8
>>>>>> queues), and no regressions were visible in any measurement performed.
>>>>> Great results! And I don't know why this code has lingered for so long,
>>>>> so thanks for helping get some attention to this again.
>>>>>
>>>>> Personally I'd be really interested in the results for the same set of
>>>>> tests, but without the blk-mq patches. Do you have them, or could you
>>>>> potentially run them?
>>>>>
>>>> Hello,
>>>>
>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>
>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>        8           32       512           158K         264K         321K
>>>>        8           32        1K           157K         260K         328K
>>>>        8           32        2K           157K         258K         336K
>>>>        8           32        4K           148K         257K         308K
>>>>        8           32        8K           124K         207K         188K
>>>>        8           32       16K            84K         105K         82K
>>>>        8           32       32K            50K          54K         36K
>>>>        8           32       64K            24K          27K         16K
>>>>        8           32      128K            11K          13K         11K
>>>>
>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>
>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>
>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>> That could help closing the performance gap we observed.
>>>>
>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>> at least when it comes to sequential read/writes operations.
>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>
>> Yes.
>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>
>
> For some reason we don't see merges even when we set multiqueue to 1.
> Below are some stats from the guest system when doing sequential 4KB reads:
>
> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
> --filename=/dev/xvdb
>
> $ iostat -xt 5 /dev/xvdb
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.50    0.00    2.73   85.14    2.00    9.63
>
> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>
> $ cat /sys/block/xvdb/queue/scheduler
> none
>
> $ cat /sys/block/xvdb/queue/nomerges
> 0
>
> Relevant bits from the xenstore configuration on the dom0:
>
> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
> /local/domain/0/backend/vbd/2/51728/type = "phy"
> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>
> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
> /local/domain/2/device/vbd/51728/ring-ref = "9"
> /local/domain/2/device/vbd/51728/event-channel = "60"

If you add --iodepth-batch=16 to that fio command line? Both mq and 
non-mq relies on plugging to get batching in the use case above, 
otherwise IO is dispatched immediately. O_DIRECT is immediate. I'd be 
more interested in seeing a test case with buffered IO of a file system 
on top of the xvdb device, if we're missing merging for that case, then 
that's a much bigger issue.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11 17:32                     ` Jens Axboe
@ 2015-08-12 10:16                       ` Bob Liu
  2015-08-12 16:46                         ` Rafal Mielniczuk
  2015-08-12 16:46                         ` Rafal Mielniczuk
  2015-08-12 10:16                       ` Bob Liu
  1 sibling, 2 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-12 10:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Rafal Mielniczuk, Marcus Granado, Arianna Avanzini,
	Felipe Franciosi, linux-kernel, Christoph Hellwig, David Vrabel,
	xen-devel, boris.ostrovsky, Jonathan Davies


On 08/12/2015 01:32 AM, Jens Axboe wrote:
> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>> On 11/08/15 07:08, Bob Liu wrote:
>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
...
>>>>> Hello,
>>>>>
>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>
>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>        8           32       512           158K         264K         321K
>>>>>        8           32        1K           157K         260K         328K
>>>>>        8           32        2K           157K         258K         336K
>>>>>        8           32        4K           148K         257K         308K
>>>>>        8           32        8K           124K         207K         188K
>>>>>        8           32       16K            84K         105K         82K
>>>>>        8           32       32K            50K          54K         36K
>>>>>        8           32       64K            24K          27K         16K
>>>>>        8           32      128K            11K          13K         11K
>>>>>
>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>
>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>
>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>> That could help closing the performance gap we observed.
>>>>>
>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>> at least when it comes to sequential read/writes operations.
>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>
>>> Yes.
>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>
>>
>> For some reason we don't see merges even when we set multiqueue to 1.
>> Below are some stats from the guest system when doing sequential 4KB reads:
>>
>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>> --filename=/dev/xvdb
>>
>> $ iostat -xt 5 /dev/xvdb
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>             0.50    0.00    2.73   85.14    2.00    9.63
>>
>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>
>> $ cat /sys/block/xvdb/queue/scheduler
>> none
>>
>> $ cat /sys/block/xvdb/queue/nomerges
>> 0
>>
>> Relevant bits from the xenstore configuration on the dom0:
>>
>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>
>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>> /local/domain/2/device/vbd/51728/event-channel = "60"
> 
> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
> if we're missing merging for that case, then that's a much bigger issue.
>
 
I was using the null block driver for xen blk-mq test.

There were not merges happen any more even after patch: 
https://lkml.org/lkml/2015/7/13/185
(Which just converted xen block driver to use blk-mq apis)

Will try a file system soon.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-11 17:32                     ` Jens Axboe
  2015-08-12 10:16                       ` Bob Liu
@ 2015-08-12 10:16                       ` Bob Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-12 10:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Davies, Felipe Franciosi, Rafal Mielniczuk,
	linux-kernel, Marcus Granado, Christoph Hellwig, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky


On 08/12/2015 01:32 AM, Jens Axboe wrote:
> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>> On 11/08/15 07:08, Bob Liu wrote:
>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
...
>>>>> Hello,
>>>>>
>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>
>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>        8           32       512           158K         264K         321K
>>>>>        8           32        1K           157K         260K         328K
>>>>>        8           32        2K           157K         258K         336K
>>>>>        8           32        4K           148K         257K         308K
>>>>>        8           32        8K           124K         207K         188K
>>>>>        8           32       16K            84K         105K         82K
>>>>>        8           32       32K            50K          54K         36K
>>>>>        8           32       64K            24K          27K         16K
>>>>>        8           32      128K            11K          13K         11K
>>>>>
>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>
>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>
>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>> That could help closing the performance gap we observed.
>>>>>
>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>> at least when it comes to sequential read/writes operations.
>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>
>>> Yes.
>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>
>>
>> For some reason we don't see merges even when we set multiqueue to 1.
>> Below are some stats from the guest system when doing sequential 4KB reads:
>>
>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>> --filename=/dev/xvdb
>>
>> $ iostat -xt 5 /dev/xvdb
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>             0.50    0.00    2.73   85.14    2.00    9.63
>>
>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>
>> $ cat /sys/block/xvdb/queue/scheduler
>> none
>>
>> $ cat /sys/block/xvdb/queue/nomerges
>> 0
>>
>> Relevant bits from the xenstore configuration on the dom0:
>>
>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>
>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>> /local/domain/2/device/vbd/51728/event-channel = "60"
> 
> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
> if we're missing merging for that case, then that's a much bigger issue.
>
 
I was using the null block driver for xen blk-mq test.

There were not merges happen any more even after patch: 
https://lkml.org/lkml/2015/7/13/185
(Which just converted xen block driver to use blk-mq apis)

Will try a file system soon.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-12 10:16                       ` Bob Liu
@ 2015-08-12 16:46                         ` Rafal Mielniczuk
  2015-08-14  8:29                           ` Bob Liu
  2015-08-14  8:29                           ` Bob Liu
  2015-08-12 16:46                         ` Rafal Mielniczuk
  1 sibling, 2 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-12 16:46 UTC (permalink / raw)
  To: Bob Liu, Jens Axboe
  Cc: Marcus Granado, Arianna Avanzini, Felipe Franciosi, linux-kernel,
	Christoph Hellwig, David Vrabel, xen-devel, boris.ostrovsky,
	Jonathan Davies

On 12/08/15 11:17, Bob Liu wrote:
> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>> On 11/08/15 07:08, Bob Liu wrote:
>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
> ...
>>>>>> Hello,
>>>>>>
>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>
>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>        8           32       512           158K         264K         321K
>>>>>>        8           32        1K           157K         260K         328K
>>>>>>        8           32        2K           157K         258K         336K
>>>>>>        8           32        4K           148K         257K         308K
>>>>>>        8           32        8K           124K         207K         188K
>>>>>>        8           32       16K            84K         105K         82K
>>>>>>        8           32       32K            50K          54K         36K
>>>>>>        8           32       64K            24K          27K         16K
>>>>>>        8           32      128K            11K          13K         11K
>>>>>>
>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>
>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>
>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>> That could help closing the performance gap we observed.
>>>>>>
>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>> at least when it comes to sequential read/writes operations.
>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>
>>>> Yes.
>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>
>>> For some reason we don't see merges even when we set multiqueue to 1.
>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>
>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>> --filename=/dev/xvdb
>>>
>>> $ iostat -xt 5 /dev/xvdb
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>
>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>
>>> $ cat /sys/block/xvdb/queue/scheduler
>>> none
>>>
>>> $ cat /sys/block/xvdb/queue/nomerges
>>> 0
>>>
>>> Relevant bits from the xenstore configuration on the dom0:
>>>
>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>
>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>> if we're missing merging for that case, then that's a much bigger issue.
>>
>  
> I was using the null block driver for xen blk-mq test.
>
> There were not merges happen any more even after patch: 
> https://lkml.org/lkml/2015/7/13/185
> (Which just converted xen block driver to use blk-mq apis)
>
> Will try a file system soon.
>
I have more results for the guest with and without the patch
https://lkml.org/lkml/2015/7/13/185
applied to the latest stable kernel (4.1.5).

Command line used was:
fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
    --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
    --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16

without patch (--direct=1):
  xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%

with patch (--direct=1):
  xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%

without patch buffered (--direct=0):
  xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60

with patch buffered (--direct=0):
  xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-12 10:16                       ` Bob Liu
  2015-08-12 16:46                         ` Rafal Mielniczuk
@ 2015-08-12 16:46                         ` Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-12 16:46 UTC (permalink / raw)
  To: Bob Liu, Jens Axboe
  Cc: Jonathan Davies, Felipe Franciosi, linux-kernel, Marcus Granado,
	Christoph Hellwig, David Vrabel, Arianna Avanzini, xen-devel,
	boris.ostrovsky

On 12/08/15 11:17, Bob Liu wrote:
> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>> On 11/08/15 07:08, Bob Liu wrote:
>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
> ...
>>>>>> Hello,
>>>>>>
>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>
>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>        8           32       512           158K         264K         321K
>>>>>>        8           32        1K           157K         260K         328K
>>>>>>        8           32        2K           157K         258K         336K
>>>>>>        8           32        4K           148K         257K         308K
>>>>>>        8           32        8K           124K         207K         188K
>>>>>>        8           32       16K            84K         105K         82K
>>>>>>        8           32       32K            50K          54K         36K
>>>>>>        8           32       64K            24K          27K         16K
>>>>>>        8           32      128K            11K          13K         11K
>>>>>>
>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>
>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>
>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>> That could help closing the performance gap we observed.
>>>>>>
>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>> at least when it comes to sequential read/writes operations.
>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>
>>>> Yes.
>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>
>>> For some reason we don't see merges even when we set multiqueue to 1.
>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>
>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>> --filename=/dev/xvdb
>>>
>>> $ iostat -xt 5 /dev/xvdb
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>
>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>
>>> $ cat /sys/block/xvdb/queue/scheduler
>>> none
>>>
>>> $ cat /sys/block/xvdb/queue/nomerges
>>> 0
>>>
>>> Relevant bits from the xenstore configuration on the dom0:
>>>
>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>
>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>> if we're missing merging for that case, then that's a much bigger issue.
>>
>  
> I was using the null block driver for xen blk-mq test.
>
> There were not merges happen any more even after patch: 
> https://lkml.org/lkml/2015/7/13/185
> (Which just converted xen block driver to use blk-mq apis)
>
> Will try a file system soon.
>
I have more results for the guest with and without the patch
https://lkml.org/lkml/2015/7/13/185
applied to the latest stable kernel (4.1.5).

Command line used was:
fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
    --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
    --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16

without patch (--direct=1):
  xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%

with patch (--direct=1):
  xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%

without patch buffered (--direct=0):
  xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60

with patch buffered (--direct=0):
  xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-12 16:46                         ` Rafal Mielniczuk
@ 2015-08-14  8:29                           ` Bob Liu
  2015-08-14 12:30                             ` Rafal Mielniczuk
  2015-08-14 12:30                             ` [Xen-devel] " Rafal Mielniczuk
  2015-08-14  8:29                           ` Bob Liu
  1 sibling, 2 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-14  8:29 UTC (permalink / raw)
  To: Rafal Mielniczuk
  Cc: Jens Axboe, Marcus Granado, Arianna Avanzini, Felipe Franciosi,
	linux-kernel, Christoph Hellwig, David Vrabel, xen-devel,
	boris.ostrovsky, Jonathan Davies


On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
> On 12/08/15 11:17, Bob Liu wrote:
>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>> ...
>>>>>>> Hello,
>>>>>>>
>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>
>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>
>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>
>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>
>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>> That could help closing the performance gap we observed.
>>>>>>>
>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>
>>>>> Yes.
>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>
>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>
>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>> --filename=/dev/xvdb
>>>>
>>>> $ iostat -xt 5 /dev/xvdb
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>
>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>
>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>> none
>>>>
>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>> 0
>>>>
>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>
>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>
>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>> if we're missing merging for that case, then that's a much bigger issue.
>>>
>>  
>> I was using the null block driver for xen blk-mq test.
>>
>> There were not merges happen any more even after patch: 
>> https://lkml.org/lkml/2015/7/13/185
>> (Which just converted xen block driver to use blk-mq apis)
>>
>> Will try a file system soon.
>>
> I have more results for the guest with and without the patch
> https://lkml.org/lkml/2015/7/13/185
> applied to the latest stable kernel (4.1.5).
> 

Thank you.

> Command line used was:
> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
> 
> without patch (--direct=1):
>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
> 
> with patch (--direct=1):
>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
> 

So request merge can happen just more difficult to be triggered.
How about the iops of both cases?

> without patch buffered (--direct=0):
>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
> 
> with patch buffered (--direct=0):
>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
> 

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-12 16:46                         ` Rafal Mielniczuk
  2015-08-14  8:29                           ` Bob Liu
@ 2015-08-14  8:29                           ` Bob Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Bob Liu @ 2015-08-14  8:29 UTC (permalink / raw)
  To: Rafal Mielniczuk
  Cc: Christoph Hellwig, Felipe Franciosi, Jonathan Davies,
	linux-kernel, Marcus Granado, Jens Axboe, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky


On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
> On 12/08/15 11:17, Bob Liu wrote:
>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>> ...
>>>>>>> Hello,
>>>>>>>
>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>
>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>
>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>
>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>
>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>> That could help closing the performance gap we observed.
>>>>>>>
>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>
>>>>> Yes.
>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>
>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>
>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>> --filename=/dev/xvdb
>>>>
>>>> $ iostat -xt 5 /dev/xvdb
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>
>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>
>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>> none
>>>>
>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>> 0
>>>>
>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>
>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>
>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>> if we're missing merging for that case, then that's a much bigger issue.
>>>
>>  
>> I was using the null block driver for xen blk-mq test.
>>
>> There were not merges happen any more even after patch: 
>> https://lkml.org/lkml/2015/7/13/185
>> (Which just converted xen block driver to use blk-mq apis)
>>
>> Will try a file system soon.
>>
> I have more results for the guest with and without the patch
> https://lkml.org/lkml/2015/7/13/185
> applied to the latest stable kernel (4.1.5).
> 

Thank you.

> Command line used was:
> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
> 
> without patch (--direct=1):
>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
> 
> with patch (--direct=1):
>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
> 

So request merge can happen just more difficult to be triggered.
How about the iops of both cases?

> without patch buffered (--direct=0):
>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
> 
> with patch buffered (--direct=0):
>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
> 

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-14  8:29                           ` Bob Liu
  2015-08-14 12:30                             ` Rafal Mielniczuk
@ 2015-08-14 12:30                             ` Rafal Mielniczuk
  2015-08-18  9:45                               ` Rafal Mielniczuk
  2015-08-18  9:45                               ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 2 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-14 12:30 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jens Axboe, Marcus Granado, Arianna Avanzini, Felipe Franciosi,
	linux-kernel, Christoph Hellwig, David Vrabel, xen-devel,
	boris.ostrovsky, Jonathan Davies

On 14/08/15 09:31, Bob Liu wrote:
> On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
>> On 12/08/15 11:17, Bob Liu wrote:
>>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>> ...
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>>
>>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>>
>>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>>
>>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>>
>>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>>> That could help closing the performance gap we observed.
>>>>>>>>
>>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>>
>>>>>> Yes.
>>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>>
>>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>>
>>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>>> --filename=/dev/xvdb
>>>>>
>>>>> $ iostat -xt 5 /dev/xvdb
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>>
>>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>>> none
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>>> 0
>>>>>
>>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>>
>>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>>
>>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>>> if we're missing merging for that case, then that's a much bigger issue.
>>>>
>>>  
>>> I was using the null block driver for xen blk-mq test.
>>>
>>> There were not merges happen any more even after patch: 
>>> https://lkml.org/lkml/2015/7/13/185
>>> (Which just converted xen block driver to use blk-mq apis)
>>>
>>> Will try a file system soon.
>>>
>> I have more results for the guest with and without the patch
>> https://lkml.org/lkml/2015/7/13/185
>> applied to the latest stable kernel (4.1.5).
>>
> Thank you.
>
>> Command line used was:
>> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
>>
>> without patch (--direct=1):
>>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
>>
>> with patch (--direct=1):
>>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
>>
> So request merge can happen just more difficult to be triggered.
> How about the iops of both cases?

Without the patch it is 318Kiops, with the patch 146Kiops

>> without patch buffered (--direct=0):
>>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
>>
>> with patch buffered (--direct=0):
>>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
>>

There seems to be very little difference when we measure buffered
sequential reads.
Although iostat shows that there are almost no merges happening for both
cases,
the avgrq-sz is around 250 sectors (125KB). Does that mean that the
merges are actually happening
but on some other layer, not visible to the iostat?

There is a big discrepancy for direct sequential reads and small block
sizes,
where we are missing merges that were happening in the version before
the patch.
It looks like the request does not reside in the queue for long enough
to get merged.

One thing I noticed is that in block/blk-mq.c in function

bool blk_mq_attempt_merge(struct request_queue *q,
                          struct blk_mq_ctx *ctx, struct bio *bio)

The ctx->rq_list queue is mostly empty, the for loop inside the body
of the function is almost never executed.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-14  8:29                           ` Bob Liu
@ 2015-08-14 12:30                             ` Rafal Mielniczuk
  2015-08-14 12:30                             ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-14 12:30 UTC (permalink / raw)
  To: Bob Liu
  Cc: Christoph Hellwig, Felipe Franciosi, Jonathan Davies,
	linux-kernel, Marcus Granado, Jens Axboe, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky

On 14/08/15 09:31, Bob Liu wrote:
> On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
>> On 12/08/15 11:17, Bob Liu wrote:
>>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>> ...
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>>
>>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>>
>>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>>
>>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>>
>>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>>> That could help closing the performance gap we observed.
>>>>>>>>
>>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>>
>>>>>> Yes.
>>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>>
>>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>>
>>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>>> --filename=/dev/xvdb
>>>>>
>>>>> $ iostat -xt 5 /dev/xvdb
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>>
>>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>>> none
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>>> 0
>>>>>
>>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>>
>>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>>
>>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>>> if we're missing merging for that case, then that's a much bigger issue.
>>>>
>>>  
>>> I was using the null block driver for xen blk-mq test.
>>>
>>> There were not merges happen any more even after patch: 
>>> https://lkml.org/lkml/2015/7/13/185
>>> (Which just converted xen block driver to use blk-mq apis)
>>>
>>> Will try a file system soon.
>>>
>> I have more results for the guest with and without the patch
>> https://lkml.org/lkml/2015/7/13/185
>> applied to the latest stable kernel (4.1.5).
>>
> Thank you.
>
>> Command line used was:
>> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
>>
>> without patch (--direct=1):
>>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
>>
>> with patch (--direct=1):
>>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
>>
> So request merge can happen just more difficult to be triggered.
> How about the iops of both cases?

Without the patch it is 318Kiops, with the patch 146Kiops

>> without patch buffered (--direct=0):
>>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
>>
>> with patch buffered (--direct=0):
>>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
>>

There seems to be very little difference when we measure buffered
sequential reads.
Although iostat shows that there are almost no merges happening for both
cases,
the avgrq-sz is around 250 sectors (125KB). Does that mean that the
merges are actually happening
but on some other layer, not visible to the iostat?

There is a big discrepancy for direct sequential reads and small block
sizes,
where we are missing merges that were happening in the version before
the patch.
It looks like the request does not reside in the queue for long enough
to get merged.

One thing I noticed is that in block/blk-mq.c in function

bool blk_mq_attempt_merge(struct request_queue *q,
                          struct blk_mq_ctx *ctx, struct bio *bio)

The ctx->rq_list queue is mostly empty, the for loop inside the body
of the function is almost never executed.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-14 12:30                             ` [Xen-devel] " Rafal Mielniczuk
  2015-08-18  9:45                               ` Rafal Mielniczuk
@ 2015-08-18  9:45                               ` Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-18  9:45 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jens Axboe, Marcus Granado, Arianna Avanzini, Felipe Franciosi,
	linux-kernel, Christoph Hellwig, David Vrabel, xen-devel,
	boris.ostrovsky, Jonathan Davies

On 14/08/15 13:30, Rafal Mielniczuk wrote:
> On 14/08/15 09:31, Bob Liu wrote:
>> On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
>>> On 12/08/15 11:17, Bob Liu wrote:
>>>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>>> ...
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>>>
>>>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>>>
>>>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>>>
>>>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>>>
>>>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>>>> That could help closing the performance gap we observed.
>>>>>>>>>
>>>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>>>
>>>>>>> Yes.
>>>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>>>
>>>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>>>
>>>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>>>> --filename=/dev/xvdb
>>>>>>
>>>>>> $ iostat -xt 5 /dev/xvdb
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>>>
>>>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>>>> none
>>>>>>
>>>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>>>> 0
>>>>>>
>>>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>>>
>>>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>>>
>>>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>>>> if we're missing merging for that case, then that's a much bigger issue.
>>>>>
>>>>  
>>>> I was using the null block driver for xen blk-mq test.
>>>>
>>>> There were not merges happen any more even after patch: 
>>>> https://lkml.org/lkml/2015/7/13/185
>>>> (Which just converted xen block driver to use blk-mq apis)
>>>>
>>>> Will try a file system soon.
>>>>
>>> I have more results for the guest with and without the patch
>>> https://lkml.org/lkml/2015/7/13/185
>>> applied to the latest stable kernel (4.1.5).
>>>
>> Thank you.
>>
>>> Command line used was:
>>> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>>>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>>>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
>>>
>>> without patch (--direct=1):
>>>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
>>>
>>> with patch (--direct=1):
>>>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
>>>
>> So request merge can happen just more difficult to be triggered.
>> How about the iops of both cases?
> Without the patch it is 318Kiops, with the patch 146Kiops
>
>>> without patch buffered (--direct=0):
>>>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
>>>
>>> with patch buffered (--direct=0):
>>>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
>>>
> There seems to be very little difference when we measure buffered
> sequential reads.
> Although iostat shows that there are almost no merges happening for both
> cases,
> the avgrq-sz is around 250 sectors (125KB). Does that mean that the
> merges are actually happening
> but on some other layer, not visible to the iostat?
>
> There is a big discrepancy for direct sequential reads and small block
> sizes,
> where we are missing merges that were happening in the version before
> the patch.
> It looks like the request does not reside in the queue for long enough
> to get merged.
>
> One thing I noticed is that in block/blk-mq.c in function
>
> bool blk_mq_attempt_merge(struct request_queue *q,
>                           struct blk_mq_ctx *ctx, struct bio *bio)
>
> The ctx->rq_list queue is mostly empty, the for loop inside the body
> of the function is almost never executed.
>
Hi,

I was able to reproduce Bob's results with nullblk device with default module parameters.

Also, when I increased the completion time of the requests,
I could see merges happening in the version without the patch, which resulted in greater throughput.

Could it be because request had time to accumulate in the queue and had a chance to be merged?
Why merges did not happen in the version with the patch then? Is the patch missing plugging Jens mentioned,
or is it a problem in blk-mq itself?

fio --name=test --ioengine=libaio --rw=read --numjobs=8 --iodepth=32 \
    --time_based=1 --runtime=30 --bs=4KB --filename=/dev/xvdb \
    --direct=1 --group_reporting=1 --iodepth_batch=16

========================================================================
modprobe null_blk
========================================================================
------------------------------------------------------------------------
*no patch* (avgrq-sz = 8.00 avgqu-sz=5.00)
------------------------------------------------------------------------
READ: io=10655MB, aggrb=363694KB/s, minb=363694KB/s, maxb=363694KB/s, mint=30001msec, maxt=30001msec

Disk stats (read/write):
  xvdb: ios=2715852/0, merge=1089/0, ticks=126572/0, in_queue=127456, util=100.00%

------------------------------------------------------------------------
*with patch* (avgrq-sz = 8.00 avgqu-sz=8.00)
------------------------------------------------------------------------
READ: io=20655MB, aggrb=705010KB/s, minb=705010KB/s, maxb=705010KB/s, mint=30001msec, maxt=30001msec

Disk stats (read/write):
  xvdb: ios=5274633/0, merge=22/0, ticks=243208/0, in_queue=242908, util=99.98%

========================================================================
modprobe null_blk irqmode=2 completion_nsec=1000000
========================================================================
------------------------------------------------------------------------
*no patch* (avgrq-sz = 34.00 avgqu-sz=38.00)
------------------------------------------------------------------------
READ: io=10372MB, aggrb=354008KB/s, minb=354008KB/s, maxb=354008KB/s, mint=30003msec, maxt=30003msec

Disk stats (read/write):
  xvdb: ios=621760/0, *merge=1988170/0*, ticks=1136700/0, in_queue=1146020, util=99.76%

------------------------------------------------------------------------
*with patch* (avgrq-sz = 8.00 avgqu-sz=28.00)
------------------------------------------------------------------------
READ: io=2876.8MB, aggrb=98187KB/s, minb=98187KB/s, maxb=98187KB/s, mint=30002msec, maxt=30002msec

Disk stats (read/write):
  xvdb: ios=734048/0, merge=0/0, ticks=843584/0, in_queue=843080, util=99.72%

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
  2015-08-14 12:30                             ` [Xen-devel] " Rafal Mielniczuk
@ 2015-08-18  9:45                               ` Rafal Mielniczuk
  2015-08-18  9:45                               ` [Xen-devel] " Rafal Mielniczuk
  1 sibling, 0 replies; 63+ messages in thread
From: Rafal Mielniczuk @ 2015-08-18  9:45 UTC (permalink / raw)
  To: Bob Liu
  Cc: Christoph Hellwig, Felipe Franciosi, Jonathan Davies,
	linux-kernel, Marcus Granado, Jens Axboe, David Vrabel,
	Arianna Avanzini, xen-devel, boris.ostrovsky

On 14/08/15 13:30, Rafal Mielniczuk wrote:
> On 14/08/15 09:31, Bob Liu wrote:
>> On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
>>> On 12/08/15 11:17, Bob Liu wrote:
>>>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>>> ...
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>>>
>>>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>>>
>>>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>>>
>>>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>>>
>>>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>>>> That could help closing the performance gap we observed.
>>>>>>>>>
>>>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>>>
>>>>>>> Yes.
>>>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>>>
>>>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>>>
>>>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>>>> --filename=/dev/xvdb
>>>>>>
>>>>>> $ iostat -xt 5 /dev/xvdb
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>>>
>>>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>>>> none
>>>>>>
>>>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>>>> 0
>>>>>>
>>>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>>>
>>>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>>>
>>>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>>>> if we're missing merging for that case, then that's a much bigger issue.
>>>>>
>>>>  
>>>> I was using the null block driver for xen blk-mq test.
>>>>
>>>> There were not merges happen any more even after patch: 
>>>> https://lkml.org/lkml/2015/7/13/185
>>>> (Which just converted xen block driver to use blk-mq apis)
>>>>
>>>> Will try a file system soon.
>>>>
>>> I have more results for the guest with and without the patch
>>> https://lkml.org/lkml/2015/7/13/185
>>> applied to the latest stable kernel (4.1.5).
>>>
>> Thank you.
>>
>>> Command line used was:
>>> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>>>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>>>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
>>>
>>> without patch (--direct=1):
>>>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
>>>
>>> with patch (--direct=1):
>>>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
>>>
>> So request merge can happen just more difficult to be triggered.
>> How about the iops of both cases?
> Without the patch it is 318Kiops, with the patch 146Kiops
>
>>> without patch buffered (--direct=0):
>>>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
>>>
>>> with patch buffered (--direct=0):
>>>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
>>>
> There seems to be very little difference when we measure buffered
> sequential reads.
> Although iostat shows that there are almost no merges happening for both
> cases,
> the avgrq-sz is around 250 sectors (125KB). Does that mean that the
> merges are actually happening
> but on some other layer, not visible to the iostat?
>
> There is a big discrepancy for direct sequential reads and small block
> sizes,
> where we are missing merges that were happening in the version before
> the patch.
> It looks like the request does not reside in the queue for long enough
> to get merged.
>
> One thing I noticed is that in block/blk-mq.c in function
>
> bool blk_mq_attempt_merge(struct request_queue *q,
>                           struct blk_mq_ctx *ctx, struct bio *bio)
>
> The ctx->rq_list queue is mostly empty, the for loop inside the body
> of the function is almost never executed.
>
Hi,

I was able to reproduce Bob's results with nullblk device with default module parameters.

Also, when I increased the completion time of the requests,
I could see merges happening in the version without the patch, which resulted in greater throughput.

Could it be because request had time to accumulate in the queue and had a chance to be merged?
Why merges did not happen in the version with the patch then? Is the patch missing plugging Jens mentioned,
or is it a problem in blk-mq itself?

fio --name=test --ioengine=libaio --rw=read --numjobs=8 --iodepth=32 \
    --time_based=1 --runtime=30 --bs=4KB --filename=/dev/xvdb \
    --direct=1 --group_reporting=1 --iodepth_batch=16

========================================================================
modprobe null_blk
========================================================================
------------------------------------------------------------------------
*no patch* (avgrq-sz = 8.00 avgqu-sz=5.00)
------------------------------------------------------------------------
READ: io=10655MB, aggrb=363694KB/s, minb=363694KB/s, maxb=363694KB/s, mint=30001msec, maxt=30001msec

Disk stats (read/write):
  xvdb: ios=2715852/0, merge=1089/0, ticks=126572/0, in_queue=127456, util=100.00%

------------------------------------------------------------------------
*with patch* (avgrq-sz = 8.00 avgqu-sz=8.00)
------------------------------------------------------------------------
READ: io=20655MB, aggrb=705010KB/s, minb=705010KB/s, maxb=705010KB/s, mint=30001msec, maxt=30001msec

Disk stats (read/write):
  xvdb: ios=5274633/0, merge=22/0, ticks=243208/0, in_queue=242908, util=99.98%

========================================================================
modprobe null_blk irqmode=2 completion_nsec=1000000
========================================================================
------------------------------------------------------------------------
*no patch* (avgrq-sz = 34.00 avgqu-sz=38.00)
------------------------------------------------------------------------
READ: io=10372MB, aggrb=354008KB/s, minb=354008KB/s, maxb=354008KB/s, mint=30003msec, maxt=30003msec

Disk stats (read/write):
  xvdb: ios=621760/0, *merge=1988170/0*, ticks=1136700/0, in_queue=1146020, util=99.76%

------------------------------------------------------------------------
*with patch* (avgrq-sz = 8.00 avgqu-sz=28.00)
------------------------------------------------------------------------
READ: io=2876.8MB, aggrb=98187KB/s, minb=98187KB/s, maxb=98187KB/s, mint=30002msec, maxt=30002msec

Disk stats (read/write):
  xvdb: ios=734048/0, merge=0/0, ticks=843584/0, in_queue=843080, util=99.72%

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2015-08-18  9:45 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-11 23:57 [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Arianna Avanzini
2014-09-11 23:57 ` [PATCH RFC v2 1/5] xen, blkfront: port to the the multi-queue block layer API Arianna Avanzini
2014-09-11 23:57 ` Arianna Avanzini
2014-09-13 19:29   ` Christoph Hellwig
2014-09-13 19:29   ` Christoph Hellwig
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-09-11 23:57 ` [PATCH RFC v2 2/5] xen, blkfront: introduce support for multiple block rings Arianna Avanzini
2014-09-11 23:57 ` Arianna Avanzini
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-09-11 23:57 ` [PATCH RFC v2 3/5] xen, blkfront: negotiate the number of block rings with the backend Arianna Avanzini
2014-09-11 23:57 ` Arianna Avanzini
2014-09-12 10:46   ` David Vrabel
2014-09-12 10:46   ` David Vrabel
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-09-11 23:57 ` [PATCH RFC v2 4/5] xen, blkback: introduce support for multiple block rings Arianna Avanzini
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-10-01 20:18   ` Konrad Rzeszutek Wilk
2014-09-11 23:57 ` Arianna Avanzini
2014-09-11 23:57 ` [PATCH RFC v2 5/5] xen, blkback: negotiate of the number of block rings with the frontend Arianna Avanzini
2014-09-11 23:57 ` Arianna Avanzini
2014-09-12 10:58   ` David Vrabel
2014-09-12 10:58   ` David Vrabel
2014-10-01 20:23   ` Konrad Rzeszutek Wilk
2014-10-01 20:23   ` Konrad Rzeszutek Wilk
2014-10-01 20:27 ` [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback Konrad Rzeszutek Wilk
2015-04-28  7:36   ` Christoph Hellwig
2015-04-28  7:46     ` Arianna Avanzini
2015-04-28  7:46     ` Arianna Avanzini
2015-05-13 10:29       ` Bob Liu
2015-05-13 10:29       ` Bob Liu
2015-06-30 14:21         ` Marcus Granado
2015-06-30 14:21         ` [Xen-devel] " Marcus Granado
2015-07-01  0:04           ` Bob Liu
2015-07-01  0:04           ` Bob Liu
2015-07-01  3:02           ` Jens Axboe
2015-07-01  3:02           ` [Xen-devel] " Jens Axboe
2015-08-10 11:03             ` Rafal Mielniczuk
2015-08-10 11:03             ` [Xen-devel] " Rafal Mielniczuk
2015-08-10 11:14               ` Bob Liu
2015-08-10 11:14               ` [Xen-devel] " Bob Liu
2015-08-10 15:52               ` Jens Axboe
2015-08-10 15:52               ` [Xen-devel] " Jens Axboe
2015-08-11  6:07                 ` Bob Liu
2015-08-11  6:07                 ` [Xen-devel] " Bob Liu
2015-08-11  9:45                   ` Rafal Mielniczuk
2015-08-11  9:45                   ` [Xen-devel] " Rafal Mielniczuk
2015-08-11 17:32                     ` Jens Axboe
2015-08-12 10:16                       ` Bob Liu
2015-08-12 16:46                         ` Rafal Mielniczuk
2015-08-14  8:29                           ` Bob Liu
2015-08-14 12:30                             ` Rafal Mielniczuk
2015-08-14 12:30                             ` [Xen-devel] " Rafal Mielniczuk
2015-08-18  9:45                               ` Rafal Mielniczuk
2015-08-18  9:45                               ` [Xen-devel] " Rafal Mielniczuk
2015-08-14  8:29                           ` Bob Liu
2015-08-12 16:46                         ` Rafal Mielniczuk
2015-08-12 10:16                       ` Bob Liu
2015-08-11 17:32                     ` Jens Axboe
2015-04-28  7:36   ` Christoph Hellwig
2014-10-01 20:27 ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.