xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 2/3] driver: xen-blkfront: move talk_to_blkback to a more suitable place
       [not found] <1433310003-13089-1-git-send-email-bob.liu@oracle.com>
@ 2015-06-03  5:40 ` Bob Liu
  2015-06-03  5:40 ` [PATCH 3/3] xen/block: add multi-page ring support Bob Liu
       [not found] ` <1433310003-13089-3-git-send-email-bob.liu@oracle.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Bob Liu @ 2015-06-03  5:40 UTC (permalink / raw)
  To: xen-devel
  Cc: linux-kernel, julien.grall, justing, Bob Liu, paul.durrant,
	david.vrabel, roger.pau

The major responsibility of talk_to_blkback() is allocate and initialize
the request ring and write the ring info to xenstore.
But this work should be done after backend entered 'XenbusStateInitWait' as
defined in the protocol file.
See xen/include/public/io/blkif.h in XEN git tree:
Front                                Back
=================================    =====================================
XenbusStateInitialising              XenbusStateInitialising
 o Query virtual device               o Query backend device identification
   properties.                          data.
 o Setup OS device instance.          o Open and validate backend device.
                                      o Publish backend features and
                                        transport parameters.
                                                     |
                                                     |
                                                     V
                                     XenbusStateInitWait

o Query backend features and
  transport parameters.
o Allocate and initialize the
  request ring.

There is no problem with this yet, but it is an violation of the design and
furthermore it would not allow frontend/backend to negotiate 'multi-page'
and 'multi-queue' features.

Changes in v2:
 - Re-write the commit message to be more clear.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
---
 drivers/block/xen-blkfront.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c61cf8..88e23fd 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1430,13 +1430,6 @@ static int blkfront_probe(struct xenbus_device *dev,
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
 
-	err = talk_to_blkback(dev, info);
-	if (err) {
-		kfree(info);
-		dev_set_drvdata(&dev->dev, NULL);
-		return err;
-	}
-
 	return 0;
 }
 
@@ -1906,8 +1899,13 @@ static void blkback_changed(struct xenbus_device *dev,
 	dev_dbg(&dev->dev, "blkfront:blkback_changed to state %d.\n", backend_state);
 
 	switch (backend_state) {
-	case XenbusStateInitialising:
 	case XenbusStateInitWait:
+		if (talk_to_blkback(dev, info)) {
+			kfree(info);
+			dev_set_drvdata(&dev->dev, NULL);
+			break;
+		}
+	case XenbusStateInitialising:
 	case XenbusStateInitialised:
 	case XenbusStateReconfiguring:
 	case XenbusStateReconfigured:
-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] xen/block: add multi-page ring support
       [not found] <1433310003-13089-1-git-send-email-bob.liu@oracle.com>
  2015-06-03  5:40 ` [PATCH 2/3] driver: xen-blkfront: move talk_to_blkback to a more suitable place Bob Liu
@ 2015-06-03  5:40 ` Bob Liu
       [not found] ` <1433310003-13089-3-git-send-email-bob.liu@oracle.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Bob Liu @ 2015-06-03  5:40 UTC (permalink / raw)
  To: xen-devel
  Cc: linux-kernel, julien.grall, justing, Bob Liu, paul.durrant,
	david.vrabel, roger.pau

Extend xen/block to support multi-page ring, so that more requests can be
issued by using more than one pages as the request ring between blkfront
and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster
backend. If using 64 pages as the ring, the IOPS increased about 15 times
for the throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical
drives, 32 was really not enough to keep them busy.

Changes in v2:
 - Rebased to 4.0-rc6.
 - Document on how multi-page ring feature working to linux io/blkif.h.

Changes in v3:
 - Remove changes to linux io/blkif.h and follow the protocol defined
   in io/blkif.h of XEN tree.
 - Rebased to 4.1-rc3

Changes in v4:
 - Turn to use 'ring-page-order' and 'max-ring-page-order'.
 - A few comments from Roger.

Changes in v5:
 - Clarify with 4k granularity to comment
 - Address more comments from Roger

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |   13 ++++
 drivers/block/xen-blkback/common.h  |    2 +
 drivers/block/xen-blkback/xenbus.c  |   89 +++++++++++++++++------
 drivers/block/xen-blkfront.c        |  135 +++++++++++++++++++++++++----------
 4 files changed, 180 insertions(+), 59 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 713fc9f..2126842 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
                  "Maximum number of grants to map persistently");
 
 /*
+ * Maximum order of pages to be used for the shared ring between front and
+ * backend, 4KB page granularity is used.
+ */
+unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
+MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");
+/*
  * The LRU mechanism to clean the lists of persistent grants needs to
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
@@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
 	if (!xen_domain())
 		return -ENODEV;
 
+	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
+		pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
+			xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
+		xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
+	}
+
 	rc = xen_blkif_interface_init();
 	if (rc)
 		goto failed_init;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 043f13b..8ccc49d 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -44,6 +44,7 @@
 #include <xen/interface/io/blkif.h>
 #include <xen/interface/io/protocols.h>
 
+extern unsigned int xen_blkif_max_ring_order;
 /*
  * This is the maximum number of segments that would be allowed in indirect
  * requests. This value will also be passed to the frontend.
@@ -320,6 +321,7 @@ struct xen_blkif {
 	struct work_struct	free_work;
 	/* Thread shutdown wait queue. */
 	wait_queue_head_t	shutdown_wq;
+	unsigned int nr_ring_pages;
 };
 
 struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index c212d41..deb3f00 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -25,6 +25,7 @@
 
 /* Enlarge the array size in order to fully show blkback name. */
 #define BLKBACK_NAME_LEN (20)
+#define RINGREF_NAME_LEN (20)
 
 struct backend_info {
 	struct xenbus_device	*dev;
@@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 	return blkif;
 }
 
-static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
-			 unsigned int evtchn)
+static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
+			 unsigned int nr_grefs, unsigned int evtchn)
 {
 	int err;
 
@@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
 	if (blkif->irq)
 		return 0;
 
-	err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
+	err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
 				     &blkif->blk_ring);
 	if (err < 0)
 		return err;
@@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
 	{
 		struct blkif_sring *sring;
 		sring = (struct blkif_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
+		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * nr_grefs);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_32:
 	{
 		struct blkif_x86_32_sring *sring_x86_32;
 		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
+		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE * nr_grefs);
 		break;
 	}
 	case BLKIF_PROTOCOL_X86_64:
 	{
 		struct blkif_x86_64_sring *sring_x86_64;
 		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
-		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
+		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE * nr_grefs);
 		break;
 	}
 	default:
@@ -270,7 +271,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
 		i++;
 	}
 
-	WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
+	WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
 
 	kmem_cache_free(xen_blkif_cachep, blkif);
 }
@@ -555,6 +556,11 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
 	if (err)
 		goto fail;
 
+	err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-page-order", "%u",
+			    xen_blkif_max_ring_order);
+	if (err)
+		pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);
+
 	err = xenbus_switch_state(dev, XenbusStateInitWait);
 	if (err)
 		goto fail;
@@ -818,8 +824,8 @@ again:
 static int connect_ring(struct backend_info *be)
 {
 	struct xenbus_device *dev = be->dev;
-	unsigned long ring_ref;
-	unsigned int evtchn;
+	unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
+	unsigned int evtchn, nr_grefs, ring_page_order;
 	unsigned int pers_grants;
 	char protocol[64] = "";
 	struct pending_req *req, *n;
@@ -827,14 +833,57 @@ static int connect_ring(struct backend_info *be)
 
 	pr_debug("%s %s\n", __func__, dev->otherend);
 
-	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
-			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
-	if (err) {
-		xenbus_dev_fatal(dev, err,
-				 "reading %s/ring-ref and event-channel",
+	err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel", "%u",
+			  &evtchn);
+	if (err != 1) {
+		err = -EINVAL;
+		xenbus_dev_fatal(dev, err, "reading %s/event-channel",
 				 dev->otherend);
 		return err;
 	}
+	pr_info("event-channel %u\n", evtchn);
+
+	err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
+			  &ring_page_order);
+	if (err != 1) {
+		err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref",
+				  "%u", &ring_ref[0]);
+		if (err != 1) {
+			err = -EINVAL;
+			xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
+					 dev->otherend);
+			return err;
+		}
+		nr_grefs = 1;
+		pr_info("%s:using single page: ring-ref %d\n", dev->otherend,
+			ring_ref[0]);
+	} else {
+		unsigned int i;
+
+		if (ring_page_order > xen_blkif_max_ring_order) {
+			err = -EINVAL;
+			xenbus_dev_fatal(dev, err, "%s/request %d ring page order exceed max:%d",
+					 dev->otherend, ring_page_order,
+					 xen_blkif_max_ring_order);
+			return err;
+		}
+
+		nr_grefs = 1 << ring_page_order;
+		for (i = 0; i < nr_grefs; i++) {
+			char ring_ref_name[RINGREF_NAME_LEN];
+
+			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+			err = xenbus_scanf(XBT_NIL, dev->otherend, ring_ref_name,
+					   "%u", &ring_ref[i]);
+			if (err != 1) {
+				err = -EINVAL;
+				xenbus_dev_fatal(dev, err, "reading %s/%s",
+						 dev->otherend, ring_ref_name);
+				return err;
+			}
+			pr_info("ring-ref%u: %u\n", i, ring_ref[i]);
+		}
+	}
 
 	be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
 	err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
@@ -859,12 +908,13 @@ static int connect_ring(struct backend_info *be)
 
 	be->blkif->vbd.feature_gnt_persistent = pers_grants;
 	be->blkif->vbd.overflow_max_grants = 0;
+	be->blkif->nr_ring_pages = nr_grefs;
 
-	pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
-		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
+	pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n",
+		nr_grefs, evtchn, be->blkif->blk_protocol, protocol,
 		pers_grants ? "persistent grants" : "");
 
-	for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
+	for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
 		req = kzalloc(sizeof(*req), GFP_KERNEL);
 		if (!req)
 			goto fail;
@@ -883,10 +933,9 @@ static int connect_ring(struct backend_info *be)
 	}
 
 	/* Map the shared frame, irq etc. */
-	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
+	err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn);
 	if (err) {
-		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
-				 ring_ref, evtchn);
+		xenbus_dev_fatal(dev, err, "mapping ring-ref port %u", evtchn);
 		return err;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 88e23fd..d3c1a95 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -98,7 +98,21 @@ static unsigned int xen_blkif_max_segments = 32;
 module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
 MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default is 32)");
 
-#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
+/*
+ * Maximum order of pages to be used for the shared ring between front and
+ * backend, 4KB page granularity is used.
+ */
+static unsigned int xen_blkif_max_ring_order;
+module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
+MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");
+
+#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * (info)->nr_ring_pages)
+#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES)
+/*
+ * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
+ * characters are enough. Define to 20 to keep consist with backend.
+ */
+#define RINGREF_NAME_LEN (20)
 
 /*
  * We have one of these per vbd, whether ide, scsi or 'other'.  They
@@ -114,13 +128,14 @@ struct blkfront_info
 	int vdevice;
 	blkif_vdev_t handle;
 	enum blkif_state connected;
-	int ring_ref;
+	int ring_ref[XENBUS_MAX_RING_PAGES];
+	unsigned int nr_ring_pages;
 	struct blkif_front_ring ring;
 	unsigned int evtchn, irq;
 	struct request_queue *rq;
 	struct work_struct work;
 	struct gnttab_free_callback callback;
-	struct blk_shadow shadow[BLK_RING_SIZE];
+	struct blk_shadow shadow[BLK_MAX_RING_SIZE];
 	struct list_head grants;
 	struct list_head indirect_pages;
 	unsigned int persistent_gnts_c;
@@ -139,8 +154,6 @@ static unsigned int nr_minors;
 static unsigned long *minors;
 static DEFINE_SPINLOCK(minor_lock);
 
-#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
-	(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
 #define GRANT_INVALID_REF	0
 
 #define PARTS_PER_DISK		16
@@ -170,7 +183,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info);
 static int get_id_from_freelist(struct blkfront_info *info)
 {
 	unsigned long free = info->shadow_free;
-	BUG_ON(free >= BLK_RING_SIZE);
+	BUG_ON(free >= BLK_RING_SIZE(info));
 	info->shadow_free = info->shadow[free].req.u.rw.id;
 	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
 	return free;
@@ -983,7 +996,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		}
 	}
 
-	for (i = 0; i < BLK_RING_SIZE; i++) {
+	for (i = 0; i < BLK_RING_SIZE(info); i++) {
 		/*
 		 * Clear persistent grants present in requests already
 		 * on the shared ring
@@ -1033,12 +1046,15 @@ free_shadow:
 	flush_work(&info->work);
 
 	/* Free resources associated with old device channel. */
-	if (info->ring_ref != GRANT_INVALID_REF) {
-		gnttab_end_foreign_access(info->ring_ref, 0,
-					  (unsigned long)info->ring.sring);
-		info->ring_ref = GRANT_INVALID_REF;
-		info->ring.sring = NULL;
+	for (i = 0; i < info->nr_ring_pages; i++) {
+		if (info->ring_ref[i] != GRANT_INVALID_REF) {
+			gnttab_end_foreign_access(info->ring_ref[i], 0, 0);
+			info->ring_ref[i] = GRANT_INVALID_REF;
+		}
 	}
+	free_pages((unsigned long)info->ring.sring, get_order(info->nr_ring_pages * PAGE_SIZE));
+	info->ring.sring = NULL;
+
 	if (info->irq)
 		unbind_from_irqhandler(info->irq, info);
 	info->evtchn = info->irq = 0;
@@ -1157,7 +1173,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 		 * never have given to it (we stamp it up to BLK_RING_SIZE -
 		 * look in get_id_from_freelist.
 		 */
-		if (id >= BLK_RING_SIZE) {
+		if (id >= BLK_RING_SIZE(info)) {
 			WARN(1, "%s: response to %s has incorrect id (%ld)\n",
 			     info->gd->disk_name, op_name(bret->operation), id);
 			/* We can't safely get the 'struct request' as
@@ -1245,26 +1261,30 @@ static int setup_blkring(struct xenbus_device *dev,
 			 struct blkfront_info *info)
 {
 	struct blkif_sring *sring;
-	grant_ref_t gref;
-	int err;
+	int err, i;
+	unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
+	grant_ref_t gref[XENBUS_MAX_RING_PAGES];
 
-	info->ring_ref = GRANT_INVALID_REF;
+	for (i = 0; i < info->nr_ring_pages; i++)
+		info->ring_ref[i] = GRANT_INVALID_REF;
 
-	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH);
+	sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO | __GFP_HIGH,
+						       get_order(ring_size));
 	if (!sring) {
 		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
 		return -ENOMEM;
 	}
 	SHARED_RING_INIT(sring);
-	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
+	FRONT_RING_INIT(&info->ring, sring, ring_size);
 
-	err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref);
+	err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages, gref);
 	if (err < 0) {
-		free_page((unsigned long)sring);
+		free_pages((unsigned long)sring, get_order(ring_size));
 		info->ring.sring = NULL;
 		goto fail;
 	}
-	info->ring_ref = gref;
+	for (i = 0; i < info->nr_ring_pages; i++)
+		info->ring_ref[i] = gref[i];
 
 	err = xenbus_alloc_evtchn(dev, &info->evtchn);
 	if (err)
@@ -1292,7 +1312,18 @@ static int talk_to_blkback(struct xenbus_device *dev,
 {
 	const char *message = NULL;
 	struct xenbus_transaction xbt;
-	int err;
+	int err, i;
+	unsigned int max_page_order = 0;
+	unsigned int ring_page_order = 0;
+
+	err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
+			   "max-ring-page-order", "%u", &max_page_order);
+	if (err != 1)
+		info->nr_ring_pages = 1;
+	else {
+		ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
+		info->nr_ring_pages = 1 << ring_page_order;
+	}
 
 	/* Create shared ring, alloc event channel. */
 	err = setup_blkring(dev, info);
@@ -1306,11 +1337,32 @@ again:
 		goto destroy_blkring;
 	}
 
-	err = xenbus_printf(xbt, dev->nodename,
-			    "ring-ref", "%u", info->ring_ref);
-	if (err) {
-		message = "writing ring-ref";
-		goto abort_transaction;
+	if (info->nr_ring_pages == 1) {
+		err = xenbus_printf(xbt, dev->nodename,
+				    "ring-ref", "%u", info->ring_ref[0]);
+		if (err) {
+			message = "writing ring-ref";
+			goto abort_transaction;
+		}
+	} else {
+		err = xenbus_printf(xbt, dev->nodename,
+				    "ring-page-order", "%u", ring_page_order);
+		if (err) {
+			message = "writing ring-page-order";
+			goto abort_transaction;
+		}
+
+		for (i = 0; i < info->nr_ring_pages; i++) {
+			char ring_ref_name[RINGREF_NAME_LEN];
+
+			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+			err = xenbus_printf(xbt, dev->nodename, ring_ref_name,
+					    "%u", info->ring_ref[i]);
+			if (err) {
+				message = "writing ring-ref";
+				goto abort_transaction;
+			}
+		}
 	}
 	err = xenbus_printf(xbt, dev->nodename,
 			    "event-channel", "%u", info->evtchn);
@@ -1338,6 +1390,9 @@ again:
 		goto destroy_blkring;
 	}
 
+	for (i = 0; i < BLK_RING_SIZE(info); i++)
+		info->shadow[i].req.u.rw.id = i+1;
+	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
 	xenbus_switch_state(dev, XenbusStateInitialised);
 
 	return 0;
@@ -1361,7 +1416,7 @@ again:
 static int blkfront_probe(struct xenbus_device *dev,
 			  const struct xenbus_device_id *id)
 {
-	int err, vdevice, i;
+	int err, vdevice;
 	struct blkfront_info *info;
 
 	/* FIXME: Use dynamic device id if this is not set. */
@@ -1422,10 +1477,6 @@ static int blkfront_probe(struct xenbus_device *dev,
 	info->connected = BLKIF_STATE_DISCONNECTED;
 	INIT_WORK(&info->work, blkif_restart_queue);
 
-	for (i = 0; i < BLK_RING_SIZE; i++)
-		info->shadow[i].req.u.rw.id = i+1;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
-
 	/* Front end dir is a number, which is used as the id. */
 	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
 	dev_set_drvdata(&dev->dev, info);
@@ -1469,10 +1520,10 @@ static int blkif_recover(struct blkfront_info *info)
 
 	/* Stage 2: Set up free list. */
 	memset(&info->shadow, 0, sizeof(info->shadow));
-	for (i = 0; i < BLK_RING_SIZE; i++)
+	for (i = 0; i < BLK_RING_SIZE(info); i++)
 		info->shadow[i].req.u.rw.id = i+1;
 	info->shadow_free = info->ring.req_prod_pvt;
-	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
 
 	rc = blkfront_setup_indirect(info);
 	if (rc) {
@@ -1484,7 +1535,7 @@ static int blkif_recover(struct blkfront_info *info)
 	blk_queue_max_segments(info->rq, segs);
 	bio_list_init(&bio_list);
 	INIT_LIST_HEAD(&requests);
-	for (i = 0; i < BLK_RING_SIZE; i++) {
+	for (i = 0; i < BLK_RING_SIZE(info); i++) {
 		/* Not in use? */
 		if (!copy[i].request)
 			continue;
@@ -1690,7 +1741,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		segs = info->max_indirect_segments;
 	}
 
-	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE(info));
 	if (err)
 		goto out_of_memory;
 
@@ -1700,7 +1751,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		 * grants, we need to allocate a set of pages that can be
 		 * used for mapping indirect grefs
 		 */
-		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
+		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
 
 		BUG_ON(!list_empty(&info->indirect_pages));
 		for (i = 0; i < num; i++) {
@@ -1711,7 +1762,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 		}
 	}
 
-	for (i = 0; i < BLK_RING_SIZE; i++) {
+	for (i = 0; i < BLK_RING_SIZE(info); i++) {
 		info->shadow[i].grants_used = kzalloc(
 			sizeof(info->shadow[i].grants_used[0]) * segs,
 			GFP_NOIO);
@@ -1733,7 +1784,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
 	return 0;
 
 out_of_memory:
-	for (i = 0; i < BLK_RING_SIZE; i++) {
+	for (i = 0; i < BLK_RING_SIZE(info); i++) {
 		kfree(info->shadow[i].grants_used);
 		info->shadow[i].grants_used = NULL;
 		kfree(info->shadow[i].sg);
@@ -2089,6 +2140,12 @@ static int __init xlblk_init(void)
 	if (!xen_domain())
 		return -ENODEV;
 
+	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
+		pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
+			xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
+		xen_blkif_max_ring_order = 0;
+	}
+
 	if (!xen_has_pv_disk_devices())
 		return -ENODEV;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found] ` <1433310003-13089-3-git-send-email-bob.liu@oracle.com>
@ 2015-06-09  8:50   ` Bob Liu
       [not found]   ` <5576A8C0.8000804@oracle.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Bob Liu @ 2015-06-09  8:50 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, xen-devel, julien.grall, justing, paul.durrant,
	david.vrabel, roger.pau


On 06/03/2015 01:40 PM, Bob Liu wrote:
> Extend xen/block to support multi-page ring, so that more requests can be
> issued by using more than one pages as the request ring between blkfront
> and backend.
> As a result, the performance can get improved significantly.
> 
> We got some impressive improvements on our highend iscsi storage cluster
> backend. If using 64 pages as the ring, the IOPS increased about 15 times
> for the throughput testing and above doubled for the latency testing.
> 
> The reason was the limit on outstanding requests is 32 if use only one-page
> ring, but in our case the iscsi lun was spread across about 100 physical
> drives, 32 was really not enough to keep them busy.
> 
> Changes in v2:
>  - Rebased to 4.0-rc6.
>  - Document on how multi-page ring feature working to linux io/blkif.h.
> 
> Changes in v3:
>  - Remove changes to linux io/blkif.h and follow the protocol defined
>    in io/blkif.h of XEN tree.
>  - Rebased to 4.1-rc3
> 
> Changes in v4:
>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>  - A few comments from Roger.
> 
> Changes in v5:
>  - Clarify with 4k granularity to comment
>  - Address more comments from Roger
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>

Also tested the windows PV driver which also works fine when multi-page ring feature
was enabled in Linux backend.
http://www.xenproject.org/downloads/windows-pv-drivers.html

Regards,
-Bob

> ---
>  drivers/block/xen-blkback/blkback.c |   13 ++++
>  drivers/block/xen-blkback/common.h  |    2 +
>  drivers/block/xen-blkback/xenbus.c  |   89 +++++++++++++++++------
>  drivers/block/xen-blkfront.c        |  135 +++++++++++++++++++++++++----------
>  4 files changed, 180 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 713fc9f..2126842 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
>                   "Maximum number of grants to map persistently");
>  
>  /*
> + * Maximum order of pages to be used for the shared ring between front and
> + * backend, 4KB page granularity is used.
> + */
> +unsigned int xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
> +module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");
> +/*
>   * The LRU mechanism to clean the lists of persistent grants needs to
>   * be executed periodically. The time interval between consecutive executions
>   * of the purge mechanism is set in ms.
> @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
>  	if (!xen_domain())
>  		return -ENODEV;
>  
> +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
> +		pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
> +			xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
> +		xen_blkif_max_ring_order = XENBUS_MAX_RING_PAGE_ORDER;
> +	}
> +
>  	rc = xen_blkif_interface_init();
>  	if (rc)
>  		goto failed_init;
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 043f13b..8ccc49d 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -44,6 +44,7 @@
>  #include <xen/interface/io/blkif.h>
>  #include <xen/interface/io/protocols.h>
>  
> +extern unsigned int xen_blkif_max_ring_order;
>  /*
>   * This is the maximum number of segments that would be allowed in indirect
>   * requests. This value will also be passed to the frontend.
> @@ -320,6 +321,7 @@ struct xen_blkif {
>  	struct work_struct	free_work;
>  	/* Thread shutdown wait queue. */
>  	wait_queue_head_t	shutdown_wq;
> +	unsigned int nr_ring_pages;
>  };
>  
>  struct seg_buf {
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index c212d41..deb3f00 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -25,6 +25,7 @@
>  
>  /* Enlarge the array size in order to fully show blkback name. */
>  #define BLKBACK_NAME_LEN (20)
> +#define RINGREF_NAME_LEN (20)
>  
>  struct backend_info {
>  	struct xenbus_device	*dev;
> @@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>  	return blkif;
>  }
>  
> -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
> -			 unsigned int evtchn)
> +static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
> +			 unsigned int nr_grefs, unsigned int evtchn)
>  {
>  	int err;
>  
> @@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
>  	if (blkif->irq)
>  		return 0;
>  
> -	err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
> +	err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
>  				     &blkif->blk_ring);
>  	if (err < 0)
>  		return err;
> @@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
>  	{
>  		struct blkif_sring *sring;
>  		sring = (struct blkif_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
> +		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * nr_grefs);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_32:
>  	{
>  		struct blkif_x86_32_sring *sring_x86_32;
>  		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
> +		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE * nr_grefs);
>  		break;
>  	}
>  	case BLKIF_PROTOCOL_X86_64:
>  	{
>  		struct blkif_x86_64_sring *sring_x86_64;
>  		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
> -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
> +		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE * nr_grefs);
>  		break;
>  	}
>  	default:
> @@ -270,7 +271,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
>  		i++;
>  	}
>  
> -	WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
> +	WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
>  
>  	kmem_cache_free(xen_blkif_cachep, blkif);
>  }
> @@ -555,6 +556,11 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
>  	if (err)
>  		goto fail;
>  
> +	err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-page-order", "%u",
> +			    xen_blkif_max_ring_order);
> +	if (err)
> +		pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);
> +
>  	err = xenbus_switch_state(dev, XenbusStateInitWait);
>  	if (err)
>  		goto fail;
> @@ -818,8 +824,8 @@ again:
>  static int connect_ring(struct backend_info *be)
>  {
>  	struct xenbus_device *dev = be->dev;
> -	unsigned long ring_ref;
> -	unsigned int evtchn;
> +	unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
> +	unsigned int evtchn, nr_grefs, ring_page_order;
>  	unsigned int pers_grants;
>  	char protocol[64] = "";
>  	struct pending_req *req, *n;
> @@ -827,14 +833,57 @@ static int connect_ring(struct backend_info *be)
>  
>  	pr_debug("%s %s\n", __func__, dev->otherend);
>  
> -	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
> -			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
> -	if (err) {
> -		xenbus_dev_fatal(dev, err,
> -				 "reading %s/ring-ref and event-channel",
> +	err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel", "%u",
> +			  &evtchn);
> +	if (err != 1) {
> +		err = -EINVAL;
> +		xenbus_dev_fatal(dev, err, "reading %s/event-channel",
>  				 dev->otherend);
>  		return err;
>  	}
> +	pr_info("event-channel %u\n", evtchn);
> +
> +	err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> +			  &ring_page_order);
> +	if (err != 1) {
> +		err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref",
> +				  "%u", &ring_ref[0]);
> +		if (err != 1) {
> +			err = -EINVAL;
> +			xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
> +					 dev->otherend);
> +			return err;
> +		}
> +		nr_grefs = 1;
> +		pr_info("%s:using single page: ring-ref %d\n", dev->otherend,
> +			ring_ref[0]);
> +	} else {
> +		unsigned int i;
> +
> +		if (ring_page_order > xen_blkif_max_ring_order) {
> +			err = -EINVAL;
> +			xenbus_dev_fatal(dev, err, "%s/request %d ring page order exceed max:%d",
> +					 dev->otherend, ring_page_order,
> +					 xen_blkif_max_ring_order);
> +			return err;
> +		}
> +
> +		nr_grefs = 1 << ring_page_order;
> +		for (i = 0; i < nr_grefs; i++) {
> +			char ring_ref_name[RINGREF_NAME_LEN];
> +
> +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> +			err = xenbus_scanf(XBT_NIL, dev->otherend, ring_ref_name,
> +					   "%u", &ring_ref[i]);
> +			if (err != 1) {
> +				err = -EINVAL;
> +				xenbus_dev_fatal(dev, err, "reading %s/%s",
> +						 dev->otherend, ring_ref_name);
> +				return err;
> +			}
> +			pr_info("ring-ref%u: %u\n", i, ring_ref[i]);
> +		}
> +	}
>  
>  	be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
>  	err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
> @@ -859,12 +908,13 @@ static int connect_ring(struct backend_info *be)
>  
>  	be->blkif->vbd.feature_gnt_persistent = pers_grants;
>  	be->blkif->vbd.overflow_max_grants = 0;
> +	be->blkif->nr_ring_pages = nr_grefs;
>  
> -	pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
> -		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> +	pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n",
> +		nr_grefs, evtchn, be->blkif->blk_protocol, protocol,
>  		pers_grants ? "persistent grants" : "");
>  
> -	for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
> +	for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
>  		req = kzalloc(sizeof(*req), GFP_KERNEL);
>  		if (!req)
>  			goto fail;
> @@ -883,10 +933,9 @@ static int connect_ring(struct backend_info *be)
>  	}
>  
>  	/* Map the shared frame, irq etc. */
> -	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> +	err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn);
>  	if (err) {
> -		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
> -				 ring_ref, evtchn);
> +		xenbus_dev_fatal(dev, err, "mapping ring-ref port %u", evtchn);
>  		return err;
>  	}
>  
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 88e23fd..d3c1a95 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -98,7 +98,21 @@ static unsigned int xen_blkif_max_segments = 32;
>  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
>  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default is 32)");
>  
> -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
> +/*
> + * Maximum order of pages to be used for the shared ring between front and
> + * backend, 4KB page granularity is used.
> + */
> +static unsigned int xen_blkif_max_ring_order;
> +module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");
> +
> +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * (info)->nr_ring_pages)
> +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES)
> +/*
> + * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
> + * characters are enough. Define to 20 to keep consist with backend.
> + */
> +#define RINGREF_NAME_LEN (20)
>  
>  /*
>   * We have one of these per vbd, whether ide, scsi or 'other'.  They
> @@ -114,13 +128,14 @@ struct blkfront_info
>  	int vdevice;
>  	blkif_vdev_t handle;
>  	enum blkif_state connected;
> -	int ring_ref;
> +	int ring_ref[XENBUS_MAX_RING_PAGES];
> +	unsigned int nr_ring_pages;
>  	struct blkif_front_ring ring;
>  	unsigned int evtchn, irq;
>  	struct request_queue *rq;
>  	struct work_struct work;
>  	struct gnttab_free_callback callback;
> -	struct blk_shadow shadow[BLK_RING_SIZE];
> +	struct blk_shadow shadow[BLK_MAX_RING_SIZE];
>  	struct list_head grants;
>  	struct list_head indirect_pages;
>  	unsigned int persistent_gnts_c;
> @@ -139,8 +154,6 @@ static unsigned int nr_minors;
>  static unsigned long *minors;
>  static DEFINE_SPINLOCK(minor_lock);
>  
> -#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
> -	(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
>  #define GRANT_INVALID_REF	0
>  
>  #define PARTS_PER_DISK		16
> @@ -170,7 +183,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info);
>  static int get_id_from_freelist(struct blkfront_info *info)
>  {
>  	unsigned long free = info->shadow_free;
> -	BUG_ON(free >= BLK_RING_SIZE);
> +	BUG_ON(free >= BLK_RING_SIZE(info));
>  	info->shadow_free = info->shadow[free].req.u.rw.id;
>  	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
>  	return free;
> @@ -983,7 +996,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  		}
>  	}
>  
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>  		/*
>  		 * Clear persistent grants present in requests already
>  		 * on the shared ring
> @@ -1033,12 +1046,15 @@ free_shadow:
>  	flush_work(&info->work);
>  
>  	/* Free resources associated with old device channel. */
> -	if (info->ring_ref != GRANT_INVALID_REF) {
> -		gnttab_end_foreign_access(info->ring_ref, 0,
> -					  (unsigned long)info->ring.sring);
> -		info->ring_ref = GRANT_INVALID_REF;
> -		info->ring.sring = NULL;
> +	for (i = 0; i < info->nr_ring_pages; i++) {
> +		if (info->ring_ref[i] != GRANT_INVALID_REF) {
> +			gnttab_end_foreign_access(info->ring_ref[i], 0, 0);
> +			info->ring_ref[i] = GRANT_INVALID_REF;
> +		}
>  	}
> +	free_pages((unsigned long)info->ring.sring, get_order(info->nr_ring_pages * PAGE_SIZE));
> +	info->ring.sring = NULL;
> +
>  	if (info->irq)
>  		unbind_from_irqhandler(info->irq, info);
>  	info->evtchn = info->irq = 0;
> @@ -1157,7 +1173,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  		 * never have given to it (we stamp it up to BLK_RING_SIZE -
>  		 * look in get_id_from_freelist.
>  		 */
> -		if (id >= BLK_RING_SIZE) {
> +		if (id >= BLK_RING_SIZE(info)) {
>  			WARN(1, "%s: response to %s has incorrect id (%ld)\n",
>  			     info->gd->disk_name, op_name(bret->operation), id);
>  			/* We can't safely get the 'struct request' as
> @@ -1245,26 +1261,30 @@ static int setup_blkring(struct xenbus_device *dev,
>  			 struct blkfront_info *info)
>  {
>  	struct blkif_sring *sring;
> -	grant_ref_t gref;
> -	int err;
> +	int err, i;
> +	unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
> +	grant_ref_t gref[XENBUS_MAX_RING_PAGES];
>  
> -	info->ring_ref = GRANT_INVALID_REF;
> +	for (i = 0; i < info->nr_ring_pages; i++)
> +		info->ring_ref[i] = GRANT_INVALID_REF;
>  
> -	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH);
> +	sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO | __GFP_HIGH,
> +						       get_order(ring_size));
>  	if (!sring) {
>  		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
>  		return -ENOMEM;
>  	}
>  	SHARED_RING_INIT(sring);
> -	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
> +	FRONT_RING_INIT(&info->ring, sring, ring_size);
>  
> -	err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref);
> +	err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages, gref);
>  	if (err < 0) {
> -		free_page((unsigned long)sring);
> +		free_pages((unsigned long)sring, get_order(ring_size));
>  		info->ring.sring = NULL;
>  		goto fail;
>  	}
> -	info->ring_ref = gref;
> +	for (i = 0; i < info->nr_ring_pages; i++)
> +		info->ring_ref[i] = gref[i];
>  
>  	err = xenbus_alloc_evtchn(dev, &info->evtchn);
>  	if (err)
> @@ -1292,7 +1312,18 @@ static int talk_to_blkback(struct xenbus_device *dev,
>  {
>  	const char *message = NULL;
>  	struct xenbus_transaction xbt;
> -	int err;
> +	int err, i;
> +	unsigned int max_page_order = 0;
> +	unsigned int ring_page_order = 0;
> +
> +	err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
> +			   "max-ring-page-order", "%u", &max_page_order);
> +	if (err != 1)
> +		info->nr_ring_pages = 1;
> +	else {
> +		ring_page_order = min(xen_blkif_max_ring_order, max_page_order);
> +		info->nr_ring_pages = 1 << ring_page_order;
> +	}
>  
>  	/* Create shared ring, alloc event channel. */
>  	err = setup_blkring(dev, info);
> @@ -1306,11 +1337,32 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> -	err = xenbus_printf(xbt, dev->nodename,
> -			    "ring-ref", "%u", info->ring_ref);
> -	if (err) {
> -		message = "writing ring-ref";
> -		goto abort_transaction;
> +	if (info->nr_ring_pages == 1) {
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    "ring-ref", "%u", info->ring_ref[0]);
> +		if (err) {
> +			message = "writing ring-ref";
> +			goto abort_transaction;
> +		}
> +	} else {
> +		err = xenbus_printf(xbt, dev->nodename,
> +				    "ring-page-order", "%u", ring_page_order);
> +		if (err) {
> +			message = "writing ring-page-order";
> +			goto abort_transaction;
> +		}
> +
> +		for (i = 0; i < info->nr_ring_pages; i++) {
> +			char ring_ref_name[RINGREF_NAME_LEN];
> +
> +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> +			err = xenbus_printf(xbt, dev->nodename, ring_ref_name,
> +					    "%u", info->ring_ref[i]);
> +			if (err) {
> +				message = "writing ring-ref";
> +				goto abort_transaction;
> +			}
> +		}
>  	}
>  	err = xenbus_printf(xbt, dev->nodename,
>  			    "event-channel", "%u", info->evtchn);
> @@ -1338,6 +1390,9 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	for (i = 0; i < BLK_RING_SIZE(info); i++)
> +		info->shadow[i].req.u.rw.id = i+1;
> +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
>  	xenbus_switch_state(dev, XenbusStateInitialised);
>  
>  	return 0;
> @@ -1361,7 +1416,7 @@ again:
>  static int blkfront_probe(struct xenbus_device *dev,
>  			  const struct xenbus_device_id *id)
>  {
> -	int err, vdevice, i;
> +	int err, vdevice;
>  	struct blkfront_info *info;
>  
>  	/* FIXME: Use dynamic device id if this is not set. */
> @@ -1422,10 +1477,6 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	info->connected = BLKIF_STATE_DISCONNECTED;
>  	INIT_WORK(&info->work, blkif_restart_queue);
>  
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> -		info->shadow[i].req.u.rw.id = i+1;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> -
>  	/* Front end dir is a number, which is used as the id. */
>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
>  	dev_set_drvdata(&dev->dev, info);
> @@ -1469,10 +1520,10 @@ static int blkif_recover(struct blkfront_info *info)
>  
>  	/* Stage 2: Set up free list. */
>  	memset(&info->shadow, 0, sizeof(info->shadow));
> -	for (i = 0; i < BLK_RING_SIZE; i++)
> +	for (i = 0; i < BLK_RING_SIZE(info); i++)
>  		info->shadow[i].req.u.rw.id = i+1;
>  	info->shadow_free = info->ring.req_prod_pvt;
> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
>  
>  	rc = blkfront_setup_indirect(info);
>  	if (rc) {
> @@ -1484,7 +1535,7 @@ static int blkif_recover(struct blkfront_info *info)
>  	blk_queue_max_segments(info->rq, segs);
>  	bio_list_init(&bio_list);
>  	INIT_LIST_HEAD(&requests);
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>  		/* Not in use? */
>  		if (!copy[i].request)
>  			continue;
> @@ -1690,7 +1741,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  		segs = info->max_indirect_segments;
>  	}
>  
> -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
> +	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE(info));
>  	if (err)
>  		goto out_of_memory;
>  
> @@ -1700,7 +1751,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  		 * grants, we need to allocate a set of pages that can be
>  		 * used for mapping indirect grefs
>  		 */
> -		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
> +		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
>  
>  		BUG_ON(!list_empty(&info->indirect_pages));
>  		for (i = 0; i < num; i++) {
> @@ -1711,7 +1762,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  		}
>  	}
>  
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>  		info->shadow[i].grants_used = kzalloc(
>  			sizeof(info->shadow[i].grants_used[0]) * segs,
>  			GFP_NOIO);
> @@ -1733,7 +1784,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  	return 0;
>  
>  out_of_memory:
> -	for (i = 0; i < BLK_RING_SIZE; i++) {
> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>  		kfree(info->shadow[i].grants_used);
>  		info->shadow[i].grants_used = NULL;
>  		kfree(info->shadow[i].sg);
> @@ -2089,6 +2140,12 @@ static int __init xlblk_init(void)
>  	if (!xen_domain())
>  		return -ENODEV;
>  
> +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER) {
> +		pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
> +			xen_blkif_max_ring_order, XENBUS_MAX_RING_PAGE_ORDER);
> +		xen_blkif_max_ring_order = 0;
> +	}
> +
>  	if (!xen_has_pv_disk_devices())
>  		return -ENODEV;
>  
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]   ` <5576A8C0.8000804@oracle.com>
@ 2015-06-09  8:52     ` Paul Durrant
       [not found]     ` <9AAE0902D5BC7E449B7C8E4E778ABCD0259410F3@AMSPEX01CL01.citrite.net>
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Durrant @ 2015-06-09  8:52 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-kernel, xen-devel, Julien Grall, justing, David Vrabel,
	Roger Pau Monne

> -----Original Message-----
> From: Bob Liu [mailto:bob.liu@oracle.com]
> Sent: 09 June 2015 09:50
> To: Bob Liu
> Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
> konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
> 
> 
> On 06/03/2015 01:40 PM, Bob Liu wrote:
> > Extend xen/block to support multi-page ring, so that more requests can be
> > issued by using more than one pages as the request ring between blkfront
> > and backend.
> > As a result, the performance can get improved significantly.
> >
> > We got some impressive improvements on our highend iscsi storage cluster
> > backend. If using 64 pages as the ring, the IOPS increased about 15 times
> > for the throughput testing and above doubled for the latency testing.
> >
> > The reason was the limit on outstanding requests is 32 if use only one-page
> > ring, but in our case the iscsi lun was spread across about 100 physical
> > drives, 32 was really not enough to keep them busy.
> >
> > Changes in v2:
> >  - Rebased to 4.0-rc6.
> >  - Document on how multi-page ring feature working to linux io/blkif.h.
> >
> > Changes in v3:
> >  - Remove changes to linux io/blkif.h and follow the protocol defined
> >    in io/blkif.h of XEN tree.
> >  - Rebased to 4.1-rc3
> >
> > Changes in v4:
> >  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
> >  - A few comments from Roger.
> >
> > Changes in v5:
> >  - Clarify with 4k granularity to comment
> >  - Address more comments from Roger
> >
> > Signed-off-by: Bob Liu <bob.liu@oracle.com>
> 
> Also tested the windows PV driver which also works fine when multi-page
> ring feature
> was enabled in Linux backend.
> http://www.xenproject.org/downloads/windows-pv-drivers.html
> 

Great! Thanks for verifying that :-)

  Paul

> Regards,
> -Bob
> 
> > ---
> >  drivers/block/xen-blkback/blkback.c |   13 ++++
> >  drivers/block/xen-blkback/common.h  |    2 +
> >  drivers/block/xen-blkback/xenbus.c  |   89 +++++++++++++++++------
> >  drivers/block/xen-blkfront.c        |  135 +++++++++++++++++++++++++----
> ------
> >  4 files changed, 180 insertions(+), 59 deletions(-)
> >
> > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-
> blkback/blkback.c
> > index 713fc9f..2126842 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
> >                   "Maximum number of grants to map persistently");
> >
> >  /*
> > + * Maximum order of pages to be used for the shared ring between front
> and
> > + * backend, 4KB page granularity is used.
> > + */
> > +unsigned int xen_blkif_max_ring_order =
> XENBUS_MAX_RING_PAGE_ORDER;
> > +module_param_named(max_ring_page_order,
> xen_blkif_max_ring_order, int, S_IRUGO);
> > +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
> to be used for the shared ring");
> > +/*
> >   * The LRU mechanism to clean the lists of persistent grants needs to
> >   * be executed periodically. The time interval between consecutive
> executions
> >   * of the purge mechanism is set in ms.
> > @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
> >  	if (!xen_domain())
> >  		return -ENODEV;
> >
> > +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
> {
> > +		pr_info("Invalid max_ring_order (%d), will use default max:
> %d.\n",
> > +			xen_blkif_max_ring_order,
> XENBUS_MAX_RING_PAGE_ORDER);
> > +		xen_blkif_max_ring_order =
> XENBUS_MAX_RING_PAGE_ORDER;
> > +	}
> > +
> >  	rc = xen_blkif_interface_init();
> >  	if (rc)
> >  		goto failed_init;
> > diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-
> blkback/common.h
> > index 043f13b..8ccc49d 100644
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -44,6 +44,7 @@
> >  #include <xen/interface/io/blkif.h>
> >  #include <xen/interface/io/protocols.h>
> >
> > +extern unsigned int xen_blkif_max_ring_order;
> >  /*
> >   * This is the maximum number of segments that would be allowed in
> indirect
> >   * requests. This value will also be passed to the frontend.
> > @@ -320,6 +321,7 @@ struct xen_blkif {
> >  	struct work_struct	free_work;
> >  	/* Thread shutdown wait queue. */
> >  	wait_queue_head_t	shutdown_wq;
> > +	unsigned int nr_ring_pages;
> >  };
> >
> >  struct seg_buf {
> > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-
> blkback/xenbus.c
> > index c212d41..deb3f00 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -25,6 +25,7 @@
> >
> >  /* Enlarge the array size in order to fully show blkback name. */
> >  #define BLKBACK_NAME_LEN (20)
> > +#define RINGREF_NAME_LEN (20)
> >
> >  struct backend_info {
> >  	struct xenbus_device	*dev;
> > @@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t
> domid)
> >  	return blkif;
> >  }
> >
> > -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
> > -			 unsigned int evtchn)
> > +static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
> > +			 unsigned int nr_grefs, unsigned int evtchn)
> >  {
> >  	int err;
> >
> > @@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif,
> grant_ref_t gref,
> >  	if (blkif->irq)
> >  		return 0;
> >
> > -	err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
> > +	err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
> >  				     &blkif->blk_ring);
> >  	if (err < 0)
> >  		return err;
> > @@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif,
> grant_ref_t gref,
> >  	{
> >  		struct blkif_sring *sring;
> >  		sring = (struct blkif_sring *)blkif->blk_ring;
> > -		BACK_RING_INIT(&blkif->blk_rings.native, sring,
> PAGE_SIZE);
> > +		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE
> * nr_grefs);
> >  		break;
> >  	}
> >  	case BLKIF_PROTOCOL_X86_32:
> >  	{
> >  		struct blkif_x86_32_sring *sring_x86_32;
> >  		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
> > -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
> PAGE_SIZE);
> > +		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
> PAGE_SIZE * nr_grefs);
> >  		break;
> >  	}
> >  	case BLKIF_PROTOCOL_X86_64:
> >  	{
> >  		struct blkif_x86_64_sring *sring_x86_64;
> >  		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
> > -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
> PAGE_SIZE);
> > +		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
> PAGE_SIZE * nr_grefs);
> >  		break;
> >  	}
> >  	default:
> > @@ -270,7 +271,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
> >  		i++;
> >  	}
> >
> > -	WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
> > +	WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif-
> >nr_ring_pages));
> >
> >  	kmem_cache_free(xen_blkif_cachep, blkif);
> >  }
> > @@ -555,6 +556,11 @@ static int xen_blkbk_probe(struct xenbus_device
> *dev,
> >  	if (err)
> >  		goto fail;
> >
> > +	err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-page-
> order", "%u",
> > +			    xen_blkif_max_ring_order);
> > +	if (err)
> > +		pr_warn("%s write out 'max-ring-page-order' failed\n",
> __func__);
> > +
> >  	err = xenbus_switch_state(dev, XenbusStateInitWait);
> >  	if (err)
> >  		goto fail;
> > @@ -818,8 +824,8 @@ again:
> >  static int connect_ring(struct backend_info *be)
> >  {
> >  	struct xenbus_device *dev = be->dev;
> > -	unsigned long ring_ref;
> > -	unsigned int evtchn;
> > +	unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
> > +	unsigned int evtchn, nr_grefs, ring_page_order;
> >  	unsigned int pers_grants;
> >  	char protocol[64] = "";
> >  	struct pending_req *req, *n;
> > @@ -827,14 +833,57 @@ static int connect_ring(struct backend_info *be)
> >
> >  	pr_debug("%s %s\n", __func__, dev->otherend);
> >
> > -	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
> > -			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
> > -	if (err) {
> > -		xenbus_dev_fatal(dev, err,
> > -				 "reading %s/ring-ref and event-channel",
> > +	err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel",
> "%u",
> > +			  &evtchn);
> > +	if (err != 1) {
> > +		err = -EINVAL;
> > +		xenbus_dev_fatal(dev, err, "reading %s/event-channel",
> >  				 dev->otherend);
> >  		return err;
> >  	}
> > +	pr_info("event-channel %u\n", evtchn);
> > +
> > +	err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order",
> "%u",
> > +			  &ring_page_order);
> > +	if (err != 1) {
> > +		err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref",
> > +				  "%u", &ring_ref[0]);
> > +		if (err != 1) {
> > +			err = -EINVAL;
> > +			xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
> > +					 dev->otherend);
> > +			return err;
> > +		}
> > +		nr_grefs = 1;
> > +		pr_info("%s:using single page: ring-ref %d\n", dev-
> >otherend,
> > +			ring_ref[0]);
> > +	} else {
> > +		unsigned int i;
> > +
> > +		if (ring_page_order > xen_blkif_max_ring_order) {
> > +			err = -EINVAL;
> > +			xenbus_dev_fatal(dev, err, "%s/request %d ring
> page order exceed max:%d",
> > +					 dev->otherend, ring_page_order,
> > +					 xen_blkif_max_ring_order);
> > +			return err;
> > +		}
> > +
> > +		nr_grefs = 1 << ring_page_order;
> > +		for (i = 0; i < nr_grefs; i++) {
> > +			char ring_ref_name[RINGREF_NAME_LEN];
> > +
> > +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
> ref%u", i);
> > +			err = xenbus_scanf(XBT_NIL, dev->otherend,
> ring_ref_name,
> > +					   "%u", &ring_ref[i]);
> > +			if (err != 1) {
> > +				err = -EINVAL;
> > +				xenbus_dev_fatal(dev, err, "reading %s/%s",
> > +						 dev->otherend,
> ring_ref_name);
> > +				return err;
> > +			}
> > +			pr_info("ring-ref%u: %u\n", i, ring_ref[i]);
> > +		}
> > +	}
> >
> >  	be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
> >  	err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
> > @@ -859,12 +908,13 @@ static int connect_ring(struct backend_info *be)
> >
> >  	be->blkif->vbd.feature_gnt_persistent = pers_grants;
> >  	be->blkif->vbd.overflow_max_grants = 0;
> > +	be->blkif->nr_ring_pages = nr_grefs;
> >
> > -	pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
> > -		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> > +	pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n",
> > +		nr_grefs, evtchn, be->blkif->blk_protocol, protocol,
> >  		pers_grants ? "persistent grants" : "");
> >
> > -	for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
> > +	for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
> >  		req = kzalloc(sizeof(*req), GFP_KERNEL);
> >  		if (!req)
> >  			goto fail;
> > @@ -883,10 +933,9 @@ static int connect_ring(struct backend_info *be)
> >  	}
> >
> >  	/* Map the shared frame, irq etc. */
> > -	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> > +	err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn);
> >  	if (err) {
> > -		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
> > -				 ring_ref, evtchn);
> > +		xenbus_dev_fatal(dev, err, "mapping ring-ref port %u",
> evtchn);
> >  		return err;
> >  	}
> >
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 88e23fd..d3c1a95 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -98,7 +98,21 @@ static unsigned int xen_blkif_max_segments = 32;
> >  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
> >  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect
> requests (default is 32)");
> >
> > -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
> > +/*
> > + * Maximum order of pages to be used for the shared ring between front
> and
> > + * backend, 4KB page granularity is used.
> > + */
> > +static unsigned int xen_blkif_max_ring_order;
> > +module_param_named(max_ring_page_order,
> xen_blkif_max_ring_order, int, S_IRUGO);
> > +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
> to be used for the shared ring");
> > +
> > +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE *
> (info)->nr_ring_pages)
> > +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE *
> XENBUS_MAX_RING_PAGES)
> > +/*
> > + * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
> > + * characters are enough. Define to 20 to keep consist with backend.
> > + */
> > +#define RINGREF_NAME_LEN (20)
> >
> >  /*
> >   * We have one of these per vbd, whether ide, scsi or 'other'.  They
> > @@ -114,13 +128,14 @@ struct blkfront_info
> >  	int vdevice;
> >  	blkif_vdev_t handle;
> >  	enum blkif_state connected;
> > -	int ring_ref;
> > +	int ring_ref[XENBUS_MAX_RING_PAGES];
> > +	unsigned int nr_ring_pages;
> >  	struct blkif_front_ring ring;
> >  	unsigned int evtchn, irq;
> >  	struct request_queue *rq;
> >  	struct work_struct work;
> >  	struct gnttab_free_callback callback;
> > -	struct blk_shadow shadow[BLK_RING_SIZE];
> > +	struct blk_shadow shadow[BLK_MAX_RING_SIZE];
> >  	struct list_head grants;
> >  	struct list_head indirect_pages;
> >  	unsigned int persistent_gnts_c;
> > @@ -139,8 +154,6 @@ static unsigned int nr_minors;
> >  static unsigned long *minors;
> >  static DEFINE_SPINLOCK(minor_lock);
> >
> > -#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
> > -	(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
> >  #define GRANT_INVALID_REF	0
> >
> >  #define PARTS_PER_DISK		16
> > @@ -170,7 +183,7 @@ static int blkfront_setup_indirect(struct
> blkfront_info *info);
> >  static int get_id_from_freelist(struct blkfront_info *info)
> >  {
> >  	unsigned long free = info->shadow_free;
> > -	BUG_ON(free >= BLK_RING_SIZE);
> > +	BUG_ON(free >= BLK_RING_SIZE(info));
> >  	info->shadow_free = info->shadow[free].req.u.rw.id;
> >  	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
> >  	return free;
> > @@ -983,7 +996,7 @@ static void blkif_free(struct blkfront_info *info, int
> suspend)
> >  		}
> >  	}
> >
> > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> >  		/*
> >  		 * Clear persistent grants present in requests already
> >  		 * on the shared ring
> > @@ -1033,12 +1046,15 @@ free_shadow:
> >  	flush_work(&info->work);
> >
> >  	/* Free resources associated with old device channel. */
> > -	if (info->ring_ref != GRANT_INVALID_REF) {
> > -		gnttab_end_foreign_access(info->ring_ref, 0,
> > -					  (unsigned long)info->ring.sring);
> > -		info->ring_ref = GRANT_INVALID_REF;
> > -		info->ring.sring = NULL;
> > +	for (i = 0; i < info->nr_ring_pages; i++) {
> > +		if (info->ring_ref[i] != GRANT_INVALID_REF) {
> > +			gnttab_end_foreign_access(info->ring_ref[i], 0, 0);
> > +			info->ring_ref[i] = GRANT_INVALID_REF;
> > +		}
> >  	}
> > +	free_pages((unsigned long)info->ring.sring, get_order(info-
> >nr_ring_pages * PAGE_SIZE));
> > +	info->ring.sring = NULL;
> > +
> >  	if (info->irq)
> >  		unbind_from_irqhandler(info->irq, info);
> >  	info->evtchn = info->irq = 0;
> > @@ -1157,7 +1173,7 @@ static irqreturn_t blkif_interrupt(int irq, void
> *dev_id)
> >  		 * never have given to it (we stamp it up to BLK_RING_SIZE -
> >  		 * look in get_id_from_freelist.
> >  		 */
> > -		if (id >= BLK_RING_SIZE) {
> > +		if (id >= BLK_RING_SIZE(info)) {
> >  			WARN(1, "%s: response to %s has incorrect id
> (%ld)\n",
> >  			     info->gd->disk_name, op_name(bret->operation),
> id);
> >  			/* We can't safely get the 'struct request' as
> > @@ -1245,26 +1261,30 @@ static int setup_blkring(struct xenbus_device
> *dev,
> >  			 struct blkfront_info *info)
> >  {
> >  	struct blkif_sring *sring;
> > -	grant_ref_t gref;
> > -	int err;
> > +	int err, i;
> > +	unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
> > +	grant_ref_t gref[XENBUS_MAX_RING_PAGES];
> >
> > -	info->ring_ref = GRANT_INVALID_REF;
> > +	for (i = 0; i < info->nr_ring_pages; i++)
> > +		info->ring_ref[i] = GRANT_INVALID_REF;
> >
> > -	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO |
> __GFP_HIGH);
> > +	sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO |
> __GFP_HIGH,
> > +						       get_order(ring_size));
> >  	if (!sring) {
> >  		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
> >  		return -ENOMEM;
> >  	}
> >  	SHARED_RING_INIT(sring);
> > -	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
> > +	FRONT_RING_INIT(&info->ring, sring, ring_size);
> >
> > -	err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref);
> > +	err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages,
> gref);
> >  	if (err < 0) {
> > -		free_page((unsigned long)sring);
> > +		free_pages((unsigned long)sring, get_order(ring_size));
> >  		info->ring.sring = NULL;
> >  		goto fail;
> >  	}
> > -	info->ring_ref = gref;
> > +	for (i = 0; i < info->nr_ring_pages; i++)
> > +		info->ring_ref[i] = gref[i];
> >
> >  	err = xenbus_alloc_evtchn(dev, &info->evtchn);
> >  	if (err)
> > @@ -1292,7 +1312,18 @@ static int talk_to_blkback(struct xenbus_device
> *dev,
> >  {
> >  	const char *message = NULL;
> >  	struct xenbus_transaction xbt;
> > -	int err;
> > +	int err, i;
> > +	unsigned int max_page_order = 0;
> > +	unsigned int ring_page_order = 0;
> > +
> > +	err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
> > +			   "max-ring-page-order", "%u", &max_page_order);
> > +	if (err != 1)
> > +		info->nr_ring_pages = 1;
> > +	else {
> > +		ring_page_order = min(xen_blkif_max_ring_order,
> max_page_order);
> > +		info->nr_ring_pages = 1 << ring_page_order;
> > +	}
> >
> >  	/* Create shared ring, alloc event channel. */
> >  	err = setup_blkring(dev, info);
> > @@ -1306,11 +1337,32 @@ again:
> >  		goto destroy_blkring;
> >  	}
> >
> > -	err = xenbus_printf(xbt, dev->nodename,
> > -			    "ring-ref", "%u", info->ring_ref);
> > -	if (err) {
> > -		message = "writing ring-ref";
> > -		goto abort_transaction;
> > +	if (info->nr_ring_pages == 1) {
> > +		err = xenbus_printf(xbt, dev->nodename,
> > +				    "ring-ref", "%u", info->ring_ref[0]);
> > +		if (err) {
> > +			message = "writing ring-ref";
> > +			goto abort_transaction;
> > +		}
> > +	} else {
> > +		err = xenbus_printf(xbt, dev->nodename,
> > +				    "ring-page-order", "%u", ring_page_order);
> > +		if (err) {
> > +			message = "writing ring-page-order";
> > +			goto abort_transaction;
> > +		}
> > +
> > +		for (i = 0; i < info->nr_ring_pages; i++) {
> > +			char ring_ref_name[RINGREF_NAME_LEN];
> > +
> > +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
> ref%u", i);
> > +			err = xenbus_printf(xbt, dev->nodename,
> ring_ref_name,
> > +					    "%u", info->ring_ref[i]);
> > +			if (err) {
> > +				message = "writing ring-ref";
> > +				goto abort_transaction;
> > +			}
> > +		}
> >  	}
> >  	err = xenbus_printf(xbt, dev->nodename,
> >  			    "event-channel", "%u", info->evtchn);
> > @@ -1338,6 +1390,9 @@ again:
> >  		goto destroy_blkring;
> >  	}
> >
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++)
> > +		info->shadow[i].req.u.rw.id = i+1;
> > +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
> >  	xenbus_switch_state(dev, XenbusStateInitialised);
> >
> >  	return 0;
> > @@ -1361,7 +1416,7 @@ again:
> >  static int blkfront_probe(struct xenbus_device *dev,
> >  			  const struct xenbus_device_id *id)
> >  {
> > -	int err, vdevice, i;
> > +	int err, vdevice;
> >  	struct blkfront_info *info;
> >
> >  	/* FIXME: Use dynamic device id if this is not set. */
> > @@ -1422,10 +1477,6 @@ static int blkfront_probe(struct xenbus_device
> *dev,
> >  	info->connected = BLKIF_STATE_DISCONNECTED;
> >  	INIT_WORK(&info->work, blkif_restart_queue);
> >
> > -	for (i = 0; i < BLK_RING_SIZE; i++)
> > -		info->shadow[i].req.u.rw.id = i+1;
> > -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> > -
> >  	/* Front end dir is a number, which is used as the id. */
> >  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL,
> 0);
> >  	dev_set_drvdata(&dev->dev, info);
> > @@ -1469,10 +1520,10 @@ static int blkif_recover(struct blkfront_info
> *info)
> >
> >  	/* Stage 2: Set up free list. */
> >  	memset(&info->shadow, 0, sizeof(info->shadow));
> > -	for (i = 0; i < BLK_RING_SIZE; i++)
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++)
> >  		info->shadow[i].req.u.rw.id = i+1;
> >  	info->shadow_free = info->ring.req_prod_pvt;
> > -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> > +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
> >
> >  	rc = blkfront_setup_indirect(info);
> >  	if (rc) {
> > @@ -1484,7 +1535,7 @@ static int blkif_recover(struct blkfront_info *info)
> >  	blk_queue_max_segments(info->rq, segs);
> >  	bio_list_init(&bio_list);
> >  	INIT_LIST_HEAD(&requests);
> > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> >  		/* Not in use? */
> >  		if (!copy[i].request)
> >  			continue;
> > @@ -1690,7 +1741,7 @@ static int blkfront_setup_indirect(struct
> blkfront_info *info)
> >  		segs = info->max_indirect_segments;
> >  	}
> >
> > -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
> BLK_RING_SIZE);
> > +	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
> BLK_RING_SIZE(info));
> >  	if (err)
> >  		goto out_of_memory;
> >
> > @@ -1700,7 +1751,7 @@ static int blkfront_setup_indirect(struct
> blkfront_info *info)
> >  		 * grants, we need to allocate a set of pages that can be
> >  		 * used for mapping indirect grefs
> >  		 */
> > -		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
> > +		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
> >
> >  		BUG_ON(!list_empty(&info->indirect_pages));
> >  		for (i = 0; i < num; i++) {
> > @@ -1711,7 +1762,7 @@ static int blkfront_setup_indirect(struct
> blkfront_info *info)
> >  		}
> >  	}
> >
> > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> >  		info->shadow[i].grants_used = kzalloc(
> >  			sizeof(info->shadow[i].grants_used[0]) * segs,
> >  			GFP_NOIO);
> > @@ -1733,7 +1784,7 @@ static int blkfront_setup_indirect(struct
> blkfront_info *info)
> >  	return 0;
> >
> >  out_of_memory:
> > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> >  		kfree(info->shadow[i].grants_used);
> >  		info->shadow[i].grants_used = NULL;
> >  		kfree(info->shadow[i].sg);
> > @@ -2089,6 +2140,12 @@ static int __init xlblk_init(void)
> >  	if (!xen_domain())
> >  		return -ENODEV;
> >
> > +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
> {
> > +		pr_info("Invalid max_ring_order (%d), will use default max:
> %d.\n",
> > +			xen_blkif_max_ring_order,
> XENBUS_MAX_RING_PAGE_ORDER);
> > +		xen_blkif_max_ring_order = 0;
> > +	}
> > +
> >  	if (!xen_has_pv_disk_devices())
> >  		return -ENODEV;
> >
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]     ` <9AAE0902D5BC7E449B7C8E4E778ABCD0259410F3@AMSPEX01CL01.citrite.net>
@ 2015-06-09 13:39       ` Konrad Rzeszutek Wilk
       [not found]       ` <20150609133938.GA15200@x230>
  1 sibling, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-09 13:39 UTC (permalink / raw)
  To: Paul Durrant
  Cc: linux-kernel, xen-devel, Julien Grall, justing, David Vrabel,
	Roger Pau Monne

On Tue, Jun 09, 2015 at 08:52:53AM +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Bob Liu [mailto:bob.liu@oracle.com]
> > Sent: 09 June 2015 09:50
> > To: Bob Liu
> > Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
> > konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
> > kernel@vger.kernel.org
> > Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
> > 
> > 
> > On 06/03/2015 01:40 PM, Bob Liu wrote:
> > > Extend xen/block to support multi-page ring, so that more requests can be
> > > issued by using more than one pages as the request ring between blkfront
> > > and backend.
> > > As a result, the performance can get improved significantly.
> > >
> > > We got some impressive improvements on our highend iscsi storage cluster
> > > backend. If using 64 pages as the ring, the IOPS increased about 15 times
> > > for the throughput testing and above doubled for the latency testing.
> > >
> > > The reason was the limit on outstanding requests is 32 if use only one-page
> > > ring, but in our case the iscsi lun was spread across about 100 physical
> > > drives, 32 was really not enough to keep them busy.
> > >
> > > Changes in v2:
> > >  - Rebased to 4.0-rc6.
> > >  - Document on how multi-page ring feature working to linux io/blkif.h.
> > >
> > > Changes in v3:
> > >  - Remove changes to linux io/blkif.h and follow the protocol defined
> > >    in io/blkif.h of XEN tree.
> > >  - Rebased to 4.1-rc3
> > >
> > > Changes in v4:
> > >  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
> > >  - A few comments from Roger.
> > >
> > > Changes in v5:
> > >  - Clarify with 4k granularity to comment
> > >  - Address more comments from Roger
> > >
> > > Signed-off-by: Bob Liu <bob.liu@oracle.com>
> > 
> > Also tested the windows PV driver which also works fine when multi-page
> > ring feature
> > was enabled in Linux backend.
> > http://www.xenproject.org/downloads/windows-pv-drivers.html
> > 
> 
> Great! Thanks for verifying that :-)

Woot! Bob, could you repost the blkif.h patch for the Xen tree
pleas e and also mention the testing part in it please? I think this
was the only big 'what if?!' question holding this up.


Roger, I put them (patches) on devel/for-jens-4.2 on

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git

I think these two patches:
drivers: xen-blkback: delay pending_req allocation to connect_ring
xen/block: add multi-page ring support

are the only ones that haven't been Acked by you (or maybe they
have and I missed the Ack?)


> 
>   Paul
> 
> > Regards,
> > -Bob
> > 
> > > ---
> > >  drivers/block/xen-blkback/blkback.c |   13 ++++
> > >  drivers/block/xen-blkback/common.h  |    2 +
> > >  drivers/block/xen-blkback/xenbus.c  |   89 +++++++++++++++++------
> > >  drivers/block/xen-blkfront.c        |  135 +++++++++++++++++++++++++----
> > ------
> > >  4 files changed, 180 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-
> > blkback/blkback.c
> > > index 713fc9f..2126842 100644
> > > --- a/drivers/block/xen-blkback/blkback.c
> > > +++ b/drivers/block/xen-blkback/blkback.c
> > > @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
> > >                   "Maximum number of grants to map persistently");
> > >
> > >  /*
> > > + * Maximum order of pages to be used for the shared ring between front
> > and
> > > + * backend, 4KB page granularity is used.
> > > + */
> > > +unsigned int xen_blkif_max_ring_order =
> > XENBUS_MAX_RING_PAGE_ORDER;
> > > +module_param_named(max_ring_page_order,
> > xen_blkif_max_ring_order, int, S_IRUGO);
> > > +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
> > to be used for the shared ring");
> > > +/*
> > >   * The LRU mechanism to clean the lists of persistent grants needs to
> > >   * be executed periodically. The time interval between consecutive
> > executions
> > >   * of the purge mechanism is set in ms.
> > > @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
> > >  	if (!xen_domain())
> > >  		return -ENODEV;
> > >
> > > +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
> > {
> > > +		pr_info("Invalid max_ring_order (%d), will use default max:
> > %d.\n",
> > > +			xen_blkif_max_ring_order,
> > XENBUS_MAX_RING_PAGE_ORDER);
> > > +		xen_blkif_max_ring_order =
> > XENBUS_MAX_RING_PAGE_ORDER;
> > > +	}
> > > +
> > >  	rc = xen_blkif_interface_init();
> > >  	if (rc)
> > >  		goto failed_init;
> > > diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-
> > blkback/common.h
> > > index 043f13b..8ccc49d 100644
> > > --- a/drivers/block/xen-blkback/common.h
> > > +++ b/drivers/block/xen-blkback/common.h
> > > @@ -44,6 +44,7 @@
> > >  #include <xen/interface/io/blkif.h>
> > >  #include <xen/interface/io/protocols.h>
> > >
> > > +extern unsigned int xen_blkif_max_ring_order;
> > >  /*
> > >   * This is the maximum number of segments that would be allowed in
> > indirect
> > >   * requests. This value will also be passed to the frontend.
> > > @@ -320,6 +321,7 @@ struct xen_blkif {
> > >  	struct work_struct	free_work;
> > >  	/* Thread shutdown wait queue. */
> > >  	wait_queue_head_t	shutdown_wq;
> > > +	unsigned int nr_ring_pages;
> > >  };
> > >
> > >  struct seg_buf {
> > > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-
> > blkback/xenbus.c
> > > index c212d41..deb3f00 100644
> > > --- a/drivers/block/xen-blkback/xenbus.c
> > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > @@ -25,6 +25,7 @@
> > >
> > >  /* Enlarge the array size in order to fully show blkback name. */
> > >  #define BLKBACK_NAME_LEN (20)
> > > +#define RINGREF_NAME_LEN (20)
> > >
> > >  struct backend_info {
> > >  	struct xenbus_device	*dev;
> > > @@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t
> > domid)
> > >  	return blkif;
> > >  }
> > >
> > > -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
> > > -			 unsigned int evtchn)
> > > +static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
> > > +			 unsigned int nr_grefs, unsigned int evtchn)
> > >  {
> > >  	int err;
> > >
> > > @@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif,
> > grant_ref_t gref,
> > >  	if (blkif->irq)
> > >  		return 0;
> > >
> > > -	err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
> > > +	err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
> > >  				     &blkif->blk_ring);
> > >  	if (err < 0)
> > >  		return err;
> > > @@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif,
> > grant_ref_t gref,
> > >  	{
> > >  		struct blkif_sring *sring;
> > >  		sring = (struct blkif_sring *)blkif->blk_ring;
> > > -		BACK_RING_INIT(&blkif->blk_rings.native, sring,
> > PAGE_SIZE);
> > > +		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE
> > * nr_grefs);
> > >  		break;
> > >  	}
> > >  	case BLKIF_PROTOCOL_X86_32:
> > >  	{
> > >  		struct blkif_x86_32_sring *sring_x86_32;
> > >  		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
> > > -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
> > PAGE_SIZE);
> > > +		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
> > PAGE_SIZE * nr_grefs);
> > >  		break;
> > >  	}
> > >  	case BLKIF_PROTOCOL_X86_64:
> > >  	{
> > >  		struct blkif_x86_64_sring *sring_x86_64;
> > >  		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
> > > -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
> > PAGE_SIZE);
> > > +		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
> > PAGE_SIZE * nr_grefs);
> > >  		break;
> > >  	}
> > >  	default:
> > > @@ -270,7 +271,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
> > >  		i++;
> > >  	}
> > >
> > > -	WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
> > > +	WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif-
> > >nr_ring_pages));
> > >
> > >  	kmem_cache_free(xen_blkif_cachep, blkif);
> > >  }
> > > @@ -555,6 +556,11 @@ static int xen_blkbk_probe(struct xenbus_device
> > *dev,
> > >  	if (err)
> > >  		goto fail;
> > >
> > > +	err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-page-
> > order", "%u",
> > > +			    xen_blkif_max_ring_order);
> > > +	if (err)
> > > +		pr_warn("%s write out 'max-ring-page-order' failed\n",
> > __func__);
> > > +
> > >  	err = xenbus_switch_state(dev, XenbusStateInitWait);
> > >  	if (err)
> > >  		goto fail;
> > > @@ -818,8 +824,8 @@ again:
> > >  static int connect_ring(struct backend_info *be)
> > >  {
> > >  	struct xenbus_device *dev = be->dev;
> > > -	unsigned long ring_ref;
> > > -	unsigned int evtchn;
> > > +	unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
> > > +	unsigned int evtchn, nr_grefs, ring_page_order;
> > >  	unsigned int pers_grants;
> > >  	char protocol[64] = "";
> > >  	struct pending_req *req, *n;
> > > @@ -827,14 +833,57 @@ static int connect_ring(struct backend_info *be)
> > >
> > >  	pr_debug("%s %s\n", __func__, dev->otherend);
> > >
> > > -	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
> > > -			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
> > > -	if (err) {
> > > -		xenbus_dev_fatal(dev, err,
> > > -				 "reading %s/ring-ref and event-channel",
> > > +	err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel",
> > "%u",
> > > +			  &evtchn);
> > > +	if (err != 1) {
> > > +		err = -EINVAL;
> > > +		xenbus_dev_fatal(dev, err, "reading %s/event-channel",
> > >  				 dev->otherend);
> > >  		return err;
> > >  	}
> > > +	pr_info("event-channel %u\n", evtchn);
> > > +
> > > +	err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order",
> > "%u",
> > > +			  &ring_page_order);
> > > +	if (err != 1) {
> > > +		err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref",
> > > +				  "%u", &ring_ref[0]);
> > > +		if (err != 1) {
> > > +			err = -EINVAL;
> > > +			xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
> > > +					 dev->otherend);
> > > +			return err;
> > > +		}
> > > +		nr_grefs = 1;
> > > +		pr_info("%s:using single page: ring-ref %d\n", dev-
> > >otherend,
> > > +			ring_ref[0]);
> > > +	} else {
> > > +		unsigned int i;
> > > +
> > > +		if (ring_page_order > xen_blkif_max_ring_order) {
> > > +			err = -EINVAL;
> > > +			xenbus_dev_fatal(dev, err, "%s/request %d ring
> > page order exceed max:%d",
> > > +					 dev->otherend, ring_page_order,
> > > +					 xen_blkif_max_ring_order);
> > > +			return err;
> > > +		}
> > > +
> > > +		nr_grefs = 1 << ring_page_order;
> > > +		for (i = 0; i < nr_grefs; i++) {
> > > +			char ring_ref_name[RINGREF_NAME_LEN];
> > > +
> > > +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
> > ref%u", i);
> > > +			err = xenbus_scanf(XBT_NIL, dev->otherend,
> > ring_ref_name,
> > > +					   "%u", &ring_ref[i]);
> > > +			if (err != 1) {
> > > +				err = -EINVAL;
> > > +				xenbus_dev_fatal(dev, err, "reading %s/%s",
> > > +						 dev->otherend,
> > ring_ref_name);
> > > +				return err;
> > > +			}
> > > +			pr_info("ring-ref%u: %u\n", i, ring_ref[i]);
> > > +		}
> > > +	}
> > >
> > >  	be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
> > >  	err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
> > > @@ -859,12 +908,13 @@ static int connect_ring(struct backend_info *be)
> > >
> > >  	be->blkif->vbd.feature_gnt_persistent = pers_grants;
> > >  	be->blkif->vbd.overflow_max_grants = 0;
> > > +	be->blkif->nr_ring_pages = nr_grefs;
> > >
> > > -	pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
> > > -		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> > > +	pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n",
> > > +		nr_grefs, evtchn, be->blkif->blk_protocol, protocol,
> > >  		pers_grants ? "persistent grants" : "");
> > >
> > > -	for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
> > > +	for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
> > >  		req = kzalloc(sizeof(*req), GFP_KERNEL);
> > >  		if (!req)
> > >  			goto fail;
> > > @@ -883,10 +933,9 @@ static int connect_ring(struct backend_info *be)
> > >  	}
> > >
> > >  	/* Map the shared frame, irq etc. */
> > > -	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> > > +	err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn);
> > >  	if (err) {
> > > -		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
> > > -				 ring_ref, evtchn);
> > > +		xenbus_dev_fatal(dev, err, "mapping ring-ref port %u",
> > evtchn);
> > >  		return err;
> > >  	}
> > >
> > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > > index 88e23fd..d3c1a95 100644
> > > --- a/drivers/block/xen-blkfront.c
> > > +++ b/drivers/block/xen-blkfront.c
> > > @@ -98,7 +98,21 @@ static unsigned int xen_blkif_max_segments = 32;
> > >  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
> > >  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect
> > requests (default is 32)");
> > >
> > > -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
> > > +/*
> > > + * Maximum order of pages to be used for the shared ring between front
> > and
> > > + * backend, 4KB page granularity is used.
> > > + */
> > > +static unsigned int xen_blkif_max_ring_order;
> > > +module_param_named(max_ring_page_order,
> > xen_blkif_max_ring_order, int, S_IRUGO);
> > > +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
> > to be used for the shared ring");
> > > +
> > > +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE *
> > (info)->nr_ring_pages)
> > > +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE *
> > XENBUS_MAX_RING_PAGES)
> > > +/*
> > > + * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
> > > + * characters are enough. Define to 20 to keep consist with backend.
> > > + */
> > > +#define RINGREF_NAME_LEN (20)
> > >
> > >  /*
> > >   * We have one of these per vbd, whether ide, scsi or 'other'.  They
> > > @@ -114,13 +128,14 @@ struct blkfront_info
> > >  	int vdevice;
> > >  	blkif_vdev_t handle;
> > >  	enum blkif_state connected;
> > > -	int ring_ref;
> > > +	int ring_ref[XENBUS_MAX_RING_PAGES];
> > > +	unsigned int nr_ring_pages;
> > >  	struct blkif_front_ring ring;
> > >  	unsigned int evtchn, irq;
> > >  	struct request_queue *rq;
> > >  	struct work_struct work;
> > >  	struct gnttab_free_callback callback;
> > > -	struct blk_shadow shadow[BLK_RING_SIZE];
> > > +	struct blk_shadow shadow[BLK_MAX_RING_SIZE];
> > >  	struct list_head grants;
> > >  	struct list_head indirect_pages;
> > >  	unsigned int persistent_gnts_c;
> > > @@ -139,8 +154,6 @@ static unsigned int nr_minors;
> > >  static unsigned long *minors;
> > >  static DEFINE_SPINLOCK(minor_lock);
> > >
> > > -#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
> > > -	(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
> > >  #define GRANT_INVALID_REF	0
> > >
> > >  #define PARTS_PER_DISK		16
> > > @@ -170,7 +183,7 @@ static int blkfront_setup_indirect(struct
> > blkfront_info *info);
> > >  static int get_id_from_freelist(struct blkfront_info *info)
> > >  {
> > >  	unsigned long free = info->shadow_free;
> > > -	BUG_ON(free >= BLK_RING_SIZE);
> > > +	BUG_ON(free >= BLK_RING_SIZE(info));
> > >  	info->shadow_free = info->shadow[free].req.u.rw.id;
> > >  	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
> > >  	return free;
> > > @@ -983,7 +996,7 @@ static void blkif_free(struct blkfront_info *info, int
> > suspend)
> > >  		}
> > >  	}
> > >
> > > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> > >  		/*
> > >  		 * Clear persistent grants present in requests already
> > >  		 * on the shared ring
> > > @@ -1033,12 +1046,15 @@ free_shadow:
> > >  	flush_work(&info->work);
> > >
> > >  	/* Free resources associated with old device channel. */
> > > -	if (info->ring_ref != GRANT_INVALID_REF) {
> > > -		gnttab_end_foreign_access(info->ring_ref, 0,
> > > -					  (unsigned long)info->ring.sring);
> > > -		info->ring_ref = GRANT_INVALID_REF;
> > > -		info->ring.sring = NULL;
> > > +	for (i = 0; i < info->nr_ring_pages; i++) {
> > > +		if (info->ring_ref[i] != GRANT_INVALID_REF) {
> > > +			gnttab_end_foreign_access(info->ring_ref[i], 0, 0);
> > > +			info->ring_ref[i] = GRANT_INVALID_REF;
> > > +		}
> > >  	}
> > > +	free_pages((unsigned long)info->ring.sring, get_order(info-
> > >nr_ring_pages * PAGE_SIZE));
> > > +	info->ring.sring = NULL;
> > > +
> > >  	if (info->irq)
> > >  		unbind_from_irqhandler(info->irq, info);
> > >  	info->evtchn = info->irq = 0;
> > > @@ -1157,7 +1173,7 @@ static irqreturn_t blkif_interrupt(int irq, void
> > *dev_id)
> > >  		 * never have given to it (we stamp it up to BLK_RING_SIZE -
> > >  		 * look in get_id_from_freelist.
> > >  		 */
> > > -		if (id >= BLK_RING_SIZE) {
> > > +		if (id >= BLK_RING_SIZE(info)) {
> > >  			WARN(1, "%s: response to %s has incorrect id
> > (%ld)\n",
> > >  			     info->gd->disk_name, op_name(bret->operation),
> > id);
> > >  			/* We can't safely get the 'struct request' as
> > > @@ -1245,26 +1261,30 @@ static int setup_blkring(struct xenbus_device
> > *dev,
> > >  			 struct blkfront_info *info)
> > >  {
> > >  	struct blkif_sring *sring;
> > > -	grant_ref_t gref;
> > > -	int err;
> > > +	int err, i;
> > > +	unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
> > > +	grant_ref_t gref[XENBUS_MAX_RING_PAGES];
> > >
> > > -	info->ring_ref = GRANT_INVALID_REF;
> > > +	for (i = 0; i < info->nr_ring_pages; i++)
> > > +		info->ring_ref[i] = GRANT_INVALID_REF;
> > >
> > > -	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO |
> > __GFP_HIGH);
> > > +	sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO |
> > __GFP_HIGH,
> > > +						       get_order(ring_size));
> > >  	if (!sring) {
> > >  		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
> > >  		return -ENOMEM;
> > >  	}
> > >  	SHARED_RING_INIT(sring);
> > > -	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
> > > +	FRONT_RING_INIT(&info->ring, sring, ring_size);
> > >
> > > -	err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref);
> > > +	err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages,
> > gref);
> > >  	if (err < 0) {
> > > -		free_page((unsigned long)sring);
> > > +		free_pages((unsigned long)sring, get_order(ring_size));
> > >  		info->ring.sring = NULL;
> > >  		goto fail;
> > >  	}
> > > -	info->ring_ref = gref;
> > > +	for (i = 0; i < info->nr_ring_pages; i++)
> > > +		info->ring_ref[i] = gref[i];
> > >
> > >  	err = xenbus_alloc_evtchn(dev, &info->evtchn);
> > >  	if (err)
> > > @@ -1292,7 +1312,18 @@ static int talk_to_blkback(struct xenbus_device
> > *dev,
> > >  {
> > >  	const char *message = NULL;
> > >  	struct xenbus_transaction xbt;
> > > -	int err;
> > > +	int err, i;
> > > +	unsigned int max_page_order = 0;
> > > +	unsigned int ring_page_order = 0;
> > > +
> > > +	err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
> > > +			   "max-ring-page-order", "%u", &max_page_order);
> > > +	if (err != 1)
> > > +		info->nr_ring_pages = 1;
> > > +	else {
> > > +		ring_page_order = min(xen_blkif_max_ring_order,
> > max_page_order);
> > > +		info->nr_ring_pages = 1 << ring_page_order;
> > > +	}
> > >
> > >  	/* Create shared ring, alloc event channel. */
> > >  	err = setup_blkring(dev, info);
> > > @@ -1306,11 +1337,32 @@ again:
> > >  		goto destroy_blkring;
> > >  	}
> > >
> > > -	err = xenbus_printf(xbt, dev->nodename,
> > > -			    "ring-ref", "%u", info->ring_ref);
> > > -	if (err) {
> > > -		message = "writing ring-ref";
> > > -		goto abort_transaction;
> > > +	if (info->nr_ring_pages == 1) {
> > > +		err = xenbus_printf(xbt, dev->nodename,
> > > +				    "ring-ref", "%u", info->ring_ref[0]);
> > > +		if (err) {
> > > +			message = "writing ring-ref";
> > > +			goto abort_transaction;
> > > +		}
> > > +	} else {
> > > +		err = xenbus_printf(xbt, dev->nodename,
> > > +				    "ring-page-order", "%u", ring_page_order);
> > > +		if (err) {
> > > +			message = "writing ring-page-order";
> > > +			goto abort_transaction;
> > > +		}
> > > +
> > > +		for (i = 0; i < info->nr_ring_pages; i++) {
> > > +			char ring_ref_name[RINGREF_NAME_LEN];
> > > +
> > > +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
> > ref%u", i);
> > > +			err = xenbus_printf(xbt, dev->nodename,
> > ring_ref_name,
> > > +					    "%u", info->ring_ref[i]);
> > > +			if (err) {
> > > +				message = "writing ring-ref";
> > > +				goto abort_transaction;
> > > +			}
> > > +		}
> > >  	}
> > >  	err = xenbus_printf(xbt, dev->nodename,
> > >  			    "event-channel", "%u", info->evtchn);
> > > @@ -1338,6 +1390,9 @@ again:
> > >  		goto destroy_blkring;
> > >  	}
> > >
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++)
> > > +		info->shadow[i].req.u.rw.id = i+1;
> > > +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
> > >  	xenbus_switch_state(dev, XenbusStateInitialised);
> > >
> > >  	return 0;
> > > @@ -1361,7 +1416,7 @@ again:
> > >  static int blkfront_probe(struct xenbus_device *dev,
> > >  			  const struct xenbus_device_id *id)
> > >  {
> > > -	int err, vdevice, i;
> > > +	int err, vdevice;
> > >  	struct blkfront_info *info;
> > >
> > >  	/* FIXME: Use dynamic device id if this is not set. */
> > > @@ -1422,10 +1477,6 @@ static int blkfront_probe(struct xenbus_device
> > *dev,
> > >  	info->connected = BLKIF_STATE_DISCONNECTED;
> > >  	INIT_WORK(&info->work, blkif_restart_queue);
> > >
> > > -	for (i = 0; i < BLK_RING_SIZE; i++)
> > > -		info->shadow[i].req.u.rw.id = i+1;
> > > -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> > > -
> > >  	/* Front end dir is a number, which is used as the id. */
> > >  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL,
> > 0);
> > >  	dev_set_drvdata(&dev->dev, info);
> > > @@ -1469,10 +1520,10 @@ static int blkif_recover(struct blkfront_info
> > *info)
> > >
> > >  	/* Stage 2: Set up free list. */
> > >  	memset(&info->shadow, 0, sizeof(info->shadow));
> > > -	for (i = 0; i < BLK_RING_SIZE; i++)
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++)
> > >  		info->shadow[i].req.u.rw.id = i+1;
> > >  	info->shadow_free = info->ring.req_prod_pvt;
> > > -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
> > > +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
> > >
> > >  	rc = blkfront_setup_indirect(info);
> > >  	if (rc) {
> > > @@ -1484,7 +1535,7 @@ static int blkif_recover(struct blkfront_info *info)
> > >  	blk_queue_max_segments(info->rq, segs);
> > >  	bio_list_init(&bio_list);
> > >  	INIT_LIST_HEAD(&requests);
> > > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> > >  		/* Not in use? */
> > >  		if (!copy[i].request)
> > >  			continue;
> > > @@ -1690,7 +1741,7 @@ static int blkfront_setup_indirect(struct
> > blkfront_info *info)
> > >  		segs = info->max_indirect_segments;
> > >  	}
> > >
> > > -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
> > BLK_RING_SIZE);
> > > +	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
> > BLK_RING_SIZE(info));
> > >  	if (err)
> > >  		goto out_of_memory;
> > >
> > > @@ -1700,7 +1751,7 @@ static int blkfront_setup_indirect(struct
> > blkfront_info *info)
> > >  		 * grants, we need to allocate a set of pages that can be
> > >  		 * used for mapping indirect grefs
> > >  		 */
> > > -		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
> > > +		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
> > >
> > >  		BUG_ON(!list_empty(&info->indirect_pages));
> > >  		for (i = 0; i < num; i++) {
> > > @@ -1711,7 +1762,7 @@ static int blkfront_setup_indirect(struct
> > blkfront_info *info)
> > >  		}
> > >  	}
> > >
> > > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> > >  		info->shadow[i].grants_used = kzalloc(
> > >  			sizeof(info->shadow[i].grants_used[0]) * segs,
> > >  			GFP_NOIO);
> > > @@ -1733,7 +1784,7 @@ static int blkfront_setup_indirect(struct
> > blkfront_info *info)
> > >  	return 0;
> > >
> > >  out_of_memory:
> > > -	for (i = 0; i < BLK_RING_SIZE; i++) {
> > > +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
> > >  		kfree(info->shadow[i].grants_used);
> > >  		info->shadow[i].grants_used = NULL;
> > >  		kfree(info->shadow[i].sg);
> > > @@ -2089,6 +2140,12 @@ static int __init xlblk_init(void)
> > >  	if (!xen_domain())
> > >  		return -ENODEV;
> > >
> > > +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
> > {
> > > +		pr_info("Invalid max_ring_order (%d), will use default max:
> > %d.\n",
> > > +			xen_blkif_max_ring_order,
> > XENBUS_MAX_RING_PAGE_ORDER);
> > > +		xen_blkif_max_ring_order = 0;
> > > +	}
> > > +
> > >  	if (!xen_has_pv_disk_devices())
> > >  		return -ENODEV;
> > >
> > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]       ` <20150609133938.GA15200@x230>
@ 2015-06-09 13:48         ` Bob Liu
  2015-06-09 14:07         ` Roger Pau Monné
       [not found]         ` <5576F325.5050304@citrix.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Bob Liu @ 2015-06-09 13:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, xen-devel, Julien Grall, justing, Paul Durrant,
	David Vrabel, Roger Pau Monne


On 06/09/2015 09:39 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jun 09, 2015 at 08:52:53AM +0000, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Bob Liu [mailto:bob.liu@oracle.com]
>>> Sent: 09 June 2015 09:50
>>> To: Bob Liu
>>> Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
>>> konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
>>> kernel@vger.kernel.org
>>> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
>>>
>>>
>>> On 06/03/2015 01:40 PM, Bob Liu wrote:
>>>> Extend xen/block to support multi-page ring, so that more requests can be
>>>> issued by using more than one pages as the request ring between blkfront
>>>> and backend.
>>>> As a result, the performance can get improved significantly.
>>>>
>>>> We got some impressive improvements on our highend iscsi storage cluster
>>>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
>>>> for the throughput testing and above doubled for the latency testing.
>>>>
>>>> The reason was the limit on outstanding requests is 32 if use only one-page
>>>> ring, but in our case the iscsi lun was spread across about 100 physical
>>>> drives, 32 was really not enough to keep them busy.
>>>>
>>>> Changes in v2:
>>>>  - Rebased to 4.0-rc6.
>>>>  - Document on how multi-page ring feature working to linux io/blkif.h.
>>>>
>>>> Changes in v3:
>>>>  - Remove changes to linux io/blkif.h and follow the protocol defined
>>>>    in io/blkif.h of XEN tree.
>>>>  - Rebased to 4.1-rc3
>>>>
>>>> Changes in v4:
>>>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>>>>  - A few comments from Roger.
>>>>
>>>> Changes in v5:
>>>>  - Clarify with 4k granularity to comment
>>>>  - Address more comments from Roger
>>>>
>>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>>
>>> Also tested the windows PV driver which also works fine when multi-page
>>> ring feature
>>> was enabled in Linux backend.
>>> http://www.xenproject.org/downloads/windows-pv-drivers.html
>>>
>>
>> Great! Thanks for verifying that :-)
> 
> Woot! Bob, could you repost the blkif.h patch for the Xen tree
> pleas e and also mention the testing part in it please? I think this
> was the only big 'what if?!' question holding this up.
> 

There is no more changes to blkif.h of Xen tree, I followed exactly
the protocol already defined there and that is why windows PV driver can also work well.

> 
> Roger, I put them (patches) on devel/for-jens-4.2 on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> 
> I think these two patches:
> drivers: xen-blkback: delay pending_req allocation to connect_ring
> xen/block: add multi-page ring support
> 
> are the only ones that haven't been Acked by you (or maybe they
> have and I missed the Ack?)
> 

Thank you!
-Bob

> 
>>
>>   Paul
>>
>>> Regards,
>>> -Bob
>>>
>>>> ---
>>>>  drivers/block/xen-blkback/blkback.c |   13 ++++
>>>>  drivers/block/xen-blkback/common.h  |    2 +
>>>>  drivers/block/xen-blkback/xenbus.c  |   89 +++++++++++++++++------
>>>>  drivers/block/xen-blkfront.c        |  135 +++++++++++++++++++++++++----
>>> ------
>>>>  4 files changed, 180 insertions(+), 59 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-
>>> blkback/blkback.c
>>>> index 713fc9f..2126842 100644
>>>> --- a/drivers/block/xen-blkback/blkback.c
>>>> +++ b/drivers/block/xen-blkback/blkback.c
>>>> @@ -84,6 +84,13 @@ MODULE_PARM_DESC(max_persistent_grants,
>>>>                   "Maximum number of grants to map persistently");
>>>>
>>>>  /*
>>>> + * Maximum order of pages to be used for the shared ring between front
>>> and
>>>> + * backend, 4KB page granularity is used.
>>>> + */
>>>> +unsigned int xen_blkif_max_ring_order =
>>> XENBUS_MAX_RING_PAGE_ORDER;
>>>> +module_param_named(max_ring_page_order,
>>> xen_blkif_max_ring_order, int, S_IRUGO);
>>>> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
>>> to be used for the shared ring");
>>>> +/*
>>>>   * The LRU mechanism to clean the lists of persistent grants needs to
>>>>   * be executed periodically. The time interval between consecutive
>>> executions
>>>>   * of the purge mechanism is set in ms.
>>>> @@ -1438,6 +1445,12 @@ static int __init xen_blkif_init(void)
>>>>  	if (!xen_domain())
>>>>  		return -ENODEV;
>>>>
>>>> +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
>>> {
>>>> +		pr_info("Invalid max_ring_order (%d), will use default max:
>>> %d.\n",
>>>> +			xen_blkif_max_ring_order,
>>> XENBUS_MAX_RING_PAGE_ORDER);
>>>> +		xen_blkif_max_ring_order =
>>> XENBUS_MAX_RING_PAGE_ORDER;
>>>> +	}
>>>> +
>>>>  	rc = xen_blkif_interface_init();
>>>>  	if (rc)
>>>>  		goto failed_init;
>>>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-
>>> blkback/common.h
>>>> index 043f13b..8ccc49d 100644
>>>> --- a/drivers/block/xen-blkback/common.h
>>>> +++ b/drivers/block/xen-blkback/common.h
>>>> @@ -44,6 +44,7 @@
>>>>  #include <xen/interface/io/blkif.h>
>>>>  #include <xen/interface/io/protocols.h>
>>>>
>>>> +extern unsigned int xen_blkif_max_ring_order;
>>>>  /*
>>>>   * This is the maximum number of segments that would be allowed in
>>> indirect
>>>>   * requests. This value will also be passed to the frontend.
>>>> @@ -320,6 +321,7 @@ struct xen_blkif {
>>>>  	struct work_struct	free_work;
>>>>  	/* Thread shutdown wait queue. */
>>>>  	wait_queue_head_t	shutdown_wq;
>>>> +	unsigned int nr_ring_pages;
>>>>  };
>>>>
>>>>  struct seg_buf {
>>>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-
>>> blkback/xenbus.c
>>>> index c212d41..deb3f00 100644
>>>> --- a/drivers/block/xen-blkback/xenbus.c
>>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>>> @@ -25,6 +25,7 @@
>>>>
>>>>  /* Enlarge the array size in order to fully show blkback name. */
>>>>  #define BLKBACK_NAME_LEN (20)
>>>> +#define RINGREF_NAME_LEN (20)
>>>>
>>>>  struct backend_info {
>>>>  	struct xenbus_device	*dev;
>>>> @@ -156,8 +157,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t
>>> domid)
>>>>  	return blkif;
>>>>  }
>>>>
>>>> -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref,
>>>> -			 unsigned int evtchn)
>>>> +static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
>>>> +			 unsigned int nr_grefs, unsigned int evtchn)
>>>>  {
>>>>  	int err;
>>>>
>>>> @@ -165,7 +166,7 @@ static int xen_blkif_map(struct xen_blkif *blkif,
>>> grant_ref_t gref,
>>>>  	if (blkif->irq)
>>>>  		return 0;
>>>>
>>>> -	err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1,
>>>> +	err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs,
>>>>  				     &blkif->blk_ring);
>>>>  	if (err < 0)
>>>>  		return err;
>>>> @@ -175,21 +176,21 @@ static int xen_blkif_map(struct xen_blkif *blkif,
>>> grant_ref_t gref,
>>>>  	{
>>>>  		struct blkif_sring *sring;
>>>>  		sring = (struct blkif_sring *)blkif->blk_ring;
>>>> -		BACK_RING_INIT(&blkif->blk_rings.native, sring,
>>> PAGE_SIZE);
>>>> +		BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE
>>> * nr_grefs);
>>>>  		break;
>>>>  	}
>>>>  	case BLKIF_PROTOCOL_X86_32:
>>>>  	{
>>>>  		struct blkif_x86_32_sring *sring_x86_32;
>>>>  		sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
>>>> -		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
>>> PAGE_SIZE);
>>>> +		BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
>>> PAGE_SIZE * nr_grefs);
>>>>  		break;
>>>>  	}
>>>>  	case BLKIF_PROTOCOL_X86_64:
>>>>  	{
>>>>  		struct blkif_x86_64_sring *sring_x86_64;
>>>>  		sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
>>>> -		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
>>> PAGE_SIZE);
>>>> +		BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
>>> PAGE_SIZE * nr_grefs);
>>>>  		break;
>>>>  	}
>>>>  	default:
>>>> @@ -270,7 +271,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
>>>>  		i++;
>>>>  	}
>>>>
>>>> -	WARN_ON(i != XEN_BLKIF_REQS_PER_PAGE);
>>>> +	WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif-
>>>> nr_ring_pages));
>>>>
>>>>  	kmem_cache_free(xen_blkif_cachep, blkif);
>>>>  }
>>>> @@ -555,6 +556,11 @@ static int xen_blkbk_probe(struct xenbus_device
>>> *dev,
>>>>  	if (err)
>>>>  		goto fail;
>>>>
>>>> +	err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-page-
>>> order", "%u",
>>>> +			    xen_blkif_max_ring_order);
>>>> +	if (err)
>>>> +		pr_warn("%s write out 'max-ring-page-order' failed\n",
>>> __func__);
>>>> +
>>>>  	err = xenbus_switch_state(dev, XenbusStateInitWait);
>>>>  	if (err)
>>>>  		goto fail;
>>>> @@ -818,8 +824,8 @@ again:
>>>>  static int connect_ring(struct backend_info *be)
>>>>  {
>>>>  	struct xenbus_device *dev = be->dev;
>>>> -	unsigned long ring_ref;
>>>> -	unsigned int evtchn;
>>>> +	unsigned int ring_ref[XENBUS_MAX_RING_PAGES];
>>>> +	unsigned int evtchn, nr_grefs, ring_page_order;
>>>>  	unsigned int pers_grants;
>>>>  	char protocol[64] = "";
>>>>  	struct pending_req *req, *n;
>>>> @@ -827,14 +833,57 @@ static int connect_ring(struct backend_info *be)
>>>>
>>>>  	pr_debug("%s %s\n", __func__, dev->otherend);
>>>>
>>>> -	err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
>>>> -			    &ring_ref, "event-channel", "%u", &evtchn, NULL);
>>>> -	if (err) {
>>>> -		xenbus_dev_fatal(dev, err,
>>>> -				 "reading %s/ring-ref and event-channel",
>>>> +	err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel",
>>> "%u",
>>>> +			  &evtchn);
>>>> +	if (err != 1) {
>>>> +		err = -EINVAL;
>>>> +		xenbus_dev_fatal(dev, err, "reading %s/event-channel",
>>>>  				 dev->otherend);
>>>>  		return err;
>>>>  	}
>>>> +	pr_info("event-channel %u\n", evtchn);
>>>> +
>>>> +	err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order",
>>> "%u",
>>>> +			  &ring_page_order);
>>>> +	if (err != 1) {
>>>> +		err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref",
>>>> +				  "%u", &ring_ref[0]);
>>>> +		if (err != 1) {
>>>> +			err = -EINVAL;
>>>> +			xenbus_dev_fatal(dev, err, "reading %s/ring-ref",
>>>> +					 dev->otherend);
>>>> +			return err;
>>>> +		}
>>>> +		nr_grefs = 1;
>>>> +		pr_info("%s:using single page: ring-ref %d\n", dev-
>>>> otherend,
>>>> +			ring_ref[0]);
>>>> +	} else {
>>>> +		unsigned int i;
>>>> +
>>>> +		if (ring_page_order > xen_blkif_max_ring_order) {
>>>> +			err = -EINVAL;
>>>> +			xenbus_dev_fatal(dev, err, "%s/request %d ring
>>> page order exceed max:%d",
>>>> +					 dev->otherend, ring_page_order,
>>>> +					 xen_blkif_max_ring_order);
>>>> +			return err;
>>>> +		}
>>>> +
>>>> +		nr_grefs = 1 << ring_page_order;
>>>> +		for (i = 0; i < nr_grefs; i++) {
>>>> +			char ring_ref_name[RINGREF_NAME_LEN];
>>>> +
>>>> +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
>>> ref%u", i);
>>>> +			err = xenbus_scanf(XBT_NIL, dev->otherend,
>>> ring_ref_name,
>>>> +					   "%u", &ring_ref[i]);
>>>> +			if (err != 1) {
>>>> +				err = -EINVAL;
>>>> +				xenbus_dev_fatal(dev, err, "reading %s/%s",
>>>> +						 dev->otherend,
>>> ring_ref_name);
>>>> +				return err;
>>>> +			}
>>>> +			pr_info("ring-ref%u: %u\n", i, ring_ref[i]);
>>>> +		}
>>>> +	}
>>>>
>>>>  	be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
>>>>  	err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
>>>> @@ -859,12 +908,13 @@ static int connect_ring(struct backend_info *be)
>>>>
>>>>  	be->blkif->vbd.feature_gnt_persistent = pers_grants;
>>>>  	be->blkif->vbd.overflow_max_grants = 0;
>>>> +	be->blkif->nr_ring_pages = nr_grefs;
>>>>
>>>> -	pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
>>>> -		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
>>>> +	pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n",
>>>> +		nr_grefs, evtchn, be->blkif->blk_protocol, protocol,
>>>>  		pers_grants ? "persistent grants" : "");
>>>>
>>>> -	for (i = 0; i < XEN_BLKIF_REQS_PER_PAGE; i++) {
>>>> +	for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
>>>>  		req = kzalloc(sizeof(*req), GFP_KERNEL);
>>>>  		if (!req)
>>>>  			goto fail;
>>>> @@ -883,10 +933,9 @@ static int connect_ring(struct backend_info *be)
>>>>  	}
>>>>
>>>>  	/* Map the shared frame, irq etc. */
>>>> -	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
>>>> +	err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn);
>>>>  	if (err) {
>>>> -		xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
>>>> -				 ring_ref, evtchn);
>>>> +		xenbus_dev_fatal(dev, err, "mapping ring-ref port %u",
>>> evtchn);
>>>>  		return err;
>>>>  	}
>>>>
>>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>>> index 88e23fd..d3c1a95 100644
>>>> --- a/drivers/block/xen-blkfront.c
>>>> +++ b/drivers/block/xen-blkfront.c
>>>> @@ -98,7 +98,21 @@ static unsigned int xen_blkif_max_segments = 32;
>>>>  module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
>>>>  MODULE_PARM_DESC(max, "Maximum amount of segments in indirect
>>> requests (default is 32)");
>>>>
>>>> -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
>>>> +/*
>>>> + * Maximum order of pages to be used for the shared ring between front
>>> and
>>>> + * backend, 4KB page granularity is used.
>>>> + */
>>>> +static unsigned int xen_blkif_max_ring_order;
>>>> +module_param_named(max_ring_page_order,
>>> xen_blkif_max_ring_order, int, S_IRUGO);
>>>> +MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages
>>> to be used for the shared ring");
>>>> +
>>>> +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE *
>>> (info)->nr_ring_pages)
>>>> +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE *
>>> XENBUS_MAX_RING_PAGES)
>>>> +/*
>>>> + * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
>>>> + * characters are enough. Define to 20 to keep consist with backend.
>>>> + */
>>>> +#define RINGREF_NAME_LEN (20)
>>>>
>>>>  /*
>>>>   * We have one of these per vbd, whether ide, scsi or 'other'.  They
>>>> @@ -114,13 +128,14 @@ struct blkfront_info
>>>>  	int vdevice;
>>>>  	blkif_vdev_t handle;
>>>>  	enum blkif_state connected;
>>>> -	int ring_ref;
>>>> +	int ring_ref[XENBUS_MAX_RING_PAGES];
>>>> +	unsigned int nr_ring_pages;
>>>>  	struct blkif_front_ring ring;
>>>>  	unsigned int evtchn, irq;
>>>>  	struct request_queue *rq;
>>>>  	struct work_struct work;
>>>>  	struct gnttab_free_callback callback;
>>>> -	struct blk_shadow shadow[BLK_RING_SIZE];
>>>> +	struct blk_shadow shadow[BLK_MAX_RING_SIZE];
>>>>  	struct list_head grants;
>>>>  	struct list_head indirect_pages;
>>>>  	unsigned int persistent_gnts_c;
>>>> @@ -139,8 +154,6 @@ static unsigned int nr_minors;
>>>>  static unsigned long *minors;
>>>>  static DEFINE_SPINLOCK(minor_lock);
>>>>
>>>> -#define MAXIMUM_OUTSTANDING_BLOCK_REQS \
>>>> -	(BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE)
>>>>  #define GRANT_INVALID_REF	0
>>>>
>>>>  #define PARTS_PER_DISK		16
>>>> @@ -170,7 +183,7 @@ static int blkfront_setup_indirect(struct
>>> blkfront_info *info);
>>>>  static int get_id_from_freelist(struct blkfront_info *info)
>>>>  {
>>>>  	unsigned long free = info->shadow_free;
>>>> -	BUG_ON(free >= BLK_RING_SIZE);
>>>> +	BUG_ON(free >= BLK_RING_SIZE(info));
>>>>  	info->shadow_free = info->shadow[free].req.u.rw.id;
>>>>  	info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
>>>>  	return free;
>>>> @@ -983,7 +996,7 @@ static void blkif_free(struct blkfront_info *info, int
>>> suspend)
>>>>  		}
>>>>  	}
>>>>
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++) {
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>>>>  		/*
>>>>  		 * Clear persistent grants present in requests already
>>>>  		 * on the shared ring
>>>> @@ -1033,12 +1046,15 @@ free_shadow:
>>>>  	flush_work(&info->work);
>>>>
>>>>  	/* Free resources associated with old device channel. */
>>>> -	if (info->ring_ref != GRANT_INVALID_REF) {
>>>> -		gnttab_end_foreign_access(info->ring_ref, 0,
>>>> -					  (unsigned long)info->ring.sring);
>>>> -		info->ring_ref = GRANT_INVALID_REF;
>>>> -		info->ring.sring = NULL;
>>>> +	for (i = 0; i < info->nr_ring_pages; i++) {
>>>> +		if (info->ring_ref[i] != GRANT_INVALID_REF) {
>>>> +			gnttab_end_foreign_access(info->ring_ref[i], 0, 0);
>>>> +			info->ring_ref[i] = GRANT_INVALID_REF;
>>>> +		}
>>>>  	}
>>>> +	free_pages((unsigned long)info->ring.sring, get_order(info-
>>>> nr_ring_pages * PAGE_SIZE));
>>>> +	info->ring.sring = NULL;
>>>> +
>>>>  	if (info->irq)
>>>>  		unbind_from_irqhandler(info->irq, info);
>>>>  	info->evtchn = info->irq = 0;
>>>> @@ -1157,7 +1173,7 @@ static irqreturn_t blkif_interrupt(int irq, void
>>> *dev_id)
>>>>  		 * never have given to it (we stamp it up to BLK_RING_SIZE -
>>>>  		 * look in get_id_from_freelist.
>>>>  		 */
>>>> -		if (id >= BLK_RING_SIZE) {
>>>> +		if (id >= BLK_RING_SIZE(info)) {
>>>>  			WARN(1, "%s: response to %s has incorrect id
>>> (%ld)\n",
>>>>  			     info->gd->disk_name, op_name(bret->operation),
>>> id);
>>>>  			/* We can't safely get the 'struct request' as
>>>> @@ -1245,26 +1261,30 @@ static int setup_blkring(struct xenbus_device
>>> *dev,
>>>>  			 struct blkfront_info *info)
>>>>  {
>>>>  	struct blkif_sring *sring;
>>>> -	grant_ref_t gref;
>>>> -	int err;
>>>> +	int err, i;
>>>> +	unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
>>>> +	grant_ref_t gref[XENBUS_MAX_RING_PAGES];
>>>>
>>>> -	info->ring_ref = GRANT_INVALID_REF;
>>>> +	for (i = 0; i < info->nr_ring_pages; i++)
>>>> +		info->ring_ref[i] = GRANT_INVALID_REF;
>>>>
>>>> -	sring = (struct blkif_sring *)__get_free_page(GFP_NOIO |
>>> __GFP_HIGH);
>>>> +	sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO |
>>> __GFP_HIGH,
>>>> +						       get_order(ring_size));
>>>>  	if (!sring) {
>>>>  		xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring");
>>>>  		return -ENOMEM;
>>>>  	}
>>>>  	SHARED_RING_INIT(sring);
>>>> -	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
>>>> +	FRONT_RING_INIT(&info->ring, sring, ring_size);
>>>>
>>>> -	err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref);
>>>> +	err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages,
>>> gref);
>>>>  	if (err < 0) {
>>>> -		free_page((unsigned long)sring);
>>>> +		free_pages((unsigned long)sring, get_order(ring_size));
>>>>  		info->ring.sring = NULL;
>>>>  		goto fail;
>>>>  	}
>>>> -	info->ring_ref = gref;
>>>> +	for (i = 0; i < info->nr_ring_pages; i++)
>>>> +		info->ring_ref[i] = gref[i];
>>>>
>>>>  	err = xenbus_alloc_evtchn(dev, &info->evtchn);
>>>>  	if (err)
>>>> @@ -1292,7 +1312,18 @@ static int talk_to_blkback(struct xenbus_device
>>> *dev,
>>>>  {
>>>>  	const char *message = NULL;
>>>>  	struct xenbus_transaction xbt;
>>>> -	int err;
>>>> +	int err, i;
>>>> +	unsigned int max_page_order = 0;
>>>> +	unsigned int ring_page_order = 0;
>>>> +
>>>> +	err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
>>>> +			   "max-ring-page-order", "%u", &max_page_order);
>>>> +	if (err != 1)
>>>> +		info->nr_ring_pages = 1;
>>>> +	else {
>>>> +		ring_page_order = min(xen_blkif_max_ring_order,
>>> max_page_order);
>>>> +		info->nr_ring_pages = 1 << ring_page_order;
>>>> +	}
>>>>
>>>>  	/* Create shared ring, alloc event channel. */
>>>>  	err = setup_blkring(dev, info);
>>>> @@ -1306,11 +1337,32 @@ again:
>>>>  		goto destroy_blkring;
>>>>  	}
>>>>
>>>> -	err = xenbus_printf(xbt, dev->nodename,
>>>> -			    "ring-ref", "%u", info->ring_ref);
>>>> -	if (err) {
>>>> -		message = "writing ring-ref";
>>>> -		goto abort_transaction;
>>>> +	if (info->nr_ring_pages == 1) {
>>>> +		err = xenbus_printf(xbt, dev->nodename,
>>>> +				    "ring-ref", "%u", info->ring_ref[0]);
>>>> +		if (err) {
>>>> +			message = "writing ring-ref";
>>>> +			goto abort_transaction;
>>>> +		}
>>>> +	} else {
>>>> +		err = xenbus_printf(xbt, dev->nodename,
>>>> +				    "ring-page-order", "%u", ring_page_order);
>>>> +		if (err) {
>>>> +			message = "writing ring-page-order";
>>>> +			goto abort_transaction;
>>>> +		}
>>>> +
>>>> +		for (i = 0; i < info->nr_ring_pages; i++) {
>>>> +			char ring_ref_name[RINGREF_NAME_LEN];
>>>> +
>>>> +			snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-
>>> ref%u", i);
>>>> +			err = xenbus_printf(xbt, dev->nodename,
>>> ring_ref_name,
>>>> +					    "%u", info->ring_ref[i]);
>>>> +			if (err) {
>>>> +				message = "writing ring-ref";
>>>> +				goto abort_transaction;
>>>> +			}
>>>> +		}
>>>>  	}
>>>>  	err = xenbus_printf(xbt, dev->nodename,
>>>>  			    "event-channel", "%u", info->evtchn);
>>>> @@ -1338,6 +1390,9 @@ again:
>>>>  		goto destroy_blkring;
>>>>  	}
>>>>
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++)
>>>> +		info->shadow[i].req.u.rw.id = i+1;
>>>> +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
>>>>  	xenbus_switch_state(dev, XenbusStateInitialised);
>>>>
>>>>  	return 0;
>>>> @@ -1361,7 +1416,7 @@ again:
>>>>  static int blkfront_probe(struct xenbus_device *dev,
>>>>  			  const struct xenbus_device_id *id)
>>>>  {
>>>> -	int err, vdevice, i;
>>>> +	int err, vdevice;
>>>>  	struct blkfront_info *info;
>>>>
>>>>  	/* FIXME: Use dynamic device id if this is not set. */
>>>> @@ -1422,10 +1477,6 @@ static int blkfront_probe(struct xenbus_device
>>> *dev,
>>>>  	info->connected = BLKIF_STATE_DISCONNECTED;
>>>>  	INIT_WORK(&info->work, blkif_restart_queue);
>>>>
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++)
>>>> -		info->shadow[i].req.u.rw.id = i+1;
>>>> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
>>>> -
>>>>  	/* Front end dir is a number, which is used as the id. */
>>>>  	info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL,
>>> 0);
>>>>  	dev_set_drvdata(&dev->dev, info);
>>>> @@ -1469,10 +1520,10 @@ static int blkif_recover(struct blkfront_info
>>> *info)
>>>>
>>>>  	/* Stage 2: Set up free list. */
>>>>  	memset(&info->shadow, 0, sizeof(info->shadow));
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++)
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++)
>>>>  		info->shadow[i].req.u.rw.id = i+1;
>>>>  	info->shadow_free = info->ring.req_prod_pvt;
>>>> -	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
>>>> +	info->shadow[BLK_RING_SIZE(info)-1].req.u.rw.id = 0x0fffffff;
>>>>
>>>>  	rc = blkfront_setup_indirect(info);
>>>>  	if (rc) {
>>>> @@ -1484,7 +1535,7 @@ static int blkif_recover(struct blkfront_info *info)
>>>>  	blk_queue_max_segments(info->rq, segs);
>>>>  	bio_list_init(&bio_list);
>>>>  	INIT_LIST_HEAD(&requests);
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++) {
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>>>>  		/* Not in use? */
>>>>  		if (!copy[i].request)
>>>>  			continue;
>>>> @@ -1690,7 +1741,7 @@ static int blkfront_setup_indirect(struct
>>> blkfront_info *info)
>>>>  		segs = info->max_indirect_segments;
>>>>  	}
>>>>
>>>> -	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
>>> BLK_RING_SIZE);
>>>> +	err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) *
>>> BLK_RING_SIZE(info));
>>>>  	if (err)
>>>>  		goto out_of_memory;
>>>>
>>>> @@ -1700,7 +1751,7 @@ static int blkfront_setup_indirect(struct
>>> blkfront_info *info)
>>>>  		 * grants, we need to allocate a set of pages that can be
>>>>  		 * used for mapping indirect grefs
>>>>  		 */
>>>> -		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;
>>>> +		int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
>>>>
>>>>  		BUG_ON(!list_empty(&info->indirect_pages));
>>>>  		for (i = 0; i < num; i++) {
>>>> @@ -1711,7 +1762,7 @@ static int blkfront_setup_indirect(struct
>>> blkfront_info *info)
>>>>  		}
>>>>  	}
>>>>
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++) {
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>>>>  		info->shadow[i].grants_used = kzalloc(
>>>>  			sizeof(info->shadow[i].grants_used[0]) * segs,
>>>>  			GFP_NOIO);
>>>> @@ -1733,7 +1784,7 @@ static int blkfront_setup_indirect(struct
>>> blkfront_info *info)
>>>>  	return 0;
>>>>
>>>>  out_of_memory:
>>>> -	for (i = 0; i < BLK_RING_SIZE; i++) {
>>>> +	for (i = 0; i < BLK_RING_SIZE(info); i++) {
>>>>  		kfree(info->shadow[i].grants_used);
>>>>  		info->shadow[i].grants_used = NULL;
>>>>  		kfree(info->shadow[i].sg);
>>>> @@ -2089,6 +2140,12 @@ static int __init xlblk_init(void)
>>>>  	if (!xen_domain())
>>>>  		return -ENODEV;
>>>>
>>>> +	if (xen_blkif_max_ring_order > XENBUS_MAX_RING_PAGE_ORDER)
>>> {
>>>> +		pr_info("Invalid max_ring_order (%d), will use default max:
>>> %d.\n",
>>>> +			xen_blkif_max_ring_order,
>>> XENBUS_MAX_RING_PAGE_ORDER);
>>>> +		xen_blkif_max_ring_order = 0;
>>>> +	}
>>>> +
>>>>  	if (!xen_has_pv_disk_devices())
>>>>  		return -ENODEV;
>>>>
>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]       ` <20150609133938.GA15200@x230>
  2015-06-09 13:48         ` Bob Liu
@ 2015-06-09 14:07         ` Roger Pau Monné
       [not found]         ` <5576F325.5050304@citrix.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Roger Pau Monné @ 2015-06-09 14:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Paul Durrant
  Cc: Rafal Mielniczuk, linux-kernel, Marcus Granado, xen-devel,
	Julien Grall, justing, David Vrabel

El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> On Tue, Jun 09, 2015 at 08:52:53AM +0000, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Bob Liu [mailto:bob.liu@oracle.com]
>>> Sent: 09 June 2015 09:50
>>> To: Bob Liu
>>> Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
>>> konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
>>> kernel@vger.kernel.org
>>> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
>>>
>>>
>>> On 06/03/2015 01:40 PM, Bob Liu wrote:
>>>> Extend xen/block to support multi-page ring, so that more requests can be
>>>> issued by using more than one pages as the request ring between blkfront
>>>> and backend.
>>>> As a result, the performance can get improved significantly.
>>>>
>>>> We got some impressive improvements on our highend iscsi storage cluster
>>>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
>>>> for the throughput testing and above doubled for the latency testing.
>>>>
>>>> The reason was the limit on outstanding requests is 32 if use only one-page
>>>> ring, but in our case the iscsi lun was spread across about 100 physical
>>>> drives, 32 was really not enough to keep them busy.
>>>>
>>>> Changes in v2:
>>>>  - Rebased to 4.0-rc6.
>>>>  - Document on how multi-page ring feature working to linux io/blkif.h.
>>>>
>>>> Changes in v3:
>>>>  - Remove changes to linux io/blkif.h and follow the protocol defined
>>>>    in io/blkif.h of XEN tree.
>>>>  - Rebased to 4.1-rc3
>>>>
>>>> Changes in v4:
>>>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
>>>>  - A few comments from Roger.
>>>>
>>>> Changes in v5:
>>>>  - Clarify with 4k granularity to comment
>>>>  - Address more comments from Roger
>>>>
>>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>>
>>> Also tested the windows PV driver which also works fine when multi-page
>>> ring feature
>>> was enabled in Linux backend.
>>> http://www.xenproject.org/downloads/windows-pv-drivers.html
>>>
>>
>> Great! Thanks for verifying that :-)
> 
> Woot! Bob, could you repost the blkif.h patch for the Xen tree
> pleas e and also mention the testing part in it please? I think this
> was the only big 'what if?!' question holding this up.
> 
> 
> Roger, I put them (patches) on devel/for-jens-4.2 on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> 
> I think these two patches:
> drivers: xen-blkback: delay pending_req allocation to connect_ring
> xen/block: add multi-page ring support
> 
> are the only ones that haven't been Acked by you (or maybe they
> have and I missed the Ack?)

Hello,

I was waiting to Ack those because the XenServer storage performance
folks found out that these patches cause a performance regression on
some of their tests. I'm adding them to the conversation so they can
provide more details about the issues they found, and whether we should
hold pushing this patches or not.

Roger.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]         ` <5576F325.5050304@citrix.com>
@ 2015-06-09 14:21           ` Konrad Rzeszutek Wilk
       [not found]           ` <20150609142126.GJ15200@x230>
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-09 14:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Rafal Mielniczuk, linux-kernel, Marcus Granado, xen-devel,
	Julien Grall, justing, Paul Durrant, David Vrabel

On Tue, Jun 09, 2015 at 04:07:33PM +0200, Roger Pau Monné wrote:
> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> > On Tue, Jun 09, 2015 at 08:52:53AM +0000, Paul Durrant wrote:
> >>> -----Original Message-----
> >>> From: Bob Liu [mailto:bob.liu@oracle.com]
> >>> Sent: 09 June 2015 09:50
> >>> To: Bob Liu
> >>> Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
> >>> konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
> >>> kernel@vger.kernel.org
> >>> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
> >>>
> >>>
> >>> On 06/03/2015 01:40 PM, Bob Liu wrote:
> >>>> Extend xen/block to support multi-page ring, so that more requests can be
> >>>> issued by using more than one pages as the request ring between blkfront
> >>>> and backend.
> >>>> As a result, the performance can get improved significantly.
> >>>>
> >>>> We got some impressive improvements on our highend iscsi storage cluster
> >>>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
> >>>> for the throughput testing and above doubled for the latency testing.
> >>>>
> >>>> The reason was the limit on outstanding requests is 32 if use only one-page
> >>>> ring, but in our case the iscsi lun was spread across about 100 physical
> >>>> drives, 32 was really not enough to keep them busy.
> >>>>
> >>>> Changes in v2:
> >>>>  - Rebased to 4.0-rc6.
> >>>>  - Document on how multi-page ring feature working to linux io/blkif.h.
> >>>>
> >>>> Changes in v3:
> >>>>  - Remove changes to linux io/blkif.h and follow the protocol defined
> >>>>    in io/blkif.h of XEN tree.
> >>>>  - Rebased to 4.1-rc3
> >>>>
> >>>> Changes in v4:
> >>>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
> >>>>  - A few comments from Roger.
> >>>>
> >>>> Changes in v5:
> >>>>  - Clarify with 4k granularity to comment
> >>>>  - Address more comments from Roger
> >>>>
> >>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> >>>
> >>> Also tested the windows PV driver which also works fine when multi-page
> >>> ring feature
> >>> was enabled in Linux backend.
> >>> http://www.xenproject.org/downloads/windows-pv-drivers.html
> >>>
> >>
> >> Great! Thanks for verifying that :-)
> > 
> > Woot! Bob, could you repost the blkif.h patch for the Xen tree
> > pleas e and also mention the testing part in it please? I think this
> > was the only big 'what if?!' question holding this up.
> > 
> > 
> > Roger, I put them (patches) on devel/for-jens-4.2 on
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> > 
> > I think these two patches:
> > drivers: xen-blkback: delay pending_req allocation to connect_ring
> > xen/block: add multi-page ring support
> > 
> > are the only ones that haven't been Acked by you (or maybe they
> > have and I missed the Ack?)
> 
> Hello,
> 
> I was waiting to Ack those because the XenServer storage performance
> folks found out that these patches cause a performance regression on
> some of their tests. I'm adding them to the conversation so they can

This is with multi-page enabled or with the patches but multi-page
disabled (baseline)?

> provide more details about the issues they found, and whether we should
> hold pushing this patches or not.

Or surely fix whatever is causing this.
> 
> Roger.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]           ` <20150609142126.GJ15200@x230>
@ 2015-06-19 20:12             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-19 20:12 UTC (permalink / raw)
  Cc: Rafal Mielniczuk, linux-kernel, Marcus Granado, xen-devel,
	Julien Grall, justing, Paul Durrant, David Vrabel,
	Roger Pau Monn?

On Tue, Jun 09, 2015 at 10:21:27AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Jun 09, 2015 at 04:07:33PM +0200, Roger Pau Monn? wrote:
> > El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> > > On Tue, Jun 09, 2015 at 08:52:53AM +0000, Paul Durrant wrote:
> > >>> -----Original Message-----
> > >>> From: Bob Liu [mailto:bob.liu@oracle.com]
> > >>> Sent: 09 June 2015 09:50
> > >>> To: Bob Liu
> > >>> Cc: xen-devel@lists.xen.org; David Vrabel; justing@spectralogic.com;
> > >>> konrad.wilk@oracle.com; Roger Pau Monne; Paul Durrant; Julien Grall; linux-
> > >>> kernel@vger.kernel.org
> > >>> Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
> > >>>
> > >>>
> > >>> On 06/03/2015 01:40 PM, Bob Liu wrote:
> > >>>> Extend xen/block to support multi-page ring, so that more requests can be
> > >>>> issued by using more than one pages as the request ring between blkfront
> > >>>> and backend.
> > >>>> As a result, the performance can get improved significantly.
> > >>>>
> > >>>> We got some impressive improvements on our highend iscsi storage cluster
> > >>>> backend. If using 64 pages as the ring, the IOPS increased about 15 times
> > >>>> for the throughput testing and above doubled for the latency testing.
> > >>>>
> > >>>> The reason was the limit on outstanding requests is 32 if use only one-page
> > >>>> ring, but in our case the iscsi lun was spread across about 100 physical
> > >>>> drives, 32 was really not enough to keep them busy.
> > >>>>
> > >>>> Changes in v2:
> > >>>>  - Rebased to 4.0-rc6.
> > >>>>  - Document on how multi-page ring feature working to linux io/blkif.h.
> > >>>>
> > >>>> Changes in v3:
> > >>>>  - Remove changes to linux io/blkif.h and follow the protocol defined
> > >>>>    in io/blkif.h of XEN tree.
> > >>>>  - Rebased to 4.1-rc3
> > >>>>
> > >>>> Changes in v4:
> > >>>>  - Turn to use 'ring-page-order' and 'max-ring-page-order'.
> > >>>>  - A few comments from Roger.
> > >>>>
> > >>>> Changes in v5:
> > >>>>  - Clarify with 4k granularity to comment
> > >>>>  - Address more comments from Roger
> > >>>>
> > >>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> > >>>
> > >>> Also tested the windows PV driver which also works fine when multi-page
> > >>> ring feature
> > >>> was enabled in Linux backend.
> > >>> http://www.xenproject.org/downloads/windows-pv-drivers.html
> > >>>
> > >>
> > >> Great! Thanks for verifying that :-)
> > > 
> > > Woot! Bob, could you repost the blkif.h patch for the Xen tree
> > > pleas e and also mention the testing part in it please? I think this
> > > was the only big 'what if?!' question holding this up.
> > > 
> > > 
> > > Roger, I put them (patches) on devel/for-jens-4.2 on
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> > > 
> > > I think these two patches:
> > > drivers: xen-blkback: delay pending_req allocation to connect_ring
> > > xen/block: add multi-page ring support
> > > 
> > > are the only ones that haven't been Acked by you (or maybe they
> > > have and I missed the Ack?)
> > 
> > Hello,
> > 
> > I was waiting to Ack those because the XenServer storage performance
> > folks found out that these patches cause a performance regression on
> > some of their tests. I'm adding them to the conversation so they can
> 
> This is with multi-page enabled or with the patches but multi-page
> disabled (baseline)?
> 
> > provide more details about the issues they found, and whether we should
> > hold pushing this patches or not.
> 
> Or surely fix whatever is causing this.


ping?


> > 
> > Roger.
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]         ` <5576F325.5050304@citrix.com>
  2015-06-09 14:21           ` Konrad Rzeszutek Wilk
       [not found]           ` <20150609142126.GJ15200@x230>
@ 2015-06-22  1:20           ` Bob Liu
       [not found]           ` <558762C4.2000002@oracle.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Bob Liu @ 2015-06-22  1:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Rafal Mielniczuk, linux-kernel, Marcus Granado, xen-devel,
	Julien Grall, justing, Paul Durrant, David Vrabel


On 06/09/2015 10:07 PM, Roger Pau Monné wrote:
> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
...
>> Roger, I put them (patches) on devel/for-jens-4.2 on
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
>>
>> I think these two patches:
>> drivers: xen-blkback: delay pending_req allocation to connect_ring
>> xen/block: add multi-page ring support
>>
>> are the only ones that haven't been Acked by you (or maybe they
>> have and I missed the Ack?)
> 
> Hello,
> 
> I was waiting to Ack those because the XenServer storage performance
> folks found out that these patches cause a performance regression on
> some of their tests. I'm adding them to the conversation so they can
> provide more details about the issues they found, and whether we should
> hold pushing this patches or not.
> 

Hey,

Are there any updates? What's the performance regression problem?

Thanks,
-Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]           ` <558762C4.2000002@oracle.com>
@ 2015-06-22 13:47             ` Konrad Rzeszutek Wilk
  2015-06-22 13:56             ` [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising Konrad Rzeszutek Wilk
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-22 13:47 UTC (permalink / raw)
  To: Bob Liu
  Cc: Rafal Mielniczuk, linux-kernel, Marcus Granado, xen-devel,
	Julien Grall, justing, Paul Durrant, David Vrabel,
	Roger Pau Monné

On Mon, Jun 22, 2015 at 09:20:04AM +0800, Bob Liu wrote:
> 
> On 06/09/2015 10:07 PM, Roger Pau Monné wrote:
> > El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> ...
> >> Roger, I put them (patches) on devel/for-jens-4.2 on
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> >>
> >> I think these two patches:
> >> drivers: xen-blkback: delay pending_req allocation to connect_ring
> >> xen/block: add multi-page ring support
> >>
> >> are the only ones that haven't been Acked by you (or maybe they
> >> have and I missed the Ack?)
> > 
> > Hello,
> > 
> > I was waiting to Ack those because the XenServer storage performance
> > folks found out that these patches cause a performance regression on
> > some of their tests. I'm adding them to the conversation so they can
> > provide more details about the issues they found, and whether we should
> > hold pushing this patches or not.
> > 
> 
> Hey,
> 
> Are there any updates? What's the performance regression problem?

Roger updated me on IRC and it was cleared up by doing more fio jobs.

Bob, let me also post your patch.
> 
> Thanks,
> -Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
       [not found]           ` <558762C4.2000002@oracle.com>
  2015-06-22 13:47             ` Konrad Rzeszutek Wilk
@ 2015-06-22 13:56             ` Konrad Rzeszutek Wilk
       [not found]             ` <1434981360-29794-1-git-send-email-konrad.wilk@oracle.com>
  2015-06-23 12:51             ` [PATCH 3/3] xen/block: add multi-page ring support Marcus Granado
  3 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-22 13:56 UTC (permalink / raw)
  To: linux-kernel, xen-devel, roger.pau
  Cc: Rafal.Mielniczuk, Konrad Rzeszutek Wilk, marcus.granado,
	julien.grall, justing, Paul.Durrant, david.vrabel

From: Bob Liu <bob.liu@oracle.com>

Patch 69b91ede5cab843dcf345c28bd1f4b5a99dacd9b
"drivers: xen-blkback: delay pending_req allocation to connect_ring"
exposed an problem that Xen blkfront has. There is a race
with XenStored and the drivers such that we can see two:

vbd vbd-268440320: blkfront:blkback_changed to state 2.
vbd vbd-268440320: blkfront:blkback_changed to state 2.
vbd vbd-268440320: blkfront:blkback_changed to state 4.

state changes to XenbusStateInitWait ('2'). The end result is that
blkback_changed() receives two notify and calls twice setup_blkring().

While the backend driver may only get the first setup_blkring() which is
wrong and reads out-dated (or reads them as they are being updated
with new ring-ref values).

The end result is that the ring ends up being incorrectly set.

Reported-and-Tested-by: Robert Butera <robert.butera@oracle.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/block/xen-blkfront.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d3c1a95..fc770b7 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1951,6 +1951,8 @@ static void blkback_changed(struct xenbus_device *dev,
 
 	switch (backend_state) {
 	case XenbusStateInitWait:
+		if (dev->state != XenbusStateInitialising)
+			break;
 		if (talk_to_blkback(dev, info)) {
 			kfree(info);
 			dev_set_drvdata(&dev->dev, NULL);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
       [not found]             ` <1434981360-29794-1-git-send-email-konrad.wilk@oracle.com>
@ 2015-06-22 14:06               ` Konrad Rzeszutek Wilk
  2015-06-23  6:23               ` Jan Beulich
       [not found]               ` <5589179F0200007800087E6E@mail.emea.novell.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-22 14:06 UTC (permalink / raw)
  To: linux-kernel, xen-devel, roger.pau
  Cc: Rafal.Mielniczuk, marcus.granado, julien.grall, justing,
	Paul.Durrant, david.vrabel

On Mon, Jun 22, 2015 at 09:56:00AM -0400, Konrad Rzeszutek Wilk wrote:
> From: Bob Liu <bob.liu@oracle.com>
> 
> Patch 69b91ede5cab843dcf345c28bd1f4b5a99dacd9b
> "drivers: xen-blkback: delay pending_req allocation to connect_ring"
> exposed an problem that Xen blkfront has. There is a race
> with XenStored and the drivers such that we can see two:
> 
> vbd vbd-268440320: blkfront:blkback_changed to state 2.
> vbd vbd-268440320: blkfront:blkback_changed to state 2.
> vbd vbd-268440320: blkfront:blkback_changed to state 4.
> 
> state changes to XenbusStateInitWait ('2'). The end result is that
> blkback_changed() receives two notify and calls twice setup_blkring().
> 
> While the backend driver may only get the first setup_blkring() which is
> wrong and reads out-dated (or reads them as they are being updated
> with new ring-ref values).
> 
> The end result is that the ring ends up being incorrectly set.

I forgot to mention - this check exists in xen-netfront thought the 
commit for it does not mention the double state changes. Both
Xen PCI frontend and backed are OK. So added this in the commit

"The other drivers in the tree have such checks already in."

> 
> Reported-and-Tested-by: Robert Butera <robert.butera@oracle.com>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/block/xen-blkfront.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index d3c1a95..fc770b7 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -1951,6 +1951,8 @@ static void blkback_changed(struct xenbus_device *dev,
>  
>  	switch (backend_state) {
>  	case XenbusStateInitWait:
> +		if (dev->state != XenbusStateInitialising)
> +			break;
>  		if (talk_to_blkback(dev, info)) {
>  			kfree(info);
>  			dev_set_drvdata(&dev->dev, NULL);
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
       [not found]             ` <1434981360-29794-1-git-send-email-konrad.wilk@oracle.com>
  2015-06-22 14:06               ` Konrad Rzeszutek Wilk
@ 2015-06-23  6:23               ` Jan Beulich
       [not found]               ` <5589179F0200007800087E6E@mail.emea.novell.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2015-06-23  6:23 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Rafal.Mielniczuk, linux-kernel, marcus.granado, julien.grall,
	justing, Paul.Durrant, david.vrabel, xen-devel, roger.pau

>>> On 22.06.15 at 15:56, <konrad.wilk@oracle.com> wrote:
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -1951,6 +1951,8 @@ static void blkback_changed(struct xenbus_device *dev,
>  
>  	switch (backend_state) {
>  	case XenbusStateInitWait:
> +		if (dev->state != XenbusStateInitialising)
> +			break;

If, as you say in a subsequent reply, all other drivers already have
such a check, wouldn't it make more sense to put this into xenbus'
backend_changed() or even xenbus_otherend_changed()?

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
       [not found]               ` <5589179F0200007800087E6E@mail.emea.novell.com>
@ 2015-06-23 11:57                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-23 11:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Rafal.Mielniczuk, linux-kernel, marcus.granado, julien.grall,
	justing, Paul.Durrant, david.vrabel, xen-devel, roger.pau

On June 23, 2015 2:23:59 AM EDT, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 22.06.15 at 15:56, <konrad.wilk@oracle.com> wrote:
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -1951,6 +1951,8 @@ static void blkback_changed(struct
>xenbus_device *dev,
>>  
>>  	switch (backend_state) {
>>  	case XenbusStateInitWait:
>> +		if (dev->state != XenbusStateInitialising)
>> +			break;
>
>If, as you say in a subsequent reply, all other drivers already have
>such a check, wouldn't it make more sense to put this into xenbus'
>backend_changed() or even xenbus_otherend_changed()?

Good idea. Let me spin an cleanup patch for the race handling  - but put this in the Jens branch as the merge window is happening right now.



>
>Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] xen/block: add multi-page ring support
       [not found]           ` <558762C4.2000002@oracle.com>
                               ` (2 preceding siblings ...)
       [not found]             ` <1434981360-29794-1-git-send-email-konrad.wilk@oracle.com>
@ 2015-06-23 12:51             ` Marcus Granado
  3 siblings, 0 replies; 16+ messages in thread
From: Marcus Granado @ 2015-06-23 12:51 UTC (permalink / raw)
  To: Bob Liu, Roger Pau Monné
  Cc: Rafal Mielniczuk, Jonathan Davies, linux-kernel, xen-devel,
	Julien Grall, justing, Paul Durrant, David Vrabel

On 22/06/15 02:20, Bob Liu wrote:
>
> On 06/09/2015 10:07 PM, Roger Pau Monné wrote:
>> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> ...
>>> Roger, I put them (patches) on devel/for-jens-4.2 on
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
>>>
>>> I think these two patches:
>>> drivers: xen-blkback: delay pending_req allocation to connect_ring
>>> xen/block: add multi-page ring support
>>>
>>> are the only ones that haven't been Acked by you (or maybe they
>>> have and I missed the Ack?)
>>
>> Hello,
>>
>> I was waiting to Ack those because the XenServer storage performance
>> folks found out that these patches cause a performance regression on
>> some of their tests. I'm adding them to the conversation so they can
>> provide more details about the issues they found, and whether we should
>> hold pushing this patches or not.
>>
>
> Hey,
>
> Are there any updates? What's the performance regression problem?
>

Hi,

We were using the 2 last weeks to finish measurements on the multipage 
ring v5 patches in a range of diverse conditions.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with a back-ported multipage ring v5 
applied to our dom0 kernel 3.10.

- using a recent Ubuntu 15.04 kernel 3.19 with v5 frontend applied to be 
used as guest

- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s for a 
number of block sizes ranging from 512 bytes to 4MiB. We also tried pure 
random and pure sequential reads. Random reads were used to counter-act 
read-ahead prefetching at the underlying storage.

We noticed that using large (>16) queue depths in fio would saturate 
individual vcpus in the guest, so to better utilise the cpu resources in 
the guest, we chose to (a) fix the queue depth to 4 for each fio thread, 
(b) increase the guest vcpus to 16 and (c) vary the number of fio 
threads from 1 to 64.

We were interested in observing storage iops and throughput for 
different values of in-flight requests (= io depth * fio threads) 
generated by the guest. Our expectation was that iops and throughput 
with single-page and multi-page rings would be the same up to 32 
in-flight requests (the number of requests that fit in a single-page 
ring), and then the single-page ring case would flat-line with >32 
in-flight requests, whereas the multi-page ring case would continue to 
show improvements until hitting some other bottleneck. The effect should 
be more visible when using requests with smaller block sizes because the 
measurements are less likely to be affected by memory copy delays or 
large data transfer delays to storage.

These are the results we got for the conditions above with 4KiB blocks 
and random reads:

fio_threads  io_depth  in_flight   1-page_IOPS  8-page_IOPS
     1            4         4           19K          19K
     4            4        16           89K          89K
     8            4        32          149K         149K
    16            4        64          131K         198K
    32            4       128          127K         208K
    64            4       256          132K         209K

We believe that this data shows that there's a clear improvement when 
using multi-page rings when there are more than 32 in-flight requests. 
We observed similar improvements when writing, and across all small 
block sizes. For block sizes >=16KiB, the results were similar between 
single- and multi-page rings, and we attribute that to bottlenecks when 
transferring large amounts of data that is not present with smaller 
block sizes.

Another reason for using random reads in the synthetic fio tests above 
is that we noticed that when sequential reads are used there were some 
anomalies that we believe would affect a fair comparison:

(A)- in some situations with sequential read, we observed a decreasing 
number of merges in the guest (according to 'iostat -x -m 1') with small 
block sizes <=4KiB when increasing the number of ring pages. There were 
no merges whenever in_flight < ring_pages * 32. With larger in_flight 
requests (>=128) -- visible with both 8 fio_threads x 32 io_depth and 32 
fio_threads x 8 io_depth -- storage throughput with 1 page was around 
25% better than with 8 pages. This is the regression that Roger was 
talking about previously in this discussion. It seems related to the 
merges of requests occurring much more frequently with 1 page than with 
8 pages. During the measurements, the average request queue size in 
iostat has always a similar value as the number of requests in the ring. 
I would appreciate potential explanations of why the guest kernel would 
behave like that. We believe that this regression is a corner-case that 
would be difficult to spot in a real-world load, where random reads are 
interspersed with sequential reads of many different block sizes and io 
depths, and we only spotted it because of our synthetic load with fio 
used a wide range of parameters with sequential reads. It may also be 
specific to the way that Linux handles this situation.

(B)- in other situations with sequential read (block sizes between 8KiB 
and 128KiB), we observed the storage throughput with 1 page was around 
50% worse than with 8 pages. Again, this seems related to the existence 
of merges with 1 page but not with 8 pages, and I would appreciate 
potential explanations.

For sequential reads, arguably the performance difference spotted in (A) 
is counter balanced by the performance difference in (B), and they 
cancel each other out if all block sizes are considered together. For 
random reads, 8-page rings were similar or superior to 1-page rings in 
all tested conditions.

All considered, we believe that the multi-page ring patches improve the 
storage performance (apart from case (A)) and therefore should be good 
to merge.

Marcus

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-06-23 12:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1433310003-13089-1-git-send-email-bob.liu@oracle.com>
2015-06-03  5:40 ` [PATCH 2/3] driver: xen-blkfront: move talk_to_blkback to a more suitable place Bob Liu
2015-06-03  5:40 ` [PATCH 3/3] xen/block: add multi-page ring support Bob Liu
     [not found] ` <1433310003-13089-3-git-send-email-bob.liu@oracle.com>
2015-06-09  8:50   ` Bob Liu
     [not found]   ` <5576A8C0.8000804@oracle.com>
2015-06-09  8:52     ` Paul Durrant
     [not found]     ` <9AAE0902D5BC7E449B7C8E4E778ABCD0259410F3@AMSPEX01CL01.citrite.net>
2015-06-09 13:39       ` Konrad Rzeszutek Wilk
     [not found]       ` <20150609133938.GA15200@x230>
2015-06-09 13:48         ` Bob Liu
2015-06-09 14:07         ` Roger Pau Monné
     [not found]         ` <5576F325.5050304@citrix.com>
2015-06-09 14:21           ` Konrad Rzeszutek Wilk
     [not found]           ` <20150609142126.GJ15200@x230>
2015-06-19 20:12             ` Konrad Rzeszutek Wilk
2015-06-22  1:20           ` Bob Liu
     [not found]           ` <558762C4.2000002@oracle.com>
2015-06-22 13:47             ` Konrad Rzeszutek Wilk
2015-06-22 13:56             ` [PATCH] drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising Konrad Rzeszutek Wilk
     [not found]             ` <1434981360-29794-1-git-send-email-konrad.wilk@oracle.com>
2015-06-22 14:06               ` Konrad Rzeszutek Wilk
2015-06-23  6:23               ` Jan Beulich
     [not found]               ` <5589179F0200007800087E6E@mail.emea.novell.com>
2015-06-23 11:57                 ` Konrad Rzeszutek Wilk
2015-06-23 12:51             ` [PATCH 3/3] xen/block: add multi-page ring support Marcus Granado

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).