linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/12] xen-block: indirect descriptors
@ 2013-02-28 10:28 Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr Roger Pau Monne
                   ` (12 more replies)
  0 siblings, 13 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel

This series contains the initial implementation of indirect 
descriptors for Linux blkback/blkfront.

Patches 1, 2, 3, 4 and 5 are bug fixes and minor optimizations.

Patch 6 contains a LRU implementation for blkback that will be needed 
when using indirect descriptors (since we are no longer able to map 
all possible grants blkfront might use).

Patch 7 is an addition to the print stats function in blkback in order 
to print information regarding persistent grant usage.

Patches 8, 9, 10 and 11 are preparatory work for indirect descriptors 
implementation, mainly make blkback use dynamic memory and remove the 
shared blkbk structure, so each blkback instance has it's own list of 
free requests, pages, handles and so on.

Finally patch 12 contains the indirect descriptors implementation.

I've also pushed this series to the following git repository:

git://xenbits.xen.org/people/royger/linux.git xen-block-indirect

Performance benefit of this series can be seen in the following graph:

http://xenbits.xen.org/people/royger/plot_indirect.png

Thanks for the review, Roger.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:58   ` [Xen-devel] " Jan Beulich
  2013-02-28 10:28 ` [PATCH RFC 02/12] xen-blkback: fix foreach_grant_safe to handle empty lists Roger Pau Monne
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

dev_bus_addr returned in the grant ref map operation is the mfn of the
passed page, there's no need to store it in the persistent grant
entry, since we can always get it provided that we have the page.

This reduces the memory overhead of persistent grants in blkback.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |    7 +++----
 drivers/block/xen-blkback/common.h  |    1 -
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index de1f319..d40beb3 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -621,9 +621,7 @@ static int xen_blkbk_map(struct blkif_request *req,
 				 * If this is a new persistent grant
 				 * save the handler
 				 */
-				persistent_gnts[i]->handle = map[j].handle;
-				persistent_gnts[i]->dev_bus_addr =
-					map[j++].dev_bus_addr;
+				persistent_gnts[i]->handle = map[j++].handle;
 			}
 			pending_handle(pending_req, i) =
 				persistent_gnts[i]->handle;
@@ -631,7 +629,8 @@ static int xen_blkbk_map(struct blkif_request *req,
 			if (ret)
 				continue;
 
-			seg[i].buf = persistent_gnts[i]->dev_bus_addr |
+			seg[i].buf = pfn_to_mfn(page_to_pfn(
+				persistent_gnts[i]->page)) << PAGE_SHIFT |
 				(req->u.rw.seg[i].first_sect << 9);
 		} else {
 			pending_handle(pending_req, i) = map[j].handle;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 6072390..f338f8a 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -172,7 +172,6 @@ struct persistent_gnt {
 	struct page *page;
 	grant_ref_t gnt;
 	grant_handle_t handle;
-	uint64_t dev_bus_addr;
 	struct rb_node node;
 };
 
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 02/12] xen-blkback: fix foreach_grant_safe to handle empty lists
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 03/12] xen-blkfront: switch from llist to list Roger Pau Monne
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

We may use foreach_grant_safe in the future with empty lists, so make
sure we can handle them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index d40beb3..415a0c7 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -164,7 +164,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 #define foreach_grant_safe(pos, n, rbtree, node) \
 	for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node), \
-	     (n) = rb_next(&(pos)->node); \
+	     (n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL; \
 	     &(pos)->node != NULL; \
 	     (pos) = container_of(n, typeof(*(pos)), node), \
 	     (n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 03/12] xen-blkfront: switch from llist to list
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 02/12] xen-blkback: fix foreach_grant_safe to handle empty lists Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests Roger Pau Monne
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Replace the use of llist with list.

llist_for_each_entry_safe can trigger a bug in GCC 4.1, so it's best
to remove it and use a doubly linked list, which is used extensively
in the kernel already.

Specifically this bug can be triggered by hot-unplugging a disk,
either by doing xm block-detach or by save/restore cycle.

BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffffa0047223>] blkif_free+0x63/0x130 [xen_blkfront]
The crash call trace is:
	...
bad_area_nosemaphore+0x13/0x20
do_page_fault+0x25e/0x4b0
page_fault+0x25/0x30
? blkif_free+0x63/0x130 [xen_blkfront]
blkfront_resume+0x46/0xa0 [xen_blkfront]
xenbus_dev_resume+0x6c/0x140
pm_op+0x192/0x1b0
device_resume+0x82/0x1e0
dpm_resume+0xc9/0x1a0
dpm_resume_end+0x15/0x30
do_suspend+0x117/0x1e0

When drilling down to the assembler code, on newer GCC it does
.L29:
        cmpq    $-16, %r12      #, persistent_gnt check
        je      .L30    	#, out of the loop
.L25:
	... code in the loop
        testq   %r13, %r13      # n
        je      .L29    	#, back to the top of the loop
        cmpq    $-16, %r12      #, persistent_gnt check
        movq    16(%r12), %r13  # <variable>.node.next, n
        jne     .L25    	#,	back to the top of the loop
.L30:

While on GCC 4.1, it is:
L78:
	... code in the loop
	testq   %r13, %r13      # n
        je      .L78    #,	back to the top of the loop
        movq    16(%rbx), %r13  # <variable>.node.next, n
        jmp     .L78    #,	back to the top of the loop

Which basically means that the exit loop condition instead of
being:

	&(pos)->member != NULL;

is:
	;

which makes the loop unbound.

Since we always manipulate the list while holding the io_lock, there's
no need for additional locking (llist used previously is safe to use
concurrently without additional locking).

Should be backported to 3.8 stable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[Part of the description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkfront.c |   41 ++++++++++++++++++-----------------------
 1 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index c3dae2e..2e39eaf 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -44,7 +44,7 @@
 #include <linux/mutex.h>
 #include <linux/scatterlist.h>
 #include <linux/bitmap.h>
-#include <linux/llist.h>
+#include <linux/list.h>
 
 #include <xen/xen.h>
 #include <xen/xenbus.h>
@@ -68,7 +68,7 @@ enum blkif_state {
 struct grant {
 	grant_ref_t gref;
 	unsigned long pfn;
-	struct llist_node node;
+	struct list_head node;
 };
 
 struct blk_shadow {
@@ -105,7 +105,7 @@ struct blkfront_info
 	struct work_struct work;
 	struct gnttab_free_callback callback;
 	struct blk_shadow shadow[BLK_RING_SIZE];
-	struct llist_head persistent_gnts;
+	struct list_head persistent_gnts;
 	unsigned int persistent_gnts_c;
 	unsigned long shadow_free;
 	unsigned int feature_flush;
@@ -371,10 +371,11 @@ static int blkif_queue_request(struct request *req)
 			lsect = fsect + (sg->length >> 9) - 1;
 
 			if (info->persistent_gnts_c) {
-				BUG_ON(llist_empty(&info->persistent_gnts));
-				gnt_list_entry = llist_entry(
-					llist_del_first(&info->persistent_gnts),
-					struct grant, node);
+				BUG_ON(list_empty(&info->persistent_gnts));
+				gnt_list_entry = list_first_entry(
+				                      &info->persistent_gnts,
+				                      struct grant, node);
+				list_del(&gnt_list_entry->node);
 
 				ref = gnt_list_entry->gref;
 				buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
@@ -790,9 +791,8 @@ static void blkif_restart_queue(struct work_struct *work)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
-	struct llist_node *all_gnts;
-	struct grant *persistent_gnt, *tmp;
-	struct llist_node *n;
+	struct grant *persistent_gnt;
+	struct grant *n;
 
 	/* Prevent new requests being issued until we fix things up. */
 	spin_lock_irq(&info->io_lock);
@@ -804,20 +804,15 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 
 	/* Remove all persistent grants */
 	if (info->persistent_gnts_c) {
-		all_gnts = llist_del_all(&info->persistent_gnts);
-		persistent_gnt = llist_entry(all_gnts, typeof(*(persistent_gnt)), node);
-		while (persistent_gnt) {
+		list_for_each_entry_safe(persistent_gnt, n,
+		                         &info->persistent_gnts, node) {
+			list_del(&persistent_gnt->node);
 			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
 			__free_page(pfn_to_page(persistent_gnt->pfn));
-			tmp = persistent_gnt;
-			n = persistent_gnt->node.next;
-			if (n)
-				persistent_gnt = llist_entry(n, typeof(*(persistent_gnt)), node);
-			else
-				persistent_gnt = NULL;
-			kfree(tmp);
+			kfree(persistent_gnt);
+			info->persistent_gnts_c--;
 		}
-		info->persistent_gnts_c = 0;
+		BUG_ON(info->persistent_gnts_c != 0);
 	}
 
 	/* No more gnttab callback work. */
@@ -875,7 +870,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 	}
 	/* Add the persistent grant into the list of free grants */
 	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
-		llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
+		list_add(&s->grants_used[i]->node, &info->persistent_gnts);
 		info->persistent_gnts_c++;
 	}
 }
@@ -1171,7 +1166,7 @@ static int blkfront_probe(struct xenbus_device *dev,
 	spin_lock_init(&info->io_lock);
 	info->xbdev = dev;
 	info->vdevice = vdevice;
-	init_llist_head(&info->persistent_gnts);
+	INIT_LIST_HEAD(&info->persistent_gnts);
 	info->persistent_gnts_c = 0;
 	info->connected = BLKIF_STATE_DISCONNECTED;
 	INIT_WORK(&info->work, blkif_restart_queue);
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (2 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 03/12] xen-blkfront: switch from llist to list Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-03-04 19:39   ` Konrad Rzeszutek Wilk
  2013-02-28 10:28 ` [PATCH RFC 05/12] xen-blkfront: remove frame list from blk_shadow Roger Pau Monne
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

This prevents us from having to call alloc_page while we are preparing
the request. Since blkfront was calling alloc_page with a spinlock
held we used GFP_ATOMIC, which can fail if we are requesting a lot of
pages since it is using the emergency memory pools.

Allocating all the pages at init prevents us from having to call
alloc_page, thus preventing possible failures.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
 1 files changed, 79 insertions(+), 41 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2e39eaf..5ba6b87 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
 	return 0;
 }
 
+static int fill_grant_buffer(struct blkfront_info *info, int num)
+{
+	struct page *granted_page;
+	struct grant *gnt_list_entry, *n;
+	int i = 0;
+
+	while(i < num) {
+		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
+		if (!gnt_list_entry)
+			goto out_of_memory;
+
+		granted_page = alloc_page(GFP_NOIO);
+		if (!granted_page) {
+			kfree(gnt_list_entry);
+			goto out_of_memory;
+		}
+
+		gnt_list_entry->pfn = page_to_pfn(granted_page);
+		gnt_list_entry->gref = GRANT_INVALID_REF;
+		list_add(&gnt_list_entry->node, &info->persistent_gnts);
+		i++;
+	}
+
+	return 0;
+
+out_of_memory:
+	list_for_each_entry_safe(gnt_list_entry, n,
+	                         &info->persistent_gnts, node) {
+		list_del(&gnt_list_entry->node);
+		__free_page(pfn_to_page(gnt_list_entry->pfn));
+		kfree(gnt_list_entry);
+		i--;
+	}
+	BUG_ON(i != 0);
+	return -ENOMEM;
+}
+
+static struct grant *get_grant(grant_ref_t *gref_head,
+                               struct blkfront_info *info)
+{
+	struct grant *gnt_list_entry;
+	unsigned long buffer_mfn;
+
+	BUG_ON(list_empty(&info->persistent_gnts));
+	gnt_list_entry = list_first_entry(&info->persistent_gnts, struct grant,
+	                                  node);
+	list_del(&gnt_list_entry->node);
+
+	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
+		info->persistent_gnts_c--;
+		return gnt_list_entry;
+	}
+
+	/* Assign a gref to this page */
+	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
+	BUG_ON(gnt_list_entry->gref == -ENOSPC);
+	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
+	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
+	                                info->xbdev->otherend_id,
+	                                buffer_mfn, 0);
+	return gnt_list_entry;
+}
+
 static const char *op_name(int op)
 {
 	static const char *const names[] = {
@@ -306,7 +369,6 @@ static int blkif_queue_request(struct request *req)
 	 */
 	bool new_persistent_gnts;
 	grant_ref_t gref_head;
-	struct page *granted_page;
 	struct grant *gnt_list_entry = NULL;
 	struct scatterlist *sg;
 
@@ -370,42 +432,9 @@ static int blkif_queue_request(struct request *req)
 			fsect = sg->offset >> 9;
 			lsect = fsect + (sg->length >> 9) - 1;
 
-			if (info->persistent_gnts_c) {
-				BUG_ON(list_empty(&info->persistent_gnts));
-				gnt_list_entry = list_first_entry(
-				                      &info->persistent_gnts,
-				                      struct grant, node);
-				list_del(&gnt_list_entry->node);
-
-				ref = gnt_list_entry->gref;
-				buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
-				info->persistent_gnts_c--;
-			} else {
-				ref = gnttab_claim_grant_reference(&gref_head);
-				BUG_ON(ref == -ENOSPC);
-
-				gnt_list_entry =
-					kmalloc(sizeof(struct grant),
-							 GFP_ATOMIC);
-				if (!gnt_list_entry)
-					return -ENOMEM;
-
-				granted_page = alloc_page(GFP_ATOMIC);
-				if (!granted_page) {
-					kfree(gnt_list_entry);
-					return -ENOMEM;
-				}
-
-				gnt_list_entry->pfn =
-					page_to_pfn(granted_page);
-				gnt_list_entry->gref = ref;
-
-				buffer_mfn = pfn_to_mfn(page_to_pfn(
-								granted_page));
-				gnttab_grant_foreign_access_ref(ref,
-					info->xbdev->otherend_id,
-					buffer_mfn, 0);
-			}
+			gnt_list_entry = get_grant(&gref_head, info);
+			ref = gnt_list_entry->gref;
+			buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
 
 			info->shadow[id].grants_used[i] = gnt_list_entry;
 
@@ -803,17 +832,20 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		blk_stop_queue(info->rq);
 
 	/* Remove all persistent grants */
-	if (info->persistent_gnts_c) {
+	if (!list_empty(&info->persistent_gnts)) {
 		list_for_each_entry_safe(persistent_gnt, n,
 		                         &info->persistent_gnts, node) {
 			list_del(&persistent_gnt->node);
-			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+			if (persistent_gnt->gref != GRANT_INVALID_REF) {
+				gnttab_end_foreign_access(persistent_gnt->gref,
+				                          0, 0UL);
+				info->persistent_gnts_c--;
+			}
 			__free_page(pfn_to_page(persistent_gnt->pfn));
 			kfree(persistent_gnt);
-			info->persistent_gnts_c--;
 		}
-		BUG_ON(info->persistent_gnts_c != 0);
 	}
+	BUG_ON(info->persistent_gnts_c != 0);
 
 	/* No more gnttab callback work. */
 	gnttab_cancel_free_callback(&info->callback);
@@ -1088,6 +1120,12 @@ again:
 		goto destroy_blkring;
 	}
 
+	/* Allocate memory for grants */
+	err = fill_grant_buffer(info, BLK_RING_SIZE *
+	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
+	if (err)
+		goto out;
+
 	xenbus_switch_state(dev, XenbusStateInitialised);
 
 	return 0;
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 05/12] xen-blkfront: remove frame list from blk_shadow
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (3 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants Roger Pau Monne
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

We already have the frame (pfn of the grant page) stored inside struct
grant, so there's no need to keep an aditional list of mapped frames
for a specific request. This reduces memory usage in blkfront.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkfront.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5ba6b87..4d81fcc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -74,7 +74,6 @@ struct grant {
 struct blk_shadow {
 	struct blkif_request req;
 	struct request *request;
-	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 };
 
@@ -356,7 +355,6 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
 static int blkif_queue_request(struct request *req)
 {
 	struct blkfront_info *info = req->rq_disk->private_data;
-	unsigned long buffer_mfn;
 	struct blkif_request *ring_req;
 	unsigned long id;
 	unsigned int fsect, lsect;
@@ -434,7 +432,6 @@ static int blkif_queue_request(struct request *req)
 
 			gnt_list_entry = get_grant(&gref_head, info);
 			ref = gnt_list_entry->gref;
-			buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
 
 			info->shadow[id].grants_used[i] = gnt_list_entry;
 
@@ -465,7 +462,6 @@ static int blkif_queue_request(struct request *req)
 				kunmap_atomic(shared_data);
 			}
 
-			info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
 			ring_req->u.rw.seg[i] =
 					(struct blkif_request_segment) {
 						.gref       = ref,
@@ -1269,7 +1265,7 @@ static int blkif_recover(struct blkfront_info *info)
 				gnttab_grant_foreign_access_ref(
 					req->u.rw.seg[j].gref,
 					info->xbdev->otherend_id,
-					pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
+					pfn_to_mfn(copy[i].grants_used[j]->pfn),
 					0);
 		}
 		info->shadow[req->u.rw.id].req = *req;
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (4 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 05/12] xen-blkfront: remove frame list from blk_shadow Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-03-04 20:10   ` Konrad Rzeszutek Wilk
  2013-02-28 10:28 ` [PATCH RFC 07/12] xen-blkback: print stats about " Roger Pau Monne
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

This mechanism allows blkback to change the number of grants
persistently mapped at run time.

The algorithm uses a simple LRU mechanism that removes (if needed) the
persistent grants that have not been used since the last LRU run, or
if all grants have been used it removes the first grants in the list
(that are not in use).

The algorithm has several parameters that can be tuned by the user
from sysfs:

 * max_persistent_grants: maximum number of grants that will be
   persistently mapped.
 * lru_interval: minimum interval (in ms) at which the LRU should be
   run
 * lru_num_clean: number of persistent grants to remove when executing
   the LRU.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |  207 +++++++++++++++++++++++++++--------
 drivers/block/xen-blkback/common.h  |    4 +
 drivers/block/xen-blkback/xenbus.c  |    1 +
 3 files changed, 166 insertions(+), 46 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 415a0c7..c14b736 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -63,6 +63,44 @@ static int xen_blkif_reqs = 64;
 module_param_named(reqs, xen_blkif_reqs, int, 0);
 MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
 
+/*
+ * Maximum number of grants to map persistently in blkback. For maximum
+ * performance this should be the total numbers of grants that can be used
+ * to fill the ring, but since this might become too high, specially with
+ * the use of indirect descriptors, we set it to a value that provides good
+ * performance without using too much memory.
+ *
+ * When the list of persistent grants is full we clean it using a LRU
+ * algorithm.
+ */
+
+static int xen_blkif_max_pgrants = 352;
+module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
+MODULE_PARM_DESC(max_persistent_grants,
+                 "Maximum number of grants to map persistently");
+
+/*
+ * The LRU mechanism to clean the lists of persistent grants needs to
+ * be executed periodically. The time interval between consecutive executions
+ * of the purge mechanism is set in ms.
+ */
+
+static int xen_blkif_lru_interval = 100;
+module_param_named(lru_interval, xen_blkif_lru_interval, int, 0644);
+MODULE_PARM_DESC(lru_interval,
+"Execution interval (in ms) of the LRU mechanism to clean the list of persistent grants");
+
+/*
+ * When the persistent grants list is full we will remove unused grants
+ * from the list. The number of grants to be removed at each LRU execution
+ * can be set dynamically.
+ */
+
+static int xen_blkif_lru_num_clean = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
+MODULE_PARM_DESC(lru_num_clean,
+"Number of persistent grants to unmap when the list is full");
+
 /* Run-time switchable: /sys/module/blkback/parameters/ */
 static unsigned int log_stats;
 module_param(log_stats, int, 0644);
@@ -81,7 +119,7 @@ struct pending_req {
 	unsigned short		operation;
 	int			status;
 	struct list_head	free_list;
-	DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 };
 
 #define BLKBACK_INVALID_HANDLE (~0)
@@ -102,36 +140,6 @@ struct xen_blkbk {
 static struct xen_blkbk *blkbk;
 
 /*
- * Maximum number of grant pages that can be mapped in blkback.
- * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
- * pages that blkback will persistently map.
- * Currently, this is:
- * RING_SIZE = 32 (for all known ring types)
- * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
- * sizeof(struct persistent_gnt) = 48
- * So the maximum memory used to store the grants is:
- * 32 * 11 * 48 = 16896 bytes
- */
-static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
-{
-	switch (protocol) {
-	case BLKIF_PROTOCOL_NATIVE:
-		return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
-			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
-	case BLKIF_PROTOCOL_X86_32:
-		return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
-			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
-	case BLKIF_PROTOCOL_X86_64:
-		return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
-			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
-	default:
-		BUG();
-	}
-	return 0;
-}
-
-
-/*
  * Little helpful macro to figure out the index and virtual address of the
  * pending_pages[..]. For each 'pending_req' we have have up to
  * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
@@ -251,6 +259,76 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
 	BUG_ON(num != 0);
 }
 
+static int purge_persistent_gnt(struct rb_root *root, int num)
+{
+	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct persistent_gnt *persistent_gnt;
+	struct rb_node *n;
+	int ret, segs_to_unmap = 0;
+	int requested_num = num;
+	int preserve_used = 1;
+
+	pr_debug("Requested the purge of %d persistent grants\n", num);
+
+purge_list:
+	foreach_grant_safe(persistent_gnt, n, root, node) {
+		BUG_ON(persistent_gnt->handle ==
+			BLKBACK_INVALID_HANDLE);
+
+		if (persistent_gnt->flags & PERSISTENT_GNT_ACTIVE)
+			continue;
+		if (preserve_used &&
+		    (persistent_gnt->flags & PERSISTENT_GNT_USED))
+			continue;
+
+		gnttab_set_unmap_op(&unmap[segs_to_unmap],
+			(unsigned long) pfn_to_kaddr(page_to_pfn(
+				persistent_gnt->page)),
+			GNTMAP_host_map,
+			persistent_gnt->handle);
+
+		pages[segs_to_unmap] = persistent_gnt->page;
+
+		if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+			ret = gnttab_unmap_refs(unmap, NULL, pages,
+				segs_to_unmap);
+			BUG_ON(ret);
+			free_xenballooned_pages(segs_to_unmap, pages);
+			segs_to_unmap = 0;
+		}
+
+		rb_erase(&persistent_gnt->node, root);
+		kfree(persistent_gnt);
+		if (--num == 0)
+			goto finished;
+	}
+	/*
+	 * If we get here it means we also need to start cleaning
+	 * grants that were used since last purge in order to cope
+	 * with the requested num
+	 */
+	if (preserve_used) {
+		pr_debug("Still missing %d purged frames\n", num);
+		preserve_used = 0;
+		goto purge_list;
+	}
+finished:
+	if (segs_to_unmap > 0) {
+		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
+		BUG_ON(ret);
+		free_xenballooned_pages(segs_to_unmap, pages);
+	}
+	/* Finally remove the "used" flag from all the persistent grants */
+	foreach_grant_safe(persistent_gnt, n, root, node) {
+		BUG_ON(persistent_gnt->handle ==
+			BLKBACK_INVALID_HANDLE);
+		persistent_gnt->flags &= ~PERSISTENT_GNT_USED;
+	}
+	pr_debug("Purged %d/%d\n", (requested_num - num), requested_num);
+	return (requested_num - num);
+}
+
 /*
  * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
  */
@@ -397,6 +475,8 @@ int xen_blkif_schedule(void *arg)
 {
 	struct xen_blkif *blkif = arg;
 	struct xen_vbd *vbd = &blkif->vbd;
+	int rq_purge, purged;
+	unsigned long timeout;
 
 	xen_blkif_get(blkif);
 
@@ -406,13 +486,21 @@ int xen_blkif_schedule(void *arg)
 		if (unlikely(vbd->size != vbd_sz(vbd)))
 			xen_vbd_resize(blkif);
 
-		wait_event_interruptible(
+		timeout = msecs_to_jiffies(xen_blkif_lru_interval);
+
+		timeout = wait_event_interruptible_timeout(
 			blkif->wq,
-			blkif->waiting_reqs || kthread_should_stop());
-		wait_event_interruptible(
+			blkif->waiting_reqs || kthread_should_stop(),
+			timeout);
+		if (timeout == 0)
+			goto purge_gnt_list;
+		timeout = wait_event_interruptible_timeout(
 			blkbk->pending_free_wq,
 			!list_empty(&blkbk->pending_free) ||
-			kthread_should_stop());
+			kthread_should_stop(),
+			timeout);
+		if (timeout == 0)
+			goto purge_gnt_list;
 
 		blkif->waiting_reqs = 0;
 		smp_mb(); /* clear flag *before* checking for work */
@@ -420,6 +508,32 @@ int xen_blkif_schedule(void *arg)
 		if (do_block_io_op(blkif))
 			blkif->waiting_reqs = 1;
 
+purge_gnt_list:
+		if (blkif->vbd.feature_gnt_persistent &&
+		    time_after(jiffies, blkif->next_lru)) {
+			/* Clean the list of persistent grants */
+			if (blkif->persistent_gnt_c > xen_blkif_max_pgrants ||
+			    (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
+			     blkif->vbd.overflow_max_grants)) {
+				rq_purge = blkif->persistent_gnt_c -
+				           xen_blkif_max_pgrants +
+				           xen_blkif_lru_num_clean;
+				rq_purge = rq_purge > blkif->persistent_gnt_c ?
+				           blkif->persistent_gnt_c : rq_purge;
+				purged = purge_persistent_gnt(
+					  &blkif->persistent_gnts, rq_purge);
+				if (purged != rq_purge)
+					pr_debug(DRV_PFX " unable to meet persistent grants purge requirements for device %#x, domain %u, requested %d done %d\n",
+					         blkif->domid,
+					         blkif->vbd.handle,
+					         rq_purge, purged);
+				blkif->persistent_gnt_c -= purged;
+				blkif->vbd.overflow_max_grants = 0;
+			}
+			blkif->next_lru = jiffies +
+			        msecs_to_jiffies(xen_blkif_lru_interval);
+		}
+
 		if (log_stats && time_after(jiffies, blkif->st_print))
 			print_stats(blkif);
 	}
@@ -453,13 +567,18 @@ static void xen_blkbk_unmap(struct pending_req *req)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct persistent_gnt *persistent_gnt;
 	unsigned int i, invcount = 0;
 	grant_handle_t handle;
 	int ret;
 
 	for (i = 0; i < req->nr_pages; i++) {
-		if (!test_bit(i, req->unmap_seg))
+		if (req->persistent_gnts[i] != NULL) {
+			persistent_gnt = req->persistent_gnts[i];
+			persistent_gnt->flags |= PERSISTENT_GNT_USED;
+			persistent_gnt->flags &= ~PERSISTENT_GNT_ACTIVE;
 			continue;
+		}
 		handle = pending_handle(req, i);
 		if (handle == BLKBACK_INVALID_HANDLE)
 			continue;
@@ -480,8 +599,8 @@ static int xen_blkbk_map(struct blkif_request *req,
 			 struct page *pages[])
 {
 	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
 	struct persistent_gnt *persistent_gnt = NULL;
 	struct xen_blkif *blkif = pending_req->blkif;
 	phys_addr_t addr = 0;
@@ -494,9 +613,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 
 	use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
 
-	BUG_ON(blkif->persistent_gnt_c >
-		   max_mapped_grant_pages(pending_req->blkif->blk_protocol));
-
 	/*
 	 * Fill out preq.nr_sects with proper amount of sectors, and setup
 	 * assign map[..] with the PFN of the page in our domain with the
@@ -516,9 +632,9 @@ static int xen_blkbk_map(struct blkif_request *req,
 			 * the grant is already mapped
 			 */
 			new_map = false;
+			persistent_gnt->flags |= PERSISTENT_GNT_ACTIVE;
 		} else if (use_persistent_gnts &&
-			   blkif->persistent_gnt_c <
-			   max_mapped_grant_pages(blkif->blk_protocol)) {
+			   blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
 			/*
 			 * We are using persistent grants, the grant is
 			 * not mapped but we have room for it
@@ -536,6 +652,7 @@ static int xen_blkbk_map(struct blkif_request *req,
 			}
 			persistent_gnt->gnt = req->u.rw.seg[i].gref;
 			persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
+			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
 
 			pages_to_gnt[segs_to_map] =
 				persistent_gnt->page;
@@ -547,7 +664,7 @@ static int xen_blkbk_map(struct blkif_request *req,
 			blkif->persistent_gnt_c++;
 			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
 				 persistent_gnt->gnt, blkif->persistent_gnt_c,
-				 max_mapped_grant_pages(blkif->blk_protocol));
+				 xen_blkif_max_pgrants);
 		} else {
 			/*
 			 * We are either using persistent grants and
@@ -557,7 +674,7 @@ static int xen_blkbk_map(struct blkif_request *req,
 			if (use_persistent_gnts &&
 				!blkif->vbd.overflow_max_grants) {
 				blkif->vbd.overflow_max_grants = 1;
-				pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
+				pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
 					 blkif->domid, blkif->vbd.handle);
 			}
 			new_map = true;
@@ -595,7 +712,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 	 * so that when we access vaddr(pending_req,i) it has the contents of
 	 * the page from the other domain.
 	 */
-	bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
 	for (i = 0, j = 0; i < nseg; i++) {
 		if (!persistent_gnts[i] ||
 		    persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
@@ -634,7 +750,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 				(req->u.rw.seg[i].first_sect << 9);
 		} else {
 			pending_handle(pending_req, i) = map[j].handle;
-			bitmap_set(pending_req->unmap_seg, i, 1);
 
 			if (ret) {
 				j++;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index f338f8a..bd44d75 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -167,11 +167,14 @@ struct xen_vbd {
 
 struct backend_info;
 
+#define PERSISTENT_GNT_ACTIVE	0x1
+#define PERSISTENT_GNT_USED		0x2
 
 struct persistent_gnt {
 	struct page *page;
 	grant_ref_t gnt;
 	grant_handle_t handle;
+	uint8_t flags;
 	struct rb_node node;
 };
 
@@ -204,6 +207,7 @@ struct xen_blkif {
 	/* tree to store persistent grants */
 	struct rb_root		persistent_gnts;
 	unsigned int		persistent_gnt_c;
+	unsigned long		next_lru;
 
 	/* statistics */
 	unsigned long		st_print;
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 5e237f6..abb399a 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -116,6 +116,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 	init_completion(&blkif->drain_complete);
 	atomic_set(&blkif->drain, 0);
 	blkif->st_print = jiffies;
+	blkif->next_lru = jiffies;
 	init_waitqueue_head(&blkif->waiting_to_free);
 	blkif->persistent_gnts.rb_node = NULL;
 
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 07/12] xen-blkback: print stats about persistent grants
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (5 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings Roger Pau Monne
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c14b736..b5e7495 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -460,10 +460,11 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
 static void print_stats(struct xen_blkif *blkif)
 {
 	pr_info("xen-blkback (%s): oo %3d  |  rd %4d  |  wr %4d  |  f %4d"
-		 "  |  ds %4d\n",
+		 "  |  ds %4d | pg: %4d/%4d\n",
 		 current->comm, blkif->st_oo_req,
 		 blkif->st_rd_req, blkif->st_wr_req,
-		 blkif->st_f_req, blkif->st_ds_req);
+		 blkif->st_f_req, blkif->st_ds_req,
+		 blkif->persistent_gnt_c, xen_blkif_max_pgrants);
 	blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
 	blkif->st_rd_req = 0;
 	blkif->st_wr_req = 0;
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (6 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 07/12] xen-blkback: print stats about " Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-03-04 20:22   ` Konrad Rzeszutek Wilk
  2013-02-28 10:28 ` [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req Roger Pau Monne
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Using balloon pages for all granted pages allows us to simplify the
logic in blkback, specially in the xen_blkbk_map function, since now
we can decide if we want to map a grant persistently or not after we
have actually mapped it. This could not be done before because
persistent grants used ballooned pages, and non-persistent grants used
pages from the kernel.

This patch also introduces several changes, the first one is that the
list of free pages is no longer global, now each blkback instance has
it's own list of free pages that can be used to map grants. Also, a
run time parameter (max_buffer_pages) has been added in order to tune
the maximum number of free pages each blkback instance will keep in
it's buffer.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |  278 +++++++++++++++++++----------------
 drivers/block/xen-blkback/common.h  |    5 +
 drivers/block/xen-blkback/xenbus.c  |    3 +
 3 files changed, 159 insertions(+), 127 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index b5e7495..ba27fc3 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -101,6 +101,21 @@ module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
 MODULE_PARM_DESC(lru_num_clean,
 "Number of persistent grants to unmap when the list is full");
 
+/*
+ * Maximum number of unused free pages to keep in the internal buffer.
+ * Setting this to a value too low will reduce memory used in each backend,
+ * but can have a performance penalty.
+ *
+ * A sane value is xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST, but can
+ * be set to a lower value that might degrade performance on some intensive
+ * IO workloads.
+ */
+
+static int xen_blkif_max_buffer_pages = 1024;
+module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
+MODULE_PARM_DESC(max_buffer_pages,
+"Maximum number of free pages to keep in each block backend buffer");
+
 /* Run-time switchable: /sys/module/blkback/parameters/ */
 static unsigned int log_stats;
 module_param(log_stats, int, 0644);
@@ -120,6 +135,7 @@ struct pending_req {
 	int			status;
 	struct list_head	free_list;
 	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 };
 
 #define BLKBACK_INVALID_HANDLE (~0)
@@ -131,8 +147,6 @@ struct xen_blkbk {
 	/* And its spinlock. */
 	spinlock_t		pending_free_lock;
 	wait_queue_head_t	pending_free_wq;
-	/* The list of all pages that are available. */
-	struct page		**pending_pages;
 	/* And the grant handles that are available. */
 	grant_handle_t		*pending_grant_handles;
 };
@@ -151,14 +165,66 @@ static inline int vaddr_pagenr(struct pending_req *req, int seg)
 		BLKIF_MAX_SEGMENTS_PER_REQUEST + seg;
 }
 
-#define pending_page(req, seg) pending_pages[vaddr_pagenr(req, seg)]
+static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkif->free_pages_lock, flags);
+	if (list_empty(&blkif->free_pages)) {
+		BUG_ON(blkif->free_pages_num != 0);
+		spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+		return alloc_xenballooned_pages(1, page, false);
+	}
+	BUG_ON(blkif->free_pages_num == 0);
+	page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+	list_del(&page[0]->lru);
+	blkif->free_pages_num--;
+	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+
+	return 0;
+}
+
+static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
+                                  int num)
+{
+	unsigned long flags;
+	int i;
+
+	spin_lock_irqsave(&blkif->free_pages_lock, flags);
+	for (i = 0; i < num; i++)
+		list_add(&page[i]->lru, &blkif->free_pages);
+	blkif->free_pages_num += num;
+	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+}
 
-static inline unsigned long vaddr(struct pending_req *req, int seg)
+static inline void remove_free_pages(struct xen_blkif *blkif, int num)
 {
-	unsigned long pfn = page_to_pfn(blkbk->pending_page(req, seg));
-	return (unsigned long)pfn_to_kaddr(pfn);
+	/* Remove requested pages in batches of 10 */
+	struct page *page[10];
+	unsigned long flags;
+	int num_pages = 0;
+
+	spin_lock_irqsave(&blkif->free_pages_lock, flags);
+	while (blkif->free_pages_num > num) {
+		BUG_ON(list_empty(&blkif->free_pages));
+		page[num_pages] = list_first_entry(&blkif->free_pages,
+		                                   struct page, lru);
+		list_del(&page[num_pages]->lru);
+		blkif->free_pages_num--;
+		if (++num_pages == 10) {
+			spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+			free_xenballooned_pages(num_pages, page);
+			spin_lock_irqsave(&blkif->free_pages_lock, flags);
+			num_pages = 0;
+		}
+	}
+	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+	if (num_pages != 0)
+		free_xenballooned_pages(num_pages, page);
 }
 
+#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
+
 #define pending_handle(_req, _seg) \
 	(blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)])
 
@@ -178,7 +244,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 	     (n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)
 
 
-static void add_persistent_gnt(struct rb_root *root,
+static int add_persistent_gnt(struct rb_root *root,
 			       struct persistent_gnt *persistent_gnt)
 {
 	struct rb_node **new = &(root->rb_node), *parent = NULL;
@@ -194,14 +260,15 @@ static void add_persistent_gnt(struct rb_root *root,
 		else if (persistent_gnt->gnt > this->gnt)
 			new = &((*new)->rb_right);
 		else {
-			pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
-			BUG();
+			pr_alert_ratelimited(DRV_PFX " trying to add a gref that's already in the tree\n");
+			return -EINVAL;
 		}
 	}
 
 	/* Add new node and rebalance tree. */
 	rb_link_node(&(persistent_gnt->node), parent, new);
 	rb_insert_color(&(persistent_gnt->node), root);
+	return 0;
 }
 
 static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
@@ -223,7 +290,8 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
 	return NULL;
 }
 
-static void free_persistent_gnts(struct rb_root *root, unsigned int num)
+static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
+                                 unsigned int num)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -248,7 +316,7 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			free_xenballooned_pages(segs_to_unmap, pages);
+			put_free_pages(blkif, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 
@@ -259,7 +327,8 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
 	BUG_ON(num != 0);
 }
 
-static int purge_persistent_gnt(struct rb_root *root, int num)
+static int purge_persistent_gnt(struct xen_blkif *blkif, struct rb_root *root,
+                                int num)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -294,7 +363,7 @@ purge_list:
 			ret = gnttab_unmap_refs(unmap, NULL, pages,
 				segs_to_unmap);
 			BUG_ON(ret);
-			free_xenballooned_pages(segs_to_unmap, pages);
+			put_free_pages(blkif, pages, segs_to_unmap);
 			segs_to_unmap = 0;
 		}
 
@@ -317,7 +386,7 @@ finished:
 	if (segs_to_unmap > 0) {
 		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
 		BUG_ON(ret);
-		free_xenballooned_pages(segs_to_unmap, pages);
+		put_free_pages(blkif, pages, segs_to_unmap);
 	}
 	/* Finally remove the "used" flag from all the persistent grants */
 	foreach_grant_safe(persistent_gnt, n, root, node) {
@@ -521,7 +590,7 @@ purge_gnt_list:
 				           xen_blkif_lru_num_clean;
 				rq_purge = rq_purge > blkif->persistent_gnt_c ?
 				           blkif->persistent_gnt_c : rq_purge;
-				purged = purge_persistent_gnt(
+				purged = purge_persistent_gnt(blkif,
 					  &blkif->persistent_gnts, rq_purge);
 				if (purged != rq_purge)
 					pr_debug(DRV_PFX " unable to meet persistent grants purge requirements for device %#x, domain %u, requested %d done %d\n",
@@ -535,13 +604,17 @@ purge_gnt_list:
 			        msecs_to_jiffies(xen_blkif_lru_interval);
 		}
 
+		remove_free_pages(blkif, xen_blkif_max_buffer_pages);
+
 		if (log_stats && time_after(jiffies, blkif->st_print))
 			print_stats(blkif);
 	}
 
+	remove_free_pages(blkif, 0);
+
 	/* Free all persistent grant pages */
 	if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
-		free_persistent_gnts(&blkif->persistent_gnts,
+		free_persistent_gnts(blkif, &blkif->persistent_gnts,
 			blkif->persistent_gnt_c);
 
 	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
@@ -571,6 +644,7 @@ static void xen_blkbk_unmap(struct pending_req *req)
 	struct persistent_gnt *persistent_gnt;
 	unsigned int i, invcount = 0;
 	grant_handle_t handle;
+	struct xen_blkif *blkif = req->blkif;
 	int ret;
 
 	for (i = 0; i < req->nr_pages; i++) {
@@ -581,17 +655,18 @@ static void xen_blkbk_unmap(struct pending_req *req)
 			continue;
 		}
 		handle = pending_handle(req, i);
+		pages[invcount] = req->pages[i];
 		if (handle == BLKBACK_INVALID_HANDLE)
 			continue;
-		gnttab_set_unmap_op(&unmap[invcount], vaddr(req, i),
+		gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[invcount]),
 				    GNTMAP_host_map, handle);
 		pending_handle(req, i) = BLKBACK_INVALID_HANDLE;
-		pages[invcount] = virt_to_page(vaddr(req, i));
 		invcount++;
 	}
 
 	ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
 	BUG_ON(ret);
+	put_free_pages(blkif, pages, invcount);
 }
 
 static int xen_blkbk_map(struct blkif_request *req,
@@ -606,7 +681,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 	struct xen_blkif *blkif = pending_req->blkif;
 	phys_addr_t addr = 0;
 	int i, j;
-	bool new_map;
 	int nseg = req->u.rw.nr_segments;
 	int segs_to_map = 0;
 	int ret = 0;
@@ -632,69 +706,17 @@ static int xen_blkbk_map(struct blkif_request *req,
 			 * We are using persistent grants and
 			 * the grant is already mapped
 			 */
-			new_map = false;
 			persistent_gnt->flags |= PERSISTENT_GNT_ACTIVE;
-		} else if (use_persistent_gnts &&
-			   blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
-			/*
-			 * We are using persistent grants, the grant is
-			 * not mapped but we have room for it
-			 */
-			new_map = true;
-			persistent_gnt = kmalloc(
-				sizeof(struct persistent_gnt),
-				GFP_KERNEL);
-			if (!persistent_gnt)
-				return -ENOMEM;
-			if (alloc_xenballooned_pages(1, &persistent_gnt->page,
-			    false)) {
-				kfree(persistent_gnt);
-				return -ENOMEM;
-			}
-			persistent_gnt->gnt = req->u.rw.seg[i].gref;
-			persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
-			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
-
-			pages_to_gnt[segs_to_map] =
-				persistent_gnt->page;
-			addr = (unsigned long) pfn_to_kaddr(
-				page_to_pfn(persistent_gnt->page));
-
-			add_persistent_gnt(&blkif->persistent_gnts,
-				persistent_gnt);
-			blkif->persistent_gnt_c++;
-			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
-				 persistent_gnt->gnt, blkif->persistent_gnt_c,
-				 xen_blkif_max_pgrants);
-		} else {
-			/*
-			 * We are either using persistent grants and
-			 * hit the maximum limit of grants mapped,
-			 * or we are not using persistent grants.
-			 */
-			if (use_persistent_gnts &&
-				!blkif->vbd.overflow_max_grants) {
-				blkif->vbd.overflow_max_grants = 1;
-				pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
-					 blkif->domid, blkif->vbd.handle);
-			}
-			new_map = true;
-			pages[i] = blkbk->pending_page(pending_req, i);
-			addr = vaddr(pending_req, i);
-			pages_to_gnt[segs_to_map] =
-				blkbk->pending_page(pending_req, i);
-		}
-
-		if (persistent_gnt) {
 			pages[i] = persistent_gnt->page;
 			persistent_gnts[i] = persistent_gnt;
 		} else {
+			if (get_free_page(blkif, &pages[i]))
+				goto out_of_memory;
+			addr = vaddr(pages[i]);
+			pages_to_gnt[segs_to_map] = pages[i];
 			persistent_gnts[i] = NULL;
-		}
-
-		if (new_map) {
 			flags = GNTMAP_host_map;
-			if (!persistent_gnt &&
+			if (!use_persistent_gnts &&
 			    (pending_req->operation != BLKIF_OP_READ))
 				flags |= GNTMAP_readonly;
 			gnttab_set_map_op(&map[segs_to_map++], addr,
@@ -714,54 +736,71 @@ static int xen_blkbk_map(struct blkif_request *req,
 	 * the page from the other domain.
 	 */
 	for (i = 0, j = 0; i < nseg; i++) {
-		if (!persistent_gnts[i] ||
-		    persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
+		if (!persistent_gnts[i]) {
 			/* This is a newly mapped grant */
 			BUG_ON(j >= segs_to_map);
 			if (unlikely(map[j].status != 0)) {
 				pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
-				map[j].handle = BLKBACK_INVALID_HANDLE;
+				pending_handle(pending_req, i) =
+					BLKBACK_INVALID_HANDLE;
 				ret |= 1;
-				if (persistent_gnts[i]) {
-					rb_erase(&persistent_gnts[i]->node,
-						 &blkif->persistent_gnts);
-					blkif->persistent_gnt_c--;
-					kfree(persistent_gnts[i]);
-					persistent_gnts[i] = NULL;
-				}
+				j++;
+				continue;
 			}
+			pending_handle(pending_req, i) = map[j].handle;
 		}
-		if (persistent_gnts[i]) {
-			if (persistent_gnts[i]->handle ==
-			    BLKBACK_INVALID_HANDLE) {
+		if (persistent_gnts[i])
+			goto next;
+		if (use_persistent_gnts &&
+		    blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
+			/*
+			 * We are using persistent grants, the grant is
+			 * not mapped but we have room for it
+			 */
+			persistent_gnt = kmalloc(sizeof(struct persistent_gnt),
+				                 GFP_KERNEL);
+			if (!persistent_gnt) {
 				/*
-				 * If this is a new persistent grant
-				 * save the handler
+				 * If we don't have enough memory to
+				 * allocate the persistent_gnt struct
+				 * map this grant non-persistenly
 				 */
-				persistent_gnts[i]->handle = map[j++].handle;
-			}
-			pending_handle(pending_req, i) =
-				persistent_gnts[i]->handle;
-
-			if (ret)
-				continue;
-
-			seg[i].buf = pfn_to_mfn(page_to_pfn(
-				persistent_gnts[i]->page)) << PAGE_SHIFT |
-				(req->u.rw.seg[i].first_sect << 9);
-		} else {
-			pending_handle(pending_req, i) = map[j].handle;
-
-			if (ret) {
 				j++;
-				continue;
+				goto next;
 			}
-
-			seg[i].buf = map[j++].dev_bus_addr |
-				(req->u.rw.seg[i].first_sect << 9);
+			persistent_gnt->gnt = map[j].ref;
+			persistent_gnt->handle = map[j].handle;
+			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
+			persistent_gnt->page = pages[i];
+			if (add_persistent_gnt(&blkif->persistent_gnts,
+			                       persistent_gnt)) {
+				kfree(persistent_gnt);
+				goto next;
+			}
+			blkif->persistent_gnt_c++;
+			persistent_gnts[i] = persistent_gnt;
+			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
+				 persistent_gnt->gnt, blkif->persistent_gnt_c,
+				 xen_blkif_max_pgrants);
+			j++;
+			goto next;
+		}
+		if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
+			blkif->vbd.overflow_max_grants = 1;
+			pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
+			         blkif->domid, blkif->vbd.handle);
 		}
+		j++;
+next:
+		seg[i].buf = pfn_to_mfn(page_to_pfn(pages[i])) << PAGE_SHIFT |
+		             (req->u.rw.seg[i].first_sect << 9);
 	}
 	return ret;
+
+out_of_memory:
+	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
+	put_free_pages(blkif, pages_to_gnt, segs_to_map);
+	return -ENOMEM;
 }
 
 static int dispatch_discard_io(struct xen_blkif *blkif,
@@ -962,7 +1001,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	int operation;
 	struct blk_plug plug;
 	bool drain = false;
-	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page **pages = pending_req->pages;
 
 	switch (req->operation) {
 	case BLKIF_OP_READ:
@@ -1193,22 +1232,14 @@ static int __init xen_blkif_init(void)
 					xen_blkif_reqs, GFP_KERNEL);
 	blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
 					mmap_pages, GFP_KERNEL);
-	blkbk->pending_pages         = kzalloc(sizeof(blkbk->pending_pages[0]) *
-					mmap_pages, GFP_KERNEL);
 
-	if (!blkbk->pending_reqs || !blkbk->pending_grant_handles ||
-	    !blkbk->pending_pages) {
+	if (!blkbk->pending_reqs || !blkbk->pending_grant_handles) {
 		rc = -ENOMEM;
 		goto out_of_memory;
 	}
 
 	for (i = 0; i < mmap_pages; i++) {
 		blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
-		blkbk->pending_pages[i] = alloc_page(GFP_KERNEL);
-		if (blkbk->pending_pages[i] == NULL) {
-			rc = -ENOMEM;
-			goto out_of_memory;
-		}
 	}
 	rc = xen_blkif_interface_init();
 	if (rc)
@@ -1233,13 +1264,6 @@ static int __init xen_blkif_init(void)
  failed_init:
 	kfree(blkbk->pending_reqs);
 	kfree(blkbk->pending_grant_handles);
-	if (blkbk->pending_pages) {
-		for (i = 0; i < mmap_pages; i++) {
-			if (blkbk->pending_pages[i])
-				__free_page(blkbk->pending_pages[i]);
-		}
-		kfree(blkbk->pending_pages);
-	}
 	kfree(blkbk);
 	blkbk = NULL;
 	return rc;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index bd44d75..604bd30 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -209,6 +209,11 @@ struct xen_blkif {
 	unsigned int		persistent_gnt_c;
 	unsigned long		next_lru;
 
+	/* buffer of free pages to map grant refs */
+	spinlock_t		free_pages_lock;
+	int			free_pages_num;
+	struct list_head	free_pages;
+
 	/* statistics */
 	unsigned long		st_print;
 	int			st_rd_req;
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index abb399a..d7926ec 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -119,6 +119,9 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 	blkif->next_lru = jiffies;
 	init_waitqueue_head(&blkif->waiting_to_free);
 	blkif->persistent_gnts.rb_node = NULL;
+	spin_lock_init(&blkif->free_pages_lock);
+	INIT_LIST_HEAD(&blkif->free_pages);
+	blkif->free_pages_num = 0;
 
 	return blkif;
 }
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (7 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 11:07   ` [Xen-devel] " Jan Beulich
  2013-02-28 10:28 ` [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend Roger Pau Monne
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Moving grant ref handles from blkbk to pending_req will allow us to
get rid of the shared blkbk structure.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |   16 ++++------------
 1 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index ba27fc3..c43de8a 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -136,6 +136,7 @@ struct pending_req {
 	struct list_head	free_list;
 	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 };
 
 #define BLKBACK_INVALID_HANDLE (~0)
@@ -147,8 +148,6 @@ struct xen_blkbk {
 	/* And its spinlock. */
 	spinlock_t		pending_free_lock;
 	wait_queue_head_t	pending_free_wq;
-	/* And the grant handles that are available. */
-	grant_handle_t		*pending_grant_handles;
 };
 
 static struct xen_blkbk *blkbk;
@@ -226,7 +225,7 @@ static inline void remove_free_pages(struct xen_blkif *blkif, int num)
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
 #define pending_handle(_req, _seg) \
-	(blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)])
+	(_req->grant_handles[_seg])
 
 
 static int do_block_io_op(struct xen_blkif *blkif);
@@ -1214,7 +1213,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 static int __init xen_blkif_init(void)
 {
-	int i, mmap_pages;
+	int i;
 	int rc = 0;
 
 	if (!xen_domain())
@@ -1226,21 +1225,15 @@ static int __init xen_blkif_init(void)
 		return -ENOMEM;
 	}
 
-	mmap_pages = xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST;
 
 	blkbk->pending_reqs          = kzalloc(sizeof(blkbk->pending_reqs[0]) *
 					xen_blkif_reqs, GFP_KERNEL);
-	blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
-					mmap_pages, GFP_KERNEL);
 
-	if (!blkbk->pending_reqs || !blkbk->pending_grant_handles) {
+	if (!blkbk->pending_reqs) {
 		rc = -ENOMEM;
 		goto out_of_memory;
 	}
 
-	for (i = 0; i < mmap_pages; i++) {
-		blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
-	}
 	rc = xen_blkif_interface_init();
 	if (rc)
 		goto failed_init;
@@ -1263,7 +1256,6 @@ static int __init xen_blkif_init(void)
 	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
  failed_init:
 	kfree(blkbk->pending_reqs);
-	kfree(blkbk->pending_grant_handles);
 	kfree(blkbk);
 	blkbk = NULL;
 	return rc;
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (8 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 11:08   ` [Xen-devel] " Jan Beulich
  2013-02-28 10:28 ` [PATCH RFC 11/12] xen-blkback: expand map/unmap functions Roger Pau Monne
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Remove the last dependency from blkbk by moving the list of free
requests to blkif. This change reduces the contention on the list of
available requests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |  123 +++++++----------------------------
 drivers/block/xen-blkback/common.h  |   27 ++++++++
 drivers/block/xen-blkback/xenbus.c  |   17 +++++
 3 files changed, 67 insertions(+), 100 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c43de8a..04ad2aa 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -50,18 +50,14 @@
 #include "common.h"
 
 /*
- * These are rather arbitrary. They are fairly large because adjacent requests
- * pulled from a communication ring are quite likely to end up being part of
- * the same scatter/gather request at the disc.
- *
- * ** TRY INCREASING 'xen_blkif_reqs' IF WRITE SPEEDS SEEM TOO LOW **
- *
- * This will increase the chances of being able to write whole tracks.
- * 64 should be enough to keep us competitive with Linux.
+ * This is the number of requests that will be pre-allocated for each backend.
+ * For better performance this is set to RING_SIZE (32), so requests
+ * in the ring will never have to wait for a free pending_req.
  */
-static int xen_blkif_reqs = 64;
+
+int xen_blkif_reqs = 32;
 module_param_named(reqs, xen_blkif_reqs, int, 0);
-MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
+MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate per backend");
 
 /*
  * Maximum number of grants to map persistently in blkback. For maximum
@@ -120,50 +116,8 @@ MODULE_PARM_DESC(max_buffer_pages,
 static unsigned int log_stats;
 module_param(log_stats, int, 0644);
 
-/*
- * Each outstanding request that we've passed to the lower device layers has a
- * 'pending_req' allocated to it. Each buffer_head that completes decrements
- * the pendcnt towards zero. When it hits zero, the specified domain has a
- * response queued for it, with the saved 'id' passed back.
- */
-struct pending_req {
-	struct xen_blkif	*blkif;
-	u64			id;
-	int			nr_pages;
-	atomic_t		pendcnt;
-	unsigned short		operation;
-	int			status;
-	struct list_head	free_list;
-	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-};
-
 #define BLKBACK_INVALID_HANDLE (~0)
 
-struct xen_blkbk {
-	struct pending_req	*pending_reqs;
-	/* List of all 'pending_req' available */
-	struct list_head	pending_free;
-	/* And its spinlock. */
-	spinlock_t		pending_free_lock;
-	wait_queue_head_t	pending_free_wq;
-};
-
-static struct xen_blkbk *blkbk;
-
-/*
- * Little helpful macro to figure out the index and virtual address of the
- * pending_pages[..]. For each 'pending_req' we have have up to
- * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
- * 10 and would index in the pending_pages[..].
- */
-static inline int vaddr_pagenr(struct pending_req *req, int seg)
-{
-	return (req - blkbk->pending_reqs) *
-		BLKIF_MAX_SEGMENTS_PER_REQUEST + seg;
-}
-
 static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
 {
 	unsigned long flags;
@@ -400,18 +354,18 @@ finished:
 /*
  * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
  */
-static struct pending_req *alloc_req(void)
+static struct pending_req *alloc_req(struct xen_blkif *blkif)
 {
 	struct pending_req *req = NULL;
 	unsigned long flags;
 
-	spin_lock_irqsave(&blkbk->pending_free_lock, flags);
-	if (!list_empty(&blkbk->pending_free)) {
-		req = list_entry(blkbk->pending_free.next, struct pending_req,
+	spin_lock_irqsave(&blkif->pending_free_lock, flags);
+	if (!list_empty(&blkif->pending_free)) {
+		req = list_entry(blkif->pending_free.next, struct pending_req,
 				 free_list);
 		list_del(&req->free_list);
 	}
-	spin_unlock_irqrestore(&blkbk->pending_free_lock, flags);
+	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
 	return req;
 }
 
@@ -419,17 +373,17 @@ static struct pending_req *alloc_req(void)
  * Return the 'pending_req' structure back to the freepool. We also
  * wake up the thread if it was waiting for a free page.
  */
-static void free_req(struct pending_req *req)
+static void free_req(struct xen_blkif *blkif, struct pending_req *req)
 {
 	unsigned long flags;
 	int was_empty;
 
-	spin_lock_irqsave(&blkbk->pending_free_lock, flags);
-	was_empty = list_empty(&blkbk->pending_free);
-	list_add(&req->free_list, &blkbk->pending_free);
-	spin_unlock_irqrestore(&blkbk->pending_free_lock, flags);
+	spin_lock_irqsave(&blkif->pending_free_lock, flags);
+	was_empty = list_empty(&blkif->pending_free);
+	list_add(&req->free_list, &blkif->pending_free);
+	spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
 	if (was_empty)
-		wake_up(&blkbk->pending_free_wq);
+		wake_up(&blkif->pending_free_wq);
 }
 
 /*
@@ -564,8 +518,8 @@ int xen_blkif_schedule(void *arg)
 		if (timeout == 0)
 			goto purge_gnt_list;
 		timeout = wait_event_interruptible_timeout(
-			blkbk->pending_free_wq,
-			!list_empty(&blkbk->pending_free) ||
+			blkif->pending_free_wq,
+			!list_empty(&blkif->pending_free) ||
 			kthread_should_stop(),
 			timeout);
 		if (timeout == 0)
@@ -886,7 +840,7 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 			if (atomic_read(&pending_req->blkif->drain))
 				complete(&pending_req->blkif->drain_complete);
 		}
-		free_req(pending_req);
+		free_req(pending_req->blkif, pending_req);
 	}
 }
 
@@ -929,7 +883,7 @@ __do_block_io_op(struct xen_blkif *blkif)
 			break;
 		}
 
-		pending_req = alloc_req();
+		pending_req = alloc_req(blkif);
 		if (NULL == pending_req) {
 			blkif->st_oo_req++;
 			more_to_do = 1;
@@ -954,7 +908,7 @@ __do_block_io_op(struct xen_blkif *blkif)
 		/* Apply all sanity checks to /private copy/ of request. */
 		barrier();
 		if (unlikely(req.operation == BLKIF_OP_DISCARD)) {
-			free_req(pending_req);
+			free_req(blkif, pending_req);
 			if (dispatch_discard_io(blkif, &req))
 				break;
 		} else if (dispatch_rw_block_io(blkif, &req, pending_req))
@@ -1157,7 +1111,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
  fail_response:
 	/* Haven't submitted any bio's yet. */
 	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
-	free_req(pending_req);
+	free_req(blkif, pending_req);
 	msleep(1); /* back off a bit */
 	return -EIO;
 
@@ -1213,51 +1167,20 @@ static void make_response(struct xen_blkif *blkif, u64 id,
 
 static int __init xen_blkif_init(void)
 {
-	int i;
 	int rc = 0;
 
 	if (!xen_domain())
 		return -ENODEV;
 
-	blkbk = kzalloc(sizeof(struct xen_blkbk), GFP_KERNEL);
-	if (!blkbk) {
-		pr_alert(DRV_PFX "%s: out of memory!\n", __func__);
-		return -ENOMEM;
-	}
-
-
-	blkbk->pending_reqs          = kzalloc(sizeof(blkbk->pending_reqs[0]) *
-					xen_blkif_reqs, GFP_KERNEL);
-
-	if (!blkbk->pending_reqs) {
-		rc = -ENOMEM;
-		goto out_of_memory;
-	}
-
 	rc = xen_blkif_interface_init();
 	if (rc)
 		goto failed_init;
 
-	INIT_LIST_HEAD(&blkbk->pending_free);
-	spin_lock_init(&blkbk->pending_free_lock);
-	init_waitqueue_head(&blkbk->pending_free_wq);
-
-	for (i = 0; i < xen_blkif_reqs; i++)
-		list_add_tail(&blkbk->pending_reqs[i].free_list,
-			      &blkbk->pending_free);
-
 	rc = xen_blkif_xenbus_init();
 	if (rc)
 		goto failed_init;
 
-	return 0;
-
- out_of_memory:
-	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
  failed_init:
-	kfree(blkbk->pending_reqs);
-	kfree(blkbk);
-	blkbk = NULL;
 	return rc;
 }
 
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 604bd30..0b0ad3f 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -214,6 +214,14 @@ struct xen_blkif {
 	int			free_pages_num;
 	struct list_head	free_pages;
 
+	/* Allocation of pending_reqs */
+	struct pending_req	*pending_reqs;
+	/* List of all 'pending_req' available */
+	struct list_head	pending_free;
+	/* And its spinlock. */
+	spinlock_t		pending_free_lock;
+	wait_queue_head_t	pending_free_wq;
+
 	/* statistics */
 	unsigned long		st_print;
 	int			st_rd_req;
@@ -227,6 +235,25 @@ struct xen_blkif {
 	wait_queue_head_t	waiting_to_free;
 };
 
+/*
+ * Each outstanding request that we've passed to the lower device layers has a
+ * 'pending_req' allocated to it. Each buffer_head that completes decrements
+ * the pendcnt towards zero. When it hits zero, the specified domain has a
+ * response queued for it, with the saved 'id' passed back.
+ */
+struct pending_req {
+	struct xen_blkif	*blkif;
+	u64			id;
+	int			nr_pages;
+	atomic_t		pendcnt;
+	unsigned short		operation;
+	int			status;
+	struct list_head	free_list;
+	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+
 
 #define vbd_sz(_v)	((_v)->bdev->bd_part ? \
 			 (_v)->bdev->bd_part->nr_sects : \
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index d7926ec..8f929cb 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -30,6 +30,8 @@ struct backend_info {
 	char			*mode;
 };
 
+extern int xen_blkif_reqs;
+
 static struct kmem_cache *xen_blkif_cachep;
 static void connect(struct backend_info *);
 static int connect_ring(struct backend_info *);
@@ -104,6 +106,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
 static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 {
 	struct xen_blkif *blkif;
+	int i;
 
 	blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
 	if (!blkif)
@@ -122,6 +125,19 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 	spin_lock_init(&blkif->free_pages_lock);
 	INIT_LIST_HEAD(&blkif->free_pages);
 	blkif->free_pages_num = 0;
+	blkif->pending_reqs = kzalloc(sizeof(blkif->pending_reqs[0]) *
+	                              xen_blkif_reqs, GFP_KERNEL);
+	if (!blkif->pending_reqs) {
+		kmem_cache_free(xen_blkif_cachep, blkif);
+		return ERR_PTR(-ENOMEM);
+	}
+	INIT_LIST_HEAD(&blkif->pending_free);
+	spin_lock_init(&blkif->pending_free_lock);
+	init_waitqueue_head(&blkif->pending_free_wq);
+
+	for (i = 0; i < xen_blkif_reqs; i++)
+		list_add_tail(&blkif->pending_reqs[i].free_list,
+			      &blkif->pending_free);
 
 	return blkif;
 }
@@ -204,6 +220,7 @@ static void xen_blkif_free(struct xen_blkif *blkif)
 {
 	if (!atomic_dec_and_test(&blkif->refcnt))
 		BUG();
+	kfree(blkif->pending_reqs);
 	kmem_cache_free(xen_blkif_cachep, blkif);
 }
 
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 11/12] xen-blkback: expand map/unmap functions
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (9 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
  2013-02-28 10:49 ` [Xen-devel] [PATCH RFC 00/12] xen-block: " Jan Beulich
  12 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Preparatory change for implementing indirect descriptors. Change
xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount
of grants (previously it was limited to
BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req
in the map/unmap functions, so we can map/unmap grants without needing
to pass a pending_req.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |  134 ++++++++++++++++++++++-------------
 1 files changed, 85 insertions(+), 49 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 04ad2aa..0fa30db 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -178,10 +178,6 @@ static inline void remove_free_pages(struct xen_blkif *blkif, int num)
 
 #define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
 
-#define pending_handle(_req, _seg) \
-	(_req->grant_handles[_seg])
-
-
 static int do_block_io_op(struct xen_blkif *blkif);
 static int dispatch_rw_block_io(struct xen_blkif *blkif,
 				struct blkif_request *req,
@@ -590,53 +586,60 @@ struct seg_buf {
  * Unmap the grant references, and also remove the M2P over-rides
  * used in the 'pending_req'.
  */
-static void xen_blkbk_unmap(struct pending_req *req)
+static void xen_blkbk_unmap(struct xen_blkif *blkif,
+                            grant_handle_t handles[],
+                            struct page *pages[],
+                            struct persistent_gnt *persistent_gnts[],
+                            int num)
 {
 	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page *unmap_pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct persistent_gnt *persistent_gnt;
 	unsigned int i, invcount = 0;
-	grant_handle_t handle;
-	struct xen_blkif *blkif = req->blkif;
 	int ret;
 
-	for (i = 0; i < req->nr_pages; i++) {
-		if (req->persistent_gnts[i] != NULL) {
-			persistent_gnt = req->persistent_gnts[i];
+	for (i = 0; i < num; i++) {
+		if (persistent_gnts[i] != NULL) {
+			persistent_gnt = persistent_gnts[i];
 			persistent_gnt->flags |= PERSISTENT_GNT_USED;
 			persistent_gnt->flags &= ~PERSISTENT_GNT_ACTIVE;
 			continue;
 		}
-		handle = pending_handle(req, i);
-		pages[invcount] = req->pages[i];
-		if (handle == BLKBACK_INVALID_HANDLE)
+		if (handles[i] == BLKBACK_INVALID_HANDLE)
 			continue;
-		gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[invcount]),
-				    GNTMAP_host_map, handle);
-		pending_handle(req, i) = BLKBACK_INVALID_HANDLE;
-		invcount++;
+		unmap_pages[invcount] = pages[i];
+		gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[i]),
+				    GNTMAP_host_map, handles[i]);
+		handles[i] = BLKBACK_INVALID_HANDLE;
+		if (++invcount == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+			ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
+			                        invcount);
+			BUG_ON(ret);
+			put_free_pages(blkif, unmap_pages, invcount);
+			invcount = 0;
+		}
+	}
+	if (invcount) {
+		ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
+		BUG_ON(ret);
+		put_free_pages(blkif, unmap_pages, invcount);
 	}
-
-	ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
-	BUG_ON(ret);
-	put_free_pages(blkif, pages, invcount);
 }
 
-static int xen_blkbk_map(struct blkif_request *req,
-			 struct pending_req *pending_req,
-			 struct seg_buf seg[],
-			 struct page *pages[])
+static int xen_blkbk_map(struct xen_blkif *blkif, grant_ref_t grefs[],
+			 struct persistent_gnt *persistent_gnts[],
+			 grant_handle_t handles[],
+			 struct page *pages[],
+			 int num, bool ro)
 {
 	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
 	struct persistent_gnt *persistent_gnt = NULL;
-	struct xen_blkif *blkif = pending_req->blkif;
 	phys_addr_t addr = 0;
 	int i, j;
-	int nseg = req->u.rw.nr_segments;
 	int segs_to_map = 0;
 	int ret = 0;
+	int last_map = 0, map_until = 0;
 	int use_persistent_gnts;
 
 	use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
@@ -646,13 +649,14 @@ static int xen_blkbk_map(struct blkif_request *req,
 	 * assign map[..] with the PFN of the page in our domain with the
 	 * corresponding grant reference for each page.
 	 */
-	for (i = 0; i < nseg; i++) {
+again:
+	for (i = map_until; i < num; i++) {
 		uint32_t flags;
 
 		if (use_persistent_gnts)
 			persistent_gnt = get_persistent_gnt(
 				&blkif->persistent_gnts,
-				req->u.rw.seg[i].gref);
+				grefs[i]);
 
 		if (persistent_gnt) {
 			/*
@@ -669,13 +673,15 @@ static int xen_blkbk_map(struct blkif_request *req,
 			pages_to_gnt[segs_to_map] = pages[i];
 			persistent_gnts[i] = NULL;
 			flags = GNTMAP_host_map;
-			if (!use_persistent_gnts &&
-			    (pending_req->operation != BLKIF_OP_READ))
+			if (!use_persistent_gnts && ro)
 				flags |= GNTMAP_readonly;
 			gnttab_set_map_op(&map[segs_to_map++], addr,
-					  flags, req->u.rw.seg[i].gref,
+					  flags, grefs[i],
 					  blkif->domid);
 		}
+		map_until = i + 1;
+		if (segs_to_map == BLKIF_MAX_SEGMENTS_PER_REQUEST)
+			break;
 	}
 
 	if (segs_to_map) {
@@ -688,22 +694,20 @@ static int xen_blkbk_map(struct blkif_request *req,
 	 * so that when we access vaddr(pending_req,i) it has the contents of
 	 * the page from the other domain.
 	 */
-	for (i = 0, j = 0; i < nseg; i++) {
+	for (i = last_map, j = 0; i < map_until; i++) {
 		if (!persistent_gnts[i]) {
 			/* This is a newly mapped grant */
 			BUG_ON(j >= segs_to_map);
 			if (unlikely(map[j].status != 0)) {
 				pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
-				pending_handle(pending_req, i) =
-					BLKBACK_INVALID_HANDLE;
+				handles[i] = BLKBACK_INVALID_HANDLE;
 				ret |= 1;
-				j++;
-				continue;
+				goto next;
 			}
-			pending_handle(pending_req, i) = map[j].handle;
+			handles[i] = map[j].handle;
 		}
 		if (persistent_gnts[i])
-			goto next;
+			continue;
 		if (use_persistent_gnts &&
 		    blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
 			/*
@@ -718,7 +722,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 				 * allocate the persistent_gnt struct
 				 * map this grant non-persistenly
 				 */
-				j++;
 				goto next;
 			}
 			persistent_gnt->gnt = map[j].ref;
@@ -735,7 +738,6 @@ static int xen_blkbk_map(struct blkif_request *req,
 			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
 				 persistent_gnt->gnt, blkif->persistent_gnt_c,
 				 xen_blkif_max_pgrants);
-			j++;
 			goto next;
 		}
 		if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
@@ -743,11 +745,14 @@ static int xen_blkbk_map(struct blkif_request *req,
 			pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
 			         blkif->domid, blkif->vbd.handle);
 		}
-		j++;
 next:
-		seg[i].buf = pfn_to_mfn(page_to_pfn(pages[i])) << PAGE_SHIFT |
-		             (req->u.rw.seg[i].first_sect << 9);
+		j++;
 	}
+	segs_to_map = 0;
+	last_map = map_until;
+	if (map_until != num)
+		goto again;
+
 	return ret;
 
 out_of_memory:
@@ -756,6 +761,32 @@ out_of_memory:
 	return -ENOMEM;
 }
 
+static int xen_blkbk_map_seg(struct blkif_request *req,
+			     struct pending_req *pending_req,
+			     struct seg_buf seg[],
+			     struct page *pages[])
+{
+	int i, rc;
+	grant_ref_t grefs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+
+	for (i = 0; i < req->u.rw.nr_segments; i++)
+		grefs[i] = req->u.rw.seg[i].gref;
+
+	rc = xen_blkbk_map(pending_req->blkif, grefs,
+	                   pending_req->persistent_gnts,
+	                   pending_req->grant_handles, pending_req->pages,
+	                   req->u.rw.nr_segments,
+	                   (pending_req->operation != BLKIF_OP_READ));
+	if (rc)
+		return rc;
+
+	for (i = 0; i < req->u.rw.nr_segments; i++)
+		seg[i].buf = pfn_to_mfn(page_to_pfn(pending_req->pages[i]))
+		             << PAGE_SHIFT | (req->u.rw.seg[i].first_sect << 9);
+
+	return 0;
+}
+
 static int dispatch_discard_io(struct xen_blkif *blkif,
 				struct blkif_request *req)
 {
@@ -832,7 +863,10 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
 	 * the proper response on the ring.
 	 */
 	if (atomic_dec_and_test(&pending_req->pendcnt)) {
-		xen_blkbk_unmap(pending_req);
+		xen_blkbk_unmap(pending_req->blkif, pending_req->grant_handles,
+		                pending_req->pages,
+		                pending_req->persistent_gnts,
+		                pending_req->nr_pages);
 		make_response(pending_req->blkif, pending_req->id,
 			      pending_req->operation, pending_req->status);
 		xen_blkif_put(pending_req->blkif);
@@ -1040,7 +1074,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	 * the hypercall to unmap the grants - that is all done in
 	 * xen_blkbk_unmap.
 	 */
-	if (xen_blkbk_map(req, pending_req, seg, pages))
+	if (xen_blkbk_map_seg(req, pending_req, seg, pages))
 		goto fail_flush;
 
 	/*
@@ -1107,7 +1141,9 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	return 0;
 
  fail_flush:
-	xen_blkbk_unmap(pending_req);
+	xen_blkbk_unmap(blkif, pending_req->grant_handles,
+	                pending_req->pages, pending_req->persistent_gnts,
+	                pending_req->nr_pages);
  fail_response:
 	/* Haven't submitted any bio's yet. */
 	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (10 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 11/12] xen-blkback: expand map/unmap functions Roger Pau Monne
@ 2013-02-28 10:28 ` Roger Pau Monne
  2013-02-28 11:19   ` [Xen-devel] " Jan Beulich
                     ` (2 more replies)
  2013-02-28 10:49 ` [Xen-devel] [PATCH RFC 00/12] xen-block: " Jan Beulich
  12 siblings, 3 replies; 51+ messages in thread
From: Roger Pau Monne @ 2013-02-28 10:28 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Roger Pau Monne, Konrad Rzeszutek Wilk

Indirect descriptors introduce a new block operation
(BLKIF_OP_INDIRECT) that passes grant references instead of segments
in the request. This grant references are filled with arrays of
blkif_request_segment_aligned, this way we can send more segments in a
request.

The proposed implementation sets the maximum number of indirect grefs
(frames filled with blkif_request_segment_aligned) to 256 in the
backend and 64 in the frontend. The value in the frontend has been
chosen experimentally, and the backend value has been set to a sane
value that allows expanding the maximum number of indirect descriptors
in the frontend if needed.

The migration code has changed from the previous implementation, in
which we simply remapped the segments on the shared ring. Now the
maximum number of segments allowed in a request can change depending
on the backend, so we have to requeue all the requests in the ring and
in the queue and split the bios in them if they are bigger than the
new maximum number of segments.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
---
 drivers/block/xen-blkback/blkback.c |  129 +++++++---
 drivers/block/xen-blkback/common.h  |   80 ++++++-
 drivers/block/xen-blkback/xenbus.c  |    8 +
 drivers/block/xen-blkfront.c        |  498 +++++++++++++++++++++++++++++------
 include/xen/interface/io/blkif.h    |   25 ++
 5 files changed, 622 insertions(+), 118 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 0fa30db..98eb16b 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -70,7 +70,7 @@ MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate per backend");
  * algorithm.
  */
 
-static int xen_blkif_max_pgrants = 352;
+static int xen_blkif_max_pgrants = 1024;
 module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
 MODULE_PARM_DESC(max_persistent_grants,
                  "Maximum number of grants to map persistently");
@@ -578,10 +578,6 @@ purge_gnt_list:
 	return 0;
 }
 
-struct seg_buf {
-	unsigned long buf;
-	unsigned int nsec;
-};
 /*
  * Unmap the grant references, and also remove the M2P over-rides
  * used in the 'pending_req'.
@@ -761,32 +757,79 @@ out_of_memory:
 	return -ENOMEM;
 }
 
-static int xen_blkbk_map_seg(struct blkif_request *req,
-			     struct pending_req *pending_req,
+static int xen_blkbk_map_seg(struct pending_req *pending_req,
 			     struct seg_buf seg[],
 			     struct page *pages[])
 {
 	int i, rc;
-	grant_ref_t grefs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 
-	for (i = 0; i < req->u.rw.nr_segments; i++)
-		grefs[i] = req->u.rw.seg[i].gref;
-
-	rc = xen_blkbk_map(pending_req->blkif, grefs,
+	rc = xen_blkbk_map(pending_req->blkif, pending_req->grefs,
 	                   pending_req->persistent_gnts,
 	                   pending_req->grant_handles, pending_req->pages,
-	                   req->u.rw.nr_segments,
+	                   pending_req->nr_pages,
 	                   (pending_req->operation != BLKIF_OP_READ));
 	if (rc)
 		return rc;
 
-	for (i = 0; i < req->u.rw.nr_segments; i++)
-		seg[i].buf = pfn_to_mfn(page_to_pfn(pending_req->pages[i]))
-		             << PAGE_SHIFT | (req->u.rw.seg[i].first_sect << 9);
+	for (i = 0; i < pending_req->nr_pages; i++)
+		seg[i].buf |= pfn_to_mfn(page_to_pfn(pending_req->pages[i]))
+		             << PAGE_SHIFT;
 
 	return 0;
 }
 
+static int xen_blkbk_parse_indirect(struct blkif_request *req,
+                                    struct pending_req *pending_req,
+                                    struct seg_buf seg[],
+                                    struct phys_req *preq)
+{
+	struct persistent_gnt **persistent =
+		pending_req->indirect_persistent_gnts;
+	struct page **pages = pending_req->indirect_pages;
+	struct xen_blkif *blkif = pending_req->blkif;
+	int indirect_grefs, rc, n, nseg, i;
+	struct blkif_request_segment_aligned *segments = NULL;
+
+	nseg = pending_req->nr_pages;
+	indirect_grefs = (nseg + SEGS_PER_INDIRECT_FRAME - 1) /
+		         SEGS_PER_INDIRECT_FRAME;
+
+	rc = xen_blkbk_map(blkif, req->u.indirect.indirect_grefs,
+	                   persistent, pending_req->indirect_handles,
+	                   pages, indirect_grefs, true);
+	if (rc)
+		goto unmap;
+
+	for (n = 0, i = 0; n < nseg; n++) {
+		if ((n % SEGS_PER_INDIRECT_FRAME) == 0) {
+			/* Map indirect segments */
+			if (segments)
+				kunmap_atomic(segments);
+			segments =
+				kmap_atomic(pages[n/SEGS_PER_INDIRECT_FRAME]);
+		}
+		i = n % SEGS_PER_INDIRECT_FRAME;
+		pending_req->grefs[n] = segments[i].gref;
+		seg[n].nsec = segments[i].last_sect -
+			segments[i].first_sect + 1;
+		seg[n].buf = segments[i].first_sect << 9;
+		if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) ||
+	    	    (segments[i].last_sect <
+	    	     segments[i].first_sect)) {
+			rc = -EINVAL;
+			goto unmap;
+		}
+		preq->nr_sects += seg[n].nsec;
+	}
+
+unmap:
+	if (segments)
+		kunmap_atomic(segments);
+	xen_blkbk_unmap(blkif, pending_req->indirect_handles,
+                        pages, persistent, indirect_grefs);
+	return rc;
+}
+
 static int dispatch_discard_io(struct xen_blkif *blkif,
 				struct blkif_request *req)
 {
@@ -980,17 +1023,21 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 				struct pending_req *pending_req)
 {
 	struct phys_req preq;
-	struct seg_buf seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct seg_buf *seg = pending_req->seg;
 	unsigned int nseg;
 	struct bio *bio = NULL;
-	struct bio *biolist[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct bio **biolist = pending_req->biolist;
 	int i, nbio = 0;
 	int operation;
 	struct blk_plug plug;
 	bool drain = false;
 	struct page **pages = pending_req->pages;
+	unsigned short req_operation;
+
+	req_operation = req->operation == BLKIF_OP_INDIRECT ?
+	                req->u.indirect.indirect_op : req->operation;
 
-	switch (req->operation) {
+	switch (req_operation) {
 	case BLKIF_OP_READ:
 		blkif->st_rd_req++;
 		operation = READ;
@@ -1012,33 +1059,49 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	}
 
 	/* Check that the number of segments is sane. */
-	nseg = req->u.rw.nr_segments;
+	nseg = req->operation == BLKIF_OP_INDIRECT ?
+	       req->u.indirect.nr_segments : req->u.rw.nr_segments;
 
 	if (unlikely(nseg == 0 && operation != WRITE_FLUSH) ||
-	    unlikely(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
+	    unlikely((req->operation != BLKIF_OP_INDIRECT) &&
+	             (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
+	    unlikely((req->operation == BLKIF_OP_INDIRECT) &&
+	             (nseg > MAX_INDIRECT_SEGMENTS))) {
 		pr_debug(DRV_PFX "Bad number of segments in request (%d)\n",
 			 nseg);
 		/* Haven't submitted any bio's yet. */
 		goto fail_response;
 	}
 
-	preq.sector_number = req->u.rw.sector_number;
 	preq.nr_sects      = 0;
 
 	pending_req->blkif     = blkif;
-	pending_req->id        = req->u.rw.id;
-	pending_req->operation = req->operation;
 	pending_req->status    = BLKIF_RSP_OKAY;
 	pending_req->nr_pages  = nseg;
 
-	for (i = 0; i < nseg; i++) {
-		seg[i].nsec = req->u.rw.seg[i].last_sect -
-			req->u.rw.seg[i].first_sect + 1;
-		if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
-		    (req->u.rw.seg[i].last_sect < req->u.rw.seg[i].first_sect))
+	if (req->operation != BLKIF_OP_INDIRECT) {
+		preq.dev               = req->u.rw.handle;
+		preq.sector_number     = req->u.rw.sector_number;
+		pending_req->id        = req->u.rw.id;
+		pending_req->operation = req->operation;
+		for (i = 0; i < nseg; i++) {
+			pending_req->grefs[i] = req->u.rw.seg[i].gref;
+			seg[i].nsec = req->u.rw.seg[i].last_sect -
+				req->u.rw.seg[i].first_sect + 1;
+			seg[i].buf = req->u.rw.seg[i].first_sect << 9;
+			if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
+		    	    (req->u.rw.seg[i].last_sect <
+		    	     req->u.rw.seg[i].first_sect))
+				goto fail_response;
+			preq.nr_sects += seg[i].nsec;
+		}
+	} else {
+		preq.dev               = req->u.indirect.handle;
+		preq.sector_number     = req->u.indirect.sector_number;
+		pending_req->id        = req->u.indirect.id;
+		pending_req->operation = req->u.indirect.indirect_op;
+		if (xen_blkbk_parse_indirect(req, pending_req, seg, &preq))
 			goto fail_response;
-		preq.nr_sects += seg[i].nsec;
-
 	}
 
 	if (xen_vbd_translate(&preq, blkif, operation) != 0) {
@@ -1074,7 +1137,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	 * the hypercall to unmap the grants - that is all done in
 	 * xen_blkbk_unmap.
 	 */
-	if (xen_blkbk_map_seg(req, pending_req, seg, pages))
+	if (xen_blkbk_map_seg(pending_req, seg, pages))
 		goto fail_flush;
 
 	/*
@@ -1146,7 +1209,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	                pending_req->nr_pages);
  fail_response:
 	/* Haven't submitted any bio's yet. */
-	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
+	make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
 	free_req(blkif, pending_req);
 	msleep(1); /* back off a bit */
 	return -EIO;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 0b0ad3f..d3656d2 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -50,6 +50,17 @@
 		 __func__, __LINE__, ##args)
 
 
+/*
+ * This is the maximum number of segments that would be allowed in indirect
+ * requests. This value will also be passed to the frontend.
+ */
+#define MAX_INDIRECT_SEGMENTS 256
+
+#define SEGS_PER_INDIRECT_FRAME \
+(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
+#define MAX_INDIRECT_GREFS \
+((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+
 /* Not a real protocol.  Used to generate ring structs which contain
  * the elements common to all protocols only.  This way we get a
  * compiler-checkable way to use common struct elements, so we can
@@ -77,11 +88,21 @@ struct blkif_x86_32_request_discard {
 	uint64_t       nr_sectors;
 } __attribute__((__packed__));
 
+struct blkif_x86_32_request_indirect {
+	uint8_t        indirect_op;
+	uint16_t       nr_segments;
+	uint64_t       id;
+	blkif_vdev_t   handle;
+	blkif_sector_t sector_number;
+	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
+} __attribute__((__packed__));
+
 struct blkif_x86_32_request {
 	uint8_t        operation;    /* BLKIF_OP_???                         */
 	union {
 		struct blkif_x86_32_request_rw rw;
 		struct blkif_x86_32_request_discard discard;
+		struct blkif_x86_32_request_indirect indirect;
 	} u;
 } __attribute__((__packed__));
 
@@ -113,11 +134,22 @@ struct blkif_x86_64_request_discard {
 	uint64_t       nr_sectors;
 } __attribute__((__packed__));
 
+struct blkif_x86_64_request_indirect {
+	uint8_t        indirect_op;
+	uint16_t       nr_segments;
+	uint32_t       _pad1;        /* offsetof(blkif_..,u.indirect.id)==8   */
+	uint64_t       id;
+	blkif_vdev_t   handle;
+	blkif_sector_t sector_number;
+	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
+} __attribute__((__packed__));
+
 struct blkif_x86_64_request {
 	uint8_t        operation;    /* BLKIF_OP_???                         */
 	union {
 		struct blkif_x86_64_request_rw rw;
 		struct blkif_x86_64_request_discard discard;
+		struct blkif_x86_64_request_indirect indirect;
 	} u;
 } __attribute__((__packed__));
 
@@ -235,6 +267,11 @@ struct xen_blkif {
 	wait_queue_head_t	waiting_to_free;
 };
 
+struct seg_buf {
+	unsigned long buf;
+	unsigned int nsec;
+};
+
 /*
  * Each outstanding request that we've passed to the lower device layers has a
  * 'pending_req' allocated to it. Each buffer_head that completes decrements
@@ -249,9 +286,16 @@ struct pending_req {
 	unsigned short		operation;
 	int			status;
 	struct list_head	free_list;
-	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct persistent_gnt	*persistent_gnts[MAX_INDIRECT_SEGMENTS];
+	struct page		*pages[MAX_INDIRECT_SEGMENTS];
+	grant_handle_t		grant_handles[MAX_INDIRECT_SEGMENTS];
+	grant_ref_t		grefs[MAX_INDIRECT_SEGMENTS];
+	/* Indirect descriptors */
+	struct persistent_gnt	*indirect_persistent_gnts[MAX_INDIRECT_GREFS];
+	struct page		*indirect_pages[MAX_INDIRECT_GREFS];
+	grant_handle_t		indirect_handles[MAX_INDIRECT_GREFS];
+	struct seg_buf		seg[MAX_INDIRECT_SEGMENTS];
+	struct bio		*biolist[MAX_INDIRECT_SEGMENTS];
 };
 
 
@@ -289,7 +333,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
 static inline void blkif_get_x86_32_req(struct blkif_request *dst,
 					struct blkif_x86_32_request *src)
 {
-	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j = MAX_INDIRECT_GREFS;
 	dst->operation = src->operation;
 	switch (src->operation) {
 	case BLKIF_OP_READ:
@@ -312,6 +356,19 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
 		dst->u.discard.sector_number = src->u.discard.sector_number;
 		dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
 		break;
+	case BLKIF_OP_INDIRECT:
+		dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
+		dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
+		dst->u.indirect.handle = src->u.indirect.handle;
+		dst->u.indirect.id = src->u.indirect.id;
+		dst->u.indirect.sector_number = src->u.indirect.sector_number;
+		barrier();
+		if (j > dst->u.indirect.nr_segments)
+			j = dst->u.indirect.nr_segments;
+		for (i = 0; i < j; i++)
+			dst->u.indirect.indirect_grefs[i] =
+				src->u.indirect.indirect_grefs[i];
+		break;
 	default:
 		break;
 	}
@@ -320,7 +377,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
 static inline void blkif_get_x86_64_req(struct blkif_request *dst,
 					struct blkif_x86_64_request *src)
 {
-	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j = MAX_INDIRECT_GREFS;
 	dst->operation = src->operation;
 	switch (src->operation) {
 	case BLKIF_OP_READ:
@@ -343,6 +400,19 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
 		dst->u.discard.sector_number = src->u.discard.sector_number;
 		dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
 		break;
+	case BLKIF_OP_INDIRECT:
+		dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
+		dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
+		dst->u.indirect.handle = src->u.indirect.handle;
+		dst->u.indirect.id = src->u.indirect.id;
+		dst->u.indirect.sector_number = src->u.indirect.sector_number;
+		barrier();
+		if (j > dst->u.indirect.nr_segments)
+			j = dst->u.indirect.nr_segments;
+		for (i = 0; i < j; i++)
+			dst->u.indirect.indirect_grefs[i] =
+				src->u.indirect.indirect_grefs[i];
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 8f929cb..9e16abb 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -700,6 +700,14 @@ again:
 		goto abort;
 	}
 
+	err = xenbus_printf(xbt, dev->nodename, "max-indirect-segments", "%u",
+	                    MAX_INDIRECT_SEGMENTS);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "writing %s/max-indirect-segments",
+				 dev->nodename);
+		goto abort;
+	}
+
 	err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
 			    (unsigned long long)vbd_sz(&be->blkif->vbd));
 	if (err) {
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 4d81fcc..074d302 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -74,12 +74,30 @@ struct grant {
 struct blk_shadow {
 	struct blkif_request req;
 	struct request *request;
-	struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct grant **grants_used;
+	struct grant **indirect_grants;
+};
+
+struct split_bio {
+	struct bio *bio;
+	atomic_t pending;
+	int err;
 };
 
 static DEFINE_MUTEX(blkfront_mutex);
 static const struct block_device_operations xlvbd_block_fops;
 
+/*
+ * Maximum number of segments in indirect requests, the actual value used by
+ * the frontend driver is the minimum of this value and the value provided
+ * by the backend driver.
+ */
+
+static int xen_blkif_max_segments = 64;
+module_param_named(max_segments, xen_blkif_max_segments, int, 0);
+MODULE_PARM_DESC(max_segments,
+"Maximum number of segments in indirect requests");
+
 #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
 
 /*
@@ -98,7 +116,7 @@ struct blkfront_info
 	enum blkif_state connected;
 	int ring_ref;
 	struct blkif_front_ring ring;
-	struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct scatterlist *sg;
 	unsigned int evtchn, irq;
 	struct request_queue *rq;
 	struct work_struct work;
@@ -114,6 +132,8 @@ struct blkfront_info
 	unsigned int discard_granularity;
 	unsigned int discard_alignment;
 	unsigned int feature_persistent:1;
+	unsigned int max_indirect_segments;
+	unsigned int sector_size;
 	int is_ready;
 };
 
@@ -142,6 +162,14 @@ static DEFINE_SPINLOCK(minor_lock);
 
 #define DEV_NAME	"xvd"	/* name in /dev */
 
+#define SEGS_PER_INDIRECT_FRAME \
+	(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
+#define INDIRECT_GREFS(_segs) \
+	((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+#define MIN(_a, _b) ((_a) < (_b) ? (_a) : (_b))
+
+static int blkfront_setup_indirect(struct blkfront_info *info);
+
 static int get_id_from_freelist(struct blkfront_info *info)
 {
 	unsigned long free = info->shadow_free;
@@ -358,7 +386,8 @@ static int blkif_queue_request(struct request *req)
 	struct blkif_request *ring_req;
 	unsigned long id;
 	unsigned int fsect, lsect;
-	int i, ref;
+	int i, ref, n;
+	struct blkif_request_segment_aligned *segments = NULL;
 
 	/*
 	 * Used to store if we are able to queue the request by just using
@@ -369,21 +398,27 @@ static int blkif_queue_request(struct request *req)
 	grant_ref_t gref_head;
 	struct grant *gnt_list_entry = NULL;
 	struct scatterlist *sg;
+	int nseg, max_grefs;
 
 	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
 		return 1;
 
-	/* Check if we have enought grants to allocate a requests */
-	if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+	max_grefs = info->max_indirect_segments ?
+	            info->max_indirect_segments +
+	            INDIRECT_GREFS(info->max_indirect_segments) :
+	            BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+	/* Check if we have enough grants to allocate a requests */
+	if (info->persistent_gnts_c < max_grefs) {
 		new_persistent_gnts = 1;
 		if (gnttab_alloc_grant_references(
-		    BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
+		    max_grefs - info->persistent_gnts_c,
 		    &gref_head) < 0) {
 			gnttab_request_free_callback(
 				&info->callback,
 				blkif_restart_queue_callback,
 				info,
-				BLKIF_MAX_SEGMENTS_PER_REQUEST);
+				max_grefs);
 			return 1;
 		}
 	} else
@@ -394,42 +429,82 @@ static int blkif_queue_request(struct request *req)
 	id = get_id_from_freelist(info);
 	info->shadow[id].request = req;
 
-	ring_req->u.rw.id = id;
-	ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
-	ring_req->u.rw.handle = info->handle;
-
-	ring_req->operation = rq_data_dir(req) ?
-		BLKIF_OP_WRITE : BLKIF_OP_READ;
-
-	if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
-		/*
-		 * Ideally we can do an unordered flush-to-disk. In case the
-		 * backend onlysupports barriers, use that. A barrier request
-		 * a superset of FUA, so we can implement it the same
-		 * way.  (It's also a FLUSH+FUA, since it is
-		 * guaranteed ordered WRT previous writes.)
-		 */
-		ring_req->operation = info->flush_op;
-	}
-
 	if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
 		/* id, sector_number and handle are set above. */
 		ring_req->operation = BLKIF_OP_DISCARD;
 		ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
+		ring_req->u.discard.id = id;
+		ring_req->u.discard.sector_number =
+			(blkif_sector_t)blk_rq_pos(req);
 		if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
 			ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
 		else
 			ring_req->u.discard.flag = 0;
 	} else {
-		ring_req->u.rw.nr_segments = blk_rq_map_sg(req->q, req,
-							   info->sg);
-		BUG_ON(ring_req->u.rw.nr_segments >
-		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
-
-		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
+		BUG_ON(info->max_indirect_segments == 0 &&
+		       req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
+		BUG_ON(info->max_indirect_segments &&
+		       req->nr_phys_segments > info->max_indirect_segments);
+		nseg = blk_rq_map_sg(req->q, req, info->sg);
+		if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+			/* Indirect OP */
+			ring_req->operation = BLKIF_OP_INDIRECT;
+			ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
+				BLKIF_OP_WRITE : BLKIF_OP_READ;
+			ring_req->u.indirect.id = id;
+			ring_req->u.indirect.sector_number =
+				(blkif_sector_t)blk_rq_pos(req);
+			ring_req->u.indirect.handle = info->handle;
+			if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
+		/*
+		 * Ideally we can do an unordered flush-to-disk. In case the
+		 * backend onlysupports barriers, use that. A barrier request
+		 * a superset of FUA, so we can implement it the same
+		 * way.  (It's also a FLUSH+FUA, since it is
+		 * guaranteed ordered WRT previous writes.)
+		 */
+				ring_req->u.indirect.indirect_op =
+					info->flush_op;
+			}
+			ring_req->u.indirect.nr_segments = nseg;
+		} else {
+			ring_req->u.rw.id = id;
+			ring_req->u.rw.sector_number =
+				(blkif_sector_t)blk_rq_pos(req);
+			ring_req->u.rw.handle = info->handle;
+			ring_req->operation = rq_data_dir(req) ?
+				BLKIF_OP_WRITE : BLKIF_OP_READ;
+			if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
+		/*
+		 * Ideally we can do an unordered flush-to-disk. In case the
+		 * backend onlysupports barriers, use that. A barrier request
+		 * a superset of FUA, so we can implement it the same
+		 * way.  (It's also a FLUSH+FUA, since it is
+		 * guaranteed ordered WRT previous writes.)
+		 */
+				ring_req->operation = info->flush_op;
+			}
+			ring_req->u.rw.nr_segments = nseg;
+		}
+		for_each_sg(info->sg, sg, nseg, i) {
 			fsect = sg->offset >> 9;
 			lsect = fsect + (sg->length >> 9) - 1;
 
+			if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
+			    (i % SEGS_PER_INDIRECT_FRAME == 0)) {
+				if (segments)
+					kunmap_atomic(segments);
+
+				n = i / SEGS_PER_INDIRECT_FRAME;
+				gnt_list_entry = get_grant(&gref_head, info);
+				info->shadow[id].indirect_grants[n] =
+					gnt_list_entry;
+				segments = kmap_atomic(
+					pfn_to_page(gnt_list_entry->pfn));
+				ring_req->u.indirect.indirect_grefs[n] =
+					gnt_list_entry->gref;
+			}
+
 			gnt_list_entry = get_grant(&gref_head, info);
 			ref = gnt_list_entry->gref;
 
@@ -461,13 +536,23 @@ static int blkif_queue_request(struct request *req)
 				kunmap_atomic(bvec_data);
 				kunmap_atomic(shared_data);
 			}
-
-			ring_req->u.rw.seg[i] =
-					(struct blkif_request_segment) {
-						.gref       = ref,
-						.first_sect = fsect,
-						.last_sect  = lsect };
+			if (ring_req->operation != BLKIF_OP_INDIRECT) {
+				ring_req->u.rw.seg[i] =
+						(struct blkif_request_segment) {
+							.gref       = ref,
+							.first_sect = fsect,
+							.last_sect  = lsect };
+			} else {
+				n = i % SEGS_PER_INDIRECT_FRAME;
+				segments[n] =
+					(struct blkif_request_segment_aligned) {
+							.gref       = ref,
+							.first_sect = fsect,
+							.last_sect  = lsect };
+			}
 		}
+		if (segments)
+			kunmap_atomic(segments);
 	}
 
 	info->ring.req_prod_pvt++;
@@ -542,7 +627,8 @@ wait:
 		flush_requests(info);
 }
 
-static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
+static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
+                                unsigned int segments)
 {
 	struct request_queue *rq;
 	struct blkfront_info *info = gd->private_data;
@@ -571,7 +657,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
 	blk_queue_max_segment_size(rq, PAGE_SIZE);
 
 	/* Ensure a merged request will fit in a single I/O ring slot. */
-	blk_queue_max_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+	blk_queue_max_segments(rq, segments);
 
 	/* Make sure buffer addresses are sector-aligned. */
 	blk_queue_dma_alignment(rq, 511);
@@ -588,13 +674,14 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
 static void xlvbd_flush(struct blkfront_info *info)
 {
 	blk_queue_flush(info->rq, info->feature_flush);
-	printk(KERN_INFO "blkfront: %s: %s: %s %s\n",
+	printk(KERN_INFO "blkfront: %s: %s: %s %s %s\n",
 	       info->gd->disk_name,
 	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
 		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
 		"flush diskcache" : "barrier or flush"),
 	       info->feature_flush ? "enabled" : "disabled",
-	       info->feature_persistent ? "using persistent grants" : "");
+	       info->feature_persistent ? "using persistent grants" : "",
+	       info->max_indirect_segments ? "using indirect descriptors" : "");
 }
 
 static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
@@ -734,7 +821,9 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 	gd->driverfs_dev = &(info->xbdev->dev);
 	set_capacity(gd, capacity);
 
-	if (xlvbd_init_blk_queue(gd, sector_size)) {
+	if (xlvbd_init_blk_queue(gd, sector_size,
+	                         info->max_indirect_segments ? :
+	                         BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
 		del_gendisk(gd);
 		goto release;
 	}
@@ -818,6 +907,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 {
 	struct grant *persistent_gnt;
 	struct grant *n;
+	int i, j, segs;
 
 	/* Prevent new requests being issued until we fix things up. */
 	spin_lock_irq(&info->io_lock);
@@ -843,6 +933,47 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 	}
 	BUG_ON(info->persistent_gnts_c != 0);
 
+	kfree(info->sg);
+	info->sg = NULL;
+	for (i = 0; i < BLK_RING_SIZE; i++) {
+		/*
+		 * Clear persistent grants present in requests already
+		 * on the shared ring
+		 */
+		if (!info->shadow[i].request)
+			goto free_shadow;
+
+		segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+		       info->shadow[i].req.u.indirect.nr_segments :
+		       info->shadow[i].req.u.rw.nr_segments;
+		for (j = 0; j < segs; j++) {
+			persistent_gnt = info->shadow[i].grants_used[j];
+			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+			__free_page(pfn_to_page(persistent_gnt->pfn));
+			kfree(persistent_gnt);
+		}
+
+		if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+			/*
+			 * If this is not an indirect operation don't try to
+			 * free indirect segments
+			 */
+			goto free_shadow;
+
+		for (j = 0; j < INDIRECT_GREFS(segs); j++) {
+			persistent_gnt = info->shadow[i].indirect_grants[j];
+			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+			__free_page(pfn_to_page(persistent_gnt->pfn));
+			kfree(persistent_gnt);
+		}
+
+free_shadow:
+		kfree(info->shadow[i].grants_used);
+		info->shadow[i].grants_used = NULL;
+		kfree(info->shadow[i].indirect_grants);
+		info->shadow[i].indirect_grants = NULL;
+	}
+
 	/* No more gnttab callback work. */
 	gnttab_cancel_free_callback(&info->callback);
 	spin_unlock_irq(&info->io_lock);
@@ -873,6 +1004,10 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 	char *bvec_data;
 	void *shared_data;
 	unsigned int offset = 0;
+	int nseg;
+
+	nseg = s->req.operation == BLKIF_OP_INDIRECT ?
+		s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
 
 	if (bret->operation == BLKIF_OP_READ) {
 		/*
@@ -885,7 +1020,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 			BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
 			if (bvec->bv_offset < offset)
 				i++;
-			BUG_ON(i >= s->req.u.rw.nr_segments);
+			BUG_ON(i >= nseg);
 			shared_data = kmap_atomic(
 				pfn_to_page(s->grants_used[i]->pfn));
 			bvec_data = bvec_kmap_irq(bvec, &flags);
@@ -897,10 +1032,17 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
 		}
 	}
 	/* Add the persistent grant into the list of free grants */
-	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
+	for (i = 0; i < nseg; i++) {
 		list_add(&s->grants_used[i]->node, &info->persistent_gnts);
 		info->persistent_gnts_c++;
 	}
+	if (s->req.operation == BLKIF_OP_INDIRECT) {
+		for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
+			list_add(&s->indirect_grants[i]->node,
+			         &info->persistent_gnts);
+			info->persistent_gnts_c++;
+		}
+	}
 }
 
 static irqreturn_t blkif_interrupt(int irq, void *dev_id)
@@ -1034,8 +1176,6 @@ static int setup_blkring(struct xenbus_device *dev,
 	SHARED_RING_INIT(sring);
 	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
 
-	sg_init_table(info->sg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
-
 	err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
 	if (err < 0) {
 		free_page((unsigned long)sring);
@@ -1116,12 +1256,6 @@ again:
 		goto destroy_blkring;
 	}
 
-	/* Allocate memory for grants */
-	err = fill_grant_buffer(info, BLK_RING_SIZE *
-	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
-	if (err)
-		goto out;
-
 	xenbus_switch_state(dev, XenbusStateInitialised);
 
 	return 0;
@@ -1223,13 +1357,84 @@ static int blkfront_probe(struct xenbus_device *dev,
 	return 0;
 }
 
+/*
+ * This is a clone of md_trim_bio, used to split a bio into smaller ones
+ */
+static void trim_bio(struct bio *bio, int offset, int size)
+{
+	/* 'bio' is a cloned bio which we need to trim to match
+	 * the given offset and size.
+	 * This requires adjusting bi_sector, bi_size, and bi_io_vec
+	 */
+	int i;
+	struct bio_vec *bvec;
+	int sofar = 0;
+
+	size <<= 9;
+	if (offset == 0 && size == bio->bi_size)
+		return;
+
+	bio->bi_sector += offset;
+	bio->bi_size = size;
+	offset <<= 9;
+	clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
+	while (bio->bi_idx < bio->bi_vcnt &&
+	       bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
+		/* remove this whole bio_vec */
+		offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
+		bio->bi_idx++;
+	}
+	if (bio->bi_idx < bio->bi_vcnt) {
+		bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
+		bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
+	}
+	/* avoid any complications with bi_idx being non-zero*/
+	if (bio->bi_idx) {
+		memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
+			(bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
+		bio->bi_vcnt -= bio->bi_idx;
+		bio->bi_idx = 0;
+	}
+	/* Make sure vcnt and last bv are not too big */
+	bio_for_each_segment(bvec, bio, i) {
+		if (sofar + bvec->bv_len > size)
+			bvec->bv_len = size - sofar;
+		if (bvec->bv_len == 0) {
+			bio->bi_vcnt = i;
+			break;
+		}
+		sofar += bvec->bv_len;
+	}
+}
+
+static void split_bio_end(struct bio *bio, int error)
+{
+	struct split_bio *split_bio = bio->bi_private;
+
+	if (error)
+		split_bio->err = error;
+
+	if (atomic_dec_and_test(&split_bio->pending)) {
+		split_bio->bio->bi_phys_segments = 0;
+		bio_endio(split_bio->bio, split_bio->err);
+		kfree(split_bio);
+	}
+	bio_put(bio);
+}
 
 static int blkif_recover(struct blkfront_info *info)
 {
 	int i;
-	struct blkif_request *req;
+	struct request *req, *n;
 	struct blk_shadow *copy;
-	int j;
+	int rc;
+	struct bio *bio, *cloned_bio;
+	struct bio_list bio_list, merge_bio;
+	unsigned int segs;
+	int pending, offset, size;
+	struct split_bio *split_bio;
+	struct list_head requests;
 
 	/* Stage 1: Make a safe copy of the shadow state. */
 	copy = kmalloc(sizeof(info->shadow),
@@ -1245,36 +1450,64 @@ static int blkif_recover(struct blkfront_info *info)
 	info->shadow_free = info->ring.req_prod_pvt;
 	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
 
-	/* Stage 3: Find pending requests and requeue them. */
+	rc = blkfront_setup_indirect(info);
+	if (rc) {
+		kfree(copy);
+		return rc;
+	}
+
+	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	blk_queue_max_segments(info->rq, segs);
+	bio_list_init(&bio_list);
+	INIT_LIST_HEAD(&requests);
 	for (i = 0; i < BLK_RING_SIZE; i++) {
 		/* Not in use? */
 		if (!copy[i].request)
 			continue;
 
-		/* Grab a request slot and copy shadow state into it. */
-		req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
-		*req = copy[i].req;
-
-		/* We get a new request id, and must reset the shadow state. */
-		req->u.rw.id = get_id_from_freelist(info);
-		memcpy(&info->shadow[req->u.rw.id], &copy[i], sizeof(copy[i]));
-
-		if (req->operation != BLKIF_OP_DISCARD) {
-		/* Rewrite any grant references invalidated by susp/resume. */
-			for (j = 0; j < req->u.rw.nr_segments; j++)
-				gnttab_grant_foreign_access_ref(
-					req->u.rw.seg[j].gref,
-					info->xbdev->otherend_id,
-					pfn_to_mfn(copy[i].grants_used[j]->pfn),
-					0);
+		/*
+		 * Get the bios in the request so we can re-queue them.
+		 */
+		if (copy[i].request->cmd_flags &
+		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+			/*
+			 * Flush operations don't contain bios, so
+			 * we need to requeue the whole request
+			 */
+			list_add(&copy[i].request->queuelist, &requests);
+			continue;
 		}
-		info->shadow[req->u.rw.id].req = *req;
-
-		info->ring.req_prod_pvt++;
+		merge_bio.head = copy[i].request->bio;
+		merge_bio.tail = copy[i].request->biotail;
+		bio_list_merge(&bio_list, &merge_bio);
+		copy[i].request->bio = NULL;
+		blk_put_request(copy[i].request);
 	}
 
 	kfree(copy);
 
+	/*
+	 * Empty the queue, this is important because we might have
+	 * requests in the queue with more segments than what we
+	 * can handle now.
+	 */
+	spin_lock_irq(&info->io_lock);
+	while ((req = blk_fetch_request(info->rq)) != NULL) {
+		if (req->cmd_flags &
+		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+			list_add(&req->queuelist, &requests);
+			continue;
+		}
+		merge_bio.head = req->bio;
+		merge_bio.tail = req->biotail;
+		bio_list_merge(&bio_list, &merge_bio);
+		req->bio = NULL;
+		if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
+			pr_alert("diskcache flush request found!\n");
+		__blk_put_request(info->rq, req);
+	}
+	spin_unlock_irq(&info->io_lock);
+
 	xenbus_switch_state(info->xbdev, XenbusStateConnected);
 
 	spin_lock_irq(&info->io_lock);
@@ -1282,14 +1515,50 @@ static int blkif_recover(struct blkfront_info *info)
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	/* Send off requeued requests */
-	flush_requests(info);
-
 	/* Kick any other new requests queued since we resumed */
 	kick_pending_request_queues(info);
 
+	list_for_each_entry_safe(req, n, &requests, queuelist) {
+		/* Requeue pending requests (flush or discard) */
+		list_del_init(&req->queuelist);
+		BUG_ON(req->nr_phys_segments > segs);
+		blk_requeue_request(info->rq, req);
+	}
 	spin_unlock_irq(&info->io_lock);
 
+	while ((bio = bio_list_pop(&bio_list)) != NULL) {
+		/* Traverse the list of pending bios and re-queue them */
+		if (bio_segments(bio) > segs) {
+			/*
+			 * This bio has more segments than what we can
+			 * handle, we have to split it.
+			 */
+			pending = (bio_segments(bio) + segs - 1) / segs;
+			split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
+			BUG_ON(split_bio == NULL);
+			atomic_set(&split_bio->pending, pending);
+			split_bio->bio = bio;
+			for (i = 0; i < pending; i++) {
+				offset = (i * segs * PAGE_SIZE) >> 9;
+				size = MIN((segs * PAGE_SIZE) >> 9,
+				           (bio->bi_size >> 9) - offset);
+				cloned_bio = bio_clone(bio, GFP_NOIO);
+				BUG_ON(cloned_bio == NULL);
+				trim_bio(cloned_bio, offset, size);
+				cloned_bio->bi_private = split_bio;
+				cloned_bio->bi_end_io = split_bio_end;
+				submit_bio(cloned_bio->bi_rw, cloned_bio);
+			}
+			/*
+			 * Now we have to wait for all those smaller bios to
+			 * end, so we can also end the "parent" bio.
+			 */
+			continue;
+		}
+		/* We don't need to split this bio */
+		submit_bio(bio->bi_rw, bio);
+	}
+
 	return 0;
 }
 
@@ -1309,8 +1578,12 @@ static int blkfront_resume(struct xenbus_device *dev)
 	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
 
 	err = talk_to_blkback(dev, info);
-	if (info->connected == BLKIF_STATE_SUSPENDED && !err)
-		err = blkif_recover(info);
+
+	/*
+	 * We have to wait for the backend to switch to
+	 * connected state, since we want to read which
+	 * features it supports.
+	 */
 
 	return err;
 }
@@ -1388,6 +1661,62 @@ static void blkfront_setup_discard(struct blkfront_info *info)
 	kfree(type);
 }
 
+static int blkfront_setup_indirect(struct blkfront_info *info)
+{
+	unsigned int indirect_segments, segs;
+	int err, i;
+
+	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			    "max-indirect-segments", "%u", &indirect_segments,
+			    NULL);
+	if (err) {
+		info->max_indirect_segments = 0;
+		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+	} else {
+		info->max_indirect_segments = MIN(indirect_segments,
+		                                  xen_blkif_max_segments);
+		segs = info->max_indirect_segments;
+	}
+	info->sg = kzalloc(sizeof(info->sg[0]) * segs, GFP_KERNEL);
+	if (info->sg == NULL)
+		goto out_of_memory;
+	sg_init_table(info->sg, segs);
+
+	err = fill_grant_buffer(info,
+	                        (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+	if (err)
+		goto out_of_memory;
+
+	for (i = 0; i < BLK_RING_SIZE; i++) {
+		info->shadow[i].grants_used = kzalloc(
+			sizeof(info->shadow[i].grants_used[0]) * segs,
+			GFP_NOIO);
+		if (info->max_indirect_segments)
+			info->shadow[i].indirect_grants = kzalloc(
+				sizeof(info->shadow[i].indirect_grants[0]) *
+				INDIRECT_GREFS(segs),
+				GFP_NOIO);
+		if ((info->shadow[i].grants_used == NULL) ||
+		     (info->max_indirect_segments &&
+		     (info->shadow[i].indirect_grants == NULL)))
+			goto out_of_memory;
+	}
+
+
+	return 0;
+
+out_of_memory:
+	kfree(info->sg);
+	info->sg = NULL;
+	for (i = 0; i < BLK_RING_SIZE; i++) {
+		kfree(info->shadow[i].grants_used);
+		info->shadow[i].grants_used = NULL;
+		kfree(info->shadow[i].indirect_grants);
+		info->shadow[i].indirect_grants = NULL;
+	}
+	return -ENOMEM;
+}
+
 /*
  * Invoked when the backend is finally 'ready' (and has told produced
  * the details about the physical device - #sectors, size, etc).
@@ -1415,8 +1744,9 @@ static void blkfront_connect(struct blkfront_info *info)
 		set_capacity(info->gd, sectors);
 		revalidate_disk(info->gd);
 
-		/* fall through */
+		return;
 	case BLKIF_STATE_SUSPENDED:
+		blkif_recover(info);
 		return;
 
 	default:
@@ -1437,6 +1767,7 @@ static void blkfront_connect(struct blkfront_info *info)
 				 info->xbdev->otherend);
 		return;
 	}
+	info->sector_size = sector_size;
 
 	info->feature_flush = 0;
 	info->flush_op = 0;
@@ -1484,6 +1815,13 @@ static void blkfront_connect(struct blkfront_info *info)
 	else
 		info->feature_persistent = persistent;
 
+	err = blkfront_setup_indirect(info);
+	if (err) {
+		xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
+				 info->xbdev->otherend);
+		return;
+	}
+
 	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
 	if (err) {
 		xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index 01c3d62..6d99849 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -102,6 +102,8 @@ typedef uint64_t blkif_sector_t;
  */
 #define BLKIF_OP_DISCARD           5
 
+#define BLKIF_OP_INDIRECT          6
+
 /*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
@@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
  */
 #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
 
+#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
+
+struct blkif_request_segment_aligned {
+	grant_ref_t gref;        /* reference to I/O buffer frame        */
+	/* @first_sect: first sector in frame to transfer (inclusive).   */
+	/* @last_sect: last sector in frame to transfer (inclusive).     */
+	uint8_t     first_sect, last_sect;
+	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
+} __attribute__((__packed__));
+
 struct blkif_request_rw {
 	uint8_t        nr_segments;  /* number of segments                   */
 	blkif_vdev_t   handle;       /* only for read/write requests         */
@@ -138,11 +150,24 @@ struct blkif_request_discard {
 	uint8_t        _pad3;
 } __attribute__((__packed__));
 
+struct blkif_request_indirect {
+	uint8_t        indirect_op;
+	uint16_t       nr_segments;
+#ifdef CONFIG_X86_64
+	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
+#endif
+	uint64_t       id;
+	blkif_vdev_t   handle;
+	blkif_sector_t sector_number;
+	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
+} __attribute__((__packed__));
+
 struct blkif_request {
 	uint8_t        operation;    /* BLKIF_OP_???                         */
 	union {
 		struct blkif_request_rw rw;
 		struct blkif_request_discard discard;
+		struct blkif_request_indirect indirect;
 	} u;
 } __attribute__((__packed__));
 
-- 
1.7.7.5 (Apple Git-26)


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 00/12] xen-block: indirect descriptors
  2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
                   ` (11 preceding siblings ...)
  2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
@ 2013-02-28 10:49 ` Jan Beulich
  2013-02-28 11:25   ` Roger Pau Monné
  12 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 10:49 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, linux-kernel

>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> This series contains the initial implementation of indirect 
> descriptors for Linux blkback/blkfront.
> 
> Patches 1, 2, 3, 4 and 5 are bug fixes and minor optimizations.
> 
> Patch 6 contains a LRU implementation for blkback that will be needed 
> when using indirect descriptors (since we are no longer able to map 
> all possible grants blkfront might use).

Considering this, ...

> Patch 7 is an addition to the print stats function in blkback in order 
> to print information regarding persistent grant usage.
> 
> Patches 8, 9, 10 and 11 are preparatory work for indirect descriptors 
> implementation, mainly make blkback use dynamic memory and remove the 
> shared blkbk structure, so each blkback instance has it's own list of 
> free requests, pages, handles and so on.
> 
> Finally patch 12 contains the indirect descriptors implementation.
> 
> I've also pushed this series to the following git repository:
> 
> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
> 
> Performance benefit of this series can be seen in the following graph:
> 
> http://xenbits.xen.org/people/royger/plot_indirect.png 

... would you happen to also have a comparison with using
indirect descriptors but not persistent grants? IOW I'm
wondering about the hit rate on the persistently mapped
grants, especially when blkfront really saturates the added
bandwidth.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr
  2013-02-28 10:28 ` [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr Roger Pau Monne
@ 2013-02-28 10:58   ` Jan Beulich
  2013-03-04 17:19     ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 10:58 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> dev_bus_addr returned in the grant ref map operation is the mfn of the
> passed page, there's no need to store it in the persistent grant
> entry, since we can always get it provided that we have the page.

Interesting that you come up with this, as I have a similar patch
pending (not posted yet), aiming at reducing the stack usage in
dispatch_rw_block_io(): seg[].buf is really unnecessary with the
dev_bus_addr storing removed, as the only reader of that field
can equally well use req->u.rw.seg[i].first_sect.

And then the biolist[] array really can be folded into a union
with the remaining seg[] one, as their usage scopes are easily
separable.

> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -621,9 +621,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>  				 * If this is a new persistent grant
>  				 * save the handler
>  				 */
> -				persistent_gnts[i]->handle = map[j].handle;
> -				persistent_gnts[i]->dev_bus_addr =
> -					map[j++].dev_bus_addr;
> +				persistent_gnts[i]->handle = map[j++].handle;
>  			}
>  			pending_handle(pending_req, i) =
>  				persistent_gnts[i]->handle;
> @@ -631,7 +629,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			if (ret)
>  				continue;
>  
> -			seg[i].buf = persistent_gnts[i]->dev_bus_addr |
> +			seg[i].buf = pfn_to_mfn(page_to_pfn(
> +				persistent_gnts[i]->page)) << PAGE_SHIFT |

So why do you do this? The only reader masks the field with
~PAGE_MASK anyway.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req
  2013-02-28 10:28 ` [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req Roger Pau Monne
@ 2013-02-28 11:07   ` Jan Beulich
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 11:07 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> Moving grant ref handles from blkbk to pending_req will allow us to
> get rid of the shared blkbk structure.

At the expense of (slightly?) higher memory requirements?

> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -136,6 +136,7 @@ struct pending_req {
>  	struct list_head	free_list;
>  	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];

Adding yet another array here makes it even more desirable to
switch from multiple arrays to a singly array of a structure, thus
improving locality of the memory accesses involved in processing
an individual segment.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend
  2013-02-28 10:28 ` [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend Roger Pau Monne
@ 2013-02-28 11:08   ` Jan Beulich
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 11:08 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> @@ -122,6 +125,19 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>  	spin_lock_init(&blkif->free_pages_lock);
>  	INIT_LIST_HEAD(&blkif->free_pages);
>  	blkif->free_pages_num = 0;
> +	blkif->pending_reqs = kzalloc(sizeof(blkif->pending_reqs[0]) *
> +	                              xen_blkif_reqs, GFP_KERNEL);

kcalloc() is preferred in cases like this, I believe.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
@ 2013-02-28 11:19   ` Jan Beulich
  2013-02-28 12:00     ` Roger Pau Monné
  2013-03-04 20:41   ` Konrad Rzeszutek Wilk
  2013-03-18 17:06   ` Roger Pau Monné
  2 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 11:19 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> @@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>  
> +#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
> +
> +struct blkif_request_segment_aligned {
> +	grant_ref_t gref;        /* reference to I/O buffer frame        */
> +	/* @first_sect: first sector in frame to transfer (inclusive).   */
> +	/* @last_sect: last sector in frame to transfer (inclusive).     */
> +	uint8_t     first_sect, last_sect;
> +	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
> +} __attribute__((__packed__));

What's the __packed__ for here?

> +
>  struct blkif_request_rw {
>  	uint8_t        nr_segments;  /* number of segments                   */
>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> @@ -138,11 +150,24 @@ struct blkif_request_discard {
>  	uint8_t        _pad3;
>  } __attribute__((__packed__));
>  
> +struct blkif_request_indirect {
> +	uint8_t        indirect_op;
> +	uint16_t       nr_segments;
> +#ifdef CONFIG_X86_64
> +	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
> +#endif

Either you want the structure be packed tightly (and you don't care
about misaligned fields), in which case you shouldn't need a padding
field. That's even more so as there's no padding between indirect_op
and nr_segments, so everything is misaligned anyway, and the
comment above is wrong too (offsetof() really ought to yield 7 in
that case).

Or you want the structure fields aligned, in which case you again
ought to drop the use of the __packed__ attribute and introduce
_all_ necessary padding fields.

> +	uint64_t       id;
> +	blkif_vdev_t   handle;
> +	blkif_sector_t sector_number;
> +	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> +} __attribute__((__packed__));

And then it would be quite nice for new features to no longer
require translation between a 32- and a 64-bit layout at all.

Plus, rather than introducing uninitialized padding fields, I'd
suggest using fields that are required to be zero initialized, to
allow giving them a meaning later.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 00/12] xen-block: indirect descriptors
  2013-02-28 10:49 ` [Xen-devel] [PATCH RFC 00/12] xen-block: " Jan Beulich
@ 2013-02-28 11:25   ` Roger Pau Monné
  2013-02-28 11:35     ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-02-28 11:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, linux-kernel

On 28/02/13 11:49, Jan Beulich wrote:
>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>> This series contains the initial implementation of indirect 
>> descriptors for Linux blkback/blkfront.
>>
>> Patches 1, 2, 3, 4 and 5 are bug fixes and minor optimizations.
>>
>> Patch 6 contains a LRU implementation for blkback that will be needed 
>> when using indirect descriptors (since we are no longer able to map 
>> all possible grants blkfront might use).
> 
> Considering this, ...
> 
>> Patch 7 is an addition to the print stats function in blkback in order 
>> to print information regarding persistent grant usage.
>>
>> Patches 8, 9, 10 and 11 are preparatory work for indirect descriptors 
>> implementation, mainly make blkback use dynamic memory and remove the 
>> shared blkbk structure, so each blkback instance has it's own list of 
>> free requests, pages, handles and so on.
>>
>> Finally patch 12 contains the indirect descriptors implementation.
>>
>> I've also pushed this series to the following git repository:
>>
>> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
>>
>> Performance benefit of this series can be seen in the following graph:
>>
>> http://xenbits.xen.org/people/royger/plot_indirect.png 
> 
> ... would you happen to also have a comparison with using
> indirect descriptors but not persistent grants? IOW I'm
> wondering about the hit rate on the persistently mapped
> grants, especially when blkfront really saturates the added
> bandwidth.

This is the expanded graph that also contains indirect descriptors
without persistent grants:

http://xenbits.xen.org/people/royger/plot_indirect_nopers.png


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 00/12] xen-block: indirect descriptors
  2013-02-28 11:25   ` Roger Pau Monné
@ 2013-02-28 11:35     ` Jan Beulich
  2013-02-28 11:44       ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 11:35 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, linux-kernel

>>> On 28.02.13 at 12:25, Roger Pau Monné<roger.pau@citrix.com> wrote:
> This is the expanded graph that also contains indirect descriptors
> without persistent grants:
> 
> http://xenbits.xen.org/people/royger/plot_indirect_nopers.png 

Thanks. Interesting - this suggests an unexpectedly high hit rate.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 00/12] xen-block: indirect descriptors
  2013-02-28 11:35     ` Jan Beulich
@ 2013-02-28 11:44       ` Roger Pau Monné
  0 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-02-28 11:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, linux-kernel

On 28/02/13 12:35, Jan Beulich wrote:
>>>> On 28.02.13 at 12:25, Roger Pau Monné<roger.pau@citrix.com> wrote:
>> This is the expanded graph that also contains indirect descriptors
>> without persistent grants:
>>
>> http://xenbits.xen.org/people/royger/plot_indirect_nopers.png 
> 
> Thanks. Interesting - this suggests an unexpectedly high hit rate.

This graph is using the default values, so we are persistently mapping
1024 grants of the possible 2080 that could be used by blkfront ((64
segments + 1 indirect gref) * 32 = 2080).

blkfront stores grants using a stack, so it should try to reuse the same
grants as much as possible. On the other hand, blkback cleans unused
grants periodically, to try to have the most commonly used blkfront
grants mapped.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 11:19   ` [Xen-devel] " Jan Beulich
@ 2013-02-28 12:00     ` Roger Pau Monné
  2013-02-28 13:28       ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-02-28 12:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

On 28/02/13 12:19, Jan Beulich wrote:
>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>> @@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
>>   */
>>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>>  
>> +#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
>> +
>> +struct blkif_request_segment_aligned {
>> +	grant_ref_t gref;        /* reference to I/O buffer frame        */
>> +	/* @first_sect: first sector in frame to transfer (inclusive).   */
>> +	/* @last_sect: last sector in frame to transfer (inclusive).     */
>> +	uint8_t     first_sect, last_sect;
>> +	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
>> +} __attribute__((__packed__));
> 
> What's the __packed__ for here?

Yes, that's not needed.

> 
>> +
>>  struct blkif_request_rw {
>>  	uint8_t        nr_segments;  /* number of segments                   */
>>  	blkif_vdev_t   handle;       /* only for read/write requests         */
>> @@ -138,11 +150,24 @@ struct blkif_request_discard {
>>  	uint8_t        _pad3;
>>  } __attribute__((__packed__));
>>  
>> +struct blkif_request_indirect {
>> +	uint8_t        indirect_op;
>> +	uint16_t       nr_segments;
>> +#ifdef CONFIG_X86_64
>> +	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
>> +#endif
> 
> Either you want the structure be packed tightly (and you don't care
> about misaligned fields), in which case you shouldn't need a padding
> field. That's even more so as there's no padding between indirect_op
> and nr_segments, so everything is misaligned anyway, and the
> comment above is wrong too (offsetof() really ought to yield 7 in
> that case).

This padding is because we want to have the "id" field at the same
position as blkif_request_rw, so we need to add the padding for it to
match 32 & 64 bit blkif_request_rw structures, this prevents adding some
"if (req.op == BLKIF_OP_INDIRECT)..." if we only need to get the id of
the request.

The comment is indeed wrong, I've copied it from blkif_request_discard
and forgot to change the offset

> 
> Or you want the structure fields aligned, in which case you again
> ought to drop the use of the __packed__ attribute and introduce
> _all_ necessary padding fields.
> 
>> +	uint64_t       id;
>> +	blkif_vdev_t   handle;
>> +	blkif_sector_t sector_number;
>> +	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
>> +} __attribute__((__packed__));
> 
> And then it would be quite nice for new features to no longer
> require translation between a 32- and a 64-bit layout at all.

The translation is caused by the id issue described above.

> Plus, rather than introducing uninitialized padding fields, I'd
> suggest using fields that are required to be zero initialized, to
> allow giving them a meaning later.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 12:00     ` Roger Pau Monné
@ 2013-02-28 13:28       ` Jan Beulich
  2013-03-04 20:44         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-02-28 13:28 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 28.02.13 at 13:00, Roger Pau Monné<roger.pau@citrix.com> wrote:
> On 28/02/13 12:19, Jan Beulich wrote:
>>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>>> @@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
>>>   */
>>>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>>>  
>>> +#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
>>> +
>>> +struct blkif_request_segment_aligned {
>>> +	grant_ref_t gref;        /* reference to I/O buffer frame        */
>>> +	/* @first_sect: first sector in frame to transfer (inclusive).   */
>>> +	/* @last_sect: last sector in frame to transfer (inclusive).     */
>>> +	uint8_t     first_sect, last_sect;
>>> +	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
>>> +} __attribute__((__packed__));
>> 
>> What's the __packed__ for here?
> 
> Yes, that's not needed.
> 
>> 
>>> +
>>>  struct blkif_request_rw {
>>>  	uint8_t        nr_segments;  /* number of segments                   */
>>>  	blkif_vdev_t   handle;       /* only for read/write requests         */
>>> @@ -138,11 +150,24 @@ struct blkif_request_discard {
>>>  	uint8_t        _pad3;
>>>  } __attribute__((__packed__));
>>>  
>>> +struct blkif_request_indirect {
>>> +	uint8_t        indirect_op;
>>> +	uint16_t       nr_segments;
>>> +#ifdef CONFIG_X86_64
>>> +	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
>>> +#endif
>> 
>> Either you want the structure be packed tightly (and you don't care
>> about misaligned fields), in which case you shouldn't need a padding
>> field. That's even more so as there's no padding between indirect_op
>> and nr_segments, so everything is misaligned anyway, and the
>> comment above is wrong too (offsetof() really ought to yield 7 in
>> that case).
> 
> This padding is because we want to have the "id" field at the same
> position as blkif_request_rw, so we need to add the padding for it to
> match 32 & 64 bit blkif_request_rw structures, this prevents adding some
> "if (req.op == BLKIF_OP_INDIRECT)..." if we only need to get the id of
> the request.

Oh, right, that's desirable of course.

> The comment is indeed wrong, I've copied it from blkif_request_discard
> and forgot to change the offset

But the offset stated there then is right after all - I forgot that
there is a 1-byte field outside the union (the way this is being done
in the upstream Linux header is really ugly imo, but I guess Jeremy
and/or Konrad liked it that way). That's also why the packed
attribute is needed here.

But you will probably want to switch sector_number and handle, so
that sector_number becomes aligned, and add another 16-bit
padding field between handle and indirect_grefs[].

I also wonder whether "indirect_op" wouldn't better be named
"actual_op" or just "op".

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr
  2013-02-28 10:58   ` [Xen-devel] " Jan Beulich
@ 2013-03-04 17:19     ` Roger Pau Monné
  2013-03-05  8:06       ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-04 17:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

On 28/02/13 11:58, Jan Beulich wrote:
>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>> dev_bus_addr returned in the grant ref map operation is the mfn of the
>> passed page, there's no need to store it in the persistent grant
>> entry, since we can always get it provided that we have the page.
> 
> Interesting that you come up with this, as I have a similar patch
> pending (not posted yet), aiming at reducing the stack usage in
> dispatch_rw_block_io(): seg[].buf is really unnecessary with the
> dev_bus_addr storing removed, as the only reader of that field
> can equally well use req->u.rw.seg[i].first_sect.

Well, it can if we are not using indirect descriptors, because once we
start using indirect descriptors segments are inside a gref frame, so
it's quite comfortable to store first_sect inside a separate array, this
way we can map indirect segments, copy whatever data we need from them
and unmap them, without having them around for the whole lifetime of the
request.

> 
> And then the biolist[] array really can be folded into a union
> with the remaining seg[] one, as their usage scopes are easily
> separable.

Could we leave that for a further patch? I would like to avoid messing
any more with blkback, as I'm already touching a lot of bits with this
patch series.

> 
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -621,9 +621,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>>  				 * If this is a new persistent grant
>>  				 * save the handler
>>  				 */
>> -				persistent_gnts[i]->handle = map[j].handle;
>> -				persistent_gnts[i]->dev_bus_addr =
>> -					map[j++].dev_bus_addr;
>> +				persistent_gnts[i]->handle = map[j++].handle;
>>  			}
>>  			pending_handle(pending_req, i) =
>>  				persistent_gnts[i]->handle;
>> @@ -631,7 +629,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>>  			if (ret)
>>  				continue;
>>  
>> -			seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>> +			seg[i].buf = pfn_to_mfn(page_to_pfn(
>> +				persistent_gnts[i]->page)) << PAGE_SHIFT |
> 
> So why do you do this? The only reader masks the field with
> ~PAGE_MASK anyway.

Yes, I only need to store first_sect.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-02-28 10:28 ` [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests Roger Pau Monne
@ 2013-03-04 19:39   ` Konrad Rzeszutek Wilk
  2013-03-05 11:04     ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-04 19:39 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: linux-kernel, xen-devel

On Thu, Feb 28, 2013 at 11:28:47AM +0100, Roger Pau Monne wrote:
> This prevents us from having to call alloc_page while we are preparing
> the request. Since blkfront was calling alloc_page with a spinlock
> held we used GFP_ATOMIC, which can fail if we are requesting a lot of
> pages since it is using the emergency memory pools.
> 
> Allocating all the pages at init prevents us from having to call
> alloc_page, thus preventing possible failures.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: xen-devel@lists.xen.org
> ---
>  drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
>  1 files changed, 79 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 2e39eaf..5ba6b87 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
>  	return 0;
>  }
>  
> +static int fill_grant_buffer(struct blkfront_info *info, int num)
> +{
> +	struct page *granted_page;
> +	struct grant *gnt_list_entry, *n;
> +	int i = 0;
> +
> +	while(i < num) {
> +		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);

GFP_NORMAL ?

> +		if (!gnt_list_entry)
> +			goto out_of_memory;

Hmm, I guess another patch could be to convert this to a fail-safe
mechanism. Meaning if we fail here, we just cap our maximum amount of
grants we have up to 'i'.


> +
> +		granted_page = alloc_page(GFP_NOIO);

GFP_NORMAL

> +		if (!granted_page) {
> +			kfree(gnt_list_entry);
> +			goto out_of_memory;
> +		}
> +
> +		gnt_list_entry->pfn = page_to_pfn(granted_page);
> +		gnt_list_entry->gref = GRANT_INVALID_REF;
> +		list_add(&gnt_list_entry->node, &info->persistent_gnts);
> +		i++;
> +	}
> +
> +	return 0;
> +
> +out_of_memory:
> +	list_for_each_entry_safe(gnt_list_entry, n,
> +	                         &info->persistent_gnts, node) {
> +		list_del(&gnt_list_entry->node);
> +		__free_page(pfn_to_page(gnt_list_entry->pfn));
> +		kfree(gnt_list_entry);
> +		i--;
> +	}
> +	BUG_ON(i != 0);
> +	return -ENOMEM;
> +}
> +
> +static struct grant *get_grant(grant_ref_t *gref_head,
> +                               struct blkfront_info *info)
> +{
> +	struct grant *gnt_list_entry;
> +	unsigned long buffer_mfn;
> +
> +	BUG_ON(list_empty(&info->persistent_gnts));
> +	gnt_list_entry = list_first_entry(&info->persistent_gnts, struct grant,
> +	                                  node);
> +	list_del(&gnt_list_entry->node);
> +
> +	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
> +		info->persistent_gnts_c--;
> +		return gnt_list_entry;
> +	}
> +
> +	/* Assign a gref to this page */
> +	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
> +	BUG_ON(gnt_list_entry->gref == -ENOSPC);
> +	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> +	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
> +	                                info->xbdev->otherend_id,
> +	                                buffer_mfn, 0);
> +	return gnt_list_entry;
> +}
> +
>  static const char *op_name(int op)
>  {
>  	static const char *const names[] = {
> @@ -306,7 +369,6 @@ static int blkif_queue_request(struct request *req)
>  	 */
>  	bool new_persistent_gnts;
>  	grant_ref_t gref_head;
> -	struct page *granted_page;
>  	struct grant *gnt_list_entry = NULL;
>  	struct scatterlist *sg;
>  
> @@ -370,42 +432,9 @@ static int blkif_queue_request(struct request *req)
>  			fsect = sg->offset >> 9;
>  			lsect = fsect + (sg->length >> 9) - 1;
>  
> -			if (info->persistent_gnts_c) {
> -				BUG_ON(list_empty(&info->persistent_gnts));
> -				gnt_list_entry = list_first_entry(
> -				                      &info->persistent_gnts,
> -				                      struct grant, node);
> -				list_del(&gnt_list_entry->node);
> -
> -				ref = gnt_list_entry->gref;
> -				buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> -				info->persistent_gnts_c--;
> -			} else {
> -				ref = gnttab_claim_grant_reference(&gref_head);
> -				BUG_ON(ref == -ENOSPC);
> -
> -				gnt_list_entry =
> -					kmalloc(sizeof(struct grant),
> -							 GFP_ATOMIC);
> -				if (!gnt_list_entry)
> -					return -ENOMEM;
> -
> -				granted_page = alloc_page(GFP_ATOMIC);
> -				if (!granted_page) {
> -					kfree(gnt_list_entry);
> -					return -ENOMEM;
> -				}
> -
> -				gnt_list_entry->pfn =
> -					page_to_pfn(granted_page);
> -				gnt_list_entry->gref = ref;
> -
> -				buffer_mfn = pfn_to_mfn(page_to_pfn(
> -								granted_page));
> -				gnttab_grant_foreign_access_ref(ref,
> -					info->xbdev->otherend_id,
> -					buffer_mfn, 0);
> -			}
> +			gnt_list_entry = get_grant(&gref_head, info);
> +			ref = gnt_list_entry->gref;
> +			buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>  
>  			info->shadow[id].grants_used[i] = gnt_list_entry;
>  
> @@ -803,17 +832,20 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  		blk_stop_queue(info->rq);
>  
>  	/* Remove all persistent grants */
> -	if (info->persistent_gnts_c) {
> +	if (!list_empty(&info->persistent_gnts)) {
>  		list_for_each_entry_safe(persistent_gnt, n,
>  		                         &info->persistent_gnts, node) {
>  			list_del(&persistent_gnt->node);
> -			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> +			if (persistent_gnt->gref != GRANT_INVALID_REF) {
> +				gnttab_end_foreign_access(persistent_gnt->gref,
> +				                          0, 0UL);
> +				info->persistent_gnts_c--;
> +			}
>  			__free_page(pfn_to_page(persistent_gnt->pfn));
>  			kfree(persistent_gnt);
> -			info->persistent_gnts_c--;
>  		}
> -		BUG_ON(info->persistent_gnts_c != 0);
>  	}
> +	BUG_ON(info->persistent_gnts_c != 0);

So if the guest _never_ sent any I/Os and just attached/detached the device - won't
we fail here?.
>  
>  	/* No more gnttab callback work. */
>  	gnttab_cancel_free_callback(&info->callback);
> @@ -1088,6 +1120,12 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> +	/* Allocate memory for grants */
> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	if (err)
> +		goto out;

That looks to be in the wrong function - talk_to_blkback function is
to talk to the blkback. Not do initialization type operations.

Also I think this means that on resume - we would try to allocate
again the grants?
> +
>  	xenbus_switch_state(dev, XenbusStateInitialised);
>  
>  	return 0;
> -- 
> 1.7.7.5 (Apple Git-26)
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants
  2013-02-28 10:28 ` [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants Roger Pau Monne
@ 2013-03-04 20:10   ` Konrad Rzeszutek Wilk
  2013-03-05 18:10     ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-04 20:10 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: linux-kernel, xen-devel

On Thu, Feb 28, 2013 at 11:28:49AM +0100, Roger Pau Monne wrote:
> This mechanism allows blkback to change the number of grants
> persistently mapped at run time.
> 
> The algorithm uses a simple LRU mechanism that removes (if needed) the
> persistent grants that have not been used since the last LRU run, or
> if all grants have been used it removes the first grants in the list
> (that are not in use).
> 
> The algorithm has several parameters that can be tuned by the user
> from sysfs:
> 
>  * max_persistent_grants: maximum number of grants that will be
>    persistently mapped.
>  * lru_interval: minimum interval (in ms) at which the LRU should be
>    run
>  * lru_num_clean: number of persistent grants to remove when executing
>    the LRU.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: xen-devel@lists.xen.org
> ---
>  drivers/block/xen-blkback/blkback.c |  207 +++++++++++++++++++++++++++--------
>  drivers/block/xen-blkback/common.h  |    4 +
>  drivers/block/xen-blkback/xenbus.c  |    1 +
>  3 files changed, 166 insertions(+), 46 deletions(-)

You also should add a Documentation/sysfs/ABI/stable/sysfs-bus-xen-backend

> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 415a0c7..c14b736 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -63,6 +63,44 @@ static int xen_blkif_reqs = 64;
>  module_param_named(reqs, xen_blkif_reqs, int, 0);
>  MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
>  
> +/*
> + * Maximum number of grants to map persistently in blkback. For maximum
> + * performance this should be the total numbers of grants that can be used
> + * to fill the ring, but since this might become too high, specially with
> + * the use of indirect descriptors, we set it to a value that provides good
> + * performance without using too much memory.
> + *
> + * When the list of persistent grants is full we clean it using a LRU
> + * algorithm.
> + */
> +
> +static int xen_blkif_max_pgrants = 352;

And a little blurb saying why 352.

> +module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
> +MODULE_PARM_DESC(max_persistent_grants,
> +                 "Maximum number of grants to map persistently");
> +
> +/*
> + * The LRU mechanism to clean the lists of persistent grants needs to
> + * be executed periodically. The time interval between consecutive executions
> + * of the purge mechanism is set in ms.
> + */
> +
> +static int xen_blkif_lru_interval = 100;

So every second? What is the benefit of having the user modify this? Would
it better if there was an watermark system in xen-blkfront to automatically
figure this out? (This could be a TODO of course)

> +module_param_named(lru_interval, xen_blkif_lru_interval, int, 0644);
> +MODULE_PARM_DESC(lru_interval,
> +"Execution interval (in ms) of the LRU mechanism to clean the list of persistent grants");
> +
> +/*
> + * When the persistent grants list is full we will remove unused grants
> + * from the list. The number of grants to be removed at each LRU execution
> + * can be set dynamically.
> + */
> +
> +static int xen_blkif_lru_num_clean = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
> +MODULE_PARM_DESC(lru_num_clean,
> +"Number of persistent grants to unmap when the list is full");

Again, what does that mean to the system admin? Why would they need to modify
the contents of that value?


Now if this is a debug related one for developing, then this could all be
done in DebugFS.

> +
>  /* Run-time switchable: /sys/module/blkback/parameters/ */
>  static unsigned int log_stats;
>  module_param(log_stats, int, 0644);
> @@ -81,7 +119,7 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> -	DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  };
>  
>  #define BLKBACK_INVALID_HANDLE (~0)
> @@ -102,36 +140,6 @@ struct xen_blkbk {
>  static struct xen_blkbk *blkbk;
>  
>  /*
> - * Maximum number of grant pages that can be mapped in blkback.
> - * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> - * pages that blkback will persistently map.
> - * Currently, this is:
> - * RING_SIZE = 32 (for all known ring types)
> - * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
> - * sizeof(struct persistent_gnt) = 48
> - * So the maximum memory used to store the grants is:
> - * 32 * 11 * 48 = 16896 bytes
> - */
> -static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> -{
> -	switch (protocol) {
> -	case BLKIF_PROTOCOL_NATIVE:
> -		return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> -			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
> -	case BLKIF_PROTOCOL_X86_32:
> -		return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> -			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
> -	case BLKIF_PROTOCOL_X86_64:
> -		return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> -			   BLKIF_MAX_SEGMENTS_PER_REQUEST;
> -	default:
> -		BUG();
> -	}
> -	return 0;
> -}
> -
> -
> -/*
>   * Little helpful macro to figure out the index and virtual address of the
>   * pending_pages[..]. For each 'pending_req' we have have up to
>   * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> @@ -251,6 +259,76 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
>  	BUG_ON(num != 0);
>  }
>  
> +static int purge_persistent_gnt(struct rb_root *root, int num)
> +{
> +	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct persistent_gnt *persistent_gnt;
> +	struct rb_node *n;
> +	int ret, segs_to_unmap = 0;
> +	int requested_num = num;
> +	int preserve_used = 1;

Boolean? And perhaps 'scan_dirty' ?


> +
> +	pr_debug("Requested the purge of %d persistent grants\n", num);
> +
> +purge_list:

This could be written a bit differently to also run outside the xen_blkif_schedule
(so a new thread). This would require using the lock mechanism and converting
this big loop to two smaller loops:
 1) - one quick that holds the lock - to take the items of the list,
 2) second one to do the grant_set_unmap_op operations and all the heavy
    free_xenballooned_pages call.

.. As this function ends up (presumarily?) causing xen_blkif_schedule to be doing
this for some time every second. Irregardless of how utilized the ring is - so
if we are 100% busy - we should not need to call this function. But if we do,
then we end up walking the persistent_gnt twice - once with preserve_used set
to true, and the other with it set to false.

We don't really want that - so is there a way for xen_blkif_schedule to
do a quick determintion of this caliber:


	if (RING_HAS_UNCONSUMED_REQUESTS(x) >= some_value) 
		wait_up(blkif->purgarator)

And the thread would just sit there until kicked in action?
					

> +	foreach_grant_safe(persistent_gnt, n, root, node) {
> +		BUG_ON(persistent_gnt->handle ==
> +			BLKBACK_INVALID_HANDLE);
> +
> +		if (persistent_gnt->flags & PERSISTENT_GNT_ACTIVE)
> +			continue;
> +		if (preserve_used &&
> +		    (persistent_gnt->flags & PERSISTENT_GNT_USED))

Is that similar to DIRTY on pagetables?

> +			continue;
> +
> +		gnttab_set_unmap_op(&unmap[segs_to_unmap],
> +			(unsigned long) pfn_to_kaddr(page_to_pfn(
> +				persistent_gnt->page)),
> +			GNTMAP_host_map,
> +			persistent_gnt->handle);
> +
> +		pages[segs_to_unmap] = persistent_gnt->page;
> +
> +		if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +			ret = gnttab_unmap_refs(unmap, NULL, pages,
> +				segs_to_unmap);
> +			BUG_ON(ret);
> +			free_xenballooned_pages(segs_to_unmap, pages);
> +			segs_to_unmap = 0;
> +		}
> +
> +		rb_erase(&persistent_gnt->node, root);
> +		kfree(persistent_gnt);
> +		if (--num == 0)
> +			goto finished;
> +	}
> +	/*
> +	 * If we get here it means we also need to start cleaning
> +	 * grants that were used since last purge in order to cope
> +	 * with the requested num
> +	 */
> +	if (preserve_used) {
> +		pr_debug("Still missing %d purged frames\n", num);
> +		preserve_used = 0;
> +		goto purge_list;
> +	}
> +finished:
> +	if (segs_to_unmap > 0) {
> +		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
> +		BUG_ON(ret);
> +		free_xenballooned_pages(segs_to_unmap, pages);
> +	}
> +	/* Finally remove the "used" flag from all the persistent grants */
> +	foreach_grant_safe(persistent_gnt, n, root, node) {
> +		BUG_ON(persistent_gnt->handle ==
> +			BLKBACK_INVALID_HANDLE);
> +		persistent_gnt->flags &= ~PERSISTENT_GNT_USED;
> +	}
> +	pr_debug("Purged %d/%d\n", (requested_num - num), requested_num);
> +	return (requested_num - num);
> +}
> +
>  /*
>   * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>   */
> @@ -397,6 +475,8 @@ int xen_blkif_schedule(void *arg)
>  {
>  	struct xen_blkif *blkif = arg;
>  	struct xen_vbd *vbd = &blkif->vbd;
> +	int rq_purge, purged;
> +	unsigned long timeout;
>  
>  	xen_blkif_get(blkif);
>  
> @@ -406,13 +486,21 @@ int xen_blkif_schedule(void *arg)
>  		if (unlikely(vbd->size != vbd_sz(vbd)))
>  			xen_vbd_resize(blkif);
>  
> -		wait_event_interruptible(
> +		timeout = msecs_to_jiffies(xen_blkif_lru_interval);
> +
> +		timeout = wait_event_interruptible_timeout(
>  			blkif->wq,
> -			blkif->waiting_reqs || kthread_should_stop());
> -		wait_event_interruptible(
> +			blkif->waiting_reqs || kthread_should_stop(),
> +			timeout);
> +		if (timeout == 0)
> +			goto purge_gnt_list;
> +		timeout = wait_event_interruptible_timeout(
>  			blkbk->pending_free_wq,
>  			!list_empty(&blkbk->pending_free) ||
> -			kthread_should_stop());
> +			kthread_should_stop(),
> +			timeout);
> +		if (timeout == 0)
> +			goto purge_gnt_list;
>  
>  		blkif->waiting_reqs = 0;
>  		smp_mb(); /* clear flag *before* checking for work */
> @@ -420,6 +508,32 @@ int xen_blkif_schedule(void *arg)
>  		if (do_block_io_op(blkif))
>  			blkif->waiting_reqs = 1;
>  
> +purge_gnt_list:
> +		if (blkif->vbd.feature_gnt_persistent &&
> +		    time_after(jiffies, blkif->next_lru)) {
> +			/* Clean the list of persistent grants */
> +			if (blkif->persistent_gnt_c > xen_blkif_max_pgrants ||
> +			    (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
> +			     blkif->vbd.overflow_max_grants)) {
> +				rq_purge = blkif->persistent_gnt_c -
> +				           xen_blkif_max_pgrants +
> +				           xen_blkif_lru_num_clean;

You can make this more than 80 lines.
> +				rq_purge = rq_purge > blkif->persistent_gnt_c ?
> +				           blkif->persistent_gnt_c : rq_purge;
> +				purged = purge_persistent_gnt(
> +					  &blkif->persistent_gnts, rq_purge);
> +				if (purged != rq_purge)
> +					pr_debug(DRV_PFX " unable to meet persistent grants purge requirements for device %#x, domain %u, requested %d done %d\n",
> +					         blkif->domid,
> +					         blkif->vbd.handle,
> +					         rq_purge, purged);
> +				blkif->persistent_gnt_c -= purged;
> +				blkif->vbd.overflow_max_grants = 0;
> +			}
> +			blkif->next_lru = jiffies +
> +			        msecs_to_jiffies(xen_blkif_lru_interval);
> +		}
> +
>  		if (log_stats && time_after(jiffies, blkif->st_print))
>  			print_stats(blkif);
>  	}
> @@ -453,13 +567,18 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  {
>  	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct persistent_gnt *persistent_gnt;
>  	unsigned int i, invcount = 0;
>  	grant_handle_t handle;
>  	int ret;
>  
>  	for (i = 0; i < req->nr_pages; i++) {
> -		if (!test_bit(i, req->unmap_seg))
> +		if (req->persistent_gnts[i] != NULL) {
> +			persistent_gnt = req->persistent_gnts[i];
> +			persistent_gnt->flags |= PERSISTENT_GNT_USED;
> +			persistent_gnt->flags &= ~PERSISTENT_GNT_ACTIVE;
>  			continue;
> +		}
>  		handle = pending_handle(req, i);
>  		if (handle == BLKBACK_INVALID_HANDLE)
>  			continue;
> @@ -480,8 +599,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			 struct page *pages[])
>  {
>  	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -	struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
>  	struct persistent_gnt *persistent_gnt = NULL;
>  	struct xen_blkif *blkif = pending_req->blkif;
>  	phys_addr_t addr = 0;
> @@ -494,9 +613,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>  
>  	use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
>  
> -	BUG_ON(blkif->persistent_gnt_c >
> -		   max_mapped_grant_pages(pending_req->blkif->blk_protocol));
> -
>  	/*
>  	 * Fill out preq.nr_sects with proper amount of sectors, and setup
>  	 * assign map[..] with the PFN of the page in our domain with the
> @@ -516,9 +632,9 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			 * the grant is already mapped
>  			 */
>  			new_map = false;
> +			persistent_gnt->flags |= PERSISTENT_GNT_ACTIVE;
>  		} else if (use_persistent_gnts &&
> -			   blkif->persistent_gnt_c <
> -			   max_mapped_grant_pages(blkif->blk_protocol)) {
> +			   blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
>  			/*
>  			 * We are using persistent grants, the grant is
>  			 * not mapped but we have room for it
> @@ -536,6 +652,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			}
>  			persistent_gnt->gnt = req->u.rw.seg[i].gref;
>  			persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
> +			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
>  
>  			pages_to_gnt[segs_to_map] =
>  				persistent_gnt->page;
> @@ -547,7 +664,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			blkif->persistent_gnt_c++;
>  			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
>  				 persistent_gnt->gnt, blkif->persistent_gnt_c,
> -				 max_mapped_grant_pages(blkif->blk_protocol));
> +				 xen_blkif_max_pgrants);
>  		} else {
>  			/*
>  			 * We are either using persistent grants and
> @@ -557,7 +674,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			if (use_persistent_gnts &&
>  				!blkif->vbd.overflow_max_grants) {
>  				blkif->vbd.overflow_max_grants = 1;
> -				pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> +				pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>  					 blkif->domid, blkif->vbd.handle);
>  			}
>  			new_map = true;
> @@ -595,7 +712,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	 * so that when we access vaddr(pending_req,i) it has the contents of
>  	 * the page from the other domain.
>  	 */
> -	bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
>  	for (i = 0, j = 0; i < nseg; i++) {
>  		if (!persistent_gnts[i] ||
>  		    persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
> @@ -634,7 +750,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>  				(req->u.rw.seg[i].first_sect << 9);
>  		} else {
>  			pending_handle(pending_req, i) = map[j].handle;
> -			bitmap_set(pending_req->unmap_seg, i, 1);
>  
>  			if (ret) {
>  				j++;
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index f338f8a..bd44d75 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -167,11 +167,14 @@ struct xen_vbd {
>  
>  struct backend_info;
>  
> +#define PERSISTENT_GNT_ACTIVE	0x1
> +#define PERSISTENT_GNT_USED		0x2
>  
>  struct persistent_gnt {
>  	struct page *page;
>  	grant_ref_t gnt;
>  	grant_handle_t handle;
> +	uint8_t flags;
>  	struct rb_node node;
>  };
>  
> @@ -204,6 +207,7 @@ struct xen_blkif {
>  	/* tree to store persistent grants */
>  	struct rb_root		persistent_gnts;
>  	unsigned int		persistent_gnt_c;
> +	unsigned long		next_lru;
>  
>  	/* statistics */
>  	unsigned long		st_print;
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 5e237f6..abb399a 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -116,6 +116,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>  	init_completion(&blkif->drain_complete);
>  	atomic_set(&blkif->drain, 0);
>  	blkif->st_print = jiffies;
> +	blkif->next_lru = jiffies;
>  	init_waitqueue_head(&blkif->waiting_to_free);
>  	blkif->persistent_gnts.rb_node = NULL;
>  
> -- 
> 1.7.7.5 (Apple Git-26)
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings
  2013-02-28 10:28 ` [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings Roger Pau Monne
@ 2013-03-04 20:22   ` Konrad Rzeszutek Wilk
  2013-03-26 17:30     ` Roger Pau Monné
  2013-03-26 17:48     ` Roger Pau Monné
  0 siblings, 2 replies; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-04 20:22 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: linux-kernel, xen-devel

On Thu, Feb 28, 2013 at 11:28:51AM +0100, Roger Pau Monne wrote:
> Using balloon pages for all granted pages allows us to simplify the
> logic in blkback, specially in the xen_blkbk_map function, since now

especially

> we can decide if we want to map a grant persistently or not after we
> have actually mapped it. This could not be done before because
> persistent grants used ballooned pages, and non-persistent grants used
                                          ^^^whereas
> pages from the kernel.
> 
> This patch also introduces several changes, the first one is that the
> list of free pages is no longer global, now each blkback instance has
> it's own list of free pages that can be used to map grants. Also, a
> run time parameter (max_buffer_pages) has been added in order to tune
> the maximum number of free pages each blkback instance will keep in
> it's buffer.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Cc: xen-devel@lists.xen.org
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/block/xen-blkback/blkback.c |  278 +++++++++++++++++++----------------
>  drivers/block/xen-blkback/common.h  |    5 +
>  drivers/block/xen-blkback/xenbus.c  |    3 +

You also need a Documentation/ABI/sysfs-stable/... file.

>  3 files changed, 159 insertions(+), 127 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index b5e7495..ba27fc3 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -101,6 +101,21 @@ module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
>  MODULE_PARM_DESC(lru_num_clean,
>  "Number of persistent grants to unmap when the list is full");
>  
> +/*
> + * Maximum number of unused free pages to keep in the internal buffer.
> + * Setting this to a value too low will reduce memory used in each backend,
> + * but can have a performance penalty.
> + *
> + * A sane value is xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST, but can

Should that value be default value then?

> + * be set to a lower value that might degrade performance on some intensive
> + * IO workloads.
> + */
> +
> +static int xen_blkif_max_buffer_pages = 1024;
> +module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
> +MODULE_PARM_DESC(max_buffer_pages,
> +"Maximum number of free pages to keep in each block backend buffer");
> +
>  /* Run-time switchable: /sys/module/blkback/parameters/ */
>  static unsigned int log_stats;
>  module_param(log_stats, int, 0644);
> @@ -120,6 +135,7 @@ struct pending_req {
>  	int			status;
>  	struct list_head	free_list;
>  	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  };
>  
>  #define BLKBACK_INVALID_HANDLE (~0)
> @@ -131,8 +147,6 @@ struct xen_blkbk {
>  	/* And its spinlock. */
>  	spinlock_t		pending_free_lock;
>  	wait_queue_head_t	pending_free_wq;
> -	/* The list of all pages that are available. */
> -	struct page		**pending_pages;
>  	/* And the grant handles that are available. */
>  	grant_handle_t		*pending_grant_handles;
>  };
> @@ -151,14 +165,66 @@ static inline int vaddr_pagenr(struct pending_req *req, int seg)
>  		BLKIF_MAX_SEGMENTS_PER_REQUEST + seg;
>  }
>  
> -#define pending_page(req, seg) pending_pages[vaddr_pagenr(req, seg)]
> +static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&blkif->free_pages_lock, flags);
> +	if (list_empty(&blkif->free_pages)) {
> +		BUG_ON(blkif->free_pages_num != 0);
> +		spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
> +		return alloc_xenballooned_pages(1, page, false);
> +	}
> +	BUG_ON(blkif->free_pages_num == 0);
> +	page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
> +	list_del(&page[0]->lru);
> +	blkif->free_pages_num--;
> +	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
> +
> +	return 0;
> +}
> +
> +static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
> +                                  int num)
> +{
> +	unsigned long flags;
> +	int i;
> +
> +	spin_lock_irqsave(&blkif->free_pages_lock, flags);
> +	for (i = 0; i < num; i++)
> +		list_add(&page[i]->lru, &blkif->free_pages);
> +	blkif->free_pages_num += num;
> +	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
> +}
>  
> -static inline unsigned long vaddr(struct pending_req *req, int seg)
> +static inline void remove_free_pages(struct xen_blkif *blkif, int num)

Perhaps 'shrink_free_pagepool'?

>  {
> -	unsigned long pfn = page_to_pfn(blkbk->pending_page(req, seg));
> -	return (unsigned long)pfn_to_kaddr(pfn);
> +	/* Remove requested pages in batches of 10 */
> +	struct page *page[10];

Hrmp. #define!

> +	unsigned long flags;
> +	int num_pages = 0;

unsigned int

> +
> +	spin_lock_irqsave(&blkif->free_pages_lock, flags);
> +	while (blkif->free_pages_num > num) {
> +		BUG_ON(list_empty(&blkif->free_pages));
> +		page[num_pages] = list_first_entry(&blkif->free_pages,
> +		                                   struct page, lru);
> +		list_del(&page[num_pages]->lru);
> +		blkif->free_pages_num--;
> +		if (++num_pages == 10) {
> +			spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
> +			free_xenballooned_pages(num_pages, page);
> +			spin_lock_irqsave(&blkif->free_pages_lock, flags);
> +			num_pages = 0;
> +		}
> +	}
> +	spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
> +	if (num_pages != 0)
> +		free_xenballooned_pages(num_pages, page);
>  }
>  
> +#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
> +
>  #define pending_handle(_req, _seg) \
>  	(blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)])
>  
> @@ -178,7 +244,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,
>  	     (n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)
>  
>  
> -static void add_persistent_gnt(struct rb_root *root,
> +static int add_persistent_gnt(struct rb_root *root,
>  			       struct persistent_gnt *persistent_gnt)
>  {
>  	struct rb_node **new = &(root->rb_node), *parent = NULL;
> @@ -194,14 +260,15 @@ static void add_persistent_gnt(struct rb_root *root,
>  		else if (persistent_gnt->gnt > this->gnt)
>  			new = &((*new)->rb_right);
>  		else {
> -			pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
> -			BUG();
> +			pr_alert_ratelimited(DRV_PFX " trying to add a gref that's already in the tree\n");
> +			return -EINVAL;

That looks like a seperate bug-fix patch? Especially the pr_alert_ratelimited
part?

>  		}
>  	}
>  
>  	/* Add new node and rebalance tree. */
>  	rb_link_node(&(persistent_gnt->node), parent, new);
>  	rb_insert_color(&(persistent_gnt->node), root);
> +	return 0;
>  }
>  
>  static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
> @@ -223,7 +290,8 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
>  	return NULL;
>  }
>  
> -static void free_persistent_gnts(struct rb_root *root, unsigned int num)
> +static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
> +                                 unsigned int num)
>  {
>  	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> @@ -248,7 +316,7 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
>  			ret = gnttab_unmap_refs(unmap, NULL, pages,
>  				segs_to_unmap);
>  			BUG_ON(ret);
> -			free_xenballooned_pages(segs_to_unmap, pages);
> +			put_free_pages(blkif, pages, segs_to_unmap);
>  			segs_to_unmap = 0;
>  		}
>  
> @@ -259,7 +327,8 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
>  	BUG_ON(num != 0);
>  }
>  
> -static int purge_persistent_gnt(struct rb_root *root, int num)
> +static int purge_persistent_gnt(struct xen_blkif *blkif, struct rb_root *root,
> +                                int num)
>  {
>  	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> @@ -294,7 +363,7 @@ purge_list:
>  			ret = gnttab_unmap_refs(unmap, NULL, pages,
>  				segs_to_unmap);
>  			BUG_ON(ret);
> -			free_xenballooned_pages(segs_to_unmap, pages);
> +			put_free_pages(blkif, pages, segs_to_unmap);
>  			segs_to_unmap = 0;
>  		}
>  
> @@ -317,7 +386,7 @@ finished:
>  	if (segs_to_unmap > 0) {
>  		ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
>  		BUG_ON(ret);
> -		free_xenballooned_pages(segs_to_unmap, pages);
> +		put_free_pages(blkif, pages, segs_to_unmap);
>  	}
>  	/* Finally remove the "used" flag from all the persistent grants */
>  	foreach_grant_safe(persistent_gnt, n, root, node) {
> @@ -521,7 +590,7 @@ purge_gnt_list:
>  				           xen_blkif_lru_num_clean;
>  				rq_purge = rq_purge > blkif->persistent_gnt_c ?
>  				           blkif->persistent_gnt_c : rq_purge;
> -				purged = purge_persistent_gnt(
> +				purged = purge_persistent_gnt(blkif,
>  					  &blkif->persistent_gnts, rq_purge);
>  				if (purged != rq_purge)
>  					pr_debug(DRV_PFX " unable to meet persistent grants purge requirements for device %#x, domain %u, requested %d done %d\n",
> @@ -535,13 +604,17 @@ purge_gnt_list:
>  			        msecs_to_jiffies(xen_blkif_lru_interval);
>  		}
>  
> +		remove_free_pages(blkif, xen_blkif_max_buffer_pages);
> +
>  		if (log_stats && time_after(jiffies, blkif->st_print))
>  			print_stats(blkif);
>  	}
>  
> +	remove_free_pages(blkif, 0);

What purpose does that have?

> +
>  	/* Free all persistent grant pages */
>  	if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
> -		free_persistent_gnts(&blkif->persistent_gnts,
> +		free_persistent_gnts(blkif, &blkif->persistent_gnts,
>  			blkif->persistent_gnt_c);
>  
>  	BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> @@ -571,6 +644,7 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  	struct persistent_gnt *persistent_gnt;
>  	unsigned int i, invcount = 0;
>  	grant_handle_t handle;
> +	struct xen_blkif *blkif = req->blkif;
>  	int ret;
>  
>  	for (i = 0; i < req->nr_pages; i++) {
> @@ -581,17 +655,18 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  			continue;
>  		}
>  		handle = pending_handle(req, i);
> +		pages[invcount] = req->pages[i];
>  		if (handle == BLKBACK_INVALID_HANDLE)
>  			continue;
> -		gnttab_set_unmap_op(&unmap[invcount], vaddr(req, i),
> +		gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[invcount]),
>  				    GNTMAP_host_map, handle);
>  		pending_handle(req, i) = BLKBACK_INVALID_HANDLE;
> -		pages[invcount] = virt_to_page(vaddr(req, i));
>  		invcount++;
>  	}
>  
>  	ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
>  	BUG_ON(ret);
> +	put_free_pages(blkif, pages, invcount);
>  }
>  
>  static int xen_blkbk_map(struct blkif_request *req,
> @@ -606,7 +681,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	struct xen_blkif *blkif = pending_req->blkif;
>  	phys_addr_t addr = 0;
>  	int i, j;
> -	bool new_map;
>  	int nseg = req->u.rw.nr_segments;
>  	int segs_to_map = 0;
>  	int ret = 0;
> @@ -632,69 +706,17 @@ static int xen_blkbk_map(struct blkif_request *req,
>  			 * We are using persistent grants and
>  			 * the grant is already mapped
>  			 */
> -			new_map = false;
>  			persistent_gnt->flags |= PERSISTENT_GNT_ACTIVE;
> -		} else if (use_persistent_gnts &&
> -			   blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
> -			/*
> -			 * We are using persistent grants, the grant is
> -			 * not mapped but we have room for it
> -			 */
> -			new_map = true;
> -			persistent_gnt = kmalloc(
> -				sizeof(struct persistent_gnt),
> -				GFP_KERNEL);
> -			if (!persistent_gnt)
> -				return -ENOMEM;
> -			if (alloc_xenballooned_pages(1, &persistent_gnt->page,
> -			    false)) {
> -				kfree(persistent_gnt);
> -				return -ENOMEM;
> -			}
> -			persistent_gnt->gnt = req->u.rw.seg[i].gref;
> -			persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
> -			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
> -
> -			pages_to_gnt[segs_to_map] =
> -				persistent_gnt->page;
> -			addr = (unsigned long) pfn_to_kaddr(
> -				page_to_pfn(persistent_gnt->page));
> -
> -			add_persistent_gnt(&blkif->persistent_gnts,
> -				persistent_gnt);
> -			blkif->persistent_gnt_c++;
> -			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> -				 persistent_gnt->gnt, blkif->persistent_gnt_c,
> -				 xen_blkif_max_pgrants);
> -		} else {
> -			/*
> -			 * We are either using persistent grants and
> -			 * hit the maximum limit of grants mapped,
> -			 * or we are not using persistent grants.
> -			 */
> -			if (use_persistent_gnts &&
> -				!blkif->vbd.overflow_max_grants) {
> -				blkif->vbd.overflow_max_grants = 1;
> -				pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> -					 blkif->domid, blkif->vbd.handle);
> -			}
> -			new_map = true;
> -			pages[i] = blkbk->pending_page(pending_req, i);
> -			addr = vaddr(pending_req, i);
> -			pages_to_gnt[segs_to_map] =
> -				blkbk->pending_page(pending_req, i);
> -		}
> -
> -		if (persistent_gnt) {
>  			pages[i] = persistent_gnt->page;
>  			persistent_gnts[i] = persistent_gnt;
>  		} else {
> +			if (get_free_page(blkif, &pages[i]))
> +				goto out_of_memory;
> +			addr = vaddr(pages[i]);
> +			pages_to_gnt[segs_to_map] = pages[i];
>  			persistent_gnts[i] = NULL;
> -		}
> -
> -		if (new_map) {
>  			flags = GNTMAP_host_map;
> -			if (!persistent_gnt &&
> +			if (!use_persistent_gnts &&
>  			    (pending_req->operation != BLKIF_OP_READ))
>  				flags |= GNTMAP_readonly;
>  			gnttab_set_map_op(&map[segs_to_map++], addr,
> @@ -714,54 +736,71 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	 * the page from the other domain.
>  	 */
>  	for (i = 0, j = 0; i < nseg; i++) {
> -		if (!persistent_gnts[i] ||
> -		    persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
> +		if (!persistent_gnts[i]) {
>  			/* This is a newly mapped grant */
>  			BUG_ON(j >= segs_to_map);
>  			if (unlikely(map[j].status != 0)) {
>  				pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> -				map[j].handle = BLKBACK_INVALID_HANDLE;
> +				pending_handle(pending_req, i) =
> +					BLKBACK_INVALID_HANDLE;
>  				ret |= 1;
> -				if (persistent_gnts[i]) {
> -					rb_erase(&persistent_gnts[i]->node,
> -						 &blkif->persistent_gnts);
> -					blkif->persistent_gnt_c--;
> -					kfree(persistent_gnts[i]);
> -					persistent_gnts[i] = NULL;
> -				}
> +				j++;
> +				continue;
>  			}
> +			pending_handle(pending_req, i) = map[j].handle;
>  		}
> -		if (persistent_gnts[i]) {
> -			if (persistent_gnts[i]->handle ==
> -			    BLKBACK_INVALID_HANDLE) {
> +		if (persistent_gnts[i])
> +			goto next;
> +		if (use_persistent_gnts &&
> +		    blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
> +			/*
> +			 * We are using persistent grants, the grant is
> +			 * not mapped but we have room for it
> +			 */
> +			persistent_gnt = kmalloc(sizeof(struct persistent_gnt),
> +				                 GFP_KERNEL);
> +			if (!persistent_gnt) {
>  				/*
> -				 * If this is a new persistent grant
> -				 * save the handler
> +				 * If we don't have enough memory to
> +				 * allocate the persistent_gnt struct
> +				 * map this grant non-persistenly
>  				 */
> -				persistent_gnts[i]->handle = map[j++].handle;
> -			}
> -			pending_handle(pending_req, i) =
> -				persistent_gnts[i]->handle;
> -
> -			if (ret)
> -				continue;
> -
> -			seg[i].buf = pfn_to_mfn(page_to_pfn(
> -				persistent_gnts[i]->page)) << PAGE_SHIFT |
> -				(req->u.rw.seg[i].first_sect << 9);
> -		} else {
> -			pending_handle(pending_req, i) = map[j].handle;
> -
> -			if (ret) {
>  				j++;
> -				continue;
> +				goto next;
>  			}
> -
> -			seg[i].buf = map[j++].dev_bus_addr |
> -				(req->u.rw.seg[i].first_sect << 9);
> +			persistent_gnt->gnt = map[j].ref;
> +			persistent_gnt->handle = map[j].handle;
> +			persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
> +			persistent_gnt->page = pages[i];
> +			if (add_persistent_gnt(&blkif->persistent_gnts,
> +			                       persistent_gnt)) {
> +				kfree(persistent_gnt);
> +				goto next;
> +			}
> +			blkif->persistent_gnt_c++;
> +			persistent_gnts[i] = persistent_gnt;
> +			pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> +				 persistent_gnt->gnt, blkif->persistent_gnt_c,
> +				 xen_blkif_max_pgrants);
> +			j++;
> +			goto next;
> +		}
> +		if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
> +			blkif->vbd.overflow_max_grants = 1;
> +			pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> +			         blkif->domid, blkif->vbd.handle);
>  		}
> +		j++;
> +next:
> +		seg[i].buf = pfn_to_mfn(page_to_pfn(pages[i])) << PAGE_SHIFT |
> +		             (req->u.rw.seg[i].first_sect << 9);
>  	}
>  	return ret;
> +
> +out_of_memory:
> +	pr_alert(DRV_PFX "%s: out of memory\n", __func__);
> +	put_free_pages(blkif, pages_to_gnt, segs_to_map);
> +	return -ENOMEM;
>  }
>  
>  static int dispatch_discard_io(struct xen_blkif *blkif,
> @@ -962,7 +1001,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	int operation;
>  	struct blk_plug plug;
>  	bool drain = false;
> -	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page **pages = pending_req->pages;
>  
>  	switch (req->operation) {
>  	case BLKIF_OP_READ:
> @@ -1193,22 +1232,14 @@ static int __init xen_blkif_init(void)
>  					xen_blkif_reqs, GFP_KERNEL);
>  	blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
>  					mmap_pages, GFP_KERNEL);
> -	blkbk->pending_pages         = kzalloc(sizeof(blkbk->pending_pages[0]) *
> -					mmap_pages, GFP_KERNEL);
>  
> -	if (!blkbk->pending_reqs || !blkbk->pending_grant_handles ||
> -	    !blkbk->pending_pages) {
> +	if (!blkbk->pending_reqs || !blkbk->pending_grant_handles) {
>  		rc = -ENOMEM;
>  		goto out_of_memory;
>  	}
>  
>  	for (i = 0; i < mmap_pages; i++) {
>  		blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
> -		blkbk->pending_pages[i] = alloc_page(GFP_KERNEL);
> -		if (blkbk->pending_pages[i] == NULL) {
> -			rc = -ENOMEM;
> -			goto out_of_memory;
> -		}
>  	}
>  	rc = xen_blkif_interface_init();
>  	if (rc)
> @@ -1233,13 +1264,6 @@ static int __init xen_blkif_init(void)
>   failed_init:
>  	kfree(blkbk->pending_reqs);
>  	kfree(blkbk->pending_grant_handles);
> -	if (blkbk->pending_pages) {
> -		for (i = 0; i < mmap_pages; i++) {
> -			if (blkbk->pending_pages[i])
> -				__free_page(blkbk->pending_pages[i]);
> -		}
> -		kfree(blkbk->pending_pages);
> -	}
>  	kfree(blkbk);
>  	blkbk = NULL;
>  	return rc;
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index bd44d75..604bd30 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -209,6 +209,11 @@ struct xen_blkif {
>  	unsigned int		persistent_gnt_c;
>  	unsigned long		next_lru;
>  
> +	/* buffer of free pages to map grant refs */
> +	spinlock_t		free_pages_lock;
> +	int			free_pages_num;
> +	struct list_head	free_pages;
> +
>  	/* statistics */
>  	unsigned long		st_print;
>  	int			st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index abb399a..d7926ec 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -119,6 +119,9 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>  	blkif->next_lru = jiffies;
>  	init_waitqueue_head(&blkif->waiting_to_free);
>  	blkif->persistent_gnts.rb_node = NULL;
> +	spin_lock_init(&blkif->free_pages_lock);
> +	INIT_LIST_HEAD(&blkif->free_pages);
> +	blkif->free_pages_num = 0;
>  
>  	return blkif;
>  }
> -- 
> 1.7.7.5 (Apple Git-26)
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
  2013-02-28 11:19   ` [Xen-devel] " Jan Beulich
@ 2013-03-04 20:41   ` Konrad Rzeszutek Wilk
  2013-03-05 17:07     ` Roger Pau Monné
  2013-03-18 17:06   ` Roger Pau Monné
  2 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-04 20:41 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: linux-kernel, xen-devel

On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
> Indirect descriptors introduce a new block operation
> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> in the request. This grant references are filled with arrays of
> blkif_request_segment_aligned, this way we can send more segments in a
> request.
> 
> The proposed implementation sets the maximum number of indirect grefs
> (frames filled with blkif_request_segment_aligned) to 256 in the
> backend and 64 in the frontend. The value in the frontend has been
> chosen experimentally, and the backend value has been set to a sane
> value that allows expanding the maximum number of indirect descriptors
> in the frontend if needed.

So we are still using a similar format of the form:

<gref, first_sec, last_sect, pad>, etc.

Why not utilize a layout that fits with the bio sg? That way
we might not even have to do the bio_alloc call and instead can
setup an bio (and bio-list) with the appropiate offsets/list?

Meaning that the format of the indirect descriptors is:

<gref, offset, next_index, pad>

We already know what the first_sec and last_sect are - they
are basically: sector_number +  nr_segments * (whatever the sector size is) + offset



> 
> The migration code has changed from the previous implementation, in
> which we simply remapped the segments on the shared ring. Now the
> maximum number of segments allowed in a request can change depending
> on the backend, so we have to requeue all the requests in the ring and
> in the queue and split the bios in them if they are bigger than the
> new maximum number of segments.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: xen-devel@lists.xen.org
> ---
>  drivers/block/xen-blkback/blkback.c |  129 +++++++---
>  drivers/block/xen-blkback/common.h  |   80 ++++++-
>  drivers/block/xen-blkback/xenbus.c  |    8 +
>  drivers/block/xen-blkfront.c        |  498 +++++++++++++++++++++++++++++------
>  include/xen/interface/io/blkif.h    |   25 ++
>  5 files changed, 622 insertions(+), 118 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 0fa30db..98eb16b 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -70,7 +70,7 @@ MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate per backend");
>   * algorithm.
>   */
>  
> -static int xen_blkif_max_pgrants = 352;
> +static int xen_blkif_max_pgrants = 1024;
>  module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
>  MODULE_PARM_DESC(max_persistent_grants,
>                   "Maximum number of grants to map persistently");
> @@ -578,10 +578,6 @@ purge_gnt_list:
>  	return 0;
>  }
>  
> -struct seg_buf {
> -	unsigned long buf;
> -	unsigned int nsec;
> -};
>  /*
>   * Unmap the grant references, and also remove the M2P over-rides
>   * used in the 'pending_req'.
> @@ -761,32 +757,79 @@ out_of_memory:
>  	return -ENOMEM;
>  }
>  
> -static int xen_blkbk_map_seg(struct blkif_request *req,
> -			     struct pending_req *pending_req,
> +static int xen_blkbk_map_seg(struct pending_req *pending_req,
>  			     struct seg_buf seg[],
>  			     struct page *pages[])
>  {
>  	int i, rc;
> -	grant_ref_t grefs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  
> -	for (i = 0; i < req->u.rw.nr_segments; i++)
> -		grefs[i] = req->u.rw.seg[i].gref;
> -
> -	rc = xen_blkbk_map(pending_req->blkif, grefs,
> +	rc = xen_blkbk_map(pending_req->blkif, pending_req->grefs,
>  	                   pending_req->persistent_gnts,
>  	                   pending_req->grant_handles, pending_req->pages,
> -	                   req->u.rw.nr_segments,
> +	                   pending_req->nr_pages,
>  	                   (pending_req->operation != BLKIF_OP_READ));
>  	if (rc)
>  		return rc;
>  
> -	for (i = 0; i < req->u.rw.nr_segments; i++)
> -		seg[i].buf = pfn_to_mfn(page_to_pfn(pending_req->pages[i]))
> -		             << PAGE_SHIFT | (req->u.rw.seg[i].first_sect << 9);
> +	for (i = 0; i < pending_req->nr_pages; i++)
> +		seg[i].buf |= pfn_to_mfn(page_to_pfn(pending_req->pages[i]))
> +		             << PAGE_SHIFT;
>  
>  	return 0;
>  }
>  
> +static int xen_blkbk_parse_indirect(struct blkif_request *req,
> +                                    struct pending_req *pending_req,
> +                                    struct seg_buf seg[],
> +                                    struct phys_req *preq)
> +{
> +	struct persistent_gnt **persistent =
> +		pending_req->indirect_persistent_gnts;
> +	struct page **pages = pending_req->indirect_pages;
> +	struct xen_blkif *blkif = pending_req->blkif;
> +	int indirect_grefs, rc, n, nseg, i;
> +	struct blkif_request_segment_aligned *segments = NULL;
> +
> +	nseg = pending_req->nr_pages;
> +	indirect_grefs = (nseg + SEGS_PER_INDIRECT_FRAME - 1) /
> +		         SEGS_PER_INDIRECT_FRAME;
> +
> +	rc = xen_blkbk_map(blkif, req->u.indirect.indirect_grefs,
> +	                   persistent, pending_req->indirect_handles,
> +	                   pages, indirect_grefs, true);
> +	if (rc)
> +		goto unmap;
> +
> +	for (n = 0, i = 0; n < nseg; n++) {
> +		if ((n % SEGS_PER_INDIRECT_FRAME) == 0) {
> +			/* Map indirect segments */
> +			if (segments)
> +				kunmap_atomic(segments);
> +			segments =
> +				kmap_atomic(pages[n/SEGS_PER_INDIRECT_FRAME]);
> +		}
> +		i = n % SEGS_PER_INDIRECT_FRAME;
> +		pending_req->grefs[n] = segments[i].gref;
> +		seg[n].nsec = segments[i].last_sect -
> +			segments[i].first_sect + 1;
> +		seg[n].buf = segments[i].first_sect << 9;
> +		if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) ||
> +	    	    (segments[i].last_sect <
> +	    	     segments[i].first_sect)) {
> +			rc = -EINVAL;
> +			goto unmap;
> +		}
> +		preq->nr_sects += seg[n].nsec;
> +	}
> +
> +unmap:
> +	if (segments)
> +		kunmap_atomic(segments);
> +	xen_blkbk_unmap(blkif, pending_req->indirect_handles,
> +                        pages, persistent, indirect_grefs);
> +	return rc;
> +}
> +
>  static int dispatch_discard_io(struct xen_blkif *blkif,
>  				struct blkif_request *req)
>  {
> @@ -980,17 +1023,21 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  				struct pending_req *pending_req)
>  {
>  	struct phys_req preq;
> -	struct seg_buf seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct seg_buf *seg = pending_req->seg;
>  	unsigned int nseg;
>  	struct bio *bio = NULL;
> -	struct bio *biolist[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct bio **biolist = pending_req->biolist;
>  	int i, nbio = 0;
>  	int operation;
>  	struct blk_plug plug;
>  	bool drain = false;
>  	struct page **pages = pending_req->pages;
> +	unsigned short req_operation;
> +
> +	req_operation = req->operation == BLKIF_OP_INDIRECT ?
> +	                req->u.indirect.indirect_op : req->operation;
>  
> -	switch (req->operation) {
> +	switch (req_operation) {
>  	case BLKIF_OP_READ:
>  		blkif->st_rd_req++;
>  		operation = READ;
> @@ -1012,33 +1059,49 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	}
>  
>  	/* Check that the number of segments is sane. */
> -	nseg = req->u.rw.nr_segments;
> +	nseg = req->operation == BLKIF_OP_INDIRECT ?
> +	       req->u.indirect.nr_segments : req->u.rw.nr_segments;
>  
>  	if (unlikely(nseg == 0 && operation != WRITE_FLUSH) ||
> -	    unlikely(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
> +	    unlikely((req->operation != BLKIF_OP_INDIRECT) &&
> +	             (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
> +	    unlikely((req->operation == BLKIF_OP_INDIRECT) &&
> +	             (nseg > MAX_INDIRECT_SEGMENTS))) {
>  		pr_debug(DRV_PFX "Bad number of segments in request (%d)\n",
>  			 nseg);
>  		/* Haven't submitted any bio's yet. */
>  		goto fail_response;
>  	}
>  
> -	preq.sector_number = req->u.rw.sector_number;
>  	preq.nr_sects      = 0;
>  
>  	pending_req->blkif     = blkif;
> -	pending_req->id        = req->u.rw.id;
> -	pending_req->operation = req->operation;
>  	pending_req->status    = BLKIF_RSP_OKAY;
>  	pending_req->nr_pages  = nseg;
>  
> -	for (i = 0; i < nseg; i++) {
> -		seg[i].nsec = req->u.rw.seg[i].last_sect -
> -			req->u.rw.seg[i].first_sect + 1;
> -		if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
> -		    (req->u.rw.seg[i].last_sect < req->u.rw.seg[i].first_sect))
> +	if (req->operation != BLKIF_OP_INDIRECT) {
> +		preq.dev               = req->u.rw.handle;
> +		preq.sector_number     = req->u.rw.sector_number;
> +		pending_req->id        = req->u.rw.id;
> +		pending_req->operation = req->operation;
> +		for (i = 0; i < nseg; i++) {
> +			pending_req->grefs[i] = req->u.rw.seg[i].gref;
> +			seg[i].nsec = req->u.rw.seg[i].last_sect -
> +				req->u.rw.seg[i].first_sect + 1;
> +			seg[i].buf = req->u.rw.seg[i].first_sect << 9;
> +			if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
> +		    	    (req->u.rw.seg[i].last_sect <
> +		    	     req->u.rw.seg[i].first_sect))
> +				goto fail_response;
> +			preq.nr_sects += seg[i].nsec;
> +		}
> +	} else {
> +		preq.dev               = req->u.indirect.handle;
> +		preq.sector_number     = req->u.indirect.sector_number;
> +		pending_req->id        = req->u.indirect.id;
> +		pending_req->operation = req->u.indirect.indirect_op;
> +		if (xen_blkbk_parse_indirect(req, pending_req, seg, &preq))
>  			goto fail_response;
> -		preq.nr_sects += seg[i].nsec;
> -
>  	}
>  
>  	if (xen_vbd_translate(&preq, blkif, operation) != 0) {
> @@ -1074,7 +1137,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	 * the hypercall to unmap the grants - that is all done in
>  	 * xen_blkbk_unmap.
>  	 */
> -	if (xen_blkbk_map_seg(req, pending_req, seg, pages))
> +	if (xen_blkbk_map_seg(pending_req, seg, pages))
>  		goto fail_flush;
>  
>  	/*
> @@ -1146,7 +1209,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	                pending_req->nr_pages);
>   fail_response:
>  	/* Haven't submitted any bio's yet. */
> -	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
> +	make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
>  	free_req(blkif, pending_req);
>  	msleep(1); /* back off a bit */
>  	return -EIO;
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 0b0ad3f..d3656d2 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -50,6 +50,17 @@
>  		 __func__, __LINE__, ##args)
>  
>  
> +/*
> + * This is the maximum number of segments that would be allowed in indirect
> + * requests. This value will also be passed to the frontend.
> + */
> +#define MAX_INDIRECT_SEGMENTS 256
> +
> +#define SEGS_PER_INDIRECT_FRAME \
> +(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
> +#define MAX_INDIRECT_GREFS \
> +((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
> +
>  /* Not a real protocol.  Used to generate ring structs which contain
>   * the elements common to all protocols only.  This way we get a
>   * compiler-checkable way to use common struct elements, so we can
> @@ -77,11 +88,21 @@ struct blkif_x86_32_request_discard {
>  	uint64_t       nr_sectors;
>  } __attribute__((__packed__));
>  
> +struct blkif_x86_32_request_indirect {
> +	uint8_t        indirect_op;
> +	uint16_t       nr_segments;
> +	uint64_t       id;
> +	blkif_vdev_t   handle;
> +	blkif_sector_t sector_number;
> +	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> +} __attribute__((__packed__));
> +
>  struct blkif_x86_32_request {
>  	uint8_t        operation;    /* BLKIF_OP_???                         */
>  	union {
>  		struct blkif_x86_32_request_rw rw;
>  		struct blkif_x86_32_request_discard discard;
> +		struct blkif_x86_32_request_indirect indirect;
>  	} u;
>  } __attribute__((__packed__));
>  
> @@ -113,11 +134,22 @@ struct blkif_x86_64_request_discard {
>  	uint64_t       nr_sectors;
>  } __attribute__((__packed__));
>  
> +struct blkif_x86_64_request_indirect {
> +	uint8_t        indirect_op;
> +	uint16_t       nr_segments;
> +	uint32_t       _pad1;        /* offsetof(blkif_..,u.indirect.id)==8   */
> +	uint64_t       id;
> +	blkif_vdev_t   handle;
> +	blkif_sector_t sector_number;
> +	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> +} __attribute__((__packed__));
> +
>  struct blkif_x86_64_request {
>  	uint8_t        operation;    /* BLKIF_OP_???                         */
>  	union {
>  		struct blkif_x86_64_request_rw rw;
>  		struct blkif_x86_64_request_discard discard;
> +		struct blkif_x86_64_request_indirect indirect;
>  	} u;
>  } __attribute__((__packed__));
>  
> @@ -235,6 +267,11 @@ struct xen_blkif {
>  	wait_queue_head_t	waiting_to_free;
>  };
>  
> +struct seg_buf {
> +	unsigned long buf;
> +	unsigned int nsec;
> +};
> +
>  /*
>   * Each outstanding request that we've passed to the lower device layers has a
>   * 'pending_req' allocated to it. Each buffer_head that completes decrements
> @@ -249,9 +286,16 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> -	struct persistent_gnt	*persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -	struct page		*pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -	grant_handle_t		grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct persistent_gnt	*persistent_gnts[MAX_INDIRECT_SEGMENTS];
> +	struct page		*pages[MAX_INDIRECT_SEGMENTS];
> +	grant_handle_t		grant_handles[MAX_INDIRECT_SEGMENTS];
> +	grant_ref_t		grefs[MAX_INDIRECT_SEGMENTS];
> +	/* Indirect descriptors */
> +	struct persistent_gnt	*indirect_persistent_gnts[MAX_INDIRECT_GREFS];
> +	struct page		*indirect_pages[MAX_INDIRECT_GREFS];
> +	grant_handle_t		indirect_handles[MAX_INDIRECT_GREFS];
> +	struct seg_buf		seg[MAX_INDIRECT_SEGMENTS];
> +	struct bio		*biolist[MAX_INDIRECT_SEGMENTS];
>  };
>  
>  
> @@ -289,7 +333,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
>  static inline void blkif_get_x86_32_req(struct blkif_request *dst,
>  					struct blkif_x86_32_request *src)
>  {
> -	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j = MAX_INDIRECT_GREFS;
>  	dst->operation = src->operation;
>  	switch (src->operation) {
>  	case BLKIF_OP_READ:
> @@ -312,6 +356,19 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
>  		dst->u.discard.sector_number = src->u.discard.sector_number;
>  		dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
>  		break;
> +	case BLKIF_OP_INDIRECT:
> +		dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
> +		dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
> +		dst->u.indirect.handle = src->u.indirect.handle;
> +		dst->u.indirect.id = src->u.indirect.id;
> +		dst->u.indirect.sector_number = src->u.indirect.sector_number;
> +		barrier();
> +		if (j > dst->u.indirect.nr_segments)
> +			j = dst->u.indirect.nr_segments;
> +		for (i = 0; i < j; i++)
> +			dst->u.indirect.indirect_grefs[i] =
> +				src->u.indirect.indirect_grefs[i];
> +		break;
>  	default:
>  		break;
>  	}
> @@ -320,7 +377,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
>  static inline void blkif_get_x86_64_req(struct blkif_request *dst,
>  					struct blkif_x86_64_request *src)
>  {
> -	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j = MAX_INDIRECT_GREFS;
>  	dst->operation = src->operation;
>  	switch (src->operation) {
>  	case BLKIF_OP_READ:
> @@ -343,6 +400,19 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
>  		dst->u.discard.sector_number = src->u.discard.sector_number;
>  		dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
>  		break;
> +	case BLKIF_OP_INDIRECT:
> +		dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
> +		dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
> +		dst->u.indirect.handle = src->u.indirect.handle;
> +		dst->u.indirect.id = src->u.indirect.id;
> +		dst->u.indirect.sector_number = src->u.indirect.sector_number;
> +		barrier();
> +		if (j > dst->u.indirect.nr_segments)
> +			j = dst->u.indirect.nr_segments;
> +		for (i = 0; i < j; i++)
> +			dst->u.indirect.indirect_grefs[i] =
> +				src->u.indirect.indirect_grefs[i];
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 8f929cb..9e16abb 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -700,6 +700,14 @@ again:
>  		goto abort;
>  	}
>  
> +	err = xenbus_printf(xbt, dev->nodename, "max-indirect-segments", "%u",
> +	                    MAX_INDIRECT_SEGMENTS);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "writing %s/max-indirect-segments",
> +				 dev->nodename);
> +		goto abort;
> +	}
> +
>  	err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
>  			    (unsigned long long)vbd_sz(&be->blkif->vbd));
>  	if (err) {
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 4d81fcc..074d302 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -74,12 +74,30 @@ struct grant {
>  struct blk_shadow {
>  	struct blkif_request req;
>  	struct request *request;
> -	struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct grant **grants_used;
> +	struct grant **indirect_grants;
> +};
> +
> +struct split_bio {
> +	struct bio *bio;
> +	atomic_t pending;
> +	int err;
>  };
>  
>  static DEFINE_MUTEX(blkfront_mutex);
>  static const struct block_device_operations xlvbd_block_fops;
>  
> +/*
> + * Maximum number of segments in indirect requests, the actual value used by
> + * the frontend driver is the minimum of this value and the value provided
> + * by the backend driver.
> + */
> +
> +static int xen_blkif_max_segments = 64;
> +module_param_named(max_segments, xen_blkif_max_segments, int, 0);
> +MODULE_PARM_DESC(max_segments,
> +"Maximum number of segments in indirect requests");
> +
>  #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
>  
>  /*
> @@ -98,7 +116,7 @@ struct blkfront_info
>  	enum blkif_state connected;
>  	int ring_ref;
>  	struct blkif_front_ring ring;
> -	struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct scatterlist *sg;
>  	unsigned int evtchn, irq;
>  	struct request_queue *rq;
>  	struct work_struct work;
> @@ -114,6 +132,8 @@ struct blkfront_info
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	unsigned int feature_persistent:1;
> +	unsigned int max_indirect_segments;
> +	unsigned int sector_size;
>  	int is_ready;
>  };
>  
> @@ -142,6 +162,14 @@ static DEFINE_SPINLOCK(minor_lock);
>  
>  #define DEV_NAME	"xvd"	/* name in /dev */
>  
> +#define SEGS_PER_INDIRECT_FRAME \
> +	(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
> +#define INDIRECT_GREFS(_segs) \
> +	((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
> +#define MIN(_a, _b) ((_a) < (_b) ? (_a) : (_b))
> +
> +static int blkfront_setup_indirect(struct blkfront_info *info);
> +
>  static int get_id_from_freelist(struct blkfront_info *info)
>  {
>  	unsigned long free = info->shadow_free;
> @@ -358,7 +386,8 @@ static int blkif_queue_request(struct request *req)
>  	struct blkif_request *ring_req;
>  	unsigned long id;
>  	unsigned int fsect, lsect;
> -	int i, ref;
> +	int i, ref, n;
> +	struct blkif_request_segment_aligned *segments = NULL;
>  
>  	/*
>  	 * Used to store if we are able to queue the request by just using
> @@ -369,21 +398,27 @@ static int blkif_queue_request(struct request *req)
>  	grant_ref_t gref_head;
>  	struct grant *gnt_list_entry = NULL;
>  	struct scatterlist *sg;
> +	int nseg, max_grefs;
>  
>  	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>  		return 1;
>  
> -	/* Check if we have enought grants to allocate a requests */
> -	if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +	max_grefs = info->max_indirect_segments ?
> +	            info->max_indirect_segments +
> +	            INDIRECT_GREFS(info->max_indirect_segments) :
> +	            BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +
> +	/* Check if we have enough grants to allocate a requests */
> +	if (info->persistent_gnts_c < max_grefs) {
>  		new_persistent_gnts = 1;
>  		if (gnttab_alloc_grant_references(
> -		    BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
> +		    max_grefs - info->persistent_gnts_c,
>  		    &gref_head) < 0) {
>  			gnttab_request_free_callback(
>  				&info->callback,
>  				blkif_restart_queue_callback,
>  				info,
> -				BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +				max_grefs);
>  			return 1;
>  		}
>  	} else
> @@ -394,42 +429,82 @@ static int blkif_queue_request(struct request *req)
>  	id = get_id_from_freelist(info);
>  	info->shadow[id].request = req;
>  
> -	ring_req->u.rw.id = id;
> -	ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
> -	ring_req->u.rw.handle = info->handle;
> -
> -	ring_req->operation = rq_data_dir(req) ?
> -		BLKIF_OP_WRITE : BLKIF_OP_READ;
> -
> -	if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
> -		/*
> -		 * Ideally we can do an unordered flush-to-disk. In case the
> -		 * backend onlysupports barriers, use that. A barrier request
> -		 * a superset of FUA, so we can implement it the same
> -		 * way.  (It's also a FLUSH+FUA, since it is
> -		 * guaranteed ordered WRT previous writes.)
> -		 */
> -		ring_req->operation = info->flush_op;
> -	}
> -
>  	if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
>  		/* id, sector_number and handle are set above. */
>  		ring_req->operation = BLKIF_OP_DISCARD;
>  		ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
> +		ring_req->u.discard.id = id;
> +		ring_req->u.discard.sector_number =
> +			(blkif_sector_t)blk_rq_pos(req);
>  		if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
>  			ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
>  		else
>  			ring_req->u.discard.flag = 0;
>  	} else {
> -		ring_req->u.rw.nr_segments = blk_rq_map_sg(req->q, req,
> -							   info->sg);
> -		BUG_ON(ring_req->u.rw.nr_segments >
> -		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -
> -		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> +		BUG_ON(info->max_indirect_segments == 0 &&
> +		       req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +		BUG_ON(info->max_indirect_segments &&
> +		       req->nr_phys_segments > info->max_indirect_segments);
> +		nseg = blk_rq_map_sg(req->q, req, info->sg);
> +		if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +			/* Indirect OP */
> +			ring_req->operation = BLKIF_OP_INDIRECT;
> +			ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
> +				BLKIF_OP_WRITE : BLKIF_OP_READ;
> +			ring_req->u.indirect.id = id;
> +			ring_req->u.indirect.sector_number =
> +				(blkif_sector_t)blk_rq_pos(req);
> +			ring_req->u.indirect.handle = info->handle;
> +			if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
> +		/*
> +		 * Ideally we can do an unordered flush-to-disk. In case the
> +		 * backend onlysupports barriers, use that. A barrier request
> +		 * a superset of FUA, so we can implement it the same
> +		 * way.  (It's also a FLUSH+FUA, since it is
> +		 * guaranteed ordered WRT previous writes.)
> +		 */
> +				ring_req->u.indirect.indirect_op =
> +					info->flush_op;
> +			}
> +			ring_req->u.indirect.nr_segments = nseg;
> +		} else {
> +			ring_req->u.rw.id = id;
> +			ring_req->u.rw.sector_number =
> +				(blkif_sector_t)blk_rq_pos(req);
> +			ring_req->u.rw.handle = info->handle;
> +			ring_req->operation = rq_data_dir(req) ?
> +				BLKIF_OP_WRITE : BLKIF_OP_READ;
> +			if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
> +		/*
> +		 * Ideally we can do an unordered flush-to-disk. In case the
> +		 * backend onlysupports barriers, use that. A barrier request
> +		 * a superset of FUA, so we can implement it the same
> +		 * way.  (It's also a FLUSH+FUA, since it is
> +		 * guaranteed ordered WRT previous writes.)
> +		 */
> +				ring_req->operation = info->flush_op;
> +			}
> +			ring_req->u.rw.nr_segments = nseg;
> +		}
> +		for_each_sg(info->sg, sg, nseg, i) {
>  			fsect = sg->offset >> 9;
>  			lsect = fsect + (sg->length >> 9) - 1;
>  
> +			if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
> +			    (i % SEGS_PER_INDIRECT_FRAME == 0)) {
> +				if (segments)
> +					kunmap_atomic(segments);
> +
> +				n = i / SEGS_PER_INDIRECT_FRAME;
> +				gnt_list_entry = get_grant(&gref_head, info);
> +				info->shadow[id].indirect_grants[n] =
> +					gnt_list_entry;
> +				segments = kmap_atomic(
> +					pfn_to_page(gnt_list_entry->pfn));
> +				ring_req->u.indirect.indirect_grefs[n] =
> +					gnt_list_entry->gref;
> +			}
> +
>  			gnt_list_entry = get_grant(&gref_head, info);
>  			ref = gnt_list_entry->gref;
>  
> @@ -461,13 +536,23 @@ static int blkif_queue_request(struct request *req)
>  				kunmap_atomic(bvec_data);
>  				kunmap_atomic(shared_data);
>  			}
> -
> -			ring_req->u.rw.seg[i] =
> -					(struct blkif_request_segment) {
> -						.gref       = ref,
> -						.first_sect = fsect,
> -						.last_sect  = lsect };
> +			if (ring_req->operation != BLKIF_OP_INDIRECT) {
> +				ring_req->u.rw.seg[i] =
> +						(struct blkif_request_segment) {
> +							.gref       = ref,
> +							.first_sect = fsect,
> +							.last_sect  = lsect };
> +			} else {
> +				n = i % SEGS_PER_INDIRECT_FRAME;
> +				segments[n] =
> +					(struct blkif_request_segment_aligned) {
> +							.gref       = ref,
> +							.first_sect = fsect,
> +							.last_sect  = lsect };
> +			}
>  		}
> +		if (segments)
> +			kunmap_atomic(segments);
>  	}
>  
>  	info->ring.req_prod_pvt++;
> @@ -542,7 +627,8 @@ wait:
>  		flush_requests(info);
>  }
>  
> -static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> +static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
> +                                unsigned int segments)
>  {
>  	struct request_queue *rq;
>  	struct blkfront_info *info = gd->private_data;
> @@ -571,7 +657,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>  	blk_queue_max_segment_size(rq, PAGE_SIZE);
>  
>  	/* Ensure a merged request will fit in a single I/O ring slot. */
> -	blk_queue_max_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	blk_queue_max_segments(rq, segments);
>  
>  	/* Make sure buffer addresses are sector-aligned. */
>  	blk_queue_dma_alignment(rq, 511);
> @@ -588,13 +674,14 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>  static void xlvbd_flush(struct blkfront_info *info)
>  {
>  	blk_queue_flush(info->rq, info->feature_flush);
> -	printk(KERN_INFO "blkfront: %s: %s: %s %s\n",
> +	printk(KERN_INFO "blkfront: %s: %s: %s %s %s\n",
>  	       info->gd->disk_name,
>  	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>  		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>  		"flush diskcache" : "barrier or flush"),
>  	       info->feature_flush ? "enabled" : "disabled",
> -	       info->feature_persistent ? "using persistent grants" : "");
> +	       info->feature_persistent ? "using persistent grants" : "",
> +	       info->max_indirect_segments ? "using indirect descriptors" : "");
>  }
>  
>  static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> @@ -734,7 +821,9 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
>  	gd->driverfs_dev = &(info->xbdev->dev);
>  	set_capacity(gd, capacity);
>  
> -	if (xlvbd_init_blk_queue(gd, sector_size)) {
> +	if (xlvbd_init_blk_queue(gd, sector_size,
> +	                         info->max_indirect_segments ? :
> +	                         BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
>  		del_gendisk(gd);
>  		goto release;
>  	}
> @@ -818,6 +907,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  {
>  	struct grant *persistent_gnt;
>  	struct grant *n;
> +	int i, j, segs;
>  
>  	/* Prevent new requests being issued until we fix things up. */
>  	spin_lock_irq(&info->io_lock);
> @@ -843,6 +933,47 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	}
>  	BUG_ON(info->persistent_gnts_c != 0);
>  
> +	kfree(info->sg);
> +	info->sg = NULL;
> +	for (i = 0; i < BLK_RING_SIZE; i++) {
> +		/*
> +		 * Clear persistent grants present in requests already
> +		 * on the shared ring
> +		 */
> +		if (!info->shadow[i].request)
> +			goto free_shadow;
> +
> +		segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
> +		       info->shadow[i].req.u.indirect.nr_segments :
> +		       info->shadow[i].req.u.rw.nr_segments;
> +		for (j = 0; j < segs; j++) {
> +			persistent_gnt = info->shadow[i].grants_used[j];
> +			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> +			__free_page(pfn_to_page(persistent_gnt->pfn));
> +			kfree(persistent_gnt);
> +		}
> +
> +		if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
> +			/*
> +			 * If this is not an indirect operation don't try to
> +			 * free indirect segments
> +			 */
> +			goto free_shadow;
> +
> +		for (j = 0; j < INDIRECT_GREFS(segs); j++) {
> +			persistent_gnt = info->shadow[i].indirect_grants[j];
> +			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> +			__free_page(pfn_to_page(persistent_gnt->pfn));
> +			kfree(persistent_gnt);
> +		}
> +
> +free_shadow:
> +		kfree(info->shadow[i].grants_used);
> +		info->shadow[i].grants_used = NULL;
> +		kfree(info->shadow[i].indirect_grants);
> +		info->shadow[i].indirect_grants = NULL;
> +	}
> +
>  	/* No more gnttab callback work. */
>  	gnttab_cancel_free_callback(&info->callback);
>  	spin_unlock_irq(&info->io_lock);
> @@ -873,6 +1004,10 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>  	char *bvec_data;
>  	void *shared_data;
>  	unsigned int offset = 0;
> +	int nseg;
> +
> +	nseg = s->req.operation == BLKIF_OP_INDIRECT ?
> +		s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
>  
>  	if (bret->operation == BLKIF_OP_READ) {
>  		/*
> @@ -885,7 +1020,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>  			BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
>  			if (bvec->bv_offset < offset)
>  				i++;
> -			BUG_ON(i >= s->req.u.rw.nr_segments);
> +			BUG_ON(i >= nseg);
>  			shared_data = kmap_atomic(
>  				pfn_to_page(s->grants_used[i]->pfn));
>  			bvec_data = bvec_kmap_irq(bvec, &flags);
> @@ -897,10 +1032,17 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>  		}
>  	}
>  	/* Add the persistent grant into the list of free grants */
> -	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> +	for (i = 0; i < nseg; i++) {
>  		list_add(&s->grants_used[i]->node, &info->persistent_gnts);
>  		info->persistent_gnts_c++;
>  	}
> +	if (s->req.operation == BLKIF_OP_INDIRECT) {
> +		for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
> +			list_add(&s->indirect_grants[i]->node,
> +			         &info->persistent_gnts);
> +			info->persistent_gnts_c++;
> +		}
> +	}
>  }
>  
>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -1034,8 +1176,6 @@ static int setup_blkring(struct xenbus_device *dev,
>  	SHARED_RING_INIT(sring);
>  	FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
>  
> -	sg_init_table(info->sg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -
>  	err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
>  	if (err < 0) {
>  		free_page((unsigned long)sring);
> @@ -1116,12 +1256,6 @@ again:
>  		goto destroy_blkring;
>  	}
>  
> -	/* Allocate memory for grants */
> -	err = fill_grant_buffer(info, BLK_RING_SIZE *
> -	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -	if (err)
> -		goto out;
> -
>  	xenbus_switch_state(dev, XenbusStateInitialised);
>  
>  	return 0;
> @@ -1223,13 +1357,84 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	return 0;
>  }
>  
> +/*
> + * This is a clone of md_trim_bio, used to split a bio into smaller ones
> + */
> +static void trim_bio(struct bio *bio, int offset, int size)
> +{
> +	/* 'bio' is a cloned bio which we need to trim to match
> +	 * the given offset and size.
> +	 * This requires adjusting bi_sector, bi_size, and bi_io_vec
> +	 */
> +	int i;
> +	struct bio_vec *bvec;
> +	int sofar = 0;
> +
> +	size <<= 9;
> +	if (offset == 0 && size == bio->bi_size)
> +		return;
> +
> +	bio->bi_sector += offset;
> +	bio->bi_size = size;
> +	offset <<= 9;
> +	clear_bit(BIO_SEG_VALID, &bio->bi_flags);
> +
> +	while (bio->bi_idx < bio->bi_vcnt &&
> +	       bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
> +		/* remove this whole bio_vec */
> +		offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
> +		bio->bi_idx++;
> +	}
> +	if (bio->bi_idx < bio->bi_vcnt) {
> +		bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
> +		bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
> +	}
> +	/* avoid any complications with bi_idx being non-zero*/
> +	if (bio->bi_idx) {
> +		memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
> +			(bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
> +		bio->bi_vcnt -= bio->bi_idx;
> +		bio->bi_idx = 0;
> +	}
> +	/* Make sure vcnt and last bv are not too big */
> +	bio_for_each_segment(bvec, bio, i) {
> +		if (sofar + bvec->bv_len > size)
> +			bvec->bv_len = size - sofar;
> +		if (bvec->bv_len == 0) {
> +			bio->bi_vcnt = i;
> +			break;
> +		}
> +		sofar += bvec->bv_len;
> +	}
> +}
> +
> +static void split_bio_end(struct bio *bio, int error)
> +{
> +	struct split_bio *split_bio = bio->bi_private;
> +
> +	if (error)
> +		split_bio->err = error;
> +
> +	if (atomic_dec_and_test(&split_bio->pending)) {
> +		split_bio->bio->bi_phys_segments = 0;
> +		bio_endio(split_bio->bio, split_bio->err);
> +		kfree(split_bio);
> +	}
> +	bio_put(bio);
> +}
>  
>  static int blkif_recover(struct blkfront_info *info)
>  {
>  	int i;
> -	struct blkif_request *req;
> +	struct request *req, *n;
>  	struct blk_shadow *copy;
> -	int j;
> +	int rc;
> +	struct bio *bio, *cloned_bio;
> +	struct bio_list bio_list, merge_bio;
> +	unsigned int segs;
> +	int pending, offset, size;
> +	struct split_bio *split_bio;
> +	struct list_head requests;
>  
>  	/* Stage 1: Make a safe copy of the shadow state. */
>  	copy = kmalloc(sizeof(info->shadow),
> @@ -1245,36 +1450,64 @@ static int blkif_recover(struct blkfront_info *info)
>  	info->shadow_free = info->ring.req_prod_pvt;
>  	info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
>  
> -	/* Stage 3: Find pending requests and requeue them. */
> +	rc = blkfront_setup_indirect(info);
> +	if (rc) {
> +		kfree(copy);
> +		return rc;
> +	}
> +
> +	segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	blk_queue_max_segments(info->rq, segs);
> +	bio_list_init(&bio_list);
> +	INIT_LIST_HEAD(&requests);
>  	for (i = 0; i < BLK_RING_SIZE; i++) {
>  		/* Not in use? */
>  		if (!copy[i].request)
>  			continue;
>  
> -		/* Grab a request slot and copy shadow state into it. */
> -		req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> -		*req = copy[i].req;
> -
> -		/* We get a new request id, and must reset the shadow state. */
> -		req->u.rw.id = get_id_from_freelist(info);
> -		memcpy(&info->shadow[req->u.rw.id], &copy[i], sizeof(copy[i]));
> -
> -		if (req->operation != BLKIF_OP_DISCARD) {
> -		/* Rewrite any grant references invalidated by susp/resume. */
> -			for (j = 0; j < req->u.rw.nr_segments; j++)
> -				gnttab_grant_foreign_access_ref(
> -					req->u.rw.seg[j].gref,
> -					info->xbdev->otherend_id,
> -					pfn_to_mfn(copy[i].grants_used[j]->pfn),
> -					0);
> +		/*
> +		 * Get the bios in the request so we can re-queue them.
> +		 */
> +		if (copy[i].request->cmd_flags &
> +		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> +			/*
> +			 * Flush operations don't contain bios, so
> +			 * we need to requeue the whole request
> +			 */
> +			list_add(&copy[i].request->queuelist, &requests);
> +			continue;
>  		}
> -		info->shadow[req->u.rw.id].req = *req;
> -
> -		info->ring.req_prod_pvt++;
> +		merge_bio.head = copy[i].request->bio;
> +		merge_bio.tail = copy[i].request->biotail;
> +		bio_list_merge(&bio_list, &merge_bio);
> +		copy[i].request->bio = NULL;
> +		blk_put_request(copy[i].request);
>  	}
>  
>  	kfree(copy);
>  
> +	/*
> +	 * Empty the queue, this is important because we might have
> +	 * requests in the queue with more segments than what we
> +	 * can handle now.
> +	 */
> +	spin_lock_irq(&info->io_lock);
> +	while ((req = blk_fetch_request(info->rq)) != NULL) {
> +		if (req->cmd_flags &
> +		    (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
> +			list_add(&req->queuelist, &requests);
> +			continue;
> +		}
> +		merge_bio.head = req->bio;
> +		merge_bio.tail = req->biotail;
> +		bio_list_merge(&bio_list, &merge_bio);
> +		req->bio = NULL;
> +		if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
> +			pr_alert("diskcache flush request found!\n");
> +		__blk_put_request(info->rq, req);
> +	}
> +	spin_unlock_irq(&info->io_lock);
> +
>  	xenbus_switch_state(info->xbdev, XenbusStateConnected);
>  
>  	spin_lock_irq(&info->io_lock);
> @@ -1282,14 +1515,50 @@ static int blkif_recover(struct blkfront_info *info)
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	/* Send off requeued requests */
> -	flush_requests(info);
> -
>  	/* Kick any other new requests queued since we resumed */
>  	kick_pending_request_queues(info);
>  
> +	list_for_each_entry_safe(req, n, &requests, queuelist) {
> +		/* Requeue pending requests (flush or discard) */
> +		list_del_init(&req->queuelist);
> +		BUG_ON(req->nr_phys_segments > segs);
> +		blk_requeue_request(info->rq, req);
> +	}
>  	spin_unlock_irq(&info->io_lock);
>  
> +	while ((bio = bio_list_pop(&bio_list)) != NULL) {
> +		/* Traverse the list of pending bios and re-queue them */
> +		if (bio_segments(bio) > segs) {
> +			/*
> +			 * This bio has more segments than what we can
> +			 * handle, we have to split it.
> +			 */
> +			pending = (bio_segments(bio) + segs - 1) / segs;
> +			split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
> +			BUG_ON(split_bio == NULL);
> +			atomic_set(&split_bio->pending, pending);
> +			split_bio->bio = bio;
> +			for (i = 0; i < pending; i++) {
> +				offset = (i * segs * PAGE_SIZE) >> 9;
> +				size = MIN((segs * PAGE_SIZE) >> 9,
> +				           (bio->bi_size >> 9) - offset);
> +				cloned_bio = bio_clone(bio, GFP_NOIO);
> +				BUG_ON(cloned_bio == NULL);
> +				trim_bio(cloned_bio, offset, size);
> +				cloned_bio->bi_private = split_bio;
> +				cloned_bio->bi_end_io = split_bio_end;
> +				submit_bio(cloned_bio->bi_rw, cloned_bio);
> +			}
> +			/*
> +			 * Now we have to wait for all those smaller bios to
> +			 * end, so we can also end the "parent" bio.
> +			 */
> +			continue;
> +		}
> +		/* We don't need to split this bio */
> +		submit_bio(bio->bi_rw, bio);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -1309,8 +1578,12 @@ static int blkfront_resume(struct xenbus_device *dev)
>  	blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
>  
>  	err = talk_to_blkback(dev, info);
> -	if (info->connected == BLKIF_STATE_SUSPENDED && !err)
> -		err = blkif_recover(info);
> +
> +	/*
> +	 * We have to wait for the backend to switch to
> +	 * connected state, since we want to read which
> +	 * features it supports.
> +	 */
>  
>  	return err;
>  }
> @@ -1388,6 +1661,62 @@ static void blkfront_setup_discard(struct blkfront_info *info)
>  	kfree(type);
>  }
>  
> +static int blkfront_setup_indirect(struct blkfront_info *info)
> +{
> +	unsigned int indirect_segments, segs;
> +	int err, i;
> +
> +	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			    "max-indirect-segments", "%u", &indirect_segments,
> +			    NULL);
> +	if (err) {
> +		info->max_indirect_segments = 0;
> +		segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	} else {
> +		info->max_indirect_segments = MIN(indirect_segments,
> +		                                  xen_blkif_max_segments);
> +		segs = info->max_indirect_segments;
> +	}
> +	info->sg = kzalloc(sizeof(info->sg[0]) * segs, GFP_KERNEL);
> +	if (info->sg == NULL)
> +		goto out_of_memory;
> +	sg_init_table(info->sg, segs);
> +
> +	err = fill_grant_buffer(info,
> +	                        (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
> +	if (err)
> +		goto out_of_memory;
> +
> +	for (i = 0; i < BLK_RING_SIZE; i++) {
> +		info->shadow[i].grants_used = kzalloc(
> +			sizeof(info->shadow[i].grants_used[0]) * segs,
> +			GFP_NOIO);
> +		if (info->max_indirect_segments)
> +			info->shadow[i].indirect_grants = kzalloc(
> +				sizeof(info->shadow[i].indirect_grants[0]) *
> +				INDIRECT_GREFS(segs),
> +				GFP_NOIO);
> +		if ((info->shadow[i].grants_used == NULL) ||
> +		     (info->max_indirect_segments &&
> +		     (info->shadow[i].indirect_grants == NULL)))
> +			goto out_of_memory;
> +	}
> +
> +
> +	return 0;
> +
> +out_of_memory:
> +	kfree(info->sg);
> +	info->sg = NULL;
> +	for (i = 0; i < BLK_RING_SIZE; i++) {
> +		kfree(info->shadow[i].grants_used);
> +		info->shadow[i].grants_used = NULL;
> +		kfree(info->shadow[i].indirect_grants);
> +		info->shadow[i].indirect_grants = NULL;
> +	}
> +	return -ENOMEM;
> +}
> +
>  /*
>   * Invoked when the backend is finally 'ready' (and has told produced
>   * the details about the physical device - #sectors, size, etc).
> @@ -1415,8 +1744,9 @@ static void blkfront_connect(struct blkfront_info *info)
>  		set_capacity(info->gd, sectors);
>  		revalidate_disk(info->gd);
>  
> -		/* fall through */
> +		return;
>  	case BLKIF_STATE_SUSPENDED:
> +		blkif_recover(info);
>  		return;
>  
>  	default:
> @@ -1437,6 +1767,7 @@ static void blkfront_connect(struct blkfront_info *info)
>  				 info->xbdev->otherend);
>  		return;
>  	}
> +	info->sector_size = sector_size;
>  
>  	info->feature_flush = 0;
>  	info->flush_op = 0;
> @@ -1484,6 +1815,13 @@ static void blkfront_connect(struct blkfront_info *info)
>  	else
>  		info->feature_persistent = persistent;
>  
> +	err = blkfront_setup_indirect(info);
> +	if (err) {
> +		xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
> +				 info->xbdev->otherend);
> +		return;
> +	}
> +
>  	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
>  	if (err) {
>  		xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
> diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
> index 01c3d62..6d99849 100644
> --- a/include/xen/interface/io/blkif.h
> +++ b/include/xen/interface/io/blkif.h
> @@ -102,6 +102,8 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_OP_DISCARD           5
>  
> +#define BLKIF_OP_INDIRECT          6
> +
>  /*
>   * Maximum scatter/gather segments per request.
>   * This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
> @@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>  
> +#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
> +
> +struct blkif_request_segment_aligned {
> +	grant_ref_t gref;        /* reference to I/O buffer frame        */
> +	/* @first_sect: first sector in frame to transfer (inclusive).   */
> +	/* @last_sect: last sector in frame to transfer (inclusive).     */
> +	uint8_t     first_sect, last_sect;
> +	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
> +} __attribute__((__packed__));
> +
>  struct blkif_request_rw {
>  	uint8_t        nr_segments;  /* number of segments                   */
>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> @@ -138,11 +150,24 @@ struct blkif_request_discard {
>  	uint8_t        _pad3;
>  } __attribute__((__packed__));
>  
> +struct blkif_request_indirect {
> +	uint8_t        indirect_op;
> +	uint16_t       nr_segments;
> +#ifdef CONFIG_X86_64
> +	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
> +#endif
> +	uint64_t       id;
> +	blkif_vdev_t   handle;
> +	blkif_sector_t sector_number;
> +	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST];
> +} __attribute__((__packed__));
> +
>  struct blkif_request {
>  	uint8_t        operation;    /* BLKIF_OP_???                         */
>  	union {
>  		struct blkif_request_rw rw;
>  		struct blkif_request_discard discard;
> +		struct blkif_request_indirect indirect;
>  	} u;
>  } __attribute__((__packed__));
>  
> -- 
> 1.7.7.5 (Apple Git-26)
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 13:28       ` Jan Beulich
@ 2013-03-04 20:44         ` Konrad Rzeszutek Wilk
  2013-03-05  8:11           ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-04 20:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, xen-devel, linux-kernel

On Thu, Feb 28, 2013 at 01:28:48PM +0000, Jan Beulich wrote:
> >>> On 28.02.13 at 13:00, Roger Pau Monné<roger.pau@citrix.com> wrote:
> > On 28/02/13 12:19, Jan Beulich wrote:
> >>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
> >>> @@ -109,6 +111,16 @@ typedef uint64_t blkif_sector_t;
> >>>   */
> >>>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
> >>>  
> >>> +#define BLKIF_MAX_INDIRECT_GREFS_PER_REQUEST 8
> >>> +
> >>> +struct blkif_request_segment_aligned {
> >>> +	grant_ref_t gref;        /* reference to I/O buffer frame        */
> >>> +	/* @first_sect: first sector in frame to transfer (inclusive).   */
> >>> +	/* @last_sect: last sector in frame to transfer (inclusive).     */
> >>> +	uint8_t     first_sect, last_sect;
> >>> +	uint16_t    _pad; /* padding to make it 8 bytes, so it's cache-aligned */
> >>> +} __attribute__((__packed__));
> >> 
> >> What's the __packed__ for here?
> > 
> > Yes, that's not needed.
> > 
> >> 
> >>> +
> >>>  struct blkif_request_rw {
> >>>  	uint8_t        nr_segments;  /* number of segments                   */
> >>>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> >>> @@ -138,11 +150,24 @@ struct blkif_request_discard {
> >>>  	uint8_t        _pad3;
> >>>  } __attribute__((__packed__));
> >>>  
> >>> +struct blkif_request_indirect {
> >>> +	uint8_t        indirect_op;
> >>> +	uint16_t       nr_segments;
> >>> +#ifdef CONFIG_X86_64
> >>> +	uint32_t       _pad1;        /* offsetof(blkif_...,u.indirect.id) == 8 */
> >>> +#endif
> >> 
> >> Either you want the structure be packed tightly (and you don't care
> >> about misaligned fields), in which case you shouldn't need a padding
> >> field. That's even more so as there's no padding between indirect_op
> >> and nr_segments, so everything is misaligned anyway, and the
> >> comment above is wrong too (offsetof() really ought to yield 7 in
> >> that case).
> > 
> > This padding is because we want to have the "id" field at the same
> > position as blkif_request_rw, so we need to add the padding for it to
> > match 32 & 64 bit blkif_request_rw structures, this prevents adding some
> > "if (req.op == BLKIF_OP_INDIRECT)..." if we only need to get the id of
> > the request.
> 
> Oh, right, that's desirable of course.
> 
> > The comment is indeed wrong, I've copied it from blkif_request_discard
> > and forgot to change the offset
> 
> But the offset stated there then is right after all - I forgot that
> there is a 1-byte field outside the union (the way this is being done
> in the upstream Linux header is really ugly imo, but I guess Jeremy
> and/or Konrad liked it that way). That's also why the packed
> attribute is needed here.

I am not particularly found as I keep on forgetting about the 1-byte field
as well. If you have a patch to clean it up would love to see it.

> 
> But you will probably want to switch sector_number and handle, so
> that sector_number becomes aligned, and add another 16-bit
> padding field between handle and indirect_grefs[].
> 
> I also wonder whether "indirect_op" wouldn't better be named
> "actual_op" or just "op".

<nods> 'op' sounds good. With a comment saying it can do all of the BLKIF_OPS_..
except the BLKIF_OP_INDIRECT one. Thought one could in theory chain
it that way for fun.

> 
> Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr
  2013-03-04 17:19     ` Roger Pau Monné
@ 2013-03-05  8:06       ` Jan Beulich
  2013-03-05 17:02         ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-03-05  8:06 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 04.03.13 at 18:19, Roger Pau Monné<roger.pau@citrix.com> wrote:
> On 28/02/13 11:58, Jan Beulich wrote:
>>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>> And then the biolist[] array really can be folded into a union
>> with the remaining seg[] one, as their usage scopes are easily
>> separable.
> 
> Could we leave that for a further patch? I would like to avoid messing
> any more with blkback, as I'm already touching a lot of bits with this
> patch series.

Fine by me, but ...

>>> @@ -631,7 +629,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>>>  			if (ret)
>>>  				continue;
>>>  
>>> -			seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>>> +			seg[i].buf = pfn_to_mfn(page_to_pfn(
>>> +				persistent_gnts[i]->page)) << PAGE_SHIFT |
>> 
>> So why do you do this? The only reader masks the field with
>> ~PAGE_MASK anyway.
> 
> Yes, I only need to store first_sect.

... as you're touching this code anyway, and as it'll make the
code as well as the patch smaller, could you at least drop this
pointless storing of the page address (which otherwise I'd ask
you to properly parenthesize anyway)?

And iirc once that's dropped, the storing of first_sect ends up
being identical between the if and else bodies, so it could be
pulled out (further reducing code size, albeit at the price of a
marginally bigger patch).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-04 20:44         ` Konrad Rzeszutek Wilk
@ 2013-03-05  8:11           ` Jan Beulich
  2013-03-05 14:16             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2013-03-05  8:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: roger.pau, xen-devel, linux-kernel

>>> On 04.03.13 at 21:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> <nods> 'op' sounds good. With a comment saying it can do all of the 
> BLKIF_OPS_..
> except the BLKIF_OP_INDIRECT one. Thought one could in theory chain
> it that way for fun.

In fact I'd like to exclude chaining as well as BLKIF_OP_DISCARD here.
The former should - if useful for anything - be controlled by a
separate feature flag, and the latter is plain pointless to indirect.
And I reckon the same would apply to BLKIF_OP_FLUSH_DISKCACHE
and BLKIF_OP_RESERVED_1 - i.e. it might be better to state that
indirection is only permitted for normal I/O (read/write) ops.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-03-04 19:39   ` Konrad Rzeszutek Wilk
@ 2013-03-05 11:04     ` Roger Pau Monné
  2013-03-05 14:18       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 11:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 04/03/13 20:39, Konrad Rzeszutek Wilk wrote:
> On Thu, Feb 28, 2013 at 11:28:47AM +0100, Roger Pau Monne wrote:
>> This prevents us from having to call alloc_page while we are preparing
>> the request. Since blkfront was calling alloc_page with a spinlock
>> held we used GFP_ATOMIC, which can fail if we are requesting a lot of
>> pages since it is using the emergency memory pools.
>>
>> Allocating all the pages at init prevents us from having to call
>> alloc_page, thus preventing possible failures.
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: xen-devel@lists.xen.org
>> ---
>>  drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
>>  1 files changed, 79 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 2e39eaf..5ba6b87 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
>>  	return 0;
>>  }
>>  
>> +static int fill_grant_buffer(struct blkfront_info *info, int num)
>> +{
>> +	struct page *granted_page;
>> +	struct grant *gnt_list_entry, *n;
>> +	int i = 0;
>> +
>> +	while(i < num) {
>> +		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
> 
> GFP_NORMAL ?

drivers/block/xen-blkfront.c:175: error: ‘GFP_NORMAL’ undeclared (first
use in this function)

Did you mean GFP_KERNEL? I think GFP_NOIO is more suitable, it can block
but no IO will be performed.

> 
>> +		if (!gnt_list_entry)
>> +			goto out_of_memory;
> 
> Hmm, I guess another patch could be to convert this to a fail-safe
> mechanism. Meaning if we fail here, we just cap our maximum amount of
> grants we have up to 'i'.
> 
> 
>> +
>> +		granted_page = alloc_page(GFP_NOIO);
> 
> GFP_NORMAL
> 
>> +		if (!granted_page) {
>> +			kfree(gnt_list_entry);
>> +			goto out_of_memory;
>> +		}
>> +
>> +		gnt_list_entry->pfn = page_to_pfn(granted_page);
>> +		gnt_list_entry->gref = GRANT_INVALID_REF;
>> +		list_add(&gnt_list_entry->node, &info->persistent_gnts);
>> +		i++;
>> +	}
>> +
>> +	return 0;
>> +
>> +out_of_memory:
>> +	list_for_each_entry_safe(gnt_list_entry, n,
>> +	                         &info->persistent_gnts, node) {
>> +		list_del(&gnt_list_entry->node);
>> +		__free_page(pfn_to_page(gnt_list_entry->pfn));
>> +		kfree(gnt_list_entry);
>> +		i--;
>> +	}
>> +	BUG_ON(i != 0);
>> +	return -ENOMEM;
>> +}
>> +
>> +static struct grant *get_grant(grant_ref_t *gref_head,
>> +                               struct blkfront_info *info)
>> +{
>> +	struct grant *gnt_list_entry;
>> +	unsigned long buffer_mfn;
>> +
>> +	BUG_ON(list_empty(&info->persistent_gnts));
>> +	gnt_list_entry = list_first_entry(&info->persistent_gnts, struct grant,
>> +	                                  node);
>> +	list_del(&gnt_list_entry->node);
>> +
>> +	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
>> +		info->persistent_gnts_c--;
>> +		return gnt_list_entry;
>> +	}
>> +
>> +	/* Assign a gref to this page */
>> +	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
>> +	BUG_ON(gnt_list_entry->gref == -ENOSPC);
>> +	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>> +	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
>> +	                                info->xbdev->otherend_id,
>> +	                                buffer_mfn, 0);
>> +	return gnt_list_entry;
>> +}
>> +
>>  static const char *op_name(int op)
>>  {
>>  	static const char *const names[] = {
>> @@ -306,7 +369,6 @@ static int blkif_queue_request(struct request *req)
>>  	 */
>>  	bool new_persistent_gnts;
>>  	grant_ref_t gref_head;
>> -	struct page *granted_page;
>>  	struct grant *gnt_list_entry = NULL;
>>  	struct scatterlist *sg;
>>  
>> @@ -370,42 +432,9 @@ static int blkif_queue_request(struct request *req)
>>  			fsect = sg->offset >> 9;
>>  			lsect = fsect + (sg->length >> 9) - 1;
>>  
>> -			if (info->persistent_gnts_c) {
>> -				BUG_ON(list_empty(&info->persistent_gnts));
>> -				gnt_list_entry = list_first_entry(
>> -				                      &info->persistent_gnts,
>> -				                      struct grant, node);
>> -				list_del(&gnt_list_entry->node);
>> -
>> -				ref = gnt_list_entry->gref;
>> -				buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>> -				info->persistent_gnts_c--;
>> -			} else {
>> -				ref = gnttab_claim_grant_reference(&gref_head);
>> -				BUG_ON(ref == -ENOSPC);
>> -
>> -				gnt_list_entry =
>> -					kmalloc(sizeof(struct grant),
>> -							 GFP_ATOMIC);
>> -				if (!gnt_list_entry)
>> -					return -ENOMEM;
>> -
>> -				granted_page = alloc_page(GFP_ATOMIC);
>> -				if (!granted_page) {
>> -					kfree(gnt_list_entry);
>> -					return -ENOMEM;
>> -				}
>> -
>> -				gnt_list_entry->pfn =
>> -					page_to_pfn(granted_page);
>> -				gnt_list_entry->gref = ref;
>> -
>> -				buffer_mfn = pfn_to_mfn(page_to_pfn(
>> -								granted_page));
>> -				gnttab_grant_foreign_access_ref(ref,
>> -					info->xbdev->otherend_id,
>> -					buffer_mfn, 0);
>> -			}
>> +			gnt_list_entry = get_grant(&gref_head, info);
>> +			ref = gnt_list_entry->gref;
>> +			buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>>  
>>  			info->shadow[id].grants_used[i] = gnt_list_entry;
>>  
>> @@ -803,17 +832,20 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>>  		blk_stop_queue(info->rq);
>>  
>>  	/* Remove all persistent grants */
>> -	if (info->persistent_gnts_c) {
>> +	if (!list_empty(&info->persistent_gnts)) {
>>  		list_for_each_entry_safe(persistent_gnt, n,
>>  		                         &info->persistent_gnts, node) {
>>  			list_del(&persistent_gnt->node);
>> -			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
>> +			if (persistent_gnt->gref != GRANT_INVALID_REF) {
>> +				gnttab_end_foreign_access(persistent_gnt->gref,
>> +				                          0, 0UL);
>> +				info->persistent_gnts_c--;
>> +			}
>>  			__free_page(pfn_to_page(persistent_gnt->pfn));
>>  			kfree(persistent_gnt);
>> -			info->persistent_gnts_c--;
>>  		}
>> -		BUG_ON(info->persistent_gnts_c != 0);
>>  	}
>> +	BUG_ON(info->persistent_gnts_c != 0);
> 
> So if the guest _never_ sent any I/Os and just attached/detached the device - won't
> we fail here?.

persistent_gnts_c is initialized to 0, so if we don't perform IO it
should still be 0 at this point. Since we have just cleaned the
persistent grants lists this should always be 0 at this point.

>>  
>>  	/* No more gnttab callback work. */
>>  	gnttab_cancel_free_callback(&info->callback);
>> @@ -1088,6 +1120,12 @@ again:
>>  		goto destroy_blkring;
>>  	}
>>  
>> +	/* Allocate memory for grants */
>> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
>> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> +	if (err)
>> +		goto out;
> 
> That looks to be in the wrong function - talk_to_blkback function is
> to talk to the blkback. Not do initialization type operations.

Yes, I know it's not the best place to place it. It's here mainly
because that's the only function that gets called by both driver
initialization and resume.

Last patch moves this to a more sensible place.

> 
> Also I think this means that on resume - we would try to allocate
> again the grants?

Yes, grants are cleaned on resume and reallocated.

>> +
>>  	xenbus_switch_state(dev, XenbusStateInitialised);
>>  
>>  	return 0;
>> -- 
>> 1.7.7.5 (Apple Git-26)
>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-05  8:11           ` Jan Beulich
@ 2013-03-05 14:16             ` Konrad Rzeszutek Wilk
  2013-03-05 17:00               ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 14:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: roger.pau, xen-devel, linux-kernel

On Tue, Mar 05, 2013 at 08:11:19AM +0000, Jan Beulich wrote:
> >>> On 04.03.13 at 21:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > <nods> 'op' sounds good. With a comment saying it can do all of the 
> > BLKIF_OPS_..
> > except the BLKIF_OP_INDIRECT one. Thought one could in theory chain
> > it that way for fun.
> 
> In fact I'd like to exclude chaining as well as BLKIF_OP_DISCARD here.
> The former should - if useful for anything - be controlled by a
> separate feature flag, and the latter is plain pointless to indirect.
> And I reckon the same would apply to BLKIF_OP_FLUSH_DISKCACHE
> and BLKIF_OP_RESERVED_1 - i.e. it might be better to state that
> indirection is only permitted for normal I/O (read/write) ops.

<nods> That makes sense. And also of course the new BLKIF_OP should
be documented in the Xen tree as well.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-03-05 11:04     ` Roger Pau Monné
@ 2013-03-05 14:18       ` Konrad Rzeszutek Wilk
  2013-03-05 16:30         ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 14:18 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: linux-kernel, xen-devel

On Tue, Mar 05, 2013 at 12:04:41PM +0100, Roger Pau Monné wrote:
> On 04/03/13 20:39, Konrad Rzeszutek Wilk wrote:
> > On Thu, Feb 28, 2013 at 11:28:47AM +0100, Roger Pau Monne wrote:
> >> This prevents us from having to call alloc_page while we are preparing
> >> the request. Since blkfront was calling alloc_page with a spinlock
> >> held we used GFP_ATOMIC, which can fail if we are requesting a lot of
> >> pages since it is using the emergency memory pools.
> >>
> >> Allocating all the pages at init prevents us from having to call
> >> alloc_page, thus preventing possible failures.
> >>
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> Cc: xen-devel@lists.xen.org
> >> ---
> >>  drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
> >>  1 files changed, 79 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >> index 2e39eaf..5ba6b87 100644
> >> --- a/drivers/block/xen-blkfront.c
> >> +++ b/drivers/block/xen-blkfront.c
> >> @@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
> >>  	return 0;
> >>  }
> >>  
> >> +static int fill_grant_buffer(struct blkfront_info *info, int num)
> >> +{
> >> +	struct page *granted_page;
> >> +	struct grant *gnt_list_entry, *n;
> >> +	int i = 0;
> >> +
> >> +	while(i < num) {
> >> +		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
> > 
> > GFP_NORMAL ?
> 
> drivers/block/xen-blkfront.c:175: error: ‘GFP_NORMAL’ undeclared (first
> use in this function)
> 
> Did you mean GFP_KERNEL? I think GFP_NOIO is more suitable, it can block
> but no IO will be performed.

<sigh> I meant GFP_KERNEL. Sorry about the incorrect advice. The GFP_KERNEL
is the more general purpose pool - is there a good reason to use _NOIO?
This is after all during initialization when there is no IO using this driver.

> 
> > 
> >> +		if (!gnt_list_entry)
> >> +			goto out_of_memory;
> > 
> > Hmm, I guess another patch could be to convert this to a fail-safe
> > mechanism. Meaning if we fail here, we just cap our maximum amount of
> > grants we have up to 'i'.
> > 
> > 
> >> +
> >> +		granted_page = alloc_page(GFP_NOIO);
> > 
> > GFP_NORMAL

GFP_KERNEL of course.
> > 
> >> +		if (!granted_page) {
> >> +			kfree(gnt_list_entry);
> >> +			goto out_of_memory;
> >> +		}
> >> +
> >> +		gnt_list_entry->pfn = page_to_pfn(granted_page);
> >> +		gnt_list_entry->gref = GRANT_INVALID_REF;
> >> +		list_add(&gnt_list_entry->node, &info->persistent_gnts);
> >> +		i++;
> >> +	}
> >> +
> >> +	return 0;
> >> +
> >> +out_of_memory:
> >> +	list_for_each_entry_safe(gnt_list_entry, n,
> >> +	                         &info->persistent_gnts, node) {
> >> +		list_del(&gnt_list_entry->node);
> >> +		__free_page(pfn_to_page(gnt_list_entry->pfn));
> >> +		kfree(gnt_list_entry);
> >> +		i--;
> >> +	}
> >> +	BUG_ON(i != 0);
> >> +	return -ENOMEM;
> >> +}
> >> +
> >> +static struct grant *get_grant(grant_ref_t *gref_head,
> >> +                               struct blkfront_info *info)
> >> +{
> >> +	struct grant *gnt_list_entry;
> >> +	unsigned long buffer_mfn;
> >> +
> >> +	BUG_ON(list_empty(&info->persistent_gnts));
> >> +	gnt_list_entry = list_first_entry(&info->persistent_gnts, struct grant,
> >> +	                                  node);
> >> +	list_del(&gnt_list_entry->node);
> >> +
> >> +	if (gnt_list_entry->gref != GRANT_INVALID_REF) {
> >> +		info->persistent_gnts_c--;
> >> +		return gnt_list_entry;
> >> +	}
> >> +
> >> +	/* Assign a gref to this page */
> >> +	gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
> >> +	BUG_ON(gnt_list_entry->gref == -ENOSPC);
> >> +	buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> >> +	gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
> >> +	                                info->xbdev->otherend_id,
> >> +	                                buffer_mfn, 0);
> >> +	return gnt_list_entry;
> >> +}
> >> +
> >>  static const char *op_name(int op)
> >>  {
> >>  	static const char *const names[] = {
> >> @@ -306,7 +369,6 @@ static int blkif_queue_request(struct request *req)
> >>  	 */
> >>  	bool new_persistent_gnts;
> >>  	grant_ref_t gref_head;
> >> -	struct page *granted_page;
> >>  	struct grant *gnt_list_entry = NULL;
> >>  	struct scatterlist *sg;
> >>  
> >> @@ -370,42 +432,9 @@ static int blkif_queue_request(struct request *req)
> >>  			fsect = sg->offset >> 9;
> >>  			lsect = fsect + (sg->length >> 9) - 1;
> >>  
> >> -			if (info->persistent_gnts_c) {
> >> -				BUG_ON(list_empty(&info->persistent_gnts));
> >> -				gnt_list_entry = list_first_entry(
> >> -				                      &info->persistent_gnts,
> >> -				                      struct grant, node);
> >> -				list_del(&gnt_list_entry->node);
> >> -
> >> -				ref = gnt_list_entry->gref;
> >> -				buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> >> -				info->persistent_gnts_c--;
> >> -			} else {
> >> -				ref = gnttab_claim_grant_reference(&gref_head);
> >> -				BUG_ON(ref == -ENOSPC);
> >> -
> >> -				gnt_list_entry =
> >> -					kmalloc(sizeof(struct grant),
> >> -							 GFP_ATOMIC);
> >> -				if (!gnt_list_entry)
> >> -					return -ENOMEM;
> >> -
> >> -				granted_page = alloc_page(GFP_ATOMIC);
> >> -				if (!granted_page) {
> >> -					kfree(gnt_list_entry);
> >> -					return -ENOMEM;
> >> -				}
> >> -
> >> -				gnt_list_entry->pfn =
> >> -					page_to_pfn(granted_page);
> >> -				gnt_list_entry->gref = ref;
> >> -
> >> -				buffer_mfn = pfn_to_mfn(page_to_pfn(
> >> -								granted_page));
> >> -				gnttab_grant_foreign_access_ref(ref,
> >> -					info->xbdev->otherend_id,
> >> -					buffer_mfn, 0);
> >> -			}
> >> +			gnt_list_entry = get_grant(&gref_head, info);
> >> +			ref = gnt_list_entry->gref;
> >> +			buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> >>  
> >>  			info->shadow[id].grants_used[i] = gnt_list_entry;
> >>  
> >> @@ -803,17 +832,20 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> >>  		blk_stop_queue(info->rq);
> >>  
> >>  	/* Remove all persistent grants */
> >> -	if (info->persistent_gnts_c) {
> >> +	if (!list_empty(&info->persistent_gnts)) {
> >>  		list_for_each_entry_safe(persistent_gnt, n,
> >>  		                         &info->persistent_gnts, node) {
> >>  			list_del(&persistent_gnt->node);
> >> -			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> >> +			if (persistent_gnt->gref != GRANT_INVALID_REF) {
> >> +				gnttab_end_foreign_access(persistent_gnt->gref,
> >> +				                          0, 0UL);
> >> +				info->persistent_gnts_c--;
> >> +			}
> >>  			__free_page(pfn_to_page(persistent_gnt->pfn));
> >>  			kfree(persistent_gnt);
> >> -			info->persistent_gnts_c--;
> >>  		}
> >> -		BUG_ON(info->persistent_gnts_c != 0);
> >>  	}
> >> +	BUG_ON(info->persistent_gnts_c != 0);
> > 
> > So if the guest _never_ sent any I/Os and just attached/detached the device - won't
> > we fail here?.
> 
> persistent_gnts_c is initialized to 0, so if we don't perform IO it
> should still be 0 at this point. Since we have just cleaned the
> persistent grants lists this should always be 0 at this point.

OK.
> 
> >>  
> >>  	/* No more gnttab callback work. */
> >>  	gnttab_cancel_free_callback(&info->callback);
> >> @@ -1088,6 +1120,12 @@ again:
> >>  		goto destroy_blkring;
> >>  	}
> >>  
> >> +	/* Allocate memory for grants */
> >> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
> >> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> +	if (err)
> >> +		goto out;
> > 
> > That looks to be in the wrong function - talk_to_blkback function is
> > to talk to the blkback. Not do initialization type operations.
> 
> Yes, I know it's not the best place to place it. It's here mainly
> because that's the only function that gets called by both driver
> initialization and resume.
> 
> Last patch moves this to a more sensible place.

Lets make it part of this patch from the start. We still have two
months of time before the next merge window opens - so we have
time to make it nice and clean.

> 
> > 
> > Also I think this means that on resume - we would try to allocate
> > again the grants?
> 
> Yes, grants are cleaned on resume and reallocated.
> 
> >> +
> >>  	xenbus_switch_state(dev, XenbusStateInitialised);
> >>  
> >>  	return 0;
> >> -- 
> >> 1.7.7.5 (Apple Git-26)
> >>
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-03-05 14:18       ` Konrad Rzeszutek Wilk
@ 2013-03-05 16:30         ` Roger Pau Monné
  2013-03-05 21:53           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 16:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 05/03/13 15:18, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 05, 2013 at 12:04:41PM +0100, Roger Pau Monné wrote:
>> On 04/03/13 20:39, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Feb 28, 2013 at 11:28:47AM +0100, Roger Pau Monne wrote:
>>>> This prevents us from having to call alloc_page while we are preparing
>>>> the request. Since blkfront was calling alloc_page with a spinlock
>>>> held we used GFP_ATOMIC, which can fail if we are requesting a lot of
>>>> pages since it is using the emergency memory pools.
>>>>
>>>> Allocating all the pages at init prevents us from having to call
>>>> alloc_page, thus preventing possible failures.
>>>>
>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> Cc: xen-devel@lists.xen.org
>>>> ---
>>>>  drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
>>>>  1 files changed, 79 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>>> index 2e39eaf..5ba6b87 100644
>>>> --- a/drivers/block/xen-blkfront.c
>>>> +++ b/drivers/block/xen-blkfront.c
>>>> @@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +static int fill_grant_buffer(struct blkfront_info *info, int num)
>>>> +{
>>>> +	struct page *granted_page;
>>>> +	struct grant *gnt_list_entry, *n;
>>>> +	int i = 0;
>>>> +
>>>> +	while(i < num) {
>>>> +		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
>>>
>>> GFP_NORMAL ?
>>
>> drivers/block/xen-blkfront.c:175: error: ‘GFP_NORMAL’ undeclared (first
>> use in this function)
>>
>> Did you mean GFP_KERNEL? I think GFP_NOIO is more suitable, it can block
>> but no IO will be performed.
> 
> <sigh> I meant GFP_KERNEL. Sorry about the incorrect advice. The GFP_KERNEL
> is the more general purpose pool - is there a good reason to use _NOIO?
> This is after all during initialization when there is no IO using this driver.

We are already allocating memory using GFP_NOIO during setup
(setup_blkring and blkif_recover), the only reason I can think could be
helpful to use _NOIO is if the kernel tries to swap memory pages to the
disk, but if it has to swap pages to disk at this point we won't
probably be able to correctly setup blkfront anyway, either using _NOIO
or _KERNEL.

>>>>  
>>>>  	/* No more gnttab callback work. */
>>>>  	gnttab_cancel_free_callback(&info->callback);
>>>> @@ -1088,6 +1120,12 @@ again:
>>>>  		goto destroy_blkring;
>>>>  	}
>>>>  
>>>> +	/* Allocate memory for grants */
>>>> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
>>>> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>>> +	if (err)
>>>> +		goto out;
>>>
>>> That looks to be in the wrong function - talk_to_blkback function is
>>> to talk to the blkback. Not do initialization type operations.
>>
>> Yes, I know it's not the best place to place it. It's here mainly
>> because that's the only function that gets called by both driver
>> initialization and resume.
>>
>> Last patch moves this to a more sensible place.
> 
> Lets make it part of this patch from the start. We still have two
> months of time before the next merge window opens - so we have
> time to make it nice and clean.

I'm moving this to blkfront_setup_indirect in a later patch (because
this function doesn't yet exist at this point), but I can put it in a
more suitable place in this patch.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-05 14:16             ` Konrad Rzeszutek Wilk
@ 2013-03-05 17:00               ` Roger Pau Monné
  2013-03-05 21:45                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 17:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jan Beulich, xen-devel, linux-kernel

On 05/03/13 15:16, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 05, 2013 at 08:11:19AM +0000, Jan Beulich wrote:
>>>>> On 04.03.13 at 21:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>>> <nods> 'op' sounds good. With a comment saying it can do all of the 
>>> BLKIF_OPS_..
>>> except the BLKIF_OP_INDIRECT one. Thought one could in theory chain
>>> it that way for fun.
>>
>> In fact I'd like to exclude chaining as well as BLKIF_OP_DISCARD here.
>> The former should - if useful for anything - be controlled by a
>> separate feature flag, and the latter is plain pointless to indirect.
>> And I reckon the same would apply to BLKIF_OP_FLUSH_DISKCACHE
>> and BLKIF_OP_RESERVED_1 - i.e. it might be better to state that
>> indirection is only permitted for normal I/O (read/write) ops.
> 
> <nods> That makes sense. And also of course the new BLKIF_OP should
> be documented in the Xen tree as well.

The only ops that can be done indirectly are _READ, _WRITE and
_BARRIER/_FLUSH. From the implementation in blkfront it seems like
_FLUSH/_BARRIER requests can indeed contain segments, but I haven't been
able to spot any _FLUSH op with segments on it. Can you confirm FLUSH
requests never contain bios?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr
  2013-03-05  8:06       ` Jan Beulich
@ 2013-03-05 17:02         ` Roger Pau Monné
  0 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 17:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

On 05/03/13 09:06, Jan Beulich wrote:
>>>> On 04.03.13 at 18:19, Roger Pau Monné<roger.pau@citrix.com> wrote:
>> On 28/02/13 11:58, Jan Beulich wrote:
>>>>>> On 28.02.13 at 11:28, Roger Pau Monne <roger.pau@citrix.com> wrote:
>>> And then the biolist[] array really can be folded into a union
>>> with the remaining seg[] one, as their usage scopes are easily
>>> separable.
>>
>> Could we leave that for a further patch? I would like to avoid messing
>> any more with blkback, as I'm already touching a lot of bits with this
>> patch series.
> 
> Fine by me, but ...
> 
>>>> @@ -631,7 +629,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>>>>  			if (ret)
>>>>  				continue;
>>>>  
>>>> -			seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>>>> +			seg[i].buf = pfn_to_mfn(page_to_pfn(
>>>> +				persistent_gnts[i]->page)) << PAGE_SHIFT |
>>>
>>> So why do you do this? The only reader masks the field with
>>> ~PAGE_MASK anyway.
>>
>> Yes, I only need to store first_sect.
> 
> ... as you're touching this code anyway, and as it'll make the
> code as well as the patch smaller, could you at least drop this
> pointless storing of the page address (which otherwise I'd ask
> you to properly parenthesize anyway)?
> 
> And iirc once that's dropped, the storing of first_sect ends up
> being identical between the if and else bodies, so it could be
> pulled out (further reducing code size, albeit at the price of a
> marginally bigger patch).

Yes, I've already done that, thanks for the suggestion.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-04 20:41   ` Konrad Rzeszutek Wilk
@ 2013-03-05 17:07     ` Roger Pau Monné
  2013-03-05 21:46       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 17:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 04/03/13 21:41, Konrad Rzeszutek Wilk wrote:
> On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
>> Indirect descriptors introduce a new block operation
>> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
>> in the request. This grant references are filled with arrays of
>> blkif_request_segment_aligned, this way we can send more segments in a
>> request.
>>
>> The proposed implementation sets the maximum number of indirect grefs
>> (frames filled with blkif_request_segment_aligned) to 256 in the
>> backend and 64 in the frontend. The value in the frontend has been
>> chosen experimentally, and the backend value has been set to a sane
>> value that allows expanding the maximum number of indirect descriptors
>> in the frontend if needed.
> 
> So we are still using a similar format of the form:
> 
> <gref, first_sec, last_sect, pad>, etc.
> 
> Why not utilize a layout that fits with the bio sg? That way
> we might not even have to do the bio_alloc call and instead can
> setup an bio (and bio-list) with the appropiate offsets/list?
> 
> Meaning that the format of the indirect descriptors is:
> 
> <gref, offset, next_index, pad>
> 
> We already know what the first_sec and last_sect are - they
> are basically: sector_number +  nr_segments * (whatever the sector size is) + offset

This will of course be suitable for Linux, but what about other OSes, I
know they support the traditional first_sec, last_sect (because it's
already implemented), but I don't know how much work will it be for them
to adopt this. If we have to do such a change I will have to check first
that other frontend/backend can handle this easily also, I wouldn't like
to simplify this for Linux by making it more difficult to implement in
other OSes...


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants
  2013-03-04 20:10   ` Konrad Rzeszutek Wilk
@ 2013-03-05 18:10     ` Roger Pau Monné
  2013-03-05 21:49       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-05 18:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 04/03/13 21:10, Konrad Rzeszutek Wilk wrote:
> On Thu, Feb 28, 2013 at 11:28:49AM +0100, Roger Pau Monne wrote:
>> This mechanism allows blkback to change the number of grants
>> persistently mapped at run time.
>>
>> The algorithm uses a simple LRU mechanism that removes (if needed) the
>> persistent grants that have not been used since the last LRU run, or
>> if all grants have been used it removes the first grants in the list
>> (that are not in use).
>>
>> The algorithm has several parameters that can be tuned by the user
>> from sysfs:
>>
>>  * max_persistent_grants: maximum number of grants that will be
>>    persistently mapped.
>>  * lru_interval: minimum interval (in ms) at which the LRU should be
>>    run
>>  * lru_num_clean: number of persistent grants to remove when executing
>>    the LRU.
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: xen-devel@lists.xen.org
>> ---
>>  drivers/block/xen-blkback/blkback.c |  207 +++++++++++++++++++++++++++--------
>>  drivers/block/xen-blkback/common.h  |    4 +
>>  drivers/block/xen-blkback/xenbus.c  |    1 +
>>  3 files changed, 166 insertions(+), 46 deletions(-)
> 
> You also should add a Documentation/sysfs/ABI/stable/sysfs-bus-xen-backend

OK

>>
>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>> index 415a0c7..c14b736 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -63,6 +63,44 @@ static int xen_blkif_reqs = 64;
>>  module_param_named(reqs, xen_blkif_reqs, int, 0);
>>  MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
>>
>> +/*
>> + * Maximum number of grants to map persistently in blkback. For maximum
>> + * performance this should be the total numbers of grants that can be used
>> + * to fill the ring, but since this might become too high, specially with
>> + * the use of indirect descriptors, we set it to a value that provides good
>> + * performance without using too much memory.
>> + *
>> + * When the list of persistent grants is full we clean it using a LRU
>> + * algorithm.
>> + */
>> +
>> +static int xen_blkif_max_pgrants = 352;
> 
> And a little blurb saying why 352.

Yes, this is (as you probably already realized) RING_SIZE *
BLKIF_MAX_SEGMENTS_PER_REQUEST

> 
>> +module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
>> +MODULE_PARM_DESC(max_persistent_grants,
>> +                 "Maximum number of grants to map persistently");
>> +
>> +/*
>> + * The LRU mechanism to clean the lists of persistent grants needs to
>> + * be executed periodically. The time interval between consecutive executions
>> + * of the purge mechanism is set in ms.
>> + */
>> +
>> +static int xen_blkif_lru_interval = 100;
> 
> So every second? What is the benefit of having the user modify this? Would
> it better if there was an watermark system in xen-blkfront to automatically
> figure this out? (This could be a TODO of course)

Every 100ms, so every 0.1 seconds. This can have an impact on
performance as implemented right now (if we move the purge to a separate
thread, it's not going to have such an impact), but anyway I feel we can
let the user tune it.

>> +module_param_named(lru_interval, xen_blkif_lru_interval, int, 0644);
>> +MODULE_PARM_DESC(lru_interval,
>> +"Execution interval (in ms) of the LRU mechanism to clean the list of persistent grants");
>> +
>> +/*
>> + * When the persistent grants list is full we will remove unused grants
>> + * from the list. The number of grants to be removed at each LRU execution
>> + * can be set dynamically.
>> + */
>> +
>> +static int xen_blkif_lru_num_clean = BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
>> +MODULE_PARM_DESC(lru_num_clean,
>> +"Number of persistent grants to unmap when the list is full");
> 
> Again, what does that mean to the system admin? Why would they need to modify
> the contents of that value?
> 

Well if you set the maximum number of grants to 1024 you might want to
increase this to 64 maybe, or on the other hand, if you set the maximum
number of grants to 10, you may wish to set this to 1, so I think it is
indeed relevant for system admins.

> Now if this is a debug related one for developing, then this could all be
> done in DebugFS.
> 
>> +
>>  /* Run-time switchable: /sys/module/blkback/parameters/ */
>>  static unsigned int log_stats;
>>  module_param(log_stats, int, 0644);
>> @@ -81,7 +119,7 @@ struct pending_req {
>>       unsigned short          operation;
>>       int                     status;
>>       struct list_head        free_list;
>> -     DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> +     struct persistent_gnt   *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>  };
>>
>>  #define BLKBACK_INVALID_HANDLE (~0)
>> @@ -102,36 +140,6 @@ struct xen_blkbk {
>>  static struct xen_blkbk *blkbk;
>>
>>  /*
>> - * Maximum number of grant pages that can be mapped in blkback.
>> - * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
>> - * pages that blkback will persistently map.
>> - * Currently, this is:
>> - * RING_SIZE = 32 (for all known ring types)
>> - * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
>> - * sizeof(struct persistent_gnt) = 48
>> - * So the maximum memory used to store the grants is:
>> - * 32 * 11 * 48 = 16896 bytes
>> - */
>> -static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
>> -{
>> -     switch (protocol) {
>> -     case BLKIF_PROTOCOL_NATIVE:
>> -             return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
>> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -     case BLKIF_PROTOCOL_X86_32:
>> -             return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
>> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -     case BLKIF_PROTOCOL_X86_64:
>> -             return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
>> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -     default:
>> -             BUG();
>> -     }
>> -     return 0;
>> -}
>> -
>> -
>> -/*
>>   * Little helpful macro to figure out the index and virtual address of the
>>   * pending_pages[..]. For each 'pending_req' we have have up to
>>   * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
>> @@ -251,6 +259,76 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
>>       BUG_ON(num != 0);
>>  }
>>
>> +static int purge_persistent_gnt(struct rb_root *root, int num)
>> +{
>> +     struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct persistent_gnt *persistent_gnt;
>> +     struct rb_node *n;
>> +     int ret, segs_to_unmap = 0;
>> +     int requested_num = num;
>> +     int preserve_used = 1;
> 
> Boolean? And perhaps 'scan_dirty' ?

Sure

> 
> 
>> +
>> +     pr_debug("Requested the purge of %d persistent grants\n", num);
>> +
>> +purge_list:
> 
> This could be written a bit differently to also run outside the xen_blkif_schedule
> (so a new thread). This would require using the lock mechanism and converting
> this big loop to two smaller loops:
>  1) - one quick that holds the lock - to take the items of the list,
>  2) second one to do the grant_set_unmap_op operations and all the heavy
>     free_xenballooned_pages call.

Yes, I could add a list_head to persistent_gnt, so we can take them out
of the red-black tree and queue them in a list to be processed (unmap +
free) after we have looped thought the list, without holding the lock.

> 
> .. As this function ends up (presumarily?) causing xen_blkif_schedule to be doing
> this for some time every second. Irregardless of how utilized the ring is - so
> if we are 100% busy - we should not need to call this function. But if we do,
> then we end up walking the persistent_gnt twice - once with preserve_used set
> to true, and the other with it set to false.
> 
> We don't really want that - so is there a way for xen_blkif_schedule to
> do a quick determintion of this caliber:
> 
> 
>         if (RING_HAS_UNCONSUMED_REQUESTS(x) >= some_value)
>                 wait_up(blkif->purgarator)

It's not possible to tell if all grants will be in use just by looking
at the number of active requests, since all requests might just be using
one segment, and thus the list of persistent grants could be purged
without problems. We could keep a count of the number of active grants
at each time and use that to check if we can kick the purge or not.

if (grants_in_use > (persistent_gnt_c - num_purge))
	wait(...)

> And the thread would just sit there until kicked in action?

And when a request frees some grants it could be kicked back to action.

> 
> 
>> +     foreach_grant_safe(persistent_gnt, n, root, node) {
>> +             BUG_ON(persistent_gnt->handle ==
>> +                     BLKBACK_INVALID_HANDLE);
>> +
>> +             if (persistent_gnt->flags & PERSISTENT_GNT_ACTIVE)
>> +                     continue;
>> +             if (preserve_used &&
>> +                 (persistent_gnt->flags & PERSISTENT_GNT_USED))
> 
> Is that similar to DIRTY on pagetables?

Well, not exactly. This is a very simple LRU scheme, so I just mark this
grant as "USED", this way when we execute the LRU we check for grants
not marked as "USED", and after cleaning unused grants we clean the
"USED" bit from all remaining persistent grants. Rudimentary, but still
provides a good hit ratio.

> 
>> +                     continue;
>> +
>> +             gnttab_set_unmap_op(&unmap[segs_to_unmap],
>> +                     (unsigned long) pfn_to_kaddr(page_to_pfn(
>> +                             persistent_gnt->page)),
>> +                     GNTMAP_host_map,
>> +                     persistent_gnt->handle);
>> +
>> +             pages[segs_to_unmap] = persistent_gnt->page;
>> +
>> +             if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>> +                     ret = gnttab_unmap_refs(unmap, NULL, pages,
>> +                             segs_to_unmap);
>> +                     BUG_ON(ret);
>> +                     free_xenballooned_pages(segs_to_unmap, pages);
>> +                     segs_to_unmap = 0;
>> +             }
>> +
>> +             rb_erase(&persistent_gnt->node, root);
>> +             kfree(persistent_gnt);
>> +             if (--num == 0)
>> +                     goto finished;
>> +     }
>> +     /*
>> +      * If we get here it means we also need to start cleaning
>> +      * grants that were used since last purge in order to cope
>> +      * with the requested num
>> +      */
>> +     if (preserve_used) {
>> +             pr_debug("Still missing %d purged frames\n", num);
>> +             preserve_used = 0;
>> +             goto purge_list;
>> +     }
>> +finished:
>> +     if (segs_to_unmap > 0) {
>> +             ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
>> +             BUG_ON(ret);
>> +             free_xenballooned_pages(segs_to_unmap, pages);
>> +     }
>> +     /* Finally remove the "used" flag from all the persistent grants */
>> +     foreach_grant_safe(persistent_gnt, n, root, node) {
>> +             BUG_ON(persistent_gnt->handle ==
>> +                     BLKBACK_INVALID_HANDLE);
>> +             persistent_gnt->flags &= ~PERSISTENT_GNT_USED;
>> +     }
>> +     pr_debug("Purged %d/%d\n", (requested_num - num), requested_num);
>> +     return (requested_num - num);
>> +}
>> +
>>  /*
>>   * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>>   */
>> @@ -397,6 +475,8 @@ int xen_blkif_schedule(void *arg)
>>  {
>>       struct xen_blkif *blkif = arg;
>>       struct xen_vbd *vbd = &blkif->vbd;
>> +     int rq_purge, purged;
>> +     unsigned long timeout;
>>
>>       xen_blkif_get(blkif);
>>
>> @@ -406,13 +486,21 @@ int xen_blkif_schedule(void *arg)
>>               if (unlikely(vbd->size != vbd_sz(vbd)))
>>                       xen_vbd_resize(blkif);
>>
>> -             wait_event_interruptible(
>> +             timeout = msecs_to_jiffies(xen_blkif_lru_interval);
>> +
>> +             timeout = wait_event_interruptible_timeout(
>>                       blkif->wq,
>> -                     blkif->waiting_reqs || kthread_should_stop());
>> -             wait_event_interruptible(
>> +                     blkif->waiting_reqs || kthread_should_stop(),
>> +                     timeout);
>> +             if (timeout == 0)
>> +                     goto purge_gnt_list;
>> +             timeout = wait_event_interruptible_timeout(
>>                       blkbk->pending_free_wq,
>>                       !list_empty(&blkbk->pending_free) ||
>> -                     kthread_should_stop());
>> +                     kthread_should_stop(),
>> +                     timeout);
>> +             if (timeout == 0)
>> +                     goto purge_gnt_list;
>>
>>               blkif->waiting_reqs = 0;
>>               smp_mb(); /* clear flag *before* checking for work */
>> @@ -420,6 +508,32 @@ int xen_blkif_schedule(void *arg)
>>               if (do_block_io_op(blkif))
>>                       blkif->waiting_reqs = 1;
>>
>> +purge_gnt_list:
>> +             if (blkif->vbd.feature_gnt_persistent &&
>> +                 time_after(jiffies, blkif->next_lru)) {
>> +                     /* Clean the list of persistent grants */
>> +                     if (blkif->persistent_gnt_c > xen_blkif_max_pgrants ||
>> +                         (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
>> +                          blkif->vbd.overflow_max_grants)) {
>> +                             rq_purge = blkif->persistent_gnt_c -
>> +                                        xen_blkif_max_pgrants +
>> +                                        xen_blkif_lru_num_clean;
> 
> You can make this more than 80 lines.

OK, good to know :)

>> +                             rq_purge = rq_purge > blkif->persistent_gnt_c ?
>> +                                        blkif->persistent_gnt_c : rq_purge;
>> +                             purged = purge_persistent_gnt(
>> +                                       &blkif->persistent_gnts, rq_purge);
>> +                             if (purged != rq_purge)
>> +                                     pr_debug(DRV_PFX " unable to meet persistent grants purge requirements for device %#x, domain %u, requested %d done %d\n",
>> +                                              blkif->domid,
>> +                                              blkif->vbd.handle,
>> +                                              rq_purge, purged);
>> +                             blkif->persistent_gnt_c -= purged;
>> +                             blkif->vbd.overflow_max_grants = 0;
>> +                     }
>> +                     blkif->next_lru = jiffies +
>> +                             msecs_to_jiffies(xen_blkif_lru_interval);
>> +             }
>> +
>>               if (log_stats && time_after(jiffies, blkif->st_print))
>>                       print_stats(blkif);
>>       }
>> @@ -453,13 +567,18 @@ static void xen_blkbk_unmap(struct pending_req *req)
>>  {
>>       struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>       struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct persistent_gnt *persistent_gnt;
>>       unsigned int i, invcount = 0;
>>       grant_handle_t handle;
>>       int ret;
>>
>>       for (i = 0; i < req->nr_pages; i++) {
>> -             if (!test_bit(i, req->unmap_seg))
>> +             if (req->persistent_gnts[i] != NULL) {
>> +                     persistent_gnt = req->persistent_gnts[i];
>> +                     persistent_gnt->flags |= PERSISTENT_GNT_USED;
>> +                     persistent_gnt->flags &= ~PERSISTENT_GNT_ACTIVE;
>>                       continue;
>> +             }
>>               handle = pending_handle(req, i);
>>               if (handle == BLKBACK_INVALID_HANDLE)
>>                       continue;
>> @@ -480,8 +599,8 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                        struct page *pages[])
>>  {
>>       struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> -     struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>       struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
>>       struct persistent_gnt *persistent_gnt = NULL;
>>       struct xen_blkif *blkif = pending_req->blkif;
>>       phys_addr_t addr = 0;
>> @@ -494,9 +613,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>>
>>       use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
>>
>> -     BUG_ON(blkif->persistent_gnt_c >
>> -                max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>> -
>>       /*
>>        * Fill out preq.nr_sects with proper amount of sectors, and setup
>>        * assign map[..] with the PFN of the page in our domain with the
>> @@ -516,9 +632,9 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                        * the grant is already mapped
>>                        */
>>                       new_map = false;
>> +                     persistent_gnt->flags |= PERSISTENT_GNT_ACTIVE;
>>               } else if (use_persistent_gnts &&
>> -                        blkif->persistent_gnt_c <
>> -                        max_mapped_grant_pages(blkif->blk_protocol)) {
>> +                        blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
>>                       /*
>>                        * We are using persistent grants, the grant is
>>                        * not mapped but we have room for it
>> @@ -536,6 +652,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                       }
>>                       persistent_gnt->gnt = req->u.rw.seg[i].gref;
>>                       persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
>> +                     persistent_gnt->flags = PERSISTENT_GNT_ACTIVE;
>>
>>                       pages_to_gnt[segs_to_map] =
>>                               persistent_gnt->page;
>> @@ -547,7 +664,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                       blkif->persistent_gnt_c++;
>>                       pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
>>                                persistent_gnt->gnt, blkif->persistent_gnt_c,
>> -                              max_mapped_grant_pages(blkif->blk_protocol));
>> +                              xen_blkif_max_pgrants);
>>               } else {
>>                       /*
>>                        * We are either using persistent grants and
>> @@ -557,7 +674,7 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                       if (use_persistent_gnts &&
>>                               !blkif->vbd.overflow_max_grants) {
>>                               blkif->vbd.overflow_max_grants = 1;
>> -                             pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>> +                             pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>>                                        blkif->domid, blkif->vbd.handle);
>>                       }
>>                       new_map = true;
>> @@ -595,7 +712,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>>        * so that when we access vaddr(pending_req,i) it has the contents of
>>        * the page from the other domain.
>>        */
>> -     bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>       for (i = 0, j = 0; i < nseg; i++) {
>>               if (!persistent_gnts[i] ||
>>                   persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
>> @@ -634,7 +750,6 @@ static int xen_blkbk_map(struct blkif_request *req,
>>                               (req->u.rw.seg[i].first_sect << 9);
>>               } else {
>>                       pending_handle(pending_req, i) = map[j].handle;
>> -                     bitmap_set(pending_req->unmap_seg, i, 1);
>>
>>                       if (ret) {
>>                               j++;
>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
>> index f338f8a..bd44d75 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -167,11 +167,14 @@ struct xen_vbd {
>>
>>  struct backend_info;
>>
>> +#define PERSISTENT_GNT_ACTIVE        0x1
>> +#define PERSISTENT_GNT_USED          0x2
>>
>>  struct persistent_gnt {
>>       struct page *page;
>>       grant_ref_t gnt;
>>       grant_handle_t handle;
>> +     uint8_t flags;
>>       struct rb_node node;
>>  };
>>
>> @@ -204,6 +207,7 @@ struct xen_blkif {
>>       /* tree to store persistent grants */
>>       struct rb_root          persistent_gnts;
>>       unsigned int            persistent_gnt_c;
>> +     unsigned long           next_lru;
>>
>>       /* statistics */
>>       unsigned long           st_print;
>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
>> index 5e237f6..abb399a 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -116,6 +116,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>>       init_completion(&blkif->drain_complete);
>>       atomic_set(&blkif->drain, 0);
>>       blkif->st_print = jiffies;
>> +     blkif->next_lru = jiffies;
>>       init_waitqueue_head(&blkif->waiting_to_free);
>>       blkif->persistent_gnts.rb_node = NULL;
>>
>> --
>> 1.7.7.5 (Apple Git-26)
>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Xen-devel] [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-05 17:00               ` Roger Pau Monné
@ 2013-03-05 21:45                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 21:45 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Jan Beulich, xen-devel, linux-kernel

On Tue, Mar 05, 2013 at 06:00:51PM +0100, Roger Pau Monné wrote:
> On 05/03/13 15:16, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 05, 2013 at 08:11:19AM +0000, Jan Beulich wrote:
> >>>>> On 04.03.13 at 21:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >>> <nods> 'op' sounds good. With a comment saying it can do all of the 
> >>> BLKIF_OPS_..
> >>> except the BLKIF_OP_INDIRECT one. Thought one could in theory chain
> >>> it that way for fun.
> >>
> >> In fact I'd like to exclude chaining as well as BLKIF_OP_DISCARD here.
> >> The former should - if useful for anything - be controlled by a
> >> separate feature flag, and the latter is plain pointless to indirect.
> >> And I reckon the same would apply to BLKIF_OP_FLUSH_DISKCACHE
> >> and BLKIF_OP_RESERVED_1 - i.e. it might be better to state that
> >> indirection is only permitted for normal I/O (read/write) ops.
> > 
> > <nods> That makes sense. And also of course the new BLKIF_OP should
> > be documented in the Xen tree as well.
> 
> The only ops that can be done indirectly are _READ, _WRITE and
> _BARRIER/_FLUSH. From the implementation in blkfront it seems like
> _FLUSH/_BARRIER requests can indeed contain segments, but I haven't been
> able to spot any _FLUSH op with segments on it. Can you confirm FLUSH
> requests never contain bios?

Not FLUSH per say. But the FUA should be able to provide data and the
flush semantics with it. Except we don't support FUA so this is irrelevant
 - unless in the future we want to intrduce FUA or WRITE with some extra
flags.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-05 17:07     ` Roger Pau Monné
@ 2013-03-05 21:46       ` Konrad Rzeszutek Wilk
  2013-03-08 17:07         ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 21:46 UTC (permalink / raw)
  To: Roger Pau Monné, james.harper; +Cc: linux-kernel, xen-devel

On Tue, Mar 05, 2013 at 06:07:57PM +0100, Roger Pau Monné wrote:
> On 04/03/13 21:41, Konrad Rzeszutek Wilk wrote:
> > On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
> >> Indirect descriptors introduce a new block operation
> >> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> >> in the request. This grant references are filled with arrays of
> >> blkif_request_segment_aligned, this way we can send more segments in a
> >> request.
> >>
> >> The proposed implementation sets the maximum number of indirect grefs
> >> (frames filled with blkif_request_segment_aligned) to 256 in the
> >> backend and 64 in the frontend. The value in the frontend has been
> >> chosen experimentally, and the backend value has been set to a sane
> >> value that allows expanding the maximum number of indirect descriptors
> >> in the frontend if needed.
> > 
> > So we are still using a similar format of the form:
> > 
> > <gref, first_sec, last_sect, pad>, etc.
> > 
> > Why not utilize a layout that fits with the bio sg? That way
> > we might not even have to do the bio_alloc call and instead can
> > setup an bio (and bio-list) with the appropiate offsets/list?
> > 
> > Meaning that the format of the indirect descriptors is:
> > 
> > <gref, offset, next_index, pad>
> > 
> > We already know what the first_sec and last_sect are - they
> > are basically: sector_number +  nr_segments * (whatever the sector size is) + offset
> 
> This will of course be suitable for Linux, but what about other OSes, I
> know they support the traditional first_sec, last_sect (because it's
> already implemented), but I don't know how much work will it be for them
> to adopt this. If we have to do such a change I will have to check first
> that other frontend/backend can handle this easily also, I wouldn't like
> to simplify this for Linux by making it more difficult to implement in
> other OSes...

I would think that most OSes use the same framework. The ones that
are of notable interest are the Windows and BSD. Lets CC James here

> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants
  2013-03-05 18:10     ` Roger Pau Monné
@ 2013-03-05 21:49       ` Konrad Rzeszutek Wilk
  2013-03-18 17:00         ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 21:49 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: linux-kernel, xen-devel

On Tue, Mar 05, 2013 at 07:10:04PM +0100, Roger Pau Monné wrote:
> On 04/03/13 21:10, Konrad Rzeszutek Wilk wrote:
> > On Thu, Feb 28, 2013 at 11:28:49AM +0100, Roger Pau Monne wrote:
> >> This mechanism allows blkback to change the number of grants
> >> persistently mapped at run time.
> >>
> >> The algorithm uses a simple LRU mechanism that removes (if needed) the
> >> persistent grants that have not been used since the last LRU run, or
> >> if all grants have been used it removes the first grants in the list
> >> (that are not in use).
> >>
> >> The algorithm has several parameters that can be tuned by the user
> >> from sysfs:
> >>
> >>  * max_persistent_grants: maximum number of grants that will be
> >>    persistently mapped.
> >>  * lru_interval: minimum interval (in ms) at which the LRU should be
> >>    run
> >>  * lru_num_clean: number of persistent grants to remove when executing
> >>    the LRU.
> >>
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> Cc: xen-devel@lists.xen.org
> >> ---
> >>  drivers/block/xen-blkback/blkback.c |  207 +++++++++++++++++++++++++++--------
> >>  drivers/block/xen-blkback/common.h  |    4 +
> >>  drivers/block/xen-blkback/xenbus.c  |    1 +
> >>  3 files changed, 166 insertions(+), 46 deletions(-)
> > 
> > You also should add a Documentation/sysfs/ABI/stable/sysfs-bus-xen-backend
> 
> OK
> 
> >>
> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> >> index 415a0c7..c14b736 100644
> >> --- a/drivers/block/xen-blkback/blkback.c
> >> +++ b/drivers/block/xen-blkback/blkback.c
> >> @@ -63,6 +63,44 @@ static int xen_blkif_reqs = 64;
> >>  module_param_named(reqs, xen_blkif_reqs, int, 0);
> >>  MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
> >>
> >> +/*
> >> + * Maximum number of grants to map persistently in blkback. For maximum
> >> + * performance this should be the total numbers of grants that can be used
> >> + * to fill the ring, but since this might become too high, specially with
> >> + * the use of indirect descriptors, we set it to a value that provides good
> >> + * performance without using too much memory.
> >> + *
> >> + * When the list of persistent grants is full we clean it using a LRU
> >> + * algorithm.
> >> + */
> >> +
> >> +static int xen_blkif_max_pgrants = 352;
> > 
> > And a little blurb saying why 352.
> 
> Yes, this is (as you probably already realized) RING_SIZE *
> BLKIF_MAX_SEGMENTS_PER_REQUEST
> 
> > 
> >> +module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
> >> +MODULE_PARM_DESC(max_persistent_grants,
> >> +                 "Maximum number of grants to map persistently");
> >> +
> >> +/*
> >> + * The LRU mechanism to clean the lists of persistent grants needs to
> >> + * be executed periodically. The time interval between consecutive executions
> >> + * of the purge mechanism is set in ms.
> >> + */
> >> +
> >> +static int xen_blkif_lru_interval = 100;
> > 
> > So every second? What is the benefit of having the user modify this? Would
> > it better if there was an watermark system in xen-blkfront to automatically
> > figure this out? (This could be a TODO of course)
> 
> Every 100ms, so every 0.1 seconds. This can have an impact on
> performance as implemented right now (if we move the purge to a separate
> thread, it's not going to have such an impact), but anyway I feel we can
> let the user tune it.

Why not automatically tune it in the backend? So it can do this by itself?
> 
> >> +module_param_named(lru_interval, xen_blkif_lru_interval, int, 0644);
> >> +MODULE_PARM_DESC(lru_interval,
> >> +"Execution interval (in ms) of the LRU mechanism to clean the list of persistent grants");
> >> +
> >> +/*
> >> + * When the persistent grants list is full we will remove unused grants
> >> + * from the list. The number of grants to be removed at each LRU execution
> >> + * can be set dynamically.
> >> + */
> >> +
> >> +static int xen_blkif_lru_num_clean = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> +module_param_named(lru_num_clean, xen_blkif_lru_num_clean, int, 0644);
> >> +MODULE_PARM_DESC(lru_num_clean,
> >> +"Number of persistent grants to unmap when the list is full");
> > 
> > Again, what does that mean to the system admin? Why would they need to modify
> > the contents of that value?
> > 
> 
> Well if you set the maximum number of grants to 1024 you might want to
> increase this to 64 maybe, or on the other hand, if you set the maximum
> number of grants to 10, you may wish to set this to 1, so I think it is
> indeed relevant for system admins.

So why not make this automatic? A value blkback can automatically
adjust as there are less or more grants. This of course does not have
to be part of this patch.
> 
> > Now if this is a debug related one for developing, then this could all be
> > done in DebugFS.
> > 
> >> +
> >>  /* Run-time switchable: /sys/module/blkback/parameters/ */
> >>  static unsigned int log_stats;
> >>  module_param(log_stats, int, 0644);
> >> @@ -81,7 +119,7 @@ struct pending_req {
> >>       unsigned short          operation;
> >>       int                     status;
> >>       struct list_head        free_list;
> >> -     DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> +     struct persistent_gnt   *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>  };
> >>
> >>  #define BLKBACK_INVALID_HANDLE (~0)
> >> @@ -102,36 +140,6 @@ struct xen_blkbk {
> >>  static struct xen_blkbk *blkbk;
> >>
> >>  /*
> >> - * Maximum number of grant pages that can be mapped in blkback.
> >> - * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> >> - * pages that blkback will persistently map.
> >> - * Currently, this is:
> >> - * RING_SIZE = 32 (for all known ring types)
> >> - * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
> >> - * sizeof(struct persistent_gnt) = 48
> >> - * So the maximum memory used to store the grants is:
> >> - * 32 * 11 * 48 = 16896 bytes
> >> - */
> >> -static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> >> -{
> >> -     switch (protocol) {
> >> -     case BLKIF_PROTOCOL_NATIVE:
> >> -             return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> >> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> -     case BLKIF_PROTOCOL_X86_32:
> >> -             return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> >> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> -     case BLKIF_PROTOCOL_X86_64:
> >> -             return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> >> -                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> -     default:
> >> -             BUG();
> >> -     }
> >> -     return 0;
> >> -}
> >> -
> >> -
> >> -/*
> >>   * Little helpful macro to figure out the index and virtual address of the
> >>   * pending_pages[..]. For each 'pending_req' we have have up to
> >>   * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> >> @@ -251,6 +259,76 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
> >>       BUG_ON(num != 0);
> >>  }
> >>
> >> +static int purge_persistent_gnt(struct rb_root *root, int num)
> >> +{
> >> +     struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct persistent_gnt *persistent_gnt;
> >> +     struct rb_node *n;
> >> +     int ret, segs_to_unmap = 0;
> >> +     int requested_num = num;
> >> +     int preserve_used = 1;
> > 
> > Boolean? And perhaps 'scan_dirty' ?
> 
> Sure
> 
> > 
> > 
> >> +
> >> +     pr_debug("Requested the purge of %d persistent grants\n", num);
> >> +
> >> +purge_list:
> > 
> > This could be written a bit differently to also run outside the xen_blkif_schedule
> > (so a new thread). This would require using the lock mechanism and converting
> > this big loop to two smaller loops:
> >  1) - one quick that holds the lock - to take the items of the list,
> >  2) second one to do the grant_set_unmap_op operations and all the heavy
> >     free_xenballooned_pages call.
> 
> Yes, I could add a list_head to persistent_gnt, so we can take them out
> of the red-black tree and queue them in a list to be processed (unmap +
> free) after we have looped thought the list, without holding the lock.
> 
> > 
> > .. As this function ends up (presumarily?) causing xen_blkif_schedule to be doing
> > this for some time every second. Irregardless of how utilized the ring is - so
> > if we are 100% busy - we should not need to call this function. But if we do,
> > then we end up walking the persistent_gnt twice - once with preserve_used set
> > to true, and the other with it set to false.
> > 
> > We don't really want that - so is there a way for xen_blkif_schedule to
> > do a quick determintion of this caliber:
> > 
> > 
> >         if (RING_HAS_UNCONSUMED_REQUESTS(x) >= some_value)
> >                 wait_up(blkif->purgarator)
> 
> It's not possible to tell if all grants will be in use just by looking
> at the number of active requests, since all requests might just be using
> one segment, and thus the list of persistent grants could be purged
> without problems. We could keep a count of the number of active grants
> at each time and use that to check if we can kick the purge or not.
> 
> if (grants_in_use > (persistent_gnt_c - num_purge))
> 	wait(...)

Sure.
> 
> > And the thread would just sit there until kicked in action?
> 
> And when a request frees some grants it could be kicked back to action.

OK.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-03-05 16:30         ` Roger Pau Monné
@ 2013-03-05 21:53           ` Konrad Rzeszutek Wilk
  2013-03-06  9:17             ` Roger Pau Monné
  0 siblings, 1 reply; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-05 21:53 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: linux-kernel, xen-devel

On Tue, Mar 05, 2013 at 05:30:01PM +0100, Roger Pau Monné wrote:
> On 05/03/13 15:18, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 05, 2013 at 12:04:41PM +0100, Roger Pau Monné wrote:
> >> On 04/03/13 20:39, Konrad Rzeszutek Wilk wrote:
> >>> On Thu, Feb 28, 2013 at 11:28:47AM +0100, Roger Pau Monne wrote:
> >>>> This prevents us from having to call alloc_page while we are preparing
> >>>> the request. Since blkfront was calling alloc_page with a spinlock
> >>>> held we used GFP_ATOMIC, which can fail if we are requesting a lot of
> >>>> pages since it is using the emergency memory pools.
> >>>>
> >>>> Allocating all the pages at init prevents us from having to call
> >>>> alloc_page, thus preventing possible failures.
> >>>>
> >>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >>>> Cc: xen-devel@lists.xen.org
> >>>> ---
> >>>>  drivers/block/xen-blkfront.c |  120 +++++++++++++++++++++++++++--------------
> >>>>  1 files changed, 79 insertions(+), 41 deletions(-)
> >>>>
> >>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >>>> index 2e39eaf..5ba6b87 100644
> >>>> --- a/drivers/block/xen-blkfront.c
> >>>> +++ b/drivers/block/xen-blkfront.c
> >>>> @@ -165,6 +165,69 @@ static int add_id_to_freelist(struct blkfront_info *info,
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> +static int fill_grant_buffer(struct blkfront_info *info, int num)
> >>>> +{
> >>>> +	struct page *granted_page;
> >>>> +	struct grant *gnt_list_entry, *n;
> >>>> +	int i = 0;
> >>>> +
> >>>> +	while(i < num) {
> >>>> +		gnt_list_entry = kzalloc(sizeof(struct grant), GFP_NOIO);
> >>>
> >>> GFP_NORMAL ?
> >>
> >> drivers/block/xen-blkfront.c:175: error: ‘GFP_NORMAL’ undeclared (first
> >> use in this function)
> >>
> >> Did you mean GFP_KERNEL? I think GFP_NOIO is more suitable, it can block
> >> but no IO will be performed.
> > 
> > <sigh> I meant GFP_KERNEL. Sorry about the incorrect advice. The GFP_KERNEL
> > is the more general purpose pool - is there a good reason to use _NOIO?
> > This is after all during initialization when there is no IO using this driver.
> 
> We are already allocating memory using GFP_NOIO during setup
> (setup_blkring and blkif_recover), the only reason I can think could be
> helpful to use _NOIO is if the kernel tries to swap memory pages to the
> disk, but if it has to swap pages to disk at this point we won't
> probably be able to correctly setup blkfront anyway, either using _NOIO
> or _KERNEL.

OK, then NOIO makes sense.
> 
> >>>>  
> >>>>  	/* No more gnttab callback work. */
> >>>>  	gnttab_cancel_free_callback(&info->callback);
> >>>> @@ -1088,6 +1120,12 @@ again:
> >>>>  		goto destroy_blkring;
> >>>>  	}
> >>>>  
> >>>> +	/* Allocate memory for grants */
> >>>> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
> >>>> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >>>> +	if (err)
> >>>> +		goto out;
> >>>
> >>> That looks to be in the wrong function - talk_to_blkback function is
> >>> to talk to the blkback. Not do initialization type operations.
> >>
> >> Yes, I know it's not the best place to place it. It's here mainly
> >> because that's the only function that gets called by both driver
> >> initialization and resume.
> >>
> >> Last patch moves this to a more sensible place.
> > 
> > Lets make it part of this patch from the start. We still have two
> > months of time before the next merge window opens - so we have
> > time to make it nice and clean.
> 
> I'm moving this to blkfront_setup_indirect in a later patch (because
> this function doesn't yet exist at this point), but I can put it in a
> more suitable place in this patch.
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests
  2013-03-05 21:53           ` Konrad Rzeszutek Wilk
@ 2013-03-06  9:17             ` Roger Pau Monné
  0 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-06  9:17 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 05/03/13 22:53, Konrad Rzeszutek Wilk wrote:
>>>>>>  
>>>>>>  	/* No more gnttab callback work. */
>>>>>>  	gnttab_cancel_free_callback(&info->callback);
>>>>>> @@ -1088,6 +1120,12 @@ again:
>>>>>>  		goto destroy_blkring;
>>>>>>  	}
>>>>>>  
>>>>>> +	/* Allocate memory for grants */
>>>>>> +	err = fill_grant_buffer(info, BLK_RING_SIZE *
>>>>>> +	                              BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>>>>> +	if (err)
>>>>>> +		goto out;
>>>>>
>>>>> That looks to be in the wrong function - talk_to_blkback function is
>>>>> to talk to the blkback. Not do initialization type operations.
>>>>
>>>> Yes, I know it's not the best place to place it. It's here mainly
>>>> because that's the only function that gets called by both driver
>>>> initialization and resume.
>>>>
>>>> Last patch moves this to a more sensible place.
>>>
>>> Lets make it part of this patch from the start. We still have two
>>> months of time before the next merge window opens - so we have
>>> time to make it nice and clean.
>>
>> I'm moving this to blkfront_setup_indirect in a later patch (because
>> this function doesn't yet exist at this point), but I can put it in a
>> more suitable place in this patch.
>>

I will place it in setup_blkring, which is the place where we also init
the sg array and it's called by both init and resume paths.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-05 21:46       ` Konrad Rzeszutek Wilk
@ 2013-03-08 17:07         ` Roger Pau Monné
  2013-03-22  1:10           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-08 17:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: james.harper, linux-kernel, xen-devel

On 05/03/13 22:46, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 05, 2013 at 06:07:57PM +0100, Roger Pau Monné wrote:
>> On 04/03/13 21:41, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
>>>> Indirect descriptors introduce a new block operation
>>>> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
>>>> in the request. This grant references are filled with arrays of
>>>> blkif_request_segment_aligned, this way we can send more segments in a
>>>> request.
>>>>
>>>> The proposed implementation sets the maximum number of indirect grefs
>>>> (frames filled with blkif_request_segment_aligned) to 256 in the
>>>> backend and 64 in the frontend. The value in the frontend has been
>>>> chosen experimentally, and the backend value has been set to a sane
>>>> value that allows expanding the maximum number of indirect descriptors
>>>> in the frontend if needed.
>>>
>>> So we are still using a similar format of the form:
>>>
>>> <gref, first_sec, last_sect, pad>, etc.
>>>
>>> Why not utilize a layout that fits with the bio sg? That way
>>> we might not even have to do the bio_alloc call and instead can
>>> setup an bio (and bio-list) with the appropiate offsets/list?

I think we can already do this without changing the structure of the
segments, we could just allocate a bio big enough to hold all the
segments and queue them up (provided that the underlying storage device
supports bios of this size).

bio = bio_alloc(GFP_KERNEL, nseg);
if (unlikely(bio == NULL))
	goto fail_put_bio;
biolist[nbio++] = bio;
bio->bi_bdev    = preq.bdev;
bio->bi_private = pending_req;
bio->bi_end_io  = end_block_io_op;
bio->bi_sector  = preq.sector_number;

for (i = 0; i < nseg; i++) {
	rc = bio_add_page(bio, pages[i], seg[i].nsec << 9,
		seg[i].buf & ~PAGE_MASK);
	if (rc == 0)
		goto fail_put_bio;
}

This seems to work with Linux blkfront/blkback, and I guess biolist in
blkback only has one bio all the time.

>>> Meaning that the format of the indirect descriptors is:
>>>
>>> <gref, offset, next_index, pad>

Don't we need a length parameter? Also, next_index will be current+1,
because we already send the segments sorted (using for_each_sg) in blkfront.

>>>
>>> We already know what the first_sec and last_sect are - they
>>> are basically: sector_number +  nr_segments * (whatever the sector size is) + offset
>>
>> This will of course be suitable for Linux, but what about other OSes, I
>> know they support the traditional first_sec, last_sect (because it's
>> already implemented), but I don't know how much work will it be for them
>> to adopt this. If we have to do such a change I will have to check first
>> that other frontend/backend can handle this easily also, I wouldn't like
>> to simplify this for Linux by making it more difficult to implement in
>> other OSes...
> 
> I would think that most OSes use the same framework. The ones that
> are of notable interest are the Windows and BSD. Lets CC James here

Maybe I'm missing something here, but I don't see a really big benefit
of using this new structure for segments instead of the current one.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants
  2013-03-05 21:49       ` Konrad Rzeszutek Wilk
@ 2013-03-18 17:00         ` Roger Pau Monné
  0 siblings, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-18 17:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 05/03/13 22:49, Konrad Rzeszutek Wilk wrote:
>>> This could be written a bit differently to also run outside the xen_blkif_schedule
>>> (so a new thread). This would require using the lock mechanism and converting
>>> this big loop to two smaller loops:
>>>  1) - one quick that holds the lock - to take the items of the list,
>>>  2) second one to do the grant_set_unmap_op operations and all the heavy
>>>     free_xenballooned_pages call.
>>
>> Yes, I could add a list_head to persistent_gnt, so we can take them out
>> of the red-black tree and queue them in a list to be processed (unmap +
>> free) after we have looped thought the list, without holding the lock.

I've been trying to implement the "purge" on a different kthread, but
I'm not able to get the same performance. Since moving this a different
thread requires additional contention (spinlocks) around the red-black
tree of persistent grants, I think we should leave it as-is right now,
and consider moving it to a different thread if we can get a performance
benefit.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
  2013-02-28 11:19   ` [Xen-devel] " Jan Beulich
  2013-03-04 20:41   ` Konrad Rzeszutek Wilk
@ 2013-03-18 17:06   ` Roger Pau Monné
  2013-03-19 14:38     ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-18 17:06 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: linux-kernel, xen-devel, Konrad Rzeszutek Wilk

On 28/02/13 11:28, Roger Pau Monne wrote:
> Indirect descriptors introduce a new block operation
> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> in the request. This grant references are filled with arrays of
> blkif_request_segment_aligned, this way we can send more segments in a
> request.
> 
> The proposed implementation sets the maximum number of indirect grefs
> (frames filled with blkif_request_segment_aligned) to 256 in the
> backend and 64 in the frontend. The value in the frontend has been
> chosen experimentally, and the backend value has been set to a sane
> value that allows expanding the maximum number of indirect descriptors
> in the frontend if needed.

I've added some additional debugging messages in blkfront, and found out
that the queue in blkfront is not providing request bigger than 64
segments for read requests, or 128 segments for write requests, although
I set:

blk_queue_max_segments(info->rq, 256);

Is there any other limit I'm missing on the number of segments per
request a queue can provide?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-18 17:06   ` Roger Pau Monné
@ 2013-03-19 14:38     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-19 14:38 UTC (permalink / raw)
  To: Roger Pau Monné, martin.petersen; +Cc: linux-kernel, xen-devel

On Mon, Mar 18, 2013 at 06:06:38PM +0100, Roger Pau Monné wrote:
> On 28/02/13 11:28, Roger Pau Monne wrote:
> > Indirect descriptors introduce a new block operation
> > (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> > in the request. This grant references are filled with arrays of
> > blkif_request_segment_aligned, this way we can send more segments in a
> > request.
> > 
> > The proposed implementation sets the maximum number of indirect grefs
> > (frames filled with blkif_request_segment_aligned) to 256 in the
> > backend and 64 in the frontend. The value in the frontend has been
> > chosen experimentally, and the backend value has been set to a sane
> > value that allows expanding the maximum number of indirect descriptors
> > in the frontend if needed.
> 
> I've added some additional debugging messages in blkfront, and found out
> that the queue in blkfront is not providing request bigger than 64
> segments for read requests, or 128 segments for write requests, although
> I set:
> 
> blk_queue_max_segments(info->rq, 256);
> 
> Is there any other limit I'm missing on the number of segments per
> request a queue can provide?

Martin, any ideas?
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 12/12] xen-block: implement indirect descriptors
  2013-03-08 17:07         ` Roger Pau Monné
@ 2013-03-22  1:10           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 51+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-03-22  1:10 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: james.harper, linux-kernel, xen-devel

On Fri, Mar 08, 2013 at 06:07:08PM +0100, Roger Pau Monné wrote:
> On 05/03/13 22:46, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 05, 2013 at 06:07:57PM +0100, Roger Pau Monné wrote:
> >> On 04/03/13 21:41, Konrad Rzeszutek Wilk wrote:
> >>> On Thu, Feb 28, 2013 at 11:28:55AM +0100, Roger Pau Monne wrote:
> >>>> Indirect descriptors introduce a new block operation
> >>>> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> >>>> in the request. This grant references are filled with arrays of
> >>>> blkif_request_segment_aligned, this way we can send more segments in a
> >>>> request.
> >>>>
> >>>> The proposed implementation sets the maximum number of indirect grefs
> >>>> (frames filled with blkif_request_segment_aligned) to 256 in the
> >>>> backend and 64 in the frontend. The value in the frontend has been
> >>>> chosen experimentally, and the backend value has been set to a sane
> >>>> value that allows expanding the maximum number of indirect descriptors
> >>>> in the frontend if needed.
> >>>
> >>> So we are still using a similar format of the form:
> >>>
> >>> <gref, first_sec, last_sect, pad>, etc.
> >>>
> >>> Why not utilize a layout that fits with the bio sg? That way
> >>> we might not even have to do the bio_alloc call and instead can
> >>> setup an bio (and bio-list) with the appropiate offsets/list?
> 
> I think we can already do this without changing the structure of the
> segments, we could just allocate a bio big enough to hold all the
> segments and queue them up (provided that the underlying storage device
> supports bios of this size).
> 
> bio = bio_alloc(GFP_KERNEL, nseg);
> if (unlikely(bio == NULL))
> 	goto fail_put_bio;
> biolist[nbio++] = bio;
> bio->bi_bdev    = preq.bdev;
> bio->bi_private = pending_req;
> bio->bi_end_io  = end_block_io_op;
> bio->bi_sector  = preq.sector_number;
> 
> for (i = 0; i < nseg; i++) {
> 	rc = bio_add_page(bio, pages[i], seg[i].nsec << 9,
> 		seg[i].buf & ~PAGE_MASK);
> 	if (rc == 0)
> 		goto fail_put_bio;
> }
> 
> This seems to work with Linux blkfront/blkback, and I guess biolist in
> blkback only has one bio all the time.

> 
> >>> Meaning that the format of the indirect descriptors is:
> >>>
> >>> <gref, offset, next_index, pad>
> 
> Don't we need a length parameter? Also, next_index will be current+1,
> because we already send the segments sorted (using for_each_sg) in blkfront.
> 
> >>>
> >>> We already know what the first_sec and last_sect are - they
> >>> are basically: sector_number +  nr_segments * (whatever the sector size is) + offset
> >>
> >> This will of course be suitable for Linux, but what about other OSes, I
> >> know they support the traditional first_sec, last_sect (because it's
> >> already implemented), but I don't know how much work will it be for them
> >> to adopt this. If we have to do such a change I will have to check first
> >> that other frontend/backend can handle this easily also, I wouldn't like
> >> to simplify this for Linux by making it more difficult to implement in
> >> other OSes...
> > 
> > I would think that most OSes use the same framework. The ones that
> > are of notable interest are the Windows and BSD. Lets CC James here
> 
> Maybe I'm missing something here, but I don't see a really big benefit
> of using this new structure for segments instead of the current one.

The DIF/DIX requires that the bio layout going in blkfront and then
emerging on the other side in the SAS/SCSI/SATA drivers must be the same.

That means when you have a bio-vec, for example, where there are
five pages linked - the first four have 512 bytes of data (say in the middle
of the page - so 2048 -> 2560 are occupied, the rest is not). The total
is 2048 bytes, and the last page contains 32 bytes (four CRC checksums, each
8 bytes).

If we coalesce any of the five pages in one, then we need to (when we
take the request out of the ring) in the backend, to reconstruct these
five pages. 

My thought was that with the fsect, lsect as they exist now, we will be 
tempted to just colesce four sectors in a page and just make lsect = fsect + 4.

That however is _not_ what we are doing now - I think. We look to recreate
the layout exactly as the READ/WRITE requests are set to xen-blkfront.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings
  2013-03-04 20:22   ` Konrad Rzeszutek Wilk
@ 2013-03-26 17:30     ` Roger Pau Monné
  2013-03-26 17:48     ` Roger Pau Monné
  1 sibling, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-26 17:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 04/03/13 21:22, Konrad Rzeszutek Wilk wrote:
[...]
>> @@ -535,13 +604,17 @@ purge_gnt_list:
>>                               msecs_to_jiffies(xen_blkif_lru_interval);
>>               }
>>
>> +             remove_free_pages(blkif, xen_blkif_max_buffer_pages);
>> +
>>               if (log_stats && time_after(jiffies, blkif->st_print))
>>                       print_stats(blkif);
>>       }
>>
>> +     remove_free_pages(blkif, 0);
> 
> What purpose does that have?

This removes all the pages from the pool before closing down.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings
  2013-03-04 20:22   ` Konrad Rzeszutek Wilk
  2013-03-26 17:30     ` Roger Pau Monné
@ 2013-03-26 17:48     ` Roger Pau Monné
  1 sibling, 0 replies; 51+ messages in thread
From: Roger Pau Monné @ 2013-03-26 17:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On 04/03/13 21:22, Konrad Rzeszutek Wilk wrote:
>> @@ -194,14 +260,15 @@ static void add_persistent_gnt(struct rb_root *root,
>>               else if (persistent_gnt->gnt > this->gnt)
>>                       new = &((*new)->rb_right);
>>               else {
>> -                     pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
>> -                     BUG();
>> +                     pr_alert_ratelimited(DRV_PFX " trying to add a gref that's already in the tree\n");
>> +                     return -EINVAL;
> 
> That looks like a seperate bug-fix patch? Especially the pr_alert_ratelimited
> part?

Not really, the way we added granted frames before this patch, it was
never possible to add a persistent grant with the same gref twice.

With the changes introduced in this patch we first map the grants and
then we try to make them persistent by adding them to the tree. So it is
possible for a frontend to craft a malicious request that has the same
gref in all segments, and when we try to add them to the tree of
persistent grants we would hit the BUG, that's why we need to ratelimit
the alert (to prevent flooding), and return EINVAL instead of crashing.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-03-26 17:48 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-28 10:28 [PATCH RFC 00/12] xen-block: indirect descriptors Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 01/12] xen-blkback: don't store dev_bus_addr Roger Pau Monne
2013-02-28 10:58   ` [Xen-devel] " Jan Beulich
2013-03-04 17:19     ` Roger Pau Monné
2013-03-05  8:06       ` Jan Beulich
2013-03-05 17:02         ` Roger Pau Monné
2013-02-28 10:28 ` [PATCH RFC 02/12] xen-blkback: fix foreach_grant_safe to handle empty lists Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 03/12] xen-blkfront: switch from llist to list Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 04/12] xen-blkfront: pre-allocate pages for requests Roger Pau Monne
2013-03-04 19:39   ` Konrad Rzeszutek Wilk
2013-03-05 11:04     ` Roger Pau Monné
2013-03-05 14:18       ` Konrad Rzeszutek Wilk
2013-03-05 16:30         ` Roger Pau Monné
2013-03-05 21:53           ` Konrad Rzeszutek Wilk
2013-03-06  9:17             ` Roger Pau Monné
2013-02-28 10:28 ` [PATCH RFC 05/12] xen-blkfront: remove frame list from blk_shadow Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 06/12] xen-blkback: implement LRU mechanism for persistent grants Roger Pau Monne
2013-03-04 20:10   ` Konrad Rzeszutek Wilk
2013-03-05 18:10     ` Roger Pau Monné
2013-03-05 21:49       ` Konrad Rzeszutek Wilk
2013-03-18 17:00         ` Roger Pau Monné
2013-02-28 10:28 ` [PATCH RFC 07/12] xen-blkback: print stats about " Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 08/12] xen-blkback: use balloon pages for all mappings Roger Pau Monne
2013-03-04 20:22   ` Konrad Rzeszutek Wilk
2013-03-26 17:30     ` Roger Pau Monné
2013-03-26 17:48     ` Roger Pau Monné
2013-02-28 10:28 ` [PATCH RFC 09/12] xen-blkback: move pending handles list from blkbk to pending_req Roger Pau Monne
2013-02-28 11:07   ` [Xen-devel] " Jan Beulich
2013-02-28 10:28 ` [PATCH RFC 10/12] xen-blkback: make the queue of free requests per backend Roger Pau Monne
2013-02-28 11:08   ` [Xen-devel] " Jan Beulich
2013-02-28 10:28 ` [PATCH RFC 11/12] xen-blkback: expand map/unmap functions Roger Pau Monne
2013-02-28 10:28 ` [PATCH RFC 12/12] xen-block: implement indirect descriptors Roger Pau Monne
2013-02-28 11:19   ` [Xen-devel] " Jan Beulich
2013-02-28 12:00     ` Roger Pau Monné
2013-02-28 13:28       ` Jan Beulich
2013-03-04 20:44         ` Konrad Rzeszutek Wilk
2013-03-05  8:11           ` Jan Beulich
2013-03-05 14:16             ` Konrad Rzeszutek Wilk
2013-03-05 17:00               ` Roger Pau Monné
2013-03-05 21:45                 ` Konrad Rzeszutek Wilk
2013-03-04 20:41   ` Konrad Rzeszutek Wilk
2013-03-05 17:07     ` Roger Pau Monné
2013-03-05 21:46       ` Konrad Rzeszutek Wilk
2013-03-08 17:07         ` Roger Pau Monné
2013-03-22  1:10           ` Konrad Rzeszutek Wilk
2013-03-18 17:06   ` Roger Pau Monné
2013-03-19 14:38     ` Konrad Rzeszutek Wilk
2013-02-28 10:49 ` [Xen-devel] [PATCH RFC 00/12] xen-block: " Jan Beulich
2013-02-28 11:25   ` Roger Pau Monné
2013-02-28 11:35     ` Jan Beulich
2013-02-28 11:44       ` Roger Pau Monné

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).