All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages
@ 2011-12-17  3:22 Adin Scannell
  2011-12-17  3:22 ` [PATCH 1/3] Make xen_remap_domain_mfn_range return value meaningful in case of error Adin Scannell
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Adin Scannell @ 2011-12-17  3:22 UTC (permalink / raw)
  To: xen-devel; +Cc: konrad, andres, olaf, JBeulich, adin

I've ported a couple of patches to the Linux pvops kernel that are necessary
for correctly running domains with paging.  In a nutshell: in the case of a
foreign attempt to map a paged-out page, the correct error code will now be
propogated up to libxc, which will already handle it correctly.

This required an implementation of mmap_batch_v2.

I've tested it using a highly-paged domain and everything looks okay (qemu will
receive the appropriate error via the mmap_batch_v2 and retry).

(My apologies if anyone/everyone receives this set of patches more than once,
I've had some trouble with both my connection dying and guilt freaking out
while I'm sending, leaving things in a bit of an unknown state.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] Make xen_remap_domain_mfn_range return value meaningful in case of error
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
@ 2011-12-17  3:22 ` Adin Scannell
  2011-12-17  3:22 ` [PATCH 2/3] Handle GNTST_eagain in kernel drivers Adin Scannell
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Adin Scannell @ 2011-12-17  3:22 UTC (permalink / raw)
  To: xen-devel; +Cc: olaf, Adin Scannell, andres, JBeulich, konrad, adin

Original patch from Olaf Hering <ohering@novell.com>

This change fixes the xc_map_foreign_bulk interface, which would
otherwise cause SIGBUS when pages are gone because -ENOENT is not
returned as expected by the IOCTL_PRIVCMD_MMAPBATCH_V2 ioctl.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

This is a port to the mainline Linux tree.  Functions were refactored and
renamed.  I believe that this is the only required change to match the
semantics of the original patch.

Signed-off-by: Adin Scannell <adin@scannell.ca>
---
 arch/x86/xen/mmu.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 87f6673..288a7fc 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2350,8 +2350,8 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 		if (err)
 			goto out;
 
-		err = -EFAULT;
-		if (HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid) < 0)
+		err = HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid);
+		if (err < 0)
 			goto out;
 
 		nr -= batch;
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
  2011-12-17  3:22 ` [PATCH 1/3] Make xen_remap_domain_mfn_range return value meaningful in case of error Adin Scannell
@ 2011-12-17  3:22 ` Adin Scannell
  2011-12-17 14:30   ` Konrad Rzeszutek Wilk
  2011-12-17  3:22 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Adin Scannell @ 2011-12-17  3:22 UTC (permalink / raw)
  To: xen-devel; +Cc: olaf, Adin Scannell, andres, JBeulich, konrad, adin

Handle GNTST_eagain status from GNTTABOP_map_grant_ref and
GNTTABOP_copy operations properly to allow usage of xenpaging without
causing crashes or data corruption.

Replace all relevant HYPERVISOR_grant_table_op() calls with a retry
loop. This loop is implemented as a macro to allow different
GNTTABOP_* args. It will sleep up to 33 seconds and wait for the
page to be paged in again.

All ->status checks were updated to use the GNTST_* namespace. All
return values are converted from GNTST_* namespace to 0/-EINVAL, since
all callers did not use the actual return value.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Patrick Colp <pjcolp@cs.ubc.ca>

This is a port to the mainline Linux tree.  This required dropping many backend
drivers which have not yet made it in.  Additionally, several of the referenced
functions have moved and/or been refactored.  I attempted to minimize changes
while keeping the same semantics.

Signed-off-by: Adin Scannell <adin@scannell.ca>

Index: linux/drivers/block/xen-blkback/blkback.c
===================================================================
---
 drivers/block/xen-blkback/blkback.c |    4 ++-
 drivers/xen/grant-table.c           |    7 ++++-
 drivers/xen/xenbus/xenbus_client.c  |    4 +++
 include/xen/grant_table.h           |   39 +++++++++++++++++++++++++++++++++++
 include/xen/interface/grant_table.h |    6 ++++-
 5 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 15ec4db..d3fb290 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -390,7 +390,9 @@ static int xen_blkbk_map(struct blkif_request *req,
 	 * the page from the other domain.
 	 */
 	for (i = 0; i < nseg; i++) {
-		if (unlikely(map[i].status != 0)) {
+		if (unlikely(map[i].status == GNTST_eagain))
+			gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, &map[i]);
+		if (unlikely(map[i].status != GNTST_okay)) {
 			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
 			map[i].handle = BLKBACK_INVALID_HANDLE;
 			ret |= 1;
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index bf1c094..48826c3 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -464,9 +464,12 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 
 	for (i = 0; i < count; i++) {
 		/* Do not add to override if the map failed. */
-		if (map_ops[i].status)
+		if (map_ops[i].status != GNTST_okay && map_ops[i].status != GNTST_eagain)
 			continue;
 
+		if (map_ops[i].status == GNTST_eagain)
+			return -EAGAIN;
+
 		if (map_ops[i].flags & GNTMAP_contains_pte) {
 			pte = (pte_t *) (mfn_to_virt(PFN_DOWN(map_ops[i].host_addr)) +
 				(map_ops[i].host_addr & ~PAGE_MASK));
@@ -565,7 +568,7 @@ static int gnttab_map(unsigned int start_idx, unsigned int end_idx)
 		return -ENOSYS;
 	}
 
-	BUG_ON(rc || setup.status);
+	BUG_ON(rc || (setup.status != GNTST_okay));
 
 	rc = arch_gnttab_map_shared(frames, nr_gframes, gnttab_max_grant_frames(),
 				    &shared);
diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index 1906125..d123c78 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -455,6 +455,8 @@ int xenbus_map_ring_valloc(struct xenbus_device *dev, int gnt_ref, void **vaddr)
 	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
 		BUG();
 
+	if (op.status == GNTST_eagain)
+		gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, (&op));
 	if (op.status != GNTST_okay) {
 		free_vm_area(area);
 		xenbus_dev_fatal(dev, op.status,
@@ -499,6 +501,8 @@ int xenbus_map_ring(struct xenbus_device *dev, int gnt_ref,
 	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
 		BUG();
 
+	if (op.status == GNTST_eagain)
+		gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, (&op));
 	if (op.status != GNTST_okay) {
 		xenbus_dev_fatal(dev, op.status,
 				 "mapping in shared page %d from domain %d",
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 11e2dfc..46fac85 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -37,6 +37,8 @@
 #ifndef __ASM_GNTTAB_H__
 #define __ASM_GNTTAB_H__
 
+#include <linux/delay.h>
+
 #include <asm/page.h>
 
 #include <xen/interface/xen.h>
@@ -160,4 +162,41 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
 		      struct page **pages, unsigned int count);
 
+#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)					\
+do {																		\
+	u8 __hc_delay = 1;														\
+	int __ret;																\
+	while (unlikely((__HCarg_p)->status == GNTST_eagain && __hc_delay)) {	\
+		msleep(__hc_delay++);												\
+		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);			\
+		BUG_ON(__ret);														\
+	}																		\
+	if (__hc_delay == 0) {													\
+		printk(KERN_ERR "%s: gnt busy\n", __func__,);						\
+		(__HCarg_p)->status = GNTST_bad_page;								\
+	}																		\
+	if ((__HCarg_p)->status != GNTST_okay)									\
+		printk(KERN_ERR "%s: gnt status %x\n", 								\
+			__func__, (__HCarg_p)->status);									\
+} while(0)
+
+#define gnttab_check_GNTST_eagain_do_while(__HCop, __HCarg_p)			\
+do {																	\
+	u8 __hc_delay = 1;													\
+	int __ret;															\
+	do {																\
+		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);		\
+		BUG_ON(__ret);													\
+		if ((__HCarg_p)->status == GNTST_eagain)						\
+			msleep(__hc_delay++);										\
+	} while ((__HCarg_p)->status == GNTST_eagain && __hc_delay);		\
+	if (__hc_delay == 0) {												\
+		printk(KERN_ERR "%s: gnt busy\n", __func__);					\
+		(__HCarg_p)->status = GNTST_bad_page;							\
+	}																	\
+	if ((__HCarg_p)->status != GNTST_okay)								\
+		printk(KERN_ERR "%s: gnt status %x\n", 							\
+			__func__, (__HCarg_p)->status);								\
+} while(0)
+
 #endif /* __ASM_GNTTAB_H__ */
diff --git a/include/xen/interface/grant_table.h b/include/xen/interface/grant_table.h
index 39e5717..ba04080 100644
--- a/include/xen/interface/grant_table.h
+++ b/include/xen/interface/grant_table.h
@@ -363,6 +363,8 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_query_size);
 #define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
 #define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
 #define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary */
+#define GNTST_address_too_big (-11) /* transfer page address too large.      */
+#define GNTST_eagain          (-12) /* Could not map at the moment. Retry.   */
 
 #define GNTTABOP_error_msgs {                   \
     "okay",                                     \
@@ -375,7 +377,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_query_size);
     "no spare translation slot in the I/O MMU", \
     "permission denied",                        \
     "bad page",                                 \
-    "copy arguments cross page boundary"        \
+    "copy arguments cross page boundary",       \
+    "page address size too large",              \
+    "could not map at the moment, retry"        \
 }
 
 #endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
  2011-12-17  3:22 ` [PATCH 1/3] Make xen_remap_domain_mfn_range return value meaningful in case of error Adin Scannell
  2011-12-17  3:22 ` [PATCH 2/3] Handle GNTST_eagain in kernel drivers Adin Scannell
@ 2011-12-17  3:22 ` Adin Scannell
  2011-12-17 14:40   ` Konrad Rzeszutek Wilk
  2011-12-17  3:49 ` [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Adin Scannell @ 2011-12-17  3:22 UTC (permalink / raw)
  To: xen-devel; +Cc: olaf, Adin Scannell, andres, JBeulich, konrad, adin

This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
tree.  The code structure is significantly different and this patch mirrors the
existing Linux code.

The primary reason for need the V2 interface is to support foreign mappings
(i.e. qemu) of paged-out pages.  The libxc code will already retry mappings
when an ENOENT is returned.  The V2 interface provides a richer error value,
so the user-space code is capable of handling these errors specifically.

Signed-off-by: Adin Scannell <adin@scannell.ca>

Index: linux/drivers/xen/xenfs/privcmd.c
===================================================================
---
 drivers/xen/xenfs/privcmd.c |   90 ++++++++++++++++++++++++++++++++++++++++++-
 include/xen/privcmd.h       |   10 +++++
 2 files changed, 99 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c
index dbd3b16..21cbb5a 100644
--- a/drivers/xen/xenfs/privcmd.c
+++ b/drivers/xen/xenfs/privcmd.c
@@ -70,7 +70,7 @@ static void free_page_list(struct list_head *pages)
  */
 static int gather_array(struct list_head *pagelist,
 			unsigned nelem, size_t size,
-			void __user *data)
+			const void __user *data)
 {
 	unsigned pageidx;
 	void *pagedata;
@@ -245,6 +245,15 @@ struct mmap_batch_state {
 	xen_pfn_t __user *user;
 };
 
+struct mmap_batch_v2_state {
+	domid_t domain;
+	unsigned long va;
+	struct vm_area_struct *vma;
+	int paged_out;
+
+	int __user *err;
+};
+
 static int mmap_batch_fn(void *data, void *state)
 {
 	xen_pfn_t *mfnp = data;
@@ -260,6 +269,20 @@ static int mmap_batch_fn(void *data, void *state)
 	return 0;
 }
 
+static int mmap_batch_v2_fn(void *data, void *state)
+{
+	xen_pfn_t *mfnp = data;
+	struct mmap_batch_v2_state *st = state;
+
+	int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
+				       st->vma->vm_page_prot, st->domain);
+	if ( rc == -ENOENT )
+		st->paged_out++;
+	st->va += PAGE_SIZE;
+
+	return put_user(rc, st->err++);
+}
+
 static int mmap_return_errors(void *data, void *state)
 {
 	xen_pfn_t *mfnp = data;
@@ -332,6 +355,67 @@ out:
 	return ret;
 }
 
+static long privcmd_ioctl_mmap_batch_v2(void __user *udata)
+{
+	int ret;
+	struct privcmd_mmapbatch_v2 m;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long nr_pages;
+	LIST_HEAD(pagelist);
+	struct mmap_batch_v2_state state;
+
+	if (!xen_initial_domain())
+		return -EPERM;
+
+	if (copy_from_user(&m, udata, sizeof(m)))
+		return -EFAULT;
+
+	nr_pages = m.num;
+	if ((m.num <= 0) || (nr_pages > (ULONG_MAX >> PAGE_SHIFT)))
+		return -EINVAL;
+
+	ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t),
+			   m.arr);
+
+	if (ret || list_empty(&pagelist))
+		goto out;
+
+	down_write(&mm->mmap_sem);
+
+	vma = find_vma(mm, m.addr);
+	ret = -EINVAL;
+	/* We allow multiple shots here, because this interface
+	 * is used by libxc and mappings for specific pages will
+	 * be retried when pages are paged-out (ENOENT). */
+	if (!vma ||
+	    vma->vm_ops != &privcmd_vm_ops ||
+	    (m.addr < vma->vm_start) ||
+	    ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	state.domain = m.dom;
+	state.vma = vma;
+	state.va = m.addr;
+	state.err = m.err;
+	state.paged_out = 0;
+
+	up_write(&mm->mmap_sem);
+
+	ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+			     &pagelist, mmap_batch_v2_fn, &state);
+
+out:
+	free_page_list(&pagelist);
+
+	if ( (ret == 0) && (state.paged_out > 0) )
+		return -ENOENT;
+        else
+		return ret;
+}
+
 static long privcmd_ioctl(struct file *file,
 			  unsigned int cmd, unsigned long data)
 {
@@ -351,6 +435,10 @@ static long privcmd_ioctl(struct file *file,
 		ret = privcmd_ioctl_mmap_batch(udata);
 		break;
 
+	case IOCTL_PRIVCMD_MMAPBATCH_V2:
+		ret = privcmd_ioctl_mmap_batch_v2(udata);
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
index 17857fb..39b92b1 100644
--- a/include/xen/privcmd.h
+++ b/include/xen/privcmd.h
@@ -62,6 +62,14 @@ struct privcmd_mmapbatch {
 	xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */
 };
 
+struct privcmd_mmapbatch_v2 {
+	int num;          /* number of pages to populate */
+	domid_t dom;      /* target domain */
+	__u64 addr;       /* virtual address */
+	const xen_pfn_t __user *arr; /* array of mfns */
+	int __user *err;  /* array of error codes */
+};
+
 /*
  * @cmd: IOCTL_PRIVCMD_HYPERCALL
  * @arg: &privcmd_hypercall_t
@@ -73,5 +81,7 @@ struct privcmd_mmapbatch {
 	_IOC(_IOC_NONE, 'P', 2, sizeof(struct privcmd_mmap))
 #define IOCTL_PRIVCMD_MMAPBATCH					\
 	_IOC(_IOC_NONE, 'P', 3, sizeof(struct privcmd_mmapbatch))
+#define IOCTL_PRIVCMD_MMAPBATCH_V2				\
+	_IOC(_IOC_NONE, 'P', 4, sizeof(struct privcmd_mmapbatch_v2))
 
 #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
                   ` (2 preceding siblings ...)
  2011-12-17  3:22 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
@ 2011-12-17  3:49 ` Adin Scannell
  2011-12-17 14:16 ` Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Adin Scannell @ 2011-12-17  3:49 UTC (permalink / raw)
  To: Adin Scannell; +Cc: konrad, andres, xen-devel, olaf, JBeulich

> (My apologies if anyone/everyone receives this set of patches more than once,
> I've had some trouble with both my connection dying and guilt freaking out
> while I'm sending, leaving things in a bit of an unknown state.)

In my frustration, things got manual.  This e-mail should be titled
[PATCH 0 of 3].

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
                   ` (3 preceding siblings ...)
  2011-12-17  3:49 ` [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
@ 2011-12-17 14:16 ` Konrad Rzeszutek Wilk
  2011-12-17 14:16 ` Konrad Rzeszutek Wilk
  2012-01-02 16:06 ` Olaf Hering
  6 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 14:16 UTC (permalink / raw)
  To: Adin Scannell; +Cc: andres, xen-devel, olaf, JBeulich, adin

On Fri, Dec 16, 2011 at 10:22:18PM -0500, Adin Scannell wrote:
> I've ported a couple of patches to the Linux pvops kernel that are necessary
> for correctly running domains with paging.  In a nutshell: in the case of a
> foreign attempt to map a paged-out page, the correct error code will now be
> propogated up to libxc, which will already handle it correctly.
> 
> This required an implementation of mmap_batch_v2.
> 
> I've tested it using a highly-paged domain and everything looks okay (qemu will
> receive the appropriate error via the mmap_batch_v2 and retry).

Please re-run these patches against scripts/checkpath.pl

Also do re-base them on my #linux-next branch (git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
#linux-next). I am a bit worried how they are going to work with Daniel's HVM
patches which I was planning to put in my branch shorlty:

http://old-list-archives.xen.org/archives/html/xen-devel/2011-10/msg01487.html

> 
> (My apologies if anyone/everyone receives this set of patches more than once,
> I've had some trouble with both my connection dying and guilt freaking out
> while I'm sending, leaving things in a bit of an unknown state.)

Next time please also send them to konrad.wilk@oracle.com.

Thanks!
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
                   ` (4 preceding siblings ...)
  2011-12-17 14:16 ` Konrad Rzeszutek Wilk
@ 2011-12-17 14:16 ` Konrad Rzeszutek Wilk
  2012-01-02 16:06 ` Olaf Hering
  6 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 14:16 UTC (permalink / raw)
  To: Adin Scannell; +Cc: andres, xen-devel, olaf, JBeulich, adin

On Fri, Dec 16, 2011 at 10:22:18PM -0500, Adin Scannell wrote:
> I've ported a couple of patches to the Linux pvops kernel that are necessary
> for correctly running domains with paging.  In a nutshell: in the case of a
> foreign attempt to map a paged-out page, the correct error code will now be
> propogated up to libxc, which will already handle it correctly.
> 
> This required an implementation of mmap_batch_v2.
> 
> I've tested it using a highly-paged domain and everything looks okay (qemu will
> receive the appropriate error via the mmap_batch_v2 and retry).
> 
> (My apologies if anyone/everyone receives this set of patches more than once,
> I've had some trouble with both my connection dying and guilt freaking out
> while I'm sending, leaving things in a bit of an unknown state.)


Ah, and also please CC LKML mailing list. Thanks.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2011-12-17  3:22 ` [PATCH 2/3] Handle GNTST_eagain in kernel drivers Adin Scannell
@ 2011-12-17 14:30   ` Konrad Rzeszutek Wilk
  2011-12-17 16:53     ` Adin Scannell
  2012-01-02 16:06     ` Olaf Hering
  0 siblings, 2 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 14:30 UTC (permalink / raw)
  To: Adin Scannell; +Cc: andres, xen-devel, olaf, JBeulich, adin

On Fri, Dec 16, 2011 at 10:22:20PM -0500, Adin Scannell wrote:
> Handle GNTST_eagain status from GNTTABOP_map_grant_ref and
> GNTTABOP_copy operations properly to allow usage of xenpaging without
> causing crashes or data corruption.
> 
> Replace all relevant HYPERVISOR_grant_table_op() calls with a retry
> loop. This loop is implemented as a macro to allow different
> GNTTABOP_* args. It will sleep up to 33 seconds and wait for the
> page to be paged in again.
> 
> All ->status checks were updated to use the GNTST_* namespace. All
> return values are converted from GNTST_* namespace to 0/-EINVAL, since
> all callers did not use the actual return value.
> 
> Signed-off-by: Olaf Hering <olaf@aepfle.de>
> Acked-by: Patrick Colp <pjcolp@cs.ubc.ca>
> 
> This is a port to the mainline Linux tree.  This required dropping many backend
> drivers which have not yet made it in.  Additionally, several of the referenced
> functions have moved and/or been refactored.  I attempted to minimize changes
> while keeping the same semantics.
> 
> Signed-off-by: Adin Scannell <adin@scannell.ca>
> 
> Index: linux/drivers/block/xen-blkback/blkback.c
> ===================================================================
> ---
>  drivers/block/xen-blkback/blkback.c |    4 ++-
>  drivers/xen/grant-table.c           |    7 ++++-
>  drivers/xen/xenbus/xenbus_client.c  |    4 +++
>  include/xen/grant_table.h           |   39 +++++++++++++++++++++++++++++++++++
>  include/xen/interface/grant_table.h |    6 ++++-

>  5 files changed, 56 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 15ec4db..d3fb290 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -390,7 +390,9 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	 * the page from the other domain.
>  	 */
>  	for (i = 0; i < nseg; i++) {
> -		if (unlikely(map[i].status != 0)) {
> +		if (unlikely(map[i].status == GNTST_eagain))
> +			gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, &map[i]);
> +		if (unlikely(map[i].status != GNTST_okay)) {
>  			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>  			map[i].handle = BLKBACK_INVALID_HANDLE;
>  			ret |= 1;

> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index bf1c094..48826c3 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -464,9 +464,12 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
>  
>  	for (i = 0; i < count; i++) {
>  		/* Do not add to override if the map failed. */
> -		if (map_ops[i].status)
> +		if (map_ops[i].status != GNTST_okay && map_ops[i].status != GNTST_eagain)
>  			continue;
>  
> +		if (map_ops[i].status == GNTST_eagain)
> +			return -EAGAIN;
> +
>  		if (map_ops[i].flags & GNTMAP_contains_pte) {
>  			pte = (pte_t *) (mfn_to_virt(PFN_DOWN(map_ops[i].host_addr)) +
>  				(map_ops[i].host_addr & ~PAGE_MASK));
> @@ -565,7 +568,7 @@ static int gnttab_map(unsigned int start_idx, unsigned int end_idx)
>  		return -ENOSYS;
>  	}
>  
> -	BUG_ON(rc || setup.status);
> +	BUG_ON(rc || (setup.status != GNTST_okay));
>  
>  	rc = arch_gnttab_map_shared(frames, nr_gframes, gnttab_max_grant_frames(),
>  				    &shared);
> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> index 1906125..d123c78 100644
> --- a/drivers/xen/xenbus/xenbus_client.c
> +++ b/drivers/xen/xenbus/xenbus_client.c
> @@ -455,6 +455,8 @@ int xenbus_map_ring_valloc(struct xenbus_device *dev, int gnt_ref, void **vaddr)
>  	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
>  		BUG();
>  
> +	if (op.status == GNTST_eagain)
> +		gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, (&op));
>  	if (op.status != GNTST_okay) {
>  		free_vm_area(area);
>  		xenbus_dev_fatal(dev, op.status,
> @@ -499,6 +501,8 @@ int xenbus_map_ring(struct xenbus_device *dev, int gnt_ref,
>  	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
>  		BUG();
>  
> +	if (op.status == GNTST_eagain)
> +		gnttab_check_GNTST_eagain_do_while(GNTTABOP_map_grant_ref, (&op));
>  	if (op.status != GNTST_okay) {
>  		xenbus_dev_fatal(dev, op.status,
>  				 "mapping in shared page %d from domain %d",
> diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
> index 11e2dfc..46fac85 100644
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -37,6 +37,8 @@
>  #ifndef __ASM_GNTTAB_H__
>  #define __ASM_GNTTAB_H__
>  
> +#include <linux/delay.h>
> +
>  #include <asm/page.h>
>  
>  #include <xen/interface/xen.h>
> @@ -160,4 +162,41 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
>  int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
>  		      struct page **pages, unsigned int count);
>  
> +#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)					\

So why does this have to be a macro? What is the advantage of that
versus making this a function?

> +do {																		\
> +	u8 __hc_delay = 1;														\
> +	int __ret;																\
> +	while (unlikely((__HCarg_p)->status == GNTST_eagain && __hc_delay)) {	\
> +		msleep(__hc_delay++);												\

Ugh. Not sure what happend to this, but there are tons of '\' at the
end.

So why msleep? Why not go for a proper yield? Call the safe_halt()
function?

> +		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);			\
> +		BUG_ON(__ret);														\
> +	}																		\
> +	if (__hc_delay == 0) {													\

So this would happen if we rolled over __hc_delay, so did this more than
255 times? Presumarily this can happen if the swapper in dom0 crashes..

I would recommend you use 'WARN' here and include tons of details.
This is a pretty serious issues, is it not?

> +		printk(KERN_ERR "%s: gnt busy\n", __func__,);						\
> +		(__HCarg_p)->status = GNTST_bad_page;								\
> +	}																		\
> +	if ((__HCarg_p)->status != GNTST_okay)									\
> +		printk(KERN_ERR "%s: gnt status %x\n", 								\
> +			__func__, (__HCarg_p)->status);									\

Use GNTTABOP_error_msgs. Also should we continue? What is the
impact if we do continue? The times this is hit is if the status
is not GNTS_okay and if it is not GNTS_eagain - so what are the
situations in which this happens and what can the domain do
to recover? Should there be some helpfull message instead of
just "gnt status: X" ?

> +} while(0)
> +
> +#define gnttab_check_GNTST_eagain_do_while(__HCop, __HCarg_p)			\
> +do {																	\
> +	u8 __hc_delay = 1;													\
> +	int __ret;															\
> +	do {																\
> +		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);		\
> +		BUG_ON(__ret);													\
> +		if ((__HCarg_p)->status == GNTST_eagain)						\
> +			msleep(__hc_delay++);										\
> +	} while ((__HCarg_p)->status == GNTST_eagain && __hc_delay);		\
> +	if (__hc_delay == 0) {												\
> +		printk(KERN_ERR "%s: gnt busy\n", __func__);					\
> +		(__HCarg_p)->status = GNTST_bad_page;							\
> +	}																	\
> +	if ((__HCarg_p)->status != GNTST_okay)								\
> +		printk(KERN_ERR "%s: gnt status %x\n", 							\
> +			__func__, (__HCarg_p)->status);								\
> +} while(0)
> +
>  #endif /* __ASM_GNTTAB_H__ */
> diff --git a/include/xen/interface/grant_table.h b/include/xen/interface/grant_table.h
> index 39e5717..ba04080 100644
> --- a/include/xen/interface/grant_table.h
> +++ b/include/xen/interface/grant_table.h
> @@ -363,6 +363,8 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_query_size);
>  #define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
>  #define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
>  #define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary */
> +#define GNTST_address_too_big (-11) /* transfer page address too large.      */
> +#define GNTST_eagain          (-12) /* Could not map at the moment. Retry.   */
>  
>  #define GNTTABOP_error_msgs {                   \
>      "okay",                                     \
> @@ -375,7 +377,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_query_size);
>      "no spare translation slot in the I/O MMU", \
>      "permission denied",                        \
>      "bad page",                                 \
> -    "copy arguments cross page boundary"        \
> +    "copy arguments cross page boundary",       \
> +    "page address size too large",              \
> +    "could not map at the moment, retry"        \
>  }
>  
>  #endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
> -- 
> 1.6.2.5
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-17  3:22 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
@ 2011-12-17 14:40   ` Konrad Rzeszutek Wilk
  2011-12-17 16:51     ` Adin Scannell
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 14:40 UTC (permalink / raw)
  To: Adin Scannell; +Cc: andres, xen-devel, olaf, JBeulich, adin

On Fri, Dec 16, 2011 at 10:22:21PM -0500, Adin Scannell wrote:
> This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
> tree.  The code structure is significantly different and this patch mirrors the
> existing Linux code.
> 
> The primary reason for need the V2 interface is to support foreign mappings
> (i.e. qemu) of paged-out pages.  The libxc code will already retry mappings
> when an ENOENT is returned.  The V2 interface provides a richer error value,
> so the user-space code is capable of handling these errors specifically.

Can you give more details on how to use paged-out pages. Perhaps a
pointer to the xen's docs?

> 
> Signed-off-by: Adin Scannell <adin@scannell.ca>
> 
> Index: linux/drivers/xen/xenfs/privcmd.c
> ===================================================================
> ---
>  drivers/xen/xenfs/privcmd.c |   90 ++++++++++++++++++++++++++++++++++++++++++-

So that file just moved to drivers/xen/privcmd.c

>  include/xen/privcmd.h       |   10 +++++
>  2 files changed, 99 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c
> index dbd3b16..21cbb5a 100644
> --- a/drivers/xen/xenfs/privcmd.c
> +++ b/drivers/xen/xenfs/privcmd.c
> @@ -70,7 +70,7 @@ static void free_page_list(struct list_head *pages)
>   */
>  static int gather_array(struct list_head *pagelist,
>  			unsigned nelem, size_t size,
> -			void __user *data)
> +			const void __user *data)
>  {
>  	unsigned pageidx;
>  	void *pagedata;
> @@ -245,6 +245,15 @@ struct mmap_batch_state {
>  	xen_pfn_t __user *user;
>  };
>  
> +struct mmap_batch_v2_state {
> +	domid_t domain;
> +	unsigned long va;
> +	struct vm_area_struct *vma;
> +	int paged_out;

Should this be unsigned int?
> +
> +	int __user *err;
> +};
> +
>  static int mmap_batch_fn(void *data, void *state)
>  {
>  	xen_pfn_t *mfnp = data;
> @@ -260,6 +269,20 @@ static int mmap_batch_fn(void *data, void *state)
>  	return 0;
>  }
>  
> +static int mmap_batch_v2_fn(void *data, void *state)
> +{
> +	xen_pfn_t *mfnp = data;
> +	struct mmap_batch_v2_state *st = state;
> +
> +	int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
> +				       st->vma->vm_page_prot, st->domain);

You don't want to check that st is not NULL?

> +	if ( rc == -ENOENT )

This is the wrong style. Please fix.

> +		st->paged_out++;

Is it possible that this ends overflowing and hitting 0?

> +	st->va += PAGE_SIZE;
> +
> +	return put_user(rc, st->err++);
> +}
> +
>  static int mmap_return_errors(void *data, void *state)
>  {
>  	xen_pfn_t *mfnp = data;
> @@ -332,6 +355,67 @@ out:
>  	return ret;
>  }
>  
> +static long privcmd_ioctl_mmap_batch_v2(void __user *udata)
> +{
> +	int ret;
> +	struct privcmd_mmapbatch_v2 m;
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long nr_pages;
> +	LIST_HEAD(pagelist);
> +	struct mmap_batch_v2_state state;
> +
> +	if (!xen_initial_domain())
> +		return -EPERM;
> +
> +	if (copy_from_user(&m, udata, sizeof(m)))
> +		return -EFAULT;
> +
> +	nr_pages = m.num;
> +	if ((m.num <= 0) || (nr_pages > (ULONG_MAX >> PAGE_SHIFT)))

Just make it nr_pages instead of m.num.

> +		return -EINVAL;
> +
> +	ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t),


nr_pages.
> +			   m.arr);
> +
> +	if (ret || list_empty(&pagelist))
> +		goto out;
> +
> +	down_write(&mm->mmap_sem);
> +
> +	vma = find_vma(mm, m.addr);
> +	ret = -EINVAL;
> +	/* We allow multiple shots here, because this interface
> +	 * is used by libxc and mappings for specific pages will
> +	 * be retried when pages are paged-out (ENOENT). */
> +	if (!vma ||
> +	    vma->vm_ops != &privcmd_vm_ops ||
> +	    (m.addr < vma->vm_start) ||
> +	    ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
> +		up_write(&mm->mmap_sem);
> +		goto out;
> +	}
> +
> +	state.domain = m.dom;

Should you check the m.dom for incorrect ones? Like -1? or DOMID_IO?

> +	state.vma = vma;
> +	state.va = m.addr;
> +	state.err = m.err;
> +	state.paged_out = 0;
> +
> +	up_write(&mm->mmap_sem);
> +
> +	ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> +			     &pagelist, mmap_batch_v2_fn, &state);
> +
> +out:
> +	free_page_list(&pagelist);
> +
> +	if ( (ret == 0) && (state.paged_out > 0) )
> +		return -ENOENT;
> +        else
> +		return ret;
> +}
> +
>  static long privcmd_ioctl(struct file *file,
>  			  unsigned int cmd, unsigned long data)
>  {
> @@ -351,6 +435,10 @@ static long privcmd_ioctl(struct file *file,
>  		ret = privcmd_ioctl_mmap_batch(udata);
>  		break;
>  
> +	case IOCTL_PRIVCMD_MMAPBATCH_V2:
> +		ret = privcmd_ioctl_mmap_batch_v2(udata);
> +		break;
> +
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
> index 17857fb..39b92b1 100644
> --- a/include/xen/privcmd.h
> +++ b/include/xen/privcmd.h
> @@ -62,6 +62,14 @@ struct privcmd_mmapbatch {
>  	xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */
>  };
>  
> +struct privcmd_mmapbatch_v2 {
> +	int num;          /* number of pages to populate */
> +	domid_t dom;      /* target domain */
> +	__u64 addr;       /* virtual address */
> +	const xen_pfn_t __user *arr; /* array of mfns */
> +	int __user *err;  /* array of error codes */
> +};
> +
>  /*
>   * @cmd: IOCTL_PRIVCMD_HYPERCALL
>   * @arg: &privcmd_hypercall_t
> @@ -73,5 +81,7 @@ struct privcmd_mmapbatch {
>  	_IOC(_IOC_NONE, 'P', 2, sizeof(struct privcmd_mmap))
>  #define IOCTL_PRIVCMD_MMAPBATCH					\
>  	_IOC(_IOC_NONE, 'P', 3, sizeof(struct privcmd_mmapbatch))
> +#define IOCTL_PRIVCMD_MMAPBATCH_V2				\
> +	_IOC(_IOC_NONE, 'P', 4, sizeof(struct privcmd_mmapbatch_v2))
>  
>  #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
> -- 
> 1.6.2.5
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-17 14:40   ` Konrad Rzeszutek Wilk
@ 2011-12-17 16:51     ` Adin Scannell
  2011-12-17 21:29       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Adin Scannell @ 2011-12-17 16:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: olaf, xen-devel, Adin Scannell, konrad.wilk, andres, JBeulich

Hi Konrad,

Thanks for the quick turnaround. I'll incorporate your feedback and
resend. Some NOTES are inline below.

On Sat, Dec 17, 2011 at 9:40 AM, Konrad Rzeszutek Wilk
<konrad@darnok.org> wrote:
> On Fri, Dec 16, 2011 at 10:22:21PM -0500, Adin Scannell wrote:
>> This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
>> tree.  The code structure is significantly different and this patch mirrors the
>> existing Linux code.
>>
>> The primary reason for need the V2 interface is to support foreign mappings
>> (i.e. qemu) of paged-out pages.  The libxc code will already retry mappings
>> when an ENOENT is returned.  The V2 interface provides a richer error value,
>> so the user-space code is capable of handling these errors specifically.
>
> Can you give more details on how to use paged-out pages. Perhaps a
> pointer to the xen's docs?

Hrm, in usual fashion the docs are a bit lacking.

Simply: the kernel has to do nothing to support paging (other than the
backend drivers which need to handle the grant EAGAIN case -- other
patch).  The only reason the V2 interface is needed here is to pass
back full error codes from the mmu_update()'s. Xen and libxc have a
mutual understanding that when you receive an ENOENT error code, you
back off and try again.

>>
>> Signed-off-by: Adin Scannell <adin@scannell.ca>
>>
>> Index: linux/drivers/xen/xenfs/privcmd.c
>> ===================================================================
>> ---
>>  drivers/xen/xenfs/privcmd.c |   90 ++++++++++++++++++++++++++++++++++++++++++-
>
> So that file just moved to drivers/xen/privcmd.c

Of course it has :)

>>  include/xen/privcmd.h       |   10 +++++
>>  2 files changed, 99 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c
>> index dbd3b16..21cbb5a 100644
>> --- a/drivers/xen/xenfs/privcmd.c
>> +++ b/drivers/xen/xenfs/privcmd.c
>> @@ -70,7 +70,7 @@ static void free_page_list(struct list_head *pages)
>>   */
>>  static int gather_array(struct list_head *pagelist,
>>                       unsigned nelem, size_t size,
>> -                     void __user *data)
>> +                     const void __user *data)
>>  {
>>       unsigned pageidx;
>>       void *pagedata;
>> @@ -245,6 +245,15 @@ struct mmap_batch_state {
>>       xen_pfn_t __user *user;
>>  };
>>
>> +struct mmap_batch_v2_state {
>> +     domid_t domain;
>> +     unsigned long va;
>> +     struct vm_area_struct *vma;
>> +     int paged_out;
>
> Should this be unsigned int?

Noted below.

>> +
>> +     int __user *err;
>> +};
>> +
>>  static int mmap_batch_fn(void *data, void *state)
>>  {
>>       xen_pfn_t *mfnp = data;
>> @@ -260,6 +269,20 @@ static int mmap_batch_fn(void *data, void *state)
>>       return 0;
>>  }
>>
>> +static int mmap_batch_v2_fn(void *data, void *state)
>> +{
>> +     xen_pfn_t *mfnp = data;
>> +     struct mmap_batch_v2_state *st = state;
>> +
>> +     int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
>> +                                    st->vma->vm_page_prot, st->domain);
>
> You don't want to check that st is not NULL?

This could be an assertion as it's only used in the
ioctl_mmap_batch_v2 function where it's guaranteed to pass in a
non-null state.

I'll add an additional patch to the queue that adds these assertions
to both mmap_batch_fn and mmap_batch_fn_v2.

>> +             st->paged_out++;
> Is it possible that this ends overflowing and hitting 0?

I don't think this is an issue practically, as the only way to trigger
this would be to be on a 64bit machine and map an ungodly number of
pages with a single mmap_batch (MAX_INT).  For correctness though, I
can update this variable and the err variable in mmap_batch_state
which suffers from the same problem -- turn them into unsigned long.
This will be included in the additional patch.

>> +                        m.arr);
>> +
>> +     if (ret || list_empty(&pagelist))
>> +             goto out;
>> +
>> +     down_write(&mm->mmap_sem);
>> +
>> +     vma = find_vma(mm, m.addr);
>> +     ret = -EINVAL;
>> +     /* We allow multiple shots here, because this interface
>> +      * is used by libxc and mappings for specific pages will
>> +      * be retried when pages are paged-out (ENOENT). */
>> +     if (!vma ||
>> +         vma->vm_ops != &privcmd_vm_ops ||
>> +         (m.addr < vma->vm_start) ||
>> +         ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
>> +             up_write(&mm->mmap_sem);
>> +             goto out;
>> +     }
>> +
>> +     state.domain = m.dom;
>
> Should you check the m.dom for incorrect ones? Like -1? or DOMID_IO?

I followed the example for mmap_batch, which just tries the call and
returns whatever error Xen gives.  I think that's the right approach
for these.

Cheers!
-Adin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2011-12-17 14:30   ` Konrad Rzeszutek Wilk
@ 2011-12-17 16:53     ` Adin Scannell
  2011-12-17 21:31       ` Konrad Rzeszutek Wilk
  2012-01-02 16:06     ` Olaf Hering
  1 sibling, 1 reply; 20+ messages in thread
From: Adin Scannell @ 2011-12-17 16:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: olaf, xen-devel, Adin Scannell, konrad.wilk, andres, JBeulich

On Sat, Dec 17, 2011 at 9:30 AM, Konrad Rzeszutek Wilk
<konrad@darnok.org> wrote:
>> +#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)                                   >
> So why does this have to be a macro? What is the advantage of that
> versus making this a function?

I just wanted to minimize changes in the patch from the known XCP
version.  I'm personally not a fan of big macros like this.

> So why msleep? Why not go for a proper yield? Call the safe_halt()
> function?

Yes, this is something that should be revisited.

Since it looks like Daniel's HVM patches are going to conflict with
this anyways, I'll revisit this patch independently from the other two
and see what I can do about making it nicer and addressing the other
bits of feedback you've given.

> Use GNTTABOP_error_msgs. Also should we continue? What is the
> impact if we do continue? The times this is hit is if the status
> is not GNTS_okay and if it is not GNTS_eagain - so what are the
> situations in which this happens and what can the domain do
> to recover? Should there be some helpfull message instead of
> just "gnt status: X" ?

In all the cases this macro is called, the caller still needs to check
op.status and handle any errors appropriately.  The point of the macro
is to reasonably get you from eagain => !eagain, whatever the result
may be.  If I turn this into a function, those semantics will still
apply.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-17 16:51     ` Adin Scannell
@ 2011-12-17 21:29       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 21:29 UTC (permalink / raw)
  To: Adin Scannell
  Cc: olaf, xen-devel, Adin Scannell, andres, JBeulich, Konrad Rzeszutek Wilk

On Sat, Dec 17, 2011 at 11:51:11AM -0500, Adin Scannell wrote:
> Hi Konrad,
> 
> Thanks for the quick turnaround. I'll incorporate your feedback and
> resend. Some NOTES are inline below.
> 
> On Sat, Dec 17, 2011 at 9:40 AM, Konrad Rzeszutek Wilk
> <konrad@darnok.org> wrote:
> > On Fri, Dec 16, 2011 at 10:22:21PM -0500, Adin Scannell wrote:
> >> This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
> >> tree.  The code structure is significantly different and this patch mirrors the
> >> existing Linux code.
> >>
> >> The primary reason for need the V2 interface is to support foreign mappings
> >> (i.e. qemu) of paged-out pages.  The libxc code will already retry mappings
> >> when an ENOENT is returned.  The V2 interface provides a richer error value,
> >> so the user-space code is capable of handling these errors specifically.
> >
> > Can you give more details on how to use paged-out pages. Perhaps a
> > pointer to the xen's docs?
> 
> Hrm, in usual fashion the docs are a bit lacking.
> 
> Simply: the kernel has to do nothing to support paging (other than the
> backend drivers which need to handle the grant EAGAIN case -- other
> patch).  The only reason the V2 interface is needed here is to pass

Hm I did not see the netback one? Should that also incorporate this?

> back full error codes from the mmu_update()'s. Xen and libxc have a
> mutual understanding that when you receive an ENOENT error code, you
> back off and try again.

What you described is pretty good. Please do include it in the git
description. Thx

> 
> >>
> >> Signed-off-by: Adin Scannell <adin@scannell.ca>
> >>
> >> Index: linux/drivers/xen/xenfs/privcmd.c
> >> ===================================================================
> >> ---
> >>  drivers/xen/xenfs/privcmd.c |   90 ++++++++++++++++++++++++++++++++++++++++++-
> >
> > So that file just moved to drivers/xen/privcmd.c
> 
> Of course it has :)

Well, we can't have it easy can we! :-)
> 
> >>  include/xen/privcmd.h       |   10 +++++
> >>  2 files changed, 99 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c
> >> index dbd3b16..21cbb5a 100644
> >> --- a/drivers/xen/xenfs/privcmd.c
> >> +++ b/drivers/xen/xenfs/privcmd.c
> >> @@ -70,7 +70,7 @@ static void free_page_list(struct list_head *pages)
> >>   */
> >>  static int gather_array(struct list_head *pagelist,
> >>                       unsigned nelem, size_t size,
> >> -                     void __user *data)
> >> +                     const void __user *data)
> >>  {
> >>       unsigned pageidx;
> >>       void *pagedata;
> >> @@ -245,6 +245,15 @@ struct mmap_batch_state {
> >>       xen_pfn_t __user *user;
> >>  };
> >>
> >> +struct mmap_batch_v2_state {
> >> +     domid_t domain;
> >> +     unsigned long va;
> >> +     struct vm_area_struct *vma;
> >> +     int paged_out;
> >
> > Should this be unsigned int?
> 
> Noted below.
> 
> >> +
> >> +     int __user *err;
> >> +};
> >> +
> >>  static int mmap_batch_fn(void *data, void *state)
> >>  {
> >>       xen_pfn_t *mfnp = data;
> >> @@ -260,6 +269,20 @@ static int mmap_batch_fn(void *data, void *state)
> >>       return 0;
> >>  }
> >>
> >> +static int mmap_batch_v2_fn(void *data, void *state)
> >> +{
> >> +     xen_pfn_t *mfnp = data;
> >> +     struct mmap_batch_v2_state *st = state;
> >> +
> >> +     int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
> >> +                                    st->vma->vm_page_prot, st->domain);
> >
> > You don't want to check that st is not NULL?
> 
> This could be an assertion as it's only used in the
> ioctl_mmap_batch_v2 function where it's guaranteed to pass in a
> non-null state.
> 
> I'll add an additional patch to the queue that adds these assertions
> to both mmap_batch_fn and mmap_batch_fn_v2.
> 
> >> +             st->paged_out++;
> > Is it possible that this ends overflowing and hitting 0?
> 
> I don't think this is an issue practically, as the only way to trigger
> this would be to be on a 64bit machine and map an ungodly number of
> pages with a single mmap_batch (MAX_INT).  For correctness though, I
> can update this variable and the err variable in mmap_batch_state
> which suffers from the same problem -- turn them into unsigned long.
> This will be included in the additional patch.
> >> +                        m.arr);
> >> +
> >> +     if (ret || list_empty(&pagelist))
> >> +             goto out;
> >> +
> >> +     down_write(&mm->mmap_sem);
> >> +
> >> +     vma = find_vma(mm, m.addr);
> >> +     ret = -EINVAL;
> >> +     /* We allow multiple shots here, because this interface
> >> +      * is used by libxc and mappings for specific pages will
> >> +      * be retried when pages are paged-out (ENOENT). */
> >> +     if (!vma ||
> >> +         vma->vm_ops != &privcmd_vm_ops ||
> >> +         (m.addr < vma->vm_start) ||
> >> +         ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
> >> +             up_write(&mm->mmap_sem);
> >> +             goto out;
> >> +     }
> >> +
> >> +     state.domain = m.dom;
> >
> > Should you check the m.dom for incorrect ones? Like -1? or DOMID_IO?
> 
> I followed the example for mmap_batch, which just tries the call and
> returns whatever error Xen gives.  I think that's the right approach
> for these.

OK. I think a another patch to actually check for that in every other
ioclt in this code might be worth it.
> 
> Cheers!
> -Adin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2011-12-17 16:53     ` Adin Scannell
@ 2011-12-17 21:31       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-17 21:31 UTC (permalink / raw)
  To: Adin Scannell
  Cc: olaf, xen-devel, Adin Scannell, andres, JBeulich, Konrad Rzeszutek Wilk

On Sat, Dec 17, 2011 at 11:53:40AM -0500, Adin Scannell wrote:
> On Sat, Dec 17, 2011 at 9:30 AM, Konrad Rzeszutek Wilk
> <konrad@darnok.org> wrote:
> >> +#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)                                   >
> > So why does this have to be a macro? What is the advantage of that
> > versus making this a function?
> 
> I just wanted to minimize changes in the patch from the known XCP
> version.  I'm personally not a fan of big macros like this.
> 
> > So why msleep? Why not go for a proper yield? Call the safe_halt()
> > function?
> 
> Yes, this is something that should be revisited.
> 
> Since it looks like Daniel's HVM patches are going to conflict with
> this anyways, I'll revisit this patch independently from the other two
> and see what I can do about making it nicer and addressing the other
> bits of feedback you've given.

OK. Thanks.
> 
> > Use GNTTABOP_error_msgs. Also should we continue? What is the
> > impact if we do continue? The times this is hit is if the status
> > is not GNTS_okay and if it is not GNTS_eagain - so what are the
> > situations in which this happens and what can the domain do
> > to recover? Should there be some helpfull message instead of
> > just "gnt status: X" ?
> 
> In all the cases this macro is called, the caller still needs to check
> op.status and handle any errors appropriately.  The point of the macro
> is to reasonably get you from eagain => !eagain, whatever the result
> may be.  If I turn this into a function, those semantics will still
> apply.

OK, it sounds as the 'printk' is not really neccessary as it will be the
responsibility of the caller to figure out what to do (which might
be very well doing a printk, or maybe not?)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2011-12-17 14:30   ` Konrad Rzeszutek Wilk
  2011-12-17 16:53     ` Adin Scannell
@ 2012-01-02 16:06     ` Olaf Hering
  2012-01-03 18:19       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 20+ messages in thread
From: Olaf Hering @ 2012-01-02 16:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andres, xen-devel, Adin Scannell, JBeulich, adin

On Sat, Dec 17, Konrad Rzeszutek Wilk wrote:

> > +#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)					\
> 
> So why does this have to be a macro? What is the advantage of that
> versus making this a function?

I dont remember why I turned this into a macro instead of a function.

> > +do {																		\
> > +	u8 __hc_delay = 1;														\
> > +	int __ret;																\
> > +	while (unlikely((__HCarg_p)->status == GNTST_eagain && __hc_delay)) {	\
> > +		msleep(__hc_delay++);												\
> 
> Ugh. Not sure what happend to this, but there are tons of '\' at the
> end.

A multiline macro needs backslashes at the end.

> So why msleep? Why not go for a proper yield? Call the safe_halt()
> function?

It needs some interuptible sleep, whatever is best in this context.

> > +		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);			\
> > +		BUG_ON(__ret);														\
> > +	}																		\
> > +	if (__hc_delay == 0) {													\
> 
> So this would happen if we rolled over __hc_delay, so did this more than
> 255 times? Presumarily this can happen if the swapper in dom0 crashes..

Or if something in the paging paths goes wrong.

> I would recommend you use 'WARN' here and include tons of details.
> This is a pretty serious issues, is it not?

Either the host is really busy and cant page-in quick enough even after
so-many seconds. Or something in the pager/hypervisor does not work
right. In either case, a backtrace wont help much as it does only cause
noise. The printk below prints the function name (I think thats the
reason why it is a macro) to give some hint. 

> > +		printk(KERN_ERR "%s: gnt busy\n", __func__,);						\
> > +		(__HCarg_p)->status = GNTST_bad_page;								\
> > +	}																		\
> > +	if ((__HCarg_p)->status != GNTST_okay)									\
> > +		printk(KERN_ERR "%s: gnt status %x\n", 								\
> > +			__func__, (__HCarg_p)->status);									\
> 
> Use GNTTABOP_error_msgs. Also should we continue? What is the
> impact if we do continue? The times this is hit is if the status
> is not GNTS_okay and if it is not GNTS_eagain - so what are the
> situations in which this happens and what can the domain do
> to recover? Should there be some helpfull message instead of
> just "gnt status: X" ?

The caller has to deal with the various !GNTST_okay states anyway, this
patch wont change that fact.

Olaf

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages
  2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
                   ` (5 preceding siblings ...)
  2011-12-17 14:16 ` Konrad Rzeszutek Wilk
@ 2012-01-02 16:06 ` Olaf Hering
  6 siblings, 0 replies; 20+ messages in thread
From: Olaf Hering @ 2012-01-02 16:06 UTC (permalink / raw)
  To: Adin Scannell; +Cc: konrad, andres, xen-devel, JBeulich, adin

On Fri, Dec 16, Adin Scannell wrote:

> I've ported a couple of patches to the Linux pvops kernel that are necessary
> for correctly running domains with paging.  In a nutshell: in the case of a
> foreign attempt to map a paged-out page, the correct error code will now be
> propogated up to libxc, which will already handle it correctly.

Adin,

thanks for doing this work.

Olaf

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2012-01-02 16:06     ` Olaf Hering
@ 2012-01-03 18:19       ` Konrad Rzeszutek Wilk
  2012-01-03 18:40         ` Olaf Hering
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-01-03 18:19 UTC (permalink / raw)
  To: Olaf Hering; +Cc: andres, xen-devel, Adin Scannell, JBeulich, adin

On Mon, Jan 02, 2012 at 05:06:12PM +0100, Olaf Hering wrote:
> On Sat, Dec 17, Konrad Rzeszutek Wilk wrote:
> 
> > > +#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)					\
> > 
> > So why does this have to be a macro? What is the advantage of that
> > versus making this a function?
> 
> I dont remember why I turned this into a macro instead of a function.
> 
> > > +do {																		\
> > > +	u8 __hc_delay = 1;														\
> > > +	int __ret;																\
> > > +	while (unlikely((__HCarg_p)->status == GNTST_eagain && __hc_delay)) {	\
> > > +		msleep(__hc_delay++);												\
> > 
> > Ugh. Not sure what happend to this, but there are tons of '\' at the
> > end.
> 
> A multiline macro needs backslashes at the end.

Yes. I should have been more specific. The '\' are out of aligment.
> 
> > So why msleep? Why not go for a proper yield? Call the safe_halt()
> > function?
> 
> It needs some interuptible sleep, whatever is best in this context.
> 
> > > +		__ret = HYPERVISOR_grant_table_op(__HCop, (__HCarg_p), 1);			\
> > > +		BUG_ON(__ret);														\
> > > +	}																		\
> > > +	if (__hc_delay == 0) {													\
> > 
> > So this would happen if we rolled over __hc_delay, so did this more than
> > 255 times? Presumarily this can happen if the swapper in dom0 crashes..
> 
> Or if something in the paging paths goes wrong.
> 
> > I would recommend you use 'WARN' here and include tons of details.
> > This is a pretty serious issues, is it not?
> 
> Either the host is really busy and cant page-in quick enough even after
> so-many seconds. Or something in the pager/hypervisor does not work
> right. In either case, a backtrace wont help much as it does only cause
> noise. The printk below prints the function name (I think thats the
> reason why it is a macro) to give some hint. 

OK, we can do this differently. Make a function that does the majority
of this, and one of the arguments is a 'const char *name' and use a
macro that does:

#define gnttab_check_GNTST_eagain_while(__HCop, __HCarg_p)
real_function(__func__, __Hcop, __HCarg_p,...)

or such.

If this problem does occur (the swapper died in dom0) should the
printk at least use printk_ratelimited so that we don't cause too much
noise?
> 
> > > +		printk(KERN_ERR "%s: gnt busy\n", __func__,);						\
> > > +		(__HCarg_p)->status = GNTST_bad_page;								\
> > > +	}																		\
> > > +	if ((__HCarg_p)->status != GNTST_okay)									\
> > > +		printk(KERN_ERR "%s: gnt status %x\n", 								\
> > > +			__func__, (__HCarg_p)->status);									\
> > 
> > Use GNTTABOP_error_msgs. Also should we continue? What is the
> > impact if we do continue? The times this is hit is if the status
> > is not GNTS_okay and if it is not GNTS_eagain - so what are the
> > situations in which this happens and what can the domain do
> > to recover? Should there be some helpfull message instead of
> > just "gnt status: X" ?
> 
> The caller has to deal with the various !GNTST_okay states anyway, this
> patch wont change that fact.

Ok, so then we don't really need the printk right? As the caller
would presumarily do the right thing and also print the error?

> 
> Olaf
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2012-01-03 18:19       ` Konrad Rzeszutek Wilk
@ 2012-01-03 18:40         ` Olaf Hering
  2012-01-03 18:48           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Olaf Hering @ 2012-01-03 18:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: andres, xen-devel, Adin Scannell, JBeulich, adin

On Tue, Jan 03, Konrad Rzeszutek Wilk wrote:

> If this problem does occur (the swapper died in dom0) should the
> printk at least use printk_ratelimited so that we don't cause too much
> noise?

I remember there was no flood because the guest was stuck anyway. But
see below.

> > The caller has to deal with the various !GNTST_okay states anyway, this
> > patch wont change that fact.
> 
> Ok, so then we don't really need the printk right? As the caller
> would presumarily do the right thing and also print the error?

I think its more a debug thing, so that I knew something bad happend.
And at that time it was just helpful to get me some understanding of the
code flow. Since now that part of the paging code is reasonable
debugged, the printk is not really needed anymore.
Instead the code who uses these new functionality should have proper
error handling and print reasonable diagnostic messages.

Olaf

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] Handle GNTST_eagain in kernel drivers
  2012-01-03 18:40         ` Olaf Hering
@ 2012-01-03 18:48           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-01-03 18:48 UTC (permalink / raw)
  To: Olaf Hering
  Cc: xen-devel, Adin Scannell, andres, JBeulich, Konrad Rzeszutek Wilk, adin

On Tue, Jan 03, 2012 at 07:40:45PM +0100, Olaf Hering wrote:
> On Tue, Jan 03, Konrad Rzeszutek Wilk wrote:
> 
> > If this problem does occur (the swapper died in dom0) should the
> > printk at least use printk_ratelimited so that we don't cause too much
> > noise?
> 
> I remember there was no flood because the guest was stuck anyway. But
> see below.
> 
> > > The caller has to deal with the various !GNTST_okay states anyway, this
> > > patch wont change that fact.
> > 
> > Ok, so then we don't really need the printk right? As the caller
> > would presumarily do the right thing and also print the error?
> 
> I think its more a debug thing, so that I knew something bad happend.
> And at that time it was just helpful to get me some understanding of the
> code flow. Since now that part of the paging code is reasonable
> debugged, the printk is not really needed anymore.

OK.
> Instead the code who uses these new functionality should have proper
> error handling and print reasonable diagnostic messages.

Excellent. So lets remove that printk(KERN_ERR).

Thanks!
> 
> Olaf
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-20  6:36 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
@ 2011-12-20 18:11   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-12-20 18:11 UTC (permalink / raw)
  To: Adin Scannell
  Cc: xen-devel, konrad, andres, adin, olaf, JBeulich, linux-kernel

On Tue, Dec 20, 2011 at 01:36:53AM -0500, Adin Scannell wrote:
> This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
> tree.  The code structure is significantly different and this patch mirrors the
> existing Linux code.
> 
> An important reason to add the V2 interface is to support foreign mappings
> (i.e.  qemu-dm) of paged-out pages.  The kernel generally has to do nothing
> beyond implementing this ioctl in order to provide this support.  The V2
> interface is needed only to pass back full error codes from the mmu_update()'s.
> Xen and libxc have a mutual understanding that when you receive an ENOENT error
> code, you back off and try again. The libxc code will already retry mappings
> when an ENOENT is returned.

Can you say what is the difference between V1 and V2? It looks to be just
that V2 adds an array of return values and not overloading the array of
MFNs with the "top nibble for set on err".

> 
> The only exception to the above case is backend drivers using grant operations.
> To support paging, these drivers must appropriately retry grant operations when
> they receive an EAGAIN error code.  This is implemented in a separate patch.

patches. You need one for netback and one for blkback b/c they go to different
maintainers.

> 
> Signed-off-by: Adin Scannell <adin@scannell.ca>
> ---
>  drivers/xen/privcmd.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++
>  include/xen/privcmd.h |   10 +++++
>  2 files changed, 100 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 161681f..dd77d5c 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -251,6 +251,15 @@ struct mmap_batch_state {
>  	xen_pfn_t __user *user;
>  };
>  
> +struct mmap_batch_v2_state {
> +	domid_t domain;
> +	unsigned long va;
> +	struct vm_area_struct *vma;
> +	unsigned long paged_out;
> +
> +	int __user *err;

The more I look at this along with the previous structure (struct mmap_batch_state)
they more they look like they could be squashed together.

The two differences are the 'paged_out' and *user. You could sqush them
together and use an union to carry those two. 

And then also make the mmap_batch_fn and mmap_batch_v2_fn be almost the
same and unify those two (in a seperate patch of course).

> +};
> +
>  static int mmap_batch_fn(void *data, void *state)
>  {
>  	xen_pfn_t *mfnp = data;
> @@ -268,6 +277,22 @@ static int mmap_batch_fn(void *data, void *state)
>  	return 0;
>  }
>  
> +static int mmap_batch_v2_fn(void *data, void *state)
> +{
> +	xen_pfn_t *mfnp = data;
> +	struct mmap_batch_v2_state *st = state;
> +
> +	BUG_ON(st == NULL || st->vma == NULL);
> +
> +	int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp,
> +				       1, st->vma->vm_page_prot, st->domain);
> +	if (rc == -ENOENT)
> +		st->paged_out++;
> +	st->va += PAGE_SIZE;
> +
> +	return put_user(rc, st->err++);
> +}
> +
>  static int mmap_return_errors(void *data, void *state)
>  {
>  	xen_pfn_t *mfnp = data;
> @@ -340,6 +365,67 @@ out:
>  	return ret;
>  }
>  
> +static long privcmd_ioctl_mmap_batch_v2(void __user *udata)
> +{
> +	int ret;
> +	struct privcmd_mmapbatch_v2 m;
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long nr_pages;
> +	LIST_HEAD(pagelist);
> +	struct mmap_batch_v2_state state;
> +
> +	if (!xen_initial_domain())
> +		return -EPERM;
> +
> +	if (copy_from_user(&m, udata, sizeof(m)))
> +		return -EFAULT;
> +
> +	nr_pages = m.num;
> +	if ((m.num <= 0) || (nr_pages > (ULONG_MAX >> PAGE_SHIFT)))
> +		return -EINVAL;
> +
> +	ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t),
> +			   m.arr);
> +
> +	if (ret || list_empty(&pagelist))
> +		goto out;
> +
> +	down_write(&mm->mmap_sem);
> +
> +	vma = find_vma(mm, m.addr);
> +	ret = -EINVAL;
> +	/* We allow multiple shots here, because this interface
> +	 * is used by libxc and mappings for specific pages will
> +	 * be retried when pages are paged-out (ENOENT). */
> +	if (!vma ||
> +	    vma->vm_ops != &privcmd_vm_ops ||
> +	    (m.addr < vma->vm_start) ||
> +	    ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
> +		up_write(&mm->mmap_sem);
> +		goto out;
> +	}
> +
> +	state.domain = m.dom;
> +	state.vma = vma;
> +	state.va = m.addr;
> +	state.err = m.err;
> +	state.paged_out = 0;
> +
> +	up_write(&mm->mmap_sem);
> +
> +	ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> +			     &pagelist, mmap_batch_v2_fn, &state);
> +
> +out:
> +	free_page_list(&pagelist);
> +
> +	if ((ret == 0) && (state.paged_out > 0))
> +		return -ENOENT;
> +	else
> +		return ret;
> +}
> +

This looks so similar to the the v1 version. Can you extract the generic
parts of privcmd_ioctl_mmap_batch out to a function and just use the
_v1 and _v2 for .. well the things that they differ with?

>  static long privcmd_ioctl(struct file *file,
>  			  unsigned int cmd, unsigned long data)
>  {
> @@ -359,6 +445,10 @@ static long privcmd_ioctl(struct file *file,
>  		ret = privcmd_ioctl_mmap_batch(udata);
>  		break;
>  
> +	case IOCTL_PRIVCMD_MMAPBATCH_V2:
> +		ret = privcmd_ioctl_mmap_batch_v2(udata);
> +		break;
> +
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
> index 17857fb..39b92b1 100644
> --- a/include/xen/privcmd.h
> +++ b/include/xen/privcmd.h
> @@ -62,6 +62,14 @@ struct privcmd_mmapbatch {
>  	xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */
>  };
>  
> +struct privcmd_mmapbatch_v2 {
> +	int num;          /* number of pages to populate */
> +	domid_t dom;      /* target domain */
> +	__u64 addr;       /* virtual address */
> +	const xen_pfn_t __user *arr; /* array of mfns */
> +	int __user *err;  /* array of error codes */
> +};
> +
>  /*
>   * @cmd: IOCTL_PRIVCMD_HYPERCALL
>   * @arg: &privcmd_hypercall_t
> @@ -73,5 +81,7 @@ struct privcmd_mmapbatch {
>  	_IOC(_IOC_NONE, 'P', 2, sizeof(struct privcmd_mmap))
>  #define IOCTL_PRIVCMD_MMAPBATCH					\
>  	_IOC(_IOC_NONE, 'P', 3, sizeof(struct privcmd_mmapbatch))
> +#define IOCTL_PRIVCMD_MMAPBATCH_V2				\
> +	_IOC(_IOC_NONE, 'P', 4, sizeof(struct privcmd_mmapbatch_v2))
>  
>  #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
> -- 
> 1.6.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen
  2011-12-20  6:36 [PATCH 0/3] Add kernel bits necessary to suport Xen paging Adin Scannell
@ 2011-12-20  6:36 ` Adin Scannell
  2011-12-20 18:11   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 20+ messages in thread
From: Adin Scannell @ 2011-12-20  6:36 UTC (permalink / raw)
  To: xen-devel
  Cc: konrad, andres, adin, olaf, JBeulich, linux-kernel, Adin Scannell

This wasn't ported from any patch, but was rewritten based on the XCP 2.6.32
tree.  The code structure is significantly different and this patch mirrors the
existing Linux code.

An important reason to add the V2 interface is to support foreign mappings
(i.e.  qemu-dm) of paged-out pages.  The kernel generally has to do nothing
beyond implementing this ioctl in order to provide this support.  The V2
interface is needed only to pass back full error codes from the mmu_update()'s.
Xen and libxc have a mutual understanding that when you receive an ENOENT error
code, you back off and try again. The libxc code will already retry mappings
when an ENOENT is returned.

The only exception to the above case is backend drivers using grant operations.
To support paging, these drivers must appropriately retry grant operations when
they receive an EAGAIN error code.  This is implemented in a separate patch.

Signed-off-by: Adin Scannell <adin@scannell.ca>
---
 drivers/xen/privcmd.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/xen/privcmd.h |   10 +++++
 2 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 161681f..dd77d5c 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -251,6 +251,15 @@ struct mmap_batch_state {
 	xen_pfn_t __user *user;
 };
 
+struct mmap_batch_v2_state {
+	domid_t domain;
+	unsigned long va;
+	struct vm_area_struct *vma;
+	unsigned long paged_out;
+
+	int __user *err;
+};
+
 static int mmap_batch_fn(void *data, void *state)
 {
 	xen_pfn_t *mfnp = data;
@@ -268,6 +277,22 @@ static int mmap_batch_fn(void *data, void *state)
 	return 0;
 }
 
+static int mmap_batch_v2_fn(void *data, void *state)
+{
+	xen_pfn_t *mfnp = data;
+	struct mmap_batch_v2_state *st = state;
+
+	BUG_ON(st == NULL || st->vma == NULL);
+
+	int rc = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp,
+				       1, st->vma->vm_page_prot, st->domain);
+	if (rc == -ENOENT)
+		st->paged_out++;
+	st->va += PAGE_SIZE;
+
+	return put_user(rc, st->err++);
+}
+
 static int mmap_return_errors(void *data, void *state)
 {
 	xen_pfn_t *mfnp = data;
@@ -340,6 +365,67 @@ out:
 	return ret;
 }
 
+static long privcmd_ioctl_mmap_batch_v2(void __user *udata)
+{
+	int ret;
+	struct privcmd_mmapbatch_v2 m;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long nr_pages;
+	LIST_HEAD(pagelist);
+	struct mmap_batch_v2_state state;
+
+	if (!xen_initial_domain())
+		return -EPERM;
+
+	if (copy_from_user(&m, udata, sizeof(m)))
+		return -EFAULT;
+
+	nr_pages = m.num;
+	if ((m.num <= 0) || (nr_pages > (ULONG_MAX >> PAGE_SHIFT)))
+		return -EINVAL;
+
+	ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t),
+			   m.arr);
+
+	if (ret || list_empty(&pagelist))
+		goto out;
+
+	down_write(&mm->mmap_sem);
+
+	vma = find_vma(mm, m.addr);
+	ret = -EINVAL;
+	/* We allow multiple shots here, because this interface
+	 * is used by libxc and mappings for specific pages will
+	 * be retried when pages are paged-out (ENOENT). */
+	if (!vma ||
+	    vma->vm_ops != &privcmd_vm_ops ||
+	    (m.addr < vma->vm_start) ||
+	    ((m.addr + (nr_pages << PAGE_SHIFT)) > vma->vm_end)) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	state.domain = m.dom;
+	state.vma = vma;
+	state.va = m.addr;
+	state.err = m.err;
+	state.paged_out = 0;
+
+	up_write(&mm->mmap_sem);
+
+	ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+			     &pagelist, mmap_batch_v2_fn, &state);
+
+out:
+	free_page_list(&pagelist);
+
+	if ((ret == 0) && (state.paged_out > 0))
+		return -ENOENT;
+	else
+		return ret;
+}
+
 static long privcmd_ioctl(struct file *file,
 			  unsigned int cmd, unsigned long data)
 {
@@ -359,6 +445,10 @@ static long privcmd_ioctl(struct file *file,
 		ret = privcmd_ioctl_mmap_batch(udata);
 		break;
 
+	case IOCTL_PRIVCMD_MMAPBATCH_V2:
+		ret = privcmd_ioctl_mmap_batch_v2(udata);
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
index 17857fb..39b92b1 100644
--- a/include/xen/privcmd.h
+++ b/include/xen/privcmd.h
@@ -62,6 +62,14 @@ struct privcmd_mmapbatch {
 	xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */
 };
 
+struct privcmd_mmapbatch_v2 {
+	int num;          /* number of pages to populate */
+	domid_t dom;      /* target domain */
+	__u64 addr;       /* virtual address */
+	const xen_pfn_t __user *arr; /* array of mfns */
+	int __user *err;  /* array of error codes */
+};
+
 /*
  * @cmd: IOCTL_PRIVCMD_HYPERCALL
  * @arg: &privcmd_hypercall_t
@@ -73,5 +81,7 @@ struct privcmd_mmapbatch {
 	_IOC(_IOC_NONE, 'P', 2, sizeof(struct privcmd_mmap))
 #define IOCTL_PRIVCMD_MMAPBATCH					\
 	_IOC(_IOC_NONE, 'P', 3, sizeof(struct privcmd_mmapbatch))
+#define IOCTL_PRIVCMD_MMAPBATCH_V2				\
+	_IOC(_IOC_NONE, 'P', 4, sizeof(struct privcmd_mmapbatch_v2))
 
 #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-03 18:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-17  3:22 [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
2011-12-17  3:22 ` [PATCH 1/3] Make xen_remap_domain_mfn_range return value meaningful in case of error Adin Scannell
2011-12-17  3:22 ` [PATCH 2/3] Handle GNTST_eagain in kernel drivers Adin Scannell
2011-12-17 14:30   ` Konrad Rzeszutek Wilk
2011-12-17 16:53     ` Adin Scannell
2011-12-17 21:31       ` Konrad Rzeszutek Wilk
2012-01-02 16:06     ` Olaf Hering
2012-01-03 18:19       ` Konrad Rzeszutek Wilk
2012-01-03 18:40         ` Olaf Hering
2012-01-03 18:48           ` Konrad Rzeszutek Wilk
2011-12-17  3:22 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
2011-12-17 14:40   ` Konrad Rzeszutek Wilk
2011-12-17 16:51     ` Adin Scannell
2011-12-17 21:29       ` Konrad Rzeszutek Wilk
2011-12-17  3:49 ` [PATCH] Add necessary bits to pvops Linux for mapping paged-out pages Adin Scannell
2011-12-17 14:16 ` Konrad Rzeszutek Wilk
2011-12-17 14:16 ` Konrad Rzeszutek Wilk
2012-01-02 16:06 ` Olaf Hering
2011-12-20  6:36 [PATCH 0/3] Add kernel bits necessary to suport Xen paging Adin Scannell
2011-12-20  6:36 ` [PATCH 3/3] Port of mmap_batch_v2 to support paging in Xen Adin Scannell
2011-12-20 18:11   ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.