All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
@ 2012-12-19 18:53 Mats Petersson
  2012-12-20 10:40 ` Ian Campbell
  0 siblings, 1 reply; 8+ messages in thread
From: Mats Petersson @ 2012-12-19 18:53 UTC (permalink / raw)
  To: konrad, xen-devel, Ian.Campbell
  Cc: mats, david.vrabel, JBeulich, Mats Petersson

This patch makes the IOCTL_PRIVCMD_MMAPBATCH_V2 (and older V1 version)
map multiple (typically 4KB worth of mfn's) pages at a time rather than one page at
a time, despite the pages being non-consecutive MFNs. The main change
is to pass a pointer to an array of mfns, rather than one mfn. To
support error reporting, we also pass an err_ptr.

Using a small test program to map Guest memory into Dom0 (repeatedly
for "Iterations" mapping the same first "Num Pages")
Iterations    Num Pages    Time 3.7rc4  Time With this patch
5000          4096         76.107       37.027
10000         2048         75.703       37.177
20000         1024         75.893       37.247

Performance of mapping guest memory into Dom0 is about half of the
original time.

When migrating a guest, this gives around 15-40% improvement on
overall live-migration time - bigger guests show bigger improvement,
and around 30% improvment on "guest downtime".

Signed-off-by: Mats Petersson <mats.petersson@citrix.com>

---

V5 of the patch contains fixes to ARM architecture, to add the same
API on that side (which would otherwise have caused a build break).
The ARM variant can do with a little bit of performance improvement.

Some minor formatting, and merging to latest 3.8 branch. No
functionality changes for x86.
---
 arch/arm/xen/enlighten.c |  113 +++++++++++++++++++++++++++++++-------
 arch/x86/xen/mmu.c       |  137 ++++++++++++++++++++++++++++++++++++++++------
 drivers/xen/privcmd.c    |   54 ++++++++++++++----
 include/xen/xen-ops.h    |    6 ++
 4 files changed, 264 insertions(+), 46 deletions(-)

diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 7a32976..967dea9 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -67,19 +67,19 @@ static int map_foreign_page(unsigned long lpfn, unsigned long fgmfn,
 	if (rc) {
 		pr_warn("Failed to map pfn to mfn rc:%d pfn:%lx mfn:%lx\n",
 			rc, lpfn, fgmfn);
-		return 1;
+		return rc;
 	}
 	return 0;
 }
 
 struct remap_data {
-	xen_pfn_t fgmfn; /* foreign domain's gmfn */
+	xen_pfn_t *fgmfn; /* foreign domain's gmfn */
 	pgprot_t prot;
 	domid_t  domid;
 	struct vm_area_struct *vma;
 	int index;
 	struct page **pages;
-	struct xen_remap_mfn_info *info;
+	int *err_ptr;
 };
 
 static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
@@ -89,37 +89,112 @@ static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
 	struct page *page = info->pages[info->index++];
 	unsigned long pfn = page_to_pfn(page);
 	pte_t pte = pfn_pte(pfn, info->prot);
-
-	if (map_foreign_page(pfn, info->fgmfn, info->domid))
-		return -EFAULT;
+	int err;
+	// TODO: We should really batch these updates.
+	err = map_foreign_page(pfn, *info->fgmfn, info->domid);
+	*info->err_ptr++ = err;
 	set_pte_at(info->vma->vm_mm, addr, ptep, pte);
+	info->fgmfn++;
 
-	return 0;
+	return err;
 }
 
-int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
-			       unsigned long addr,
-			       xen_pfn_t mfn, int nr,
-			       pgprot_t prot, unsigned domid,
-			       struct page **pages)
+/* do_remap_mfn() - helper function to map foreign pages
+ * @vma:     the vma for the pages to be mapped into
+ * @addr:    the address at which to map the pages
+ * @mfn:     pointer to array of MFNs to map
+ * @nr:      the number entries in the MFN array
+ * @prot:    page protection mask
+ * @domid:   id of the domain that we are mapping from
+ * @pages:   page information.
+ *
+ * This function takes an array of mfns and maps nr pages from that into
+ * this kernel's memory. The owner of the pages is defined by domid. Where the
+ * pages are mapped is determined by addr, and vma is used for "accounting" of
+ * the pages.
+ *
+ * Return value is zero for success, negative for failure.
+ */
+static int do_remap_mfn(struct vm_area_struct *vma,
+			unsigned long addr,
+			xen_pfn_t *mfn, int nr, int *err_ptr,
+			pgprot_t prot, unsigned domid,
+			struct page **pages)
 {
 	int err;
 	struct remap_data data;
+	unsigned long range = nr << PAGE_SHIFT;
 
-	/* TBD: Batching, current sole caller only does page at a time */
-	if (nr > 1)
-		return -EINVAL;
+	/* Kept here for the purpose of making sure code doesn't break
+	   x86 PVOPS */
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | VM_IO)));
 
 	data.fgmfn = mfn;
-	data.prot = prot;
+	data.prot  = prot;
 	data.domid = domid;
-	data.vma = vma;
-	data.index = 0;
+	data.vma   = vma;
 	data.pages = pages;
-	err = apply_to_page_range(vma->vm_mm, addr, nr << PAGE_SHIFT,
+	data.index = 0;
+	data.err_ptr = err_ptr;
+
+	err = apply_to_page_range(vma->vm_mm, addr, range,
 				  remap_pte_fn, &data);
+	/* Warning: We do probably need to care about what error we
+	   get here. However, currently, the remap_pte_fn is only
+	   likely to return EFAULT or some other "things are very
+	   bad" error code, which the rest of the calling code won't
+	   be able to fix up. So we just exit with the error we got.
+	*/
 	return err;
 }
+
+/* xen_remap_domain_mfn_array() - Used to map an array of foreign pages
+ * @vma:     the vma for the pages to be mapped into
+ * @addr:    the address at which to map the pages
+ * @mfn:     pointer to array of MFNs to map
+ * @nr:      the number entries in the MFN array
+ * @err_ptr: pointer to array of integers, one per MFN, for an error
+ *           value for each page. The err_ptr must not be NULL.
+ * @prot:    page protection mask
+ * @domid:   id of the domain that we are mapping from
+ * @pages:   page information
+ *
+ * This function takes an array of mfns and maps nr pages from that into this
+ * kernel's memory. The owner of the pages is defined by domid. Where the pages
+ * are mapped is determined by addr, and vma is used for "accounting" of the
+ * pages. The err_ptr array is filled in on any page that is not successfully
+ * mapped in.
+ *
+ * Return value is zero for success, negative ERRNO value for failure.
+ * Note that the error value -ENOENT is considered a "retry", so when this
+ * error code is seen, another call should be made with the list of pages that
+ * are marked as -ENOENT in the err_ptr array.
+ */
+int xen_remap_domain_mfn_array(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       xen_pfn_t *mfn, int nr,
+			       int *err_ptr, pgprot_t prot,
+			       unsigned domid,
+			       struct page **pages)
+{
+	/* We BUG_ON because it's a programmer error to pass a NULL err_ptr,
+	 * and the consequences later is quite hard to detect what the actual
+	 * cause of "wrong memory was mapped in".
+	 */
+	BUG_ON(err_ptr == NULL);
+	return do_remap_mfn(vma, addr, mfn, nr, err_ptr, prot, domid, pages);
+}
+EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_array);
+
+/* Not used in ARM. Use xen_remap_domain_mfn_array(). */
+int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       xen_pfn_t mfn, int nr,
+			       pgprot_t prot, unsigned domid,
+			       struct page **pages)
+{
+	return -ENOSYS;
+}
 EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
 
 int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 01de35c..323a2ab 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2477,7 +2477,8 @@ void __init xen_hvm_init_mmu_ops(void)
 #define REMAP_BATCH_SIZE 16
 
 struct remap_data {
-	unsigned long mfn;
+	xen_pfn_t *mfn;
+	bool contiguous;
 	pgprot_t prot;
 	struct mmu_update *mmu_update;
 };
@@ -2486,7 +2487,13 @@ static int remap_area_mfn_pte_fn(pte_t *ptep, pgtable_t token,
 				 unsigned long addr, void *data)
 {
 	struct remap_data *rmd = data;
-	pte_t pte = pte_mkspecial(pfn_pte(rmd->mfn++, rmd->prot));
+	pte_t pte = pte_mkspecial(pfn_pte(*rmd->mfn, rmd->prot));
+	/* If we have a contigious range, just update the mfn itself,
+	   else update pointer to be "next mfn". */
+	if (rmd->contiguous)
+		(*rmd->mfn)++;
+	else
+		rmd->mfn++;
 
 	rmd->mmu_update->ptr = virt_to_machine(ptep).maddr;
 	rmd->mmu_update->val = pte_val_ma(pte);
@@ -2495,18 +2502,34 @@ static int remap_area_mfn_pte_fn(pte_t *ptep, pgtable_t token,
 	return 0;
 }
 
-int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
-			       unsigned long addr,
-			       xen_pfn_t mfn, int nr,
-			       pgprot_t prot, unsigned domid,
-			       struct page **pages)
-
-{
+/* do_remap_mfn() - helper function to map foreign pages
+ * @vma:     the vma for the pages to be mapped into
+ * @addr:    the address at which to map the pages
+ * @mfn:     pointer to array of MFNs to map
+ * @nr:      the number entries in the MFN array
+ * @err_ptr: pointer to array
+ * @prot:    page protection mask
+ * @domid:   id of the domain that we are mapping from
+ *
+ * This function takes an array of mfns and maps nr pages from that into
+ * this kernel's memory. The owner of the pages is defined by domid. Where the
+ * pages are mapped is determined by addr, and vma is used for "accounting" of
+ * the pages.
+ *
+ * Return value is zero for success, negative for failure.
+ *
+ * Note that err_ptr is used to indicate whether *mfn
+ * is a list or a "first mfn of a contiguous range". */
+static int do_remap_mfn(struct vm_area_struct *vma,
+			unsigned long addr,
+			xen_pfn_t *mfn, int nr,
+			int *err_ptr, pgprot_t prot,
+			unsigned domid)
+{
+	int err, last_err = 0;
 	struct remap_data rmd;
 	struct mmu_update mmu_update[REMAP_BATCH_SIZE];
-	int batch;
 	unsigned long range;
-	int err = 0;
 
 	if (xen_feature(XENFEAT_auto_translated_physmap))
 		return -EINVAL;
@@ -2517,9 +2540,15 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 
 	rmd.mfn = mfn;
 	rmd.prot = prot;
+	/* We use the err_ptr to indicate if there we are doing a contigious
+	 * mapping or a discontigious mapping. */
+	rmd.contiguous = !err_ptr;
 
 	while (nr) {
-		batch = min(REMAP_BATCH_SIZE, nr);
+		int index = 0;
+		int done = 0;
+		int batch = min(REMAP_BATCH_SIZE, nr);
+		int batch_left = batch;
 		range = (unsigned long)batch << PAGE_SHIFT;
 
 		rmd.mmu_update = mmu_update;
@@ -2528,23 +2557,98 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 		if (err)
 			goto out;
 
-		err = HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid);
-		if (err < 0)
-			goto out;
+		/* We record the error for each page that gives an error, but
+		 * continue mapping until the whole set is done */
+		do {
+			err = HYPERVISOR_mmu_update(&mmu_update[index],
+						    batch_left, &done, domid);
+			if (err < 0) {
+				if (!err_ptr)
+					goto out;
+				/* increment done so we skip the error item */
+				done++;
+				last_err = err_ptr[index] = err;
+			}
+			batch_left -= done;
+			index += done;
+		} while (batch_left);
 
 		nr -= batch;
 		addr += range;
+		if (err_ptr)
+			err_ptr += batch;
 	}
 
-	err = 0;
+	err = last_err;
 out:
 
 	xen_flush_tlb_all();
 
 	return err;
 }
+
+/* xen_remap_domain_mfn_range() - Used to map foreign pages
+ * @vma:   the vma for the pages to be mapped into
+ * @addr:  the address at which to map the pages
+ * @mfn:   the first MFN to map
+ * @nr:    the number of consecutive mfns to map
+ * @prot:  page protection mask
+ * @domid: id of the domain that we are mapping from
+ *
+ * This function takes a mfn and maps nr pages on from that into this kernel's
+ * memory. The owner of the pages is defined by domid. Where the pages are
+ * mapped is determined by addr, and vma is used for "accounting" of the
+ * pages.
+ *
+ * Return value is zero for success, negative ERRNO value for failure.
+ */
+int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       xen_pfn_t mfn, int nr,
+			       pgprot_t prot, unsigned domid,
+			       struct page **pages)
+{
+	return do_remap_mfn(vma, addr, &mfn, nr, NULL, prot, domid);
+}
 EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
 
+/* xen_remap_domain_mfn_array() - Used to map an array of foreign pages
+ * @vma:     the vma for the pages to be mapped into
+ * @addr:    the address at which to map the pages
+ * @mfn:     pointer to array of MFNs to map
+ * @nr:      the number entries in the MFN array
+ * @err_ptr: pointer to array of integers, one per MFN, for an error
+ *           value for each page. The err_ptr must not be NULL.
+ * @prot:    page protection mask
+ * @domid:   id of the domain that we are mapping from
+ *
+ * This function takes an array of mfns and maps nr pages from that into this
+ * kernel's memory. The owner of the pages is defined by domid. Where the pages
+ * are mapped is determined by addr, and vma is used for "accounting" of the
+ * pages. The err_ptr array is filled in on any page that is not successfully
+ * mapped in.
+ *
+ * Return value is zero for success, negative ERRNO value for failure.
+ * Note that the error value -ENOENT is considered a "retry", so when this
+ * error code is seen, another call should be made with the list of pages that
+ * are marked as -ENOENT in the err_ptr array.
+ */
+int xen_remap_domain_mfn_array(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       xen_pfn_t *mfn, int nr,
+			       int *err_ptr, pgprot_t prot,
+			       unsigned domid, struct page **pages)
+{
+	/* We BUG_ON because it's a programmer error to pass a NULL err_ptr,
+	 * and the consequences later is quite hard to detect what the actual
+	 * cause of "wrong memory was mapped in".
+	 */
+	BUG_ON(err_ptr == NULL);
+	return do_remap_mfn(vma, addr, mfn, nr, err_ptr, prot, domid);
+}
+EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_array);
+
+
 /* Returns: 0 success */
 int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
 			       int numpgs, struct page **pages)
@@ -2555,3 +2659,4 @@ int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 EXPORT_SYMBOL_GPL(xen_unmap_domain_mfn_range);
+
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 0bbbccb..8f86a44 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -154,6 +154,40 @@ static int traverse_pages(unsigned nelem, size_t size,
 	return ret;
 }
 
+/*
+ * Similar to traverse_pages, but use each page as a "block" of
+ * data to be processed as one unit.
+ */
+static int traverse_pages_block(unsigned nelem, size_t size,
+				struct list_head *pos,
+				int (*fn)(void *data, int nr, void *state),
+				void *state)
+{
+	void *pagedata;
+	unsigned pageidx;
+	int ret = 0;
+
+	BUG_ON(size > PAGE_SIZE);
+
+	pageidx = PAGE_SIZE;
+
+	while (nelem) {
+		int nr = (PAGE_SIZE/size);
+		struct page *page;
+		if (nr > nelem)
+			nr = nelem;
+		pos = pos->next;
+		page = list_entry(pos, struct page, lru);
+		pagedata = page_address(page);
+		ret = (*fn)(pagedata, nr, state);
+		if (ret)
+			break;
+		nelem -= nr;
+	}
+
+	return ret;
+}
+
 struct mmap_mfn_state {
 	unsigned long va;
 	struct vm_area_struct *vma;
@@ -258,7 +292,7 @@ struct mmap_batch_state {
 	 *      0 for no errors
 	 *      1 if at least one error has happened (and no
 	 *          -ENOENT errors have happened)
-	 *      -ENOENT if at least 1 -ENOENT has happened.
+	 *      -ENOENT if at least one -ENOENT has happened.
 	 */
 	int global_error;
 	/* An array for individual errors */
@@ -271,7 +305,7 @@ struct mmap_batch_state {
 /* auto translated dom0 note: if domU being created is PV, then mfn is
  * mfn(addr on bus). If it's auto xlated, then mfn is pfn (input to HAP).
  */
-static int mmap_batch_fn(void *data, void *state)
+static int mmap_batch_fn(void *data, int nr, void *state)
 {
 	xen_pfn_t *mfnp = data;
 	struct mmap_batch_state *st = state;
@@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
 	if (xen_feature(XENFEAT_auto_translated_physmap))
 		cur_page = pages[st->index++];
 
-	ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
-					 st->vma->vm_page_prot, st->domain,
-					 &cur_page);
-
-	/* Store error code for second pass. */
-	*(st->err++) = ret;
+	BUG_ON(nr < 0);
+	ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
+					 st->err, st->vma->vm_page_prot, 
+					 st->domain, &cur_page);
 
 	/* And see if it affects the global_error. */
 	if (ret < 0) {
@@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
 				st->global_error = 1;
 		}
 	}
-	st->va += PAGE_SIZE;
+	st->va += PAGE_SIZE * nr;
 
 	return 0;
 }
@@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 	state.err           = err_array;
 
 	/* mmap_batch_fn guarantees ret == 0 */
-	BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
-			     &pagelist, mmap_batch_fn, &state));
+	BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
+				    &pagelist, mmap_batch_fn, &state));
 
 	up_write(&mm->mmap_sem);
 
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index d6fe062..d8c57cc 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -30,6 +30,12 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 			       xen_pfn_t mfn, int nr,
 			       pgprot_t prot, unsigned domid,
 			       struct page **pages);
+int xen_remap_domain_mfn_array(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       xen_pfn_t *mfn, int nr,
+			       int *err_ptr, pgprot_t prot,
+			       unsigned domid, 
+			       struct page **pages);
 int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
 			       int numpgs, struct page **pages);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2012-12-19 18:53 [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2 Mats Petersson
@ 2012-12-20 10:40 ` Ian Campbell
  2012-12-20 11:00   ` Mats Petersson
  2012-12-20 11:26   ` Ian Campbell
  0 siblings, 2 replies; 8+ messages in thread
From: Ian Campbell @ 2012-12-20 10:40 UTC (permalink / raw)
  To: Mats Petersson
  Cc: xen-devel, mats, Andres Lagar-Cavilla, JBeulich, konrad, David Vrabel

I've added Andres since I think this accumulation of ENOENT in err_ptr
is a paging related thing, or at least I think he understands this
code ;-)

> @@ -89,37 +89,112 @@ static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
>         struct page *page = info->pages[info->index++];
>         unsigned long pfn = page_to_pfn(page);
>         pte_t pte = pfn_pte(pfn, info->prot);
> -
> -       if (map_foreign_page(pfn, info->fgmfn, info->domid))
> -               return -EFAULT;
> +       int err;
> +       // TODO: We should really batch these updates.
> +       err = map_foreign_page(pfn, *info->fgmfn, info->domid);
> +       *info->err_ptr++ = err;
>         set_pte_at(info->vma->vm_mm, addr, ptep, pte);
> +       info->fgmfn++;
> 
> -       return 0;
> +       return err;

This will cause apply_to_page_range to stop walking and return an error.
AIUI the intention of the err_ptr array is to accumulate the individual
success/error for the entire range. The caller can then take care of the
successes/failures and ENOENTs as appropriate (in particular it doesn't
want to abort a batch because of an ENOENT, it wants to do as much as
possible)

On x86 (when err_ptr != NULL) you accumulate all of the errors from
HYPERVISOR_mmu_update rather than aborting on the first one  and this
seems correct to me.

>  int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 01de35c..323a2ab 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> [...]

> @@ -2528,23 +2557,98 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
>                 if (err)
>                         goto out;
> 
> -               err = HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid);
> -               if (err < 0)
> -                       goto out;
> +               /* We record the error for each page that gives an error, but
> +                * continue mapping until the whole set is done */
> +               do {
> +                       err = HYPERVISOR_mmu_update(&mmu_update[index],
> +                                                   batch_left, &done, domid);
> +                       if (err < 0) {
> +                               if (!err_ptr)
> +                                       goto out;
> +                               /* increment done so we skip the error item */
> +                               done++;
> +                               last_err = err_ptr[index] = err;
> +                       }
> +                       batch_left -= done;
> +                       index += done;
> +               } while (batch_left);
> 
>                 nr -= batch;
>                 addr += range;
> +               if (err_ptr)
> +                       err_ptr += batch;
>         }
> 
> -       err = 0;
> +       err = last_err;

This means that if you have 100 failures followed by one success you
return success overall. Is that intentional? That doesn't seem right.

>  out:
> 
>         xen_flush_tlb_all();
> 
>         return err;
>  }
[...]
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 0bbbccb..8f86a44 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
[...]
> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
>         if (xen_feature(XENFEAT_auto_translated_physmap))
>                 cur_page = pages[st->index++];
> 
> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
> -                                        st->vma->vm_page_prot, st->domain,
> -                                        &cur_page);
> -
> -       /* Store error code for second pass. */
> -       *(st->err++) = ret;
> +       BUG_ON(nr < 0);
> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
> +                                        st->err, st->vma->vm_page_prot,
> +                                        st->domain, &cur_page);
> 
>         /* And see if it affects the global_error. */

The code which follows needs adjustment to cope with the fact that we
now batch. I think it needs to walk st->err and set global_error as
appropriate. This is related to my comment about the return value of
xen_remap_domain_mfn_range too.

I think rather than trying to fabricate some sort of meaningful error
code for an entire batch xen_remap_domain_mfn_range should just return
an indication about whether there were any errors or not and leave it to
the caller to figure out the overall result by looking at the err array.

Perhaps return either the number of errors or the number of successes
(which turns the following if into either (ret) or (ret < nr)
respectively).

>         if (ret < 0) {
> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
>                                 st->global_error = 1;
>                 }
>         }
> -       st->va += PAGE_SIZE;
> +       st->va += PAGE_SIZE * nr;
> 
>         return 0;
>  }
> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>         state.err           = err_array;
> 
>         /* mmap_batch_fn guarantees ret == 0 */
> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
> -                            &pagelist, mmap_batch_fn, &state));
> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> +                                   &pagelist, mmap_batch_fn, &state));

Can we make traverse_pages and _block common by passing a block size
parameter?

Ian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2012-12-20 10:40 ` Ian Campbell
@ 2012-12-20 11:00   ` Mats Petersson
  2012-12-20 11:32     ` Ian Campbell
  2012-12-20 11:26   ` Ian Campbell
  1 sibling, 1 reply; 8+ messages in thread
From: Mats Petersson @ 2012-12-20 11:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel, mats, Andres Lagar-Cavilla, JBeulich, konrad, David Vrabel

On 20/12/12 10:40, Ian Campbell wrote:
> I've added Andres since I think this accumulation of ENOENT in err_ptr
> is a paging related thing, or at least I think he understands this
> code ;-)
>
>> @@ -89,37 +89,112 @@ static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
>>          struct page *page = info->pages[info->index++];
>>          unsigned long pfn = page_to_pfn(page);
>>          pte_t pte = pfn_pte(pfn, info->prot);
>> -
>> -       if (map_foreign_page(pfn, info->fgmfn, info->domid))
>> -               return -EFAULT;
>> +       int err;
>> +       // TODO: We should really batch these updates.
>> +       err = map_foreign_page(pfn, *info->fgmfn, info->domid);
>> +       *info->err_ptr++ = err;
>>          set_pte_at(info->vma->vm_mm, addr, ptep, pte);
>> +       info->fgmfn++;
>>
>> -       return 0;
>> +       return err;
> This will cause apply_to_page_range to stop walking and return an error.
> AIUI the intention of the err_ptr array is to accumulate the individual
> success/error for the entire range. The caller can then take care of the
> successes/failures and ENOENTs as appropriate (in particular it doesn't
> want to abort a batch because of an ENOENT, it wants to do as much as
> possible)
>
> On x86 (when err_ptr != NULL) you accumulate all of the errors from
> HYPERVISOR_mmu_update rather than aborting on the first one  and this
> seems correct to me.
Ok, will rework that bit.
>
>>   int xen_unmap_domain_mfn_range(struct vm_area_struct *vma,
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index 01de35c..323a2ab 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> [...]
>> @@ -2528,23 +2557,98 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
>>                  if (err)
>>                          goto out;
>>
>> -               err = HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid);
>> -               if (err < 0)
>> -                       goto out;
>> +               /* We record the error for each page that gives an error, but
>> +                * continue mapping until the whole set is done */
>> +               do {
>> +                       err = HYPERVISOR_mmu_update(&mmu_update[index],
>> +                                                   batch_left, &done, domid);
>> +                       if (err < 0) {
>> +                               if (!err_ptr)
>> +                                       goto out;
>> +                               /* increment done so we skip the error item */
>> +                               done++;
>> +                               last_err = err_ptr[index] = err;
>> +                       }
>> +                       batch_left -= done;
>> +                       index += done;
>> +               } while (batch_left);
>>
>>                  nr -= batch;
>>                  addr += range;
>> +               if (err_ptr)
>> +                       err_ptr += batch;
>>          }
>>
>> -       err = 0;
>> +       err = last_err;
> This means that if you have 100 failures followed by one success you
> return success overall. Is that intentional? That doesn't seem right.
As far as I see, it doesn't mean that. last_err is only set at the 
beginning of the call (to zero) and if there is an error.

>
>>   out:
>>
>>          xen_flush_tlb_all();
>>
>>          return err;
>>   }
> [...]
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 0bbbccb..8f86a44 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
> [...]
>> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
>>          if (xen_feature(XENFEAT_auto_translated_physmap))
>>                  cur_page = pages[st->index++];
>>
>> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
>> -                                        st->vma->vm_page_prot, st->domain,
>> -                                        &cur_page);
>> -
>> -       /* Store error code for second pass. */
>> -       *(st->err++) = ret;
>> +       BUG_ON(nr < 0);
>> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
>> +                                        st->err, st->vma->vm_page_prot,
>> +                                        st->domain, &cur_page);
>>
>>          /* And see if it affects the global_error. */
> The code which follows needs adjustment to cope with the fact that we
> now batch. I think it needs to walk st->err and set global_error as
> appropriate. This is related to my comment about the return value of
> xen_remap_domain_mfn_range too.
The return value, as commented above, is "either 0 or the last error we 
saw". There should be no need to walk the st->err, as we know if there 
was some error or not.
>
> I think rather than trying to fabricate some sort of meaningful error
> code for an entire batch xen_remap_domain_mfn_range should just return
> an indication about whether there were any errors or not and leave it to
> the caller to figure out the overall result by looking at the err array.
>
> Perhaps return either the number of errors or the number of successes
> (which turns the following if into either (ret) or (ret < nr)
> respectively).
I'm trying to not change how the code above expects things to work. 
Whilst it would be lovely to rewrite the entire code dealing with 
mapping memory, I don't think that's within the scope of my current 
project. And if I don't wish to rewrite all of libxc's memory management 
code, I don't want to alter what values are returned or when. The 
current code follows what WAS happening before my changes - which isn't 
exactly the most fantastic thing, and I think there may actually be bugs 
in there, such as:
     if (ret < 0) {
         if (ret == -ENOENT)
             st->global_error = -ENOENT;
         else {
             /* Record that at least one error has happened. */
             if (st->global_error == 0)
                 st->global_error = 1;
         }
     }
if we enter this once with -EFAULT, and then after that with -ENOENT, 
global_error will say -ENOENT. I think knowing that we got an EFAULT is 
"higher importance" than ENOENT, but that's how the old code was 
working, and I'm not sure I should change it at this point.

>>          if (ret < 0) {
>> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
>>                                  st->global_error = 1;
>>                  }
>>          }
>> -       st->va += PAGE_SIZE;
>> +       st->va += PAGE_SIZE * nr;
>>
>>          return 0;
>>   }
>> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>>          state.err           = err_array;
>>
>>          /* mmap_batch_fn guarantees ret == 0 */
>> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
>> -                            &pagelist, mmap_batch_fn, &state));
>> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>> +                                   &pagelist, mmap_batch_fn, &state));
> Can we make traverse_pages and _block common by passing a block size
> parameter?
Yes of course. Is there much benefit from that? I understand that it's 
less code, but it also makes the original traverse_pages more complex. 
Not convinced it helps much - it's quite a small function, so not much 
extra code. Additionally, all of the callback functions will have to 
deal with an extra parameter (that is probably ignored in all but one 
place).

--
Mats
>
> Ian.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2012-12-20 10:40 ` Ian Campbell
  2012-12-20 11:00   ` Mats Petersson
@ 2012-12-20 11:26   ` Ian Campbell
  1 sibling, 0 replies; 8+ messages in thread
From: Ian Campbell @ 2012-12-20 11:26 UTC (permalink / raw)
  To: Mats Petersson
  Cc: xen-devel, mats, Andres Lagar-Cavilla, JBeulich, konrad, David Vrabel

On Thu, 2012-12-20 at 10:40 +0000, Ian Campbell wrote:
> I've added Andres since I think this accumulation of ENOENT in err_ptr
> is a paging related thing, or at least I think he understands this
> code ;-)
> 
> > @@ -89,37 +89,112 @@ static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
> >         struct page *page = info->pages[info->index++];
> >         unsigned long pfn = page_to_pfn(page);
> >         pte_t pte = pfn_pte(pfn, info->prot);
> > -
> > -       if (map_foreign_page(pfn, info->fgmfn, info->domid))
> > -               return -EFAULT;
> > +       int err;
> > +       // TODO: We should really batch these updates.

This just made me realise that we need a way for
XENMEM_add_to_physmap_range to return errors for each slot. I'll look
into that in the new year.

Ian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2012-12-20 11:00   ` Mats Petersson
@ 2012-12-20 11:32     ` Ian Campbell
  2013-01-02 15:47       ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 8+ messages in thread
From: Ian Campbell @ 2012-12-20 11:32 UTC (permalink / raw)
  To: Mats Petersson
  Cc: xen-devel, mats, Andres Lagar-Cavilla, JBeulich, konrad, David Vrabel

On Thu, 2012-12-20 at 11:00 +0000, Mats Petersson wrote:

> >> -       err = 0;
> >> +       err = last_err;
> > This means that if you have 100 failures followed by one success you
> > return success overall. Is that intentional? That doesn't seem right.
> As far as I see, it doesn't mean that. last_err is only set at the 
> beginning of the call (to zero) and if there is an error.

Ah, yes I missed that (but it still isn't right, see below).

> >> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> >> index 0bbbccb..8f86a44 100644
> >> --- a/drivers/xen/privcmd.c
> >> +++ b/drivers/xen/privcmd.c
> > [...]
> >> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
> >>          if (xen_feature(XENFEAT_auto_translated_physmap))
> >>                  cur_page = pages[st->index++];
> >>
> >> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
> >> -                                        st->vma->vm_page_prot, st->domain,
> >> -                                        &cur_page);
> >> -
> >> -       /* Store error code for second pass. */
> >> -       *(st->err++) = ret;
> >> +       BUG_ON(nr < 0);
> >> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
> >> +                                        st->err, st->vma->vm_page_prot,
> >> +                                        st->domain, &cur_page);
> >>
> >>          /* And see if it affects the global_error. */
> > The code which follows needs adjustment to cope with the fact that we
> > now batch. I think it needs to walk st->err and set global_error as
> > appropriate. This is related to my comment about the return value of
> > xen_remap_domain_mfn_range too.
> The return value, as commented above, is "either 0 or the last error we 
> saw". There should be no need to walk the st->err, as we know if there 
> was some error or not.

The code which follows tries to latch having seen ENOENT in preference
to other errors, so you don't need 0 or the last error, you need either
0, ENOENT if one occurred somewhere in the batch or an error!=ENOENT.
(or maybe its vice versa, either way the last error isn't necessarily
what you need).

> > I think rather than trying to fabricate some sort of meaningful error
> > code for an entire batch xen_remap_domain_mfn_range should just return
> > an indication about whether there were any errors or not and leave it to
> > the caller to figure out the overall result by looking at the err array.
> >
> > Perhaps return either the number of errors or the number of successes
> > (which turns the following if into either (ret) or (ret < nr)
> > respectively).
> I'm trying to not change how the code above expects things to work. 
> Whilst it would be lovely to rewrite the entire code dealing with 
> mapping memory, I don't think that's within the scope of my current 
> project. And if I don't wish to rewrite all of libxc's memory management 

I'm not talking about changing the libxc or ioctl interface, only about
the internal-to-Linux interface between privcmd and mmu.c. In fact I'm
only actually asking you to change the return value of the new function
you are adding to the API.

> code, I don't want to alter what values are returned or when. The 
> current code follows what WAS happening before my changes - which isn't 
> exactly the most fantastic thing, and I think there may actually be bugs 
> in there, such as:
>      if (ret < 0) {
>          if (ret == -ENOENT)
>              st->global_error = -ENOENT;
>          else {
>              /* Record that at least one error has happened. */
>              if (st->global_error == 0)
>                  st->global_error = 1;
>          }
>      }
> if we enter this once with -EFAULT, and then after that with -ENOENT, 
> global_error will say -ENOENT. I think knowing that we got an EFAULT is 
> "higher importance" than ENOENT, but that's how the old code was 
> working, and I'm not sure I should change it at this point.

But I think you are changing it, by this behaviour or returning the last
error in the batch which causes this code to behave differently. That's
what I'm asking you to avoid!

> 
> >>          if (ret < 0) {
> >> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
> >>                                  st->global_error = 1;
> >>                  }
> >>          }
> >> -       st->va += PAGE_SIZE;
> >> +       st->va += PAGE_SIZE * nr;
> >>
> >>          return 0;
> >>   }
> >> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
> >>          state.err           = err_array;
> >>
> >>          /* mmap_batch_fn guarantees ret == 0 */
> >> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
> >> -                            &pagelist, mmap_batch_fn, &state));
> >> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> >> +                                   &pagelist, mmap_batch_fn, &state));
> > Can we make traverse_pages and _block common by passing a block size
> > parameter?
> Yes of course. Is there much benefit from that? I understand that it's 
> less code, but it also makes the original traverse_pages more complex. 
> Not convinced it helps much - it's quite a small function, so not much 
> extra code. Additionally, all of the callback functions will have to 
> deal with an extra parameter (that is probably ignored in all but one 
> place).

It depends on what the combined version ends up looking like but in
general I'm in favour of one more generic function over several cloned
and hacked versions of the same thing.

Ian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2012-12-20 11:32     ` Ian Campbell
@ 2013-01-02 15:47       ` Andres Lagar-Cavilla
  2013-01-02 16:45         ` Mats Petersson
  0 siblings, 1 reply; 8+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-02 15:47 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel, JBeulich, Mats Petersson, David Vrabel, mats, konrad,
	Andres Lagar-Cavilla

Hi,
On Dec 20, 2012, at 6:32 AM, Ian Campbell <ian.campbell@citrix.com> wrote:

> On Thu, 2012-12-20 at 11:00 +0000, Mats Petersson wrote:
> 
>>>> -       err = 0;
>>>> +       err = last_err;
>>> This means that if you have 100 failures followed by one success you
>>> return success overall. Is that intentional? That doesn't seem right.
>> As far as I see, it doesn't mean that. last_err is only set at the 
>> beginning of the call (to zero) and if there is an error.
> 
> Ah, yes I missed that (but it still isn't right, see below).
> 
>>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>>> index 0bbbccb..8f86a44 100644
>>>> --- a/drivers/xen/privcmd.c
>>>> +++ b/drivers/xen/privcmd.c
>>> [...]
>>>> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
>>>>         if (xen_feature(XENFEAT_auto_translated_physmap))
>>>>                 cur_page = pages[st->index++];
>>>> 
>>>> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
>>>> -                                        st->vma->vm_page_prot, st->domain,
>>>> -                                        &cur_page);
>>>> -
>>>> -       /* Store error code for second pass. */
>>>> -       *(st->err++) = ret;
>>>> +       BUG_ON(nr < 0);
>>>> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
>>>> +                                        st->err, st->vma->vm_page_prot,
>>>> +                                        st->domain, &cur_page);
>>>> 
>>>>         /* And see if it affects the global_error. */
>>> The code which follows needs adjustment to cope with the fact that we
>>> now batch. I think it needs to walk st->err and set global_error as
>>> appropriate. This is related to my comment about the return value of
>>> xen_remap_domain_mfn_range too.
>> The return value, as commented above, is "either 0 or the last error we 
>> saw". There should be no need to walk the st->err, as we know if there 
>> was some error or not.
> 
> The code which follows tries to latch having seen ENOENT in preference
> to other errors, so you don't need 0 or the last error, you need either
> 0, ENOENT if one occurred somewhere in the batch or an error!=ENOENT.
> (or maybe its vice versa, either way the last error isn't necessarily
> what you need).

Yeah the error reporting interface sucks. All I can say in my defense is that it was that way when I got here.

FWIW EFAULT *takes* precedence over ENOENT.
/* If we have not had any EFAULT-like global errors then set the global
 * error to -ENOENT if necessary. */
   if ((ret == 0) && (state.global_error == -ENOENT))
         ret = -ENOENT;

Otherwise, any individual mapping that fails differently from ENOENT does not affect the global error. That is, in absence of global errors like EFAULT, and in absence of per-mapping ENOENT return codes, per mapping errors like EINVAL do not affect the ioctl return code which remains zero.

IIUC Ian has been pointing that your code breaks this last statement.

Thanks
Andres
> 
>>> I think rather than trying to fabricate some sort of meaningful error
>>> code for an entire batch xen_remap_domain_mfn_range should just return
>>> an indication about whether there were any errors or not and leave it to
>>> the caller to figure out the overall result by looking at the err array.
>>> 
>>> Perhaps return either the number of errors or the number of successes
>>> (which turns the following if into either (ret) or (ret < nr)
>>> respectively).
>> I'm trying to not change how the code above expects things to work. 
>> Whilst it would be lovely to rewrite the entire code dealing with 
>> mapping memory, I don't think that's within the scope of my current 
>> project. And if I don't wish to rewrite all of libxc's memory management 
> 
> I'm not talking about changing the libxc or ioctl interface, only about
> the internal-to-Linux interface between privcmd and mmu.c. In fact I'm
> only actually asking you to change the return value of the new function
> you are adding to the API.
> 
>> code, I don't want to alter what values are returned or when. The 
>> current code follows what WAS happening before my changes - which isn't 
>> exactly the most fantastic thing, and I think there may actually be bugs 
>> in there, such as:
>>     if (ret < 0) {
>>         if (ret == -ENOENT)
>>             st->global_error = -ENOENT;
>>         else {
>>             /* Record that at least one error has happened. */
>>             if (st->global_error == 0)
>>                 st->global_error = 1;
>>         }
>>     }
>> if we enter this once with -EFAULT, and then after that with -ENOENT, 
>> global_error will say -ENOENT. I think knowing that we got an EFAULT is 
>> "higher importance" than ENOENT, but that's how the old code was 
>> working, and I'm not sure I should change it at this point.
> 
> But I think you are changing it, by this behaviour or returning the last
> error in the batch which causes this code to behave differently. That's
> what I'm asking you to avoid!
> 
>> 
>>>>         if (ret < 0) {
>>>> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
>>>>                                 st->global_error = 1;
>>>>                 }
>>>>         }
>>>> -       st->va += PAGE_SIZE;
>>>> +       st->va += PAGE_SIZE * nr;
>>>> 
>>>>         return 0;
>>>>  }
>>>> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>>>>         state.err           = err_array;
>>>> 
>>>>         /* mmap_batch_fn guarantees ret == 0 */
>>>> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
>>>> -                            &pagelist, mmap_batch_fn, &state));
>>>> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>>>> +                                   &pagelist, mmap_batch_fn, &state));
>>> Can we make traverse_pages and _block common by passing a block size
>>> parameter?
>> Yes of course. Is there much benefit from that? I understand that it's 
>> less code, but it also makes the original traverse_pages more complex. 
>> Not convinced it helps much - it's quite a small function, so not much 
>> extra code. Additionally, all of the callback functions will have to 
>> deal with an extra parameter (that is probably ignored in all but one 
>> place).
> 
> It depends on what the combined version ends up looking like but in
> general I'm in favour of one more generic function over several cloned
> and hacked versions of the same thing.
> 
> Ian.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2013-01-02 15:47       ` Andres Lagar-Cavilla
@ 2013-01-02 16:45         ` Mats Petersson
  2013-01-02 16:52           ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 8+ messages in thread
From: Mats Petersson @ 2013-01-02 16:45 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: xen-devel, Ian Campbell, JBeulich, David Vrabel, mats, konrad,
	Andres Lagar-Cavilla

On 02/01/13 15:47, Andres Lagar-Cavilla wrote:
> Hi,
> On Dec 20, 2012, at 6:32 AM, Ian Campbell <ian.campbell@citrix.com> wrote:
>
>> On Thu, 2012-12-20 at 11:00 +0000, Mats Petersson wrote:
>>
>>>>> -       err = 0;
>>>>> +       err = last_err;
>>>> This means that if you have 100 failures followed by one success you
>>>> return success overall. Is that intentional? That doesn't seem right.
>>> As far as I see, it doesn't mean that. last_err is only set at the
>>> beginning of the call (to zero) and if there is an error.
>> Ah, yes I missed that (but it still isn't right, see below).
>>
>>>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>>>> index 0bbbccb..8f86a44 100644
>>>>> --- a/drivers/xen/privcmd.c
>>>>> +++ b/drivers/xen/privcmd.c
>>>> [...]
>>>>> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
>>>>>          if (xen_feature(XENFEAT_auto_translated_physmap))
>>>>>                  cur_page = pages[st->index++];
>>>>>
>>>>> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
>>>>> -                                        st->vma->vm_page_prot, st->domain,
>>>>> -                                        &cur_page);
>>>>> -
>>>>> -       /* Store error code for second pass. */
>>>>> -       *(st->err++) = ret;
>>>>> +       BUG_ON(nr < 0);
>>>>> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
>>>>> +                                        st->err, st->vma->vm_page_prot,
>>>>> +                                        st->domain, &cur_page);
>>>>>
>>>>>          /* And see if it affects the global_error. */
>>>> The code which follows needs adjustment to cope with the fact that we
>>>> now batch. I think it needs to walk st->err and set global_error as
>>>> appropriate. This is related to my comment about the return value of
>>>> xen_remap_domain_mfn_range too.
>>> The return value, as commented above, is "either 0 or the last error we
>>> saw". There should be no need to walk the st->err, as we know if there
>>> was some error or not.
>> The code which follows tries to latch having seen ENOENT in preference
>> to other errors, so you don't need 0 or the last error, you need either
>> 0, ENOENT if one occurred somewhere in the batch or an error!=ENOENT.
>> (or maybe its vice versa, either way the last error isn't necessarily
>> what you need).
> Yeah the error reporting interface sucks. All I can say in my defense is that it was that way when I got here.
>
> FWIW EFAULT *takes* precedence over ENOENT.
> /* If we have not had any EFAULT-like global errors then set the global
>   * error to -ENOENT if necessary. */
>     if ((ret == 0) && (state.global_error == -ENOENT))
>           ret = -ENOENT;
>
> Otherwise, any individual mapping that fails differently from ENOENT does not affect the global error. That is, in absence of global errors like EFAULT, and in absence of per-mapping ENOENT return codes, per mapping errors like EINVAL do not affect the ioctl return code which remains zero.
>
> IIUC Ian has been pointing that your code breaks this last statement.
Sorry, are you saying the correct behaviour is to return 0 if we have 
EINVAL or some other error [aside from EFAULT and ENOENT]? Or are you 
saying that my code is broken such that it does this?

I did some fixes, that I believe "does the right thing", before 
Christmas, but since I knew most people wouldn't be available, I didn't 
post it. I will try to get round to it before the end of the week.

--
Mats
>
> Thanks
> Andres
>>>> I think rather than trying to fabricate some sort of meaningful error
>>>> code for an entire batch xen_remap_domain_mfn_range should just return
>>>> an indication about whether there were any errors or not and leave it to
>>>> the caller to figure out the overall result by looking at the err array.
>>>>
>>>> Perhaps return either the number of errors or the number of successes
>>>> (which turns the following if into either (ret) or (ret < nr)
>>>> respectively).
>>> I'm trying to not change how the code above expects things to work.
>>> Whilst it would be lovely to rewrite the entire code dealing with
>>> mapping memory, I don't think that's within the scope of my current
>>> project. And if I don't wish to rewrite all of libxc's memory management
>> I'm not talking about changing the libxc or ioctl interface, only about
>> the internal-to-Linux interface between privcmd and mmu.c. In fact I'm
>> only actually asking you to change the return value of the new function
>> you are adding to the API.
>>
>>> code, I don't want to alter what values are returned or when. The
>>> current code follows what WAS happening before my changes - which isn't
>>> exactly the most fantastic thing, and I think there may actually be bugs
>>> in there, such as:
>>>      if (ret < 0) {
>>>          if (ret == -ENOENT)
>>>              st->global_error = -ENOENT;
>>>          else {
>>>              /* Record that at least one error has happened. */
>>>              if (st->global_error == 0)
>>>                  st->global_error = 1;
>>>          }
>>>      }
>>> if we enter this once with -EFAULT, and then after that with -ENOENT,
>>> global_error will say -ENOENT. I think knowing that we got an EFAULT is
>>> "higher importance" than ENOENT, but that's how the old code was
>>> working, and I'm not sure I should change it at this point.
>> But I think you are changing it, by this behaviour or returning the last
>> error in the batch which causes this code to behave differently. That's
>> what I'm asking you to avoid!
>>
>>>>>          if (ret < 0) {
>>>>> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
>>>>>                                  st->global_error = 1;
>>>>>                  }
>>>>>          }
>>>>> -       st->va += PAGE_SIZE;
>>>>> +       st->va += PAGE_SIZE * nr;
>>>>>
>>>>>          return 0;
>>>>>   }
>>>>> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>>>>>          state.err           = err_array;
>>>>>
>>>>>          /* mmap_batch_fn guarantees ret == 0 */
>>>>> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
>>>>> -                            &pagelist, mmap_batch_fn, &state));
>>>>> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>>>>> +                                   &pagelist, mmap_batch_fn, &state));
>>>> Can we make traverse_pages and _block common by passing a block size
>>>> parameter?
>>> Yes of course. Is there much benefit from that? I understand that it's
>>> less code, but it also makes the original traverse_pages more complex.
>>> Not convinced it helps much - it's quite a small function, so not much
>>> extra code. Additionally, all of the callback functions will have to
>>> deal with an extra parameter (that is probably ignored in all but one
>>> place).
>> It depends on what the combined version ends up looking like but in
>> general I'm in favour of one more generic function over several cloned
>> and hacked versions of the same thing.
>>
>> Ian.
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2.
  2013-01-02 16:45         ` Mats Petersson
@ 2013-01-02 16:52           ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 8+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-02 16:52 UTC (permalink / raw)
  To: Mats Petersson
  Cc: xen-devel, Ian Campbell, JBeulich, David Vrabel, mats, konrad,
	Andres Lagar-Cavilla

On Jan 2, 2013, at 11:45 AM, Mats Petersson <mats.petersson@citrix.com> wrote:

> On 02/01/13 15:47, Andres Lagar-Cavilla wrote:
>> Hi,
>> On Dec 20, 2012, at 6:32 AM, Ian Campbell <ian.campbell@citrix.com> wrote:
>> 
>>> On Thu, 2012-12-20 at 11:00 +0000, Mats Petersson wrote:
>>> 
>>>>>> -       err = 0;
>>>>>> +       err = last_err;
>>>>> This means that if you have 100 failures followed by one success you
>>>>> return success overall. Is that intentional? That doesn't seem right.
>>>> As far as I see, it doesn't mean that. last_err is only set at the
>>>> beginning of the call (to zero) and if there is an error.
>>> Ah, yes I missed that (but it still isn't right, see below).
>>> 
>>>>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>>>>> index 0bbbccb..8f86a44 100644
>>>>>> --- a/drivers/xen/privcmd.c
>>>>>> +++ b/drivers/xen/privcmd.c
>>>>> [...]
>>>>>> @@ -283,12 +317,10 @@ static int mmap_batch_fn(void *data, void *state)
>>>>>>         if (xen_feature(XENFEAT_auto_translated_physmap))
>>>>>>                 cur_page = pages[st->index++];
>>>>>> 
>>>>>> -       ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1,
>>>>>> -                                        st->vma->vm_page_prot, st->domain,
>>>>>> -                                        &cur_page);
>>>>>> -
>>>>>> -       /* Store error code for second pass. */
>>>>>> -       *(st->err++) = ret;
>>>>>> +       BUG_ON(nr < 0);
>>>>>> +       ret = xen_remap_domain_mfn_array(st->vma, st->va & PAGE_MASK, mfnp, nr,
>>>>>> +                                        st->err, st->vma->vm_page_prot,
>>>>>> +                                        st->domain, &cur_page);
>>>>>> 
>>>>>>         /* And see if it affects the global_error. */
>>>>> The code which follows needs adjustment to cope with the fact that we
>>>>> now batch. I think it needs to walk st->err and set global_error as
>>>>> appropriate. This is related to my comment about the return value of
>>>>> xen_remap_domain_mfn_range too.
>>>> The return value, as commented above, is "either 0 or the last error we
>>>> saw". There should be no need to walk the st->err, as we know if there
>>>> was some error or not.
>>> The code which follows tries to latch having seen ENOENT in preference
>>> to other errors, so you don't need 0 or the last error, you need either
>>> 0, ENOENT if one occurred somewhere in the batch or an error!=ENOENT.
>>> (or maybe its vice versa, either way the last error isn't necessarily
>>> what you need).
>> Yeah the error reporting interface sucks. All I can say in my defense is that it was that way when I got here.
>> 
>> FWIW EFAULT *takes* precedence over ENOENT.
>> /* If we have not had any EFAULT-like global errors then set the global
>>  * error to -ENOENT if necessary. */
>>    if ((ret == 0) && (state.global_error == -ENOENT))
>>          ret = -ENOENT;
>> 
>> Otherwise, any individual mapping that fails differently from ENOENT does not affect the global error. That is, in absence of global errors like EFAULT, and in absence of per-mapping ENOENT return codes, per mapping errors like EINVAL do not affect the ioctl return code which remains zero.
>> 
>> IIUC Ian has been pointing that your code breaks this last statement.
> Sorry, are you saying the correct behaviour is to return 0 if we have EINVAL or some other error [aside from EFAULT and ENOENT]?
Yes. Note that before we get to actual mapping work we may fail with ENOMEM, EINVAL, etc (but your patch does not affect that).

> Or are you saying that my code is broken such that it does this?
I didn't grok the patch as far as being sure whether it does it or not. I'll look at your next version.

Thanks
Andres

> 
> I did some fixes, that I believe "does the right thing", before Christmas, but since I knew most people wouldn't be available, I didn't post it. I will try to get round to it before the end of the week.
> 
> --
> Mats
>> 
>> Thanks
>> Andres
>>>>> I think rather than trying to fabricate some sort of meaningful error
>>>>> code for an entire batch xen_remap_domain_mfn_range should just return
>>>>> an indication about whether there were any errors or not and leave it to
>>>>> the caller to figure out the overall result by looking at the err array.
>>>>> 
>>>>> Perhaps return either the number of errors or the number of successes
>>>>> (which turns the following if into either (ret) or (ret < nr)
>>>>> respectively).
>>>> I'm trying to not change how the code above expects things to work.
>>>> Whilst it would be lovely to rewrite the entire code dealing with
>>>> mapping memory, I don't think that's within the scope of my current
>>>> project. And if I don't wish to rewrite all of libxc's memory management
>>> I'm not talking about changing the libxc or ioctl interface, only about
>>> the internal-to-Linux interface between privcmd and mmu.c. In fact I'm
>>> only actually asking you to change the return value of the new function
>>> you are adding to the API.
>>> 
>>>> code, I don't want to alter what values are returned or when. The
>>>> current code follows what WAS happening before my changes - which isn't
>>>> exactly the most fantastic thing, and I think there may actually be bugs
>>>> in there, such as:
>>>>     if (ret < 0) {
>>>>         if (ret == -ENOENT)
>>>>             st->global_error = -ENOENT;
>>>>         else {
>>>>             /* Record that at least one error has happened. */
>>>>             if (st->global_error == 0)
>>>>                 st->global_error = 1;
>>>>         }
>>>>     }
>>>> if we enter this once with -EFAULT, and then after that with -ENOENT,
>>>> global_error will say -ENOENT. I think knowing that we got an EFAULT is
>>>> "higher importance" than ENOENT, but that's how the old code was
>>>> working, and I'm not sure I should change it at this point.
>>> But I think you are changing it, by this behaviour or returning the last
>>> error in the batch which causes this code to behave differently. That's
>>> what I'm asking you to avoid!
>>> 
>>>>>>         if (ret < 0) {
>>>>>> @@ -300,7 +332,7 @@ static int mmap_batch_fn(void *data, void *state)
>>>>>>                                 st->global_error = 1;
>>>>>>                 }
>>>>>>         }
>>>>>> -       st->va += PAGE_SIZE;
>>>>>> +       st->va += PAGE_SIZE * nr;
>>>>>> 
>>>>>>         return 0;
>>>>>>  }
>>>>>> @@ -430,8 +462,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>>>>>>         state.err           = err_array;
>>>>>> 
>>>>>>         /* mmap_batch_fn guarantees ret == 0 */
>>>>>> -       BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
>>>>>> -                            &pagelist, mmap_batch_fn, &state));
>>>>>> +       BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>>>>>> +                                   &pagelist, mmap_batch_fn, &state));
>>>>> Can we make traverse_pages and _block common by passing a block size
>>>>> parameter?
>>>> Yes of course. Is there much benefit from that? I understand that it's
>>>> less code, but it also makes the original traverse_pages more complex.
>>>> Not convinced it helps much - it's quite a small function, so not much
>>>> extra code. Additionally, all of the callback functions will have to
>>>> deal with an extra parameter (that is probably ignored in all but one
>>>> place).
>>> It depends on what the combined version ends up looking like but in
>>> general I'm in favour of one more generic function over several cloned
>>> and hacked versions of the same thing.
>>> 
>>> Ian.
>>> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-01-02 16:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-19 18:53 [PATCH V5] xen/privcmd.c: improve performance of MMAPBATCH_V2 Mats Petersson
2012-12-20 10:40 ` Ian Campbell
2012-12-20 11:00   ` Mats Petersson
2012-12-20 11:32     ` Ian Campbell
2013-01-02 15:47       ` Andres Lagar-Cavilla
2013-01-02 16:45         ` Mats Petersson
2013-01-02 16:52           ` Andres Lagar-Cavilla
2012-12-20 11:26   ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.