[PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
@ 2008-05-23 14:20 Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 1 of 4] mm: add a pte_rmw transaction abstraction Jeremy Fitzhardinge
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, xen-devel, Thomas Gleixner, Hugh Dickins, Zachary Amsden,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

Hi all,

This little series adds a new transaction-like abstraction for doing
RMW updates to a pte, hooks it into paravirt_ops, and then makes use
of it in Xen.

The basic problem is that mprotect is very slow under Xen (up to 50x
slower than native), primarily because of the

	ptent = ptep_get_and_clear(mm, addr, pte);
	ptent = pte_modify(ptent, newprot);
	/* ... */
	set_pte_at(mm, addr, pte, ptent);

sequence in mm/mprotect.c:change_pte_range().

This is bad for Xen for two reasons:

  1: ptep_get_and_clear() ends up being a xchg on the pte.  Since the
     pte page is read-only (as it must be, because Xen needs to
     control all pte updates), this traps into Xen, which then
     emulates the instruction.  Trapping into the instruction emulator
     is inherently fairly expensive.  And,

  2: because ptep_get_and_clear has atomic-fetch-and-update semantics,
     it's impossible to implement in a way which can be batched to amortize
     the cost of faulting into the hypervisor.

This series adds the pte_rmw_start() and pte_rmw_commit() operations,
which change this sequence to:

	ptent = pte_rmw_start(mm, addr, pte);
	ptent = pte_modify(ptent, newprot);
	/* ... */
	pte_rmw_commit(mm, addr, pte, ptent);

Which looks very familiar.  And, indeed, when compiled without
CONFIG_PARAVIRT (or on a non-x86 architecture), it will end up doing
precisely the same thing as before.

However, the effective semantics are a bit different.  pte_rmw_start()
means "I'm reading this pte with the intention of updating it; please
don't lose any hardware pte changes in the meantime".  And
pte_rmw_commit() means "Here's a new value for the pte, but make sure
you don't lose any hardware changes".

The default implementation achieves these semantics by making
pte_rmw_start() set the pte to non-present, which prevents any async
hardware changes to the pte.  The pte_rmw_commit() can then just write
the new value into place without having to worry about preserving any
changes, because it knows there are none.

Xen implements pte_rmw_start() as a simple read of the pte.  This
leaves the pte unchanged in memory, and the hardware may make
asynchronous changes to it.  It implements pte_rmw_commit() using a
hypercall which preserves the state of the Access/Dirty bits to update
the pte.  This allows the whole change_pte_range() loop to be run
without any synchronous unbatched traps into the hypervisor.  With
this change in place, an mprotect microbenchmark goes from being 50x
worse than native to around 7x, which is acceptible.

I believe that other virtualization systems, whether they use direct
paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can
make use of this interface to improve the performance.

Unfortunately (or fortunately) there aren't very many other areas of
the kernel which can really take advantage of this.  There's only a
couple of other instances of ptep_get_and_clear() in mm/, and they're
being used in a similar way; but I don't think they're very
performance critical (though zap_pte_range might be interesting).

In general, mprotect is rarely a performance bottleneck.  But some
debugging libraries (such as electric fence) and garbage collectors
can be very heavy users of mprotect, and this change could materially
benefit them.

Thanks,
	J

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1 of 4] mm: add a pte_rmw transaction abstraction
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
@ 2008-05-23 14:20 ` Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 2 of 4] paravirt: add hooks for pte_rmw_start/commit Jeremy Fitzhardinge
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, xen-devel, Thomas Gleixner, Hugh Dickins, Zachary Amsden,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

This patch adds an API for doing read-modify-write updates to a pte
which may race against hardware updates to the pte.  After reading the
pte, the hardware may asynchonously set the accessed or dirty bits on
a pte, which would be lost when writing back the modified pte value.

The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory.  This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state.  When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.

When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor.  To mitigate the cost of these updates, we tend to batch
them.

However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable.  This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:

pte_rmw_start() returns the current pte value, and puts the pte entry
  into a state where either the hardware will not update the pte, or if
  it does, the updates will be preserved on commit.

pte_rmw_commit() writes back the updated pte, makes sure that any
  hardware updates made since pte_rmw_start() are preserved.

pte_rmw_start() and _commit() must be exactly paired, and used while
holding the appropriate pte lock.  They do not protect against other
software updates of the pte in any way.

The current implementations of pte_rmw_start and _commit are functionally
unchanged from before: pte_rmw_start() uses ptep_get_and_clear() fetch
the pte and zero the entry, preventing any hardware updates.
pte_rmw_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.

The only current user of this interface is mprotect

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 include/asm-generic/pgtable.h |   51 ++++++++++++++++++++++++++++++++++++++++-
 mm/mprotect.c                 |   10 +++-----
 2 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -197,6 +197,55 @@
 }
 #endif /* CONFIG_MMU */

+static inline pte_t __pte_rmw_start(struct mm_struct *mm, unsigned long addr,
+				    pte_t *ptep)
+{
+	/* Get the current pte state, but zero it out to make it
+	   non-present, preventing the hardware from asynchronously
+	   updating it. */
+	return ptep_get_and_clear(mm, addr, ptep);
+}
+
+static inline void __pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
+				    pte_t *ptep, pte_t pte)
+{
+	/* The pte is non-present, so there's no hardware state to
+	   preserve. */
+	set_pte_at(mm, addr, ptep, pte);
+}
+
+#ifndef __HAVE_ARCH_PTE_RMW_TRANSACTION
+/*
+ * Start a pte read-modify-write transaction, which protects against
+ * asynchronous hardware modifications to the pte.  The intention is
+ * not to prevent the hardware from making pte updates, but to prevent
+ * any updates it may make from being lost.
+ *
+ * This does not protect against other software modifications of the
+ * pte; the appropriate pte lock must be held over the transation.
+ *
+ * Note that this interface is intended to be batchable, meaning that
+ * pte_rmw_commit may not actually update the pte, but merely queue
+ * the update to be done at some later time.  The update must be
+ * actually committed before the pte lock is released, however.
+ */
+static inline pte_t pte_rmw_start(struct mm_struct *mm, unsigned long addr,
+				  pte_t *ptep)
+{
+	return __pte_rmw_start(mm, addr, ptep);
+}
+
+/*
+ * Commit an update to a pte, leaving any hardware-controlled bits in
+ * the PTE unmodified.
+ */
+static inline void pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
+				  pte_t *ptep, pte_t pte)
+{
+	__pte_rmw_commit(mm, addr, ptep, pte);
+}
+#endif
+
 /*
  * A facility to provide lazy MMU batching.  This allows PTE updates and
  * page invalidations to be delayed until a call to leave lazy MMU mode
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -47,19 +47,17 @@
 		if (pte_present(oldpte)) {
 			pte_t ptent;

-			/* Avoid an SMP race with hardware updated dirty/clean
-			 * bits by wiping the pte and then setting the new pte
-			 * into place.
-			 */
-			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_rmw_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
+
 			/*
 			 * Avoid taking write faults for pages we know to be
 			 * dirty.
 			 */
 			if (dirty_accountable && pte_dirty(ptent))
 				ptent = pte_mkwrite(ptent);
-			set_pte_at(mm, addr, pte, ptent);
+
+			pte_rmw_commit(mm, addr, pte, ptent);
 #ifdef CONFIG_MIGRATION
 		} else if (!pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2 of 4] paravirt: add hooks for pte_rmw_start/commit
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 1 of 4] mm: add a pte_rmw transaction abstraction Jeremy Fitzhardinge
@ 2008-05-23 14:20 ` Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 3 of 4] xen: implement pte_rmw_start/commit Jeremy Fitzhardinge
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, xen-devel, Thomas Gleixner, Hugh Dickins, Zachary Amsden,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

This patch adds paravirt-ops hooks in pv_mmu_ops for pte_rmw_start and
pte_rmw_commit.  This allows the hypervisor-specific backends to
implement these in some more efficient way.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/paravirt.c |    3 +++
 arch/x86/xen/enlighten.c   |    3 +++
 include/asm-x86/paravirt.h |   26 ++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -383,6 +383,9 @@
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
 
+	.pte_rmw_start = __pte_rmw_start,
+	.pte_rmw_commit = __pte_rmw_commit,
+
 #ifdef CONFIG_HIGHPTE
 	.kmap_atomic_pte = kmap_atomic,
 #endif
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1103,6 +1103,9 @@
 	.set_pte_at = xen_set_pte_at,
 	.set_pmd = xen_set_pmd,
 
+	.pte_rmw_start = __pte_rmw_start,
+	.pte_rmw_commit = __pte_rmw_commit,
+
 	.pte_val = xen_pte_val,
 	.pte_flags = native_pte_val,
 	.pgd_val = xen_pgd_val,
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -238,6 +238,11 @@
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+
+	pte_t (*pte_rmw_start)(struct mm_struct *mm, unsigned long addr,
+			       pte_t *ptep);
+	void (*pte_rmw_commit)(struct mm_struct *mm, unsigned long addr,
+			       pte_t *ptep, pte_t pte);
 
 	pteval_t (*pte_val)(pte_t);
 	pteval_t (*pte_flags)(pte_t);
@@ -1040,6 +1045,27 @@
 	return ret;
 }
 
+#define  __HAVE_ARCH_PTE_RMW_TRANSACTION
+static inline pte_t pte_rmw_start(struct mm_struct *mm, unsigned long addr,
+				  pte_t *ptep)
+{
+	pteval_t ret;
+
+	ret = PVOP_CALL3(pteval_t, pv_mmu_ops.pte_rmw_start, mm, addr, ptep);
+
+	return (pte_t) { .pte = ret };
+}
+
+static inline void pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
+				   pte_t *ptep, pte_t pte)
+{
+	if (sizeof(pteval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.pte_rmw_commit(mm, addr, ptep, pte);
+	else
+		PVOP_VCALL4(pv_mmu_ops.pte_rmw_commit, mm, addr, ptep, pte.pte);
+}
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	if (sizeof(pteval_t) > sizeof(long))



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3 of 4] xen: implement pte_rmw_start/commit
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 1 of 4] mm: add a pte_rmw transaction abstraction Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 2 of 4] paravirt: add hooks for pte_rmw_start/commit Jeremy Fitzhardinge
@ 2008-05-23 14:20 ` Jeremy Fitzhardinge
  2008-05-23 14:20 ` [PATCH 4 of 4] xen: add mechanism to extend existing multicalls Jeremy Fitzhardinge
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, xen-devel, Thomas Gleixner, Hugh Dickins, Zachary Amsden,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

Xen has a pte update function which will update a pte while preserving
its accessed and dirty bits.  This means that pte_rmw_start() can be
implemented as a simple read of the pte value.  The hardware may
update the pte in the meantime, but pte_rmw_commit() updates it while
preserving any changes that may have happened in the meantime.

The updates in pte_rmw_commit() are batched if we're currently in lazy
mmu mode.

The mmu_update hypercall can take a batch of updates to perform, but
this code doesn't make particular use of that feature, in favour of
using generic multicall batching to get them all into the hypervisor.

The net effect of this is that each mprotect pte update turns from two
expensive trap-and-emulate faults into they hypervisor into a single
hypercall whose cost is amortized in a batched multicall.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/enlighten.c         |   13 ++++++++++---
 arch/x86/xen/mmu.c               |   21 +++++++++++++++++++++
 arch/x86/xen/mmu.h               |    4 ++++
 include/xen/interface/features.h |    3 +++
 include/xen/interface/xen.h      |    9 +++++++--
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -140,7 +140,9 @@
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
 	       pv_info.name);
-	printk(KERN_INFO "Hypervisor signature: %s\n", xen_start_info->magic);
+	printk(KERN_INFO "Hypervisor signature: %s%s\n",
+	       xen_start_info->magic,
+	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
 
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
@@ -1208,6 +1210,8 @@
 
 	BUG_ON(memcmp(xen_start_info->magic, "xen-3", 5) != 0);
 
+	xen_setup_features();
+
 	/* Install Xen paravirt ops */
 	pv_info = xen_info;
 	pv_init_ops = xen_init_ops;
@@ -1217,13 +1221,16 @@
 	pv_apic_ops = xen_apic_ops;
 	pv_mmu_ops = xen_mmu_ops;
 
+	if (xen_feature(XENFEAT_mmu_pt_update_preserve_ad)) {
+		pv_mmu_ops.pte_rmw_start = xen_pte_rmw_start;
+		pv_mmu_ops.pte_rmw_commit = xen_pte_rmw_commit;
+	}
+
 	machine_ops = xen_machine_ops;
 
 #ifdef CONFIG_SMP
 	smp_ops = xen_smp_ops;
 #endif
-
-	xen_setup_features();
 
 	/* Get mfn list */
 	if (!xen_feature(XENFEAT_auto_translated_physmap))
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -300,6 +300,27 @@
 out:
 	if (mm == &init_mm)
 		preempt_enable();
+}
+
+pte_t xen_pte_rmw_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	/* Just return the pte as-is.  We preserve the bits on commit */
+	return *ptep;
+}
+
+void xen_pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	struct multicall_space mcs;
+	struct mmu_update *u;
+
+	mcs = xen_mc_entry(sizeof(*u));
+	u = mcs.args;
+	u->ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD;
+	u->val = pte_val_ma(pte);
+	MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+
+	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
 
 pteval_t xen_pte_val(pte_t pte)
diff --git a/arch/x86/xen/mmu.h b/arch/x86/xen/mmu.h
--- a/arch/x86/xen/mmu.h
+++ b/arch/x86/xen/mmu.h
@@ -52,4 +52,8 @@
 void xen_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 void xen_pmd_clear(pmd_t *pmdp);
 
+pte_t xen_pte_rmw_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+void  xen_pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep, pte_t pte);
+
 #endif	/* _XEN_MMU_H */
diff --git a/include/xen/interface/features.h b/include/xen/interface/features.h
--- a/include/xen/interface/features.h
+++ b/include/xen/interface/features.h
@@ -38,6 +38,9 @@
  */
 #define XENFEAT_pae_pgdir_above_4gb        4
 
+/* x86: Does this Xen host support the MMU_PT_UPDATE_PRESERVE_AD hypercall? */
+#define XENFEAT_mmu_pt_update_preserve_ad  5
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -113,9 +113,14 @@
  * ptr[:2]  -- Machine address within the frame whose mapping to modify.
  *             The frame must belong to the FD, if one is specified.
  * val      -- Value to write into the mapping entry.
+ *
+ * ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD:
+ * As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed
+ * with those in @val.
  */
-#define MMU_NORMAL_PT_UPDATE     0 /* checked '*ptr = val'. ptr is MA.       */
-#define MMU_MACHPHYS_UPDATE      1 /* ptr = MA of frame to modify entry for  */
+#define MMU_NORMAL_PT_UPDATE      0 /* checked '*ptr = val'. ptr is MA.       */
+#define MMU_MACHPHYS_UPDATE       1 /* ptr = MA of frame to modify entry for  */
+#define MMU_PT_UPDATE_PRESERVE_AD 2 /* atomically: *ptr = val | (*ptr&(A|D)) */
 
 /*
  * MMU EXTENDED OPERATIONS



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4 of 4] xen: add mechanism to extend existing multicalls
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
                   ` (2 preceding siblings ...)
  2008-05-23 14:20 ` [PATCH 3 of 4] xen: implement pte_rmw_start/commit Jeremy Fitzhardinge
@ 2008-05-23 14:20 ` Jeremy Fitzhardinge
  2008-05-23 18:27 ` [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Zachary Amsden
  2008-05-23 18:57 ` Linus Torvalds
  5 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, xen-devel, Thomas Gleixner, Hugh Dickins, Zachary Amsden,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

Some Xen hypercalls accept an array of operations to work on.  In
general this is because its more efficient for the hypercall to the
work all at once rather than as separate hypercalls (even batched as a
multicall).

This patch adds a mechanism (xen_mc_extend_args()) to allocate more
argument space to the last-issued multicall, in order to extend its
argument list.

The user of this mechanism is xen/mmu.c, which uses it to extend the
args array of mmu_update.  This is particularly valuable when doing
the update for a large mprotect, which goes via pte_rmw_commit(), but
it also manages to batch updates to pgd/pmds as well.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/mmu.c        |   56 ++++++++++++++++++++++++++++-----------------
 arch/x86/xen/multicalls.c |   40 +++++++++++++++++++++++++++-----
 arch/x86/xen/multicalls.h |   12 +++++++++
 3 files changed, 81 insertions(+), 27 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -220,19 +220,35 @@
 		BUG();
 }
 
-
-void xen_set_pmd(pmd_t *ptr, pmd_t val)
+static void extend_mmu_update(const struct mmu_update *update)
 {
 	struct multicall_space mcs;
 	struct mmu_update *u;
 
+	mcs = xen_mc_extend_args(__HYPERVISOR_mmu_update, sizeof(*u));
+
+	if (mcs.mc != NULL)
+		mcs.mc->args[1]++;
+	else {
+		mcs = __xen_mc_entry(sizeof(*u));
+		MULTI_mmu_update(mcs.mc, mcs.args, 1, NULL, DOMID_SELF);
+	}
+
+	u = mcs.args;
+	*u = *update;
+}
+
+void xen_set_pmd(pmd_t *ptr, pmd_t val)
+{
+	struct mmu_update u;
+
 	preempt_disable();
 
-	mcs = xen_mc_entry(sizeof(*u));
-	u = mcs.args;
-	u->ptr = virt_to_machine(ptr).maddr;
-	u->val = pmd_val_ma(val);
-	MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+	xen_mc_batch();
+
+	u.ptr = virt_to_machine(ptr).maddr;
+	u.val = pmd_val_ma(val);
+	extend_mmu_update(&u);
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 
@@ -311,14 +327,13 @@
 void xen_pte_rmw_commit(struct mm_struct *mm, unsigned long addr,
 			pte_t *ptep, pte_t pte)
 {
-	struct multicall_space mcs;
-	struct mmu_update *u;
+	struct mmu_update u;
 
-	mcs = xen_mc_entry(sizeof(*u));
-	u = mcs.args;
-	u->ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD;
-	u->val = pte_val_ma(pte);
-	MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+	xen_mc_batch();
+
+	u.ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD;
+	u.val = pte_val_ma(pte);
+	extend_mmu_update(&u);
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
@@ -369,16 +384,15 @@
 
 void xen_set_pud(pud_t *ptr, pud_t val)
 {
-	struct multicall_space mcs;
-	struct mmu_update *u;
+	struct mmu_update u;
 
 	preempt_disable();
 
-	mcs = xen_mc_entry(sizeof(*u));
-	u = mcs.args;
-	u->ptr = virt_to_machine(ptr).maddr;
-	u->val = pud_val_ma(val);
-	MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF);
+	xen_mc_batch();
+
+	u.ptr = virt_to_machine(ptr).maddr;
+	u.val = pud_val_ma(val);
+	extend_mmu_update(&u);
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 
diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
--- a/arch/x86/xen/multicalls.c
+++ b/arch/x86/xen/multicalls.c
@@ -29,14 +29,14 @@
 #define MC_DEBUG	1
 
 #define MC_BATCH	32
-#define MC_ARGS		(MC_BATCH * 16 / sizeof(u64))
+#define MC_ARGS		(MC_BATCH * 16)
 
 struct mc_buffer {
 	struct multicall_entry entries[MC_BATCH];
 #if MC_DEBUG
 	struct multicall_entry debug[MC_BATCH];
 #endif
-	u64 args[MC_ARGS];
+	unsigned char args[MC_ARGS];
 	struct callback {
 		void (*fn)(void *);
 		void *data;
@@ -107,20 +107,48 @@
 {
 	struct mc_buffer *b = &__get_cpu_var(mc_buffer);
 	struct multicall_space ret;
-	unsigned argspace = (args + sizeof(u64) - 1) / sizeof(u64);
+	unsigned argidx = roundup(b->argidx, sizeof(u64));
 
 	BUG_ON(preemptible());
-	BUG_ON(argspace > MC_ARGS);
+	BUG_ON(b->argidx > MC_ARGS);
 
 	if (b->mcidx == MC_BATCH ||
-	    (b->argidx + argspace) > MC_ARGS)
+	    (argidx + args) > MC_ARGS) {
 		xen_mc_flush();
+		argidx = roundup(b->argidx, sizeof(u64));
+	}
 
 	ret.mc = &b->entries[b->mcidx];
 	b->mcidx++;
+	ret.args = &b->args[argidx];
+	b->argidx = argidx + args;
+
+	BUG_ON(b->argidx > MC_ARGS);
+	return ret;
+}
+
+struct multicall_space xen_mc_extend_args(unsigned long op, size_t size)
+{
+	struct mc_buffer *b = &__get_cpu_var(mc_buffer);
+	struct multicall_space ret = { NULL, NULL };
+
+	BUG_ON(preemptible());
+	BUG_ON(b->argidx > MC_ARGS);
+
+	if (b->mcidx == 0)
+		return ret;
+
+	if (b->entries[b->mcidx - 1].op != op)
+		return ret;
+
+	if ((b->argidx + size) > MC_ARGS)
+		return ret;
+
+	ret.mc = &b->entries[b->mcidx - 1];
 	ret.args = &b->args[b->argidx];
-	b->argidx += argspace;
+	b->argidx += size;
 
+	BUG_ON(b->argidx > MC_ARGS);
 	return ret;
 }
 
diff --git a/arch/x86/xen/multicalls.h b/arch/x86/xen/multicalls.h
--- a/arch/x86/xen/multicalls.h
+++ b/arch/x86/xen/multicalls.h
@@ -45,4 +45,16 @@
 /* Set up a callback to be called when the current batch is flushed */
 void xen_mc_callback(void (*fn)(void *), void *data);
 
+/*
+ * Try to extend the arguments of the previous multicall command.  The
+ * previous command's op must match.  If it does, then it attempts to
+ * extend the argument space allocated to the multicall entry by
+ * arg_size bytes.
+ *
+ * The returned multicall_space will return with mc pointing to the
+ * command on success, or NULL on failure, and args pointing to the
+ * newly allocated space.
+ */
+struct multicall_space xen_mc_extend_args(unsigned long op, size_t arg_size);
+
 #endif /* _XEN_MULTICALLS_H */



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
                   ` (3 preceding siblings ...)
  2008-05-23 14:20 ` [PATCH 4 of 4] xen: add mechanism to extend existing multicalls Jeremy Fitzhardinge
@ 2008-05-23 18:27 ` Zachary Amsden
  2008-05-23 20:32   ` Jeremy Fitzhardinge
  2008-05-23 18:57 ` Linus Torvalds
  5 siblings, 1 reply; 14+ messages in thread
From: Zachary Amsden @ 2008-05-23 18:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds


On Fri, 2008-05-23 at 15:20 +0100, Jeremy Fitzhardinge wrote:
> Hi all,
> 
> This little series adds a new transaction-like abstraction for doing
> RMW updates to a pte, hooks it into paravirt_ops, and then makes use
> of it in Xen.
> 
> The basic problem is that mprotect is very slow under Xen (up to 50x
> slower than native), primarily because of the
> 
> 	ptent = ptep_get_and_clear(mm, addr, pte);
> 	ptent = pte_modify(ptent, newprot);
> 	/* ... */
> 	set_pte_at(mm, addr, pte, ptent);
> 
> sequence in mm/mprotect.c:change_pte_range().
> 
> This is bad for Xen for two reasons:
> 
>   1: ptep_get_and_clear() ends up being a xchg on the pte.  Since the
>      pte page is read-only (as it must be, because Xen needs to
>      control all pte updates), this traps into Xen, which then
>      emulates the instruction.  Trapping into the instruction emulator
>      is inherently fairly expensive.  And,
> 
>   2: because ptep_get_and_clear has atomic-fetch-and-update semantics,
>      it's impossible to implement in a way which can be batched to amortize
>      the cost of faulting into the hypervisor.
> 
> This series adds the pte_rmw_start() and pte_rmw_commit() operations,
> which change this sequence to:
> 
> 	ptent = pte_rmw_start(mm, addr, pte);
> 	ptent = pte_modify(ptent, newprot);
> 	/* ... */
> 	pte_rmw_commit(mm, addr, pte, ptent);
> 
> Which looks very familiar.  And, indeed, when compiled without
> CONFIG_PARAVIRT (or on a non-x86 architecture), it will end up doing
> precisely the same thing as before.
> 
> However, the effective semantics are a bit different.  pte_rmw_start()
> means "I'm reading this pte with the intention of updating it; please
> don't lose any hardware pte changes in the meantime".  And
> pte_rmw_commit() means "Here's a new value for the pte, but make sure
> you don't lose any hardware changes".
> 
> The default implementation achieves these semantics by making
> pte_rmw_start() set the pte to non-present, which prevents any async
> hardware changes to the pte.  The pte_rmw_commit() can then just write
> the new value into place without having to worry about preserving any
> changes, because it knows there are none.

This all sounds fine.

> Xen implements pte_rmw_start() as a simple read of the pte.  This
> leaves the pte unchanged in memory, and the hardware may make
> asynchronous changes to it.  It implements pte_rmw_commit() using a
> hypercall which preserves the state of the Access/Dirty bits to update
> the pte.  This allows the whole change_pte_range() loop to be run
> without any synchronous unbatched traps into the hypervisor.  With
> this change in place, an mprotect microbenchmark goes from being 50x
> worse than native to around 7x, which is acceptible.

I'm a bit skeptical you can get such a semantic to work without a very
heavyweight method in the hypervisor.  How do you guarantee no other CPU
is fizzling the A/D bits in the page table (it can be done by hardware
with direct page tables), unless you use some kind of IPI?  Is this why
it is still 7x?

Still, a 7x gain from asynchronous batching is very nice.  I wonder if
that means the average mprotect size in your benchmark is 7 pages.

> I believe that other virtualization systems, whether they use direct
> paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can
> make use of this interface to improve the performance.

On VMI, we don't trap the xchg of the pte, thus we don't have any
bottleneck here to begin with.  Nit, wiggle, shadow pagetables are a
good thing.

Zach


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
                   ` (4 preceding siblings ...)
  2008-05-23 18:27 ` [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Zachary Amsden
@ 2008-05-23 18:57 ` Linus Torvalds
  2008-05-23 20:42   ` Jeremy Fitzhardinge
  5 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-05-23 18:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	Zachary Amsden, kvm-devel, Virtualization Mailing List,
	Rusty Russell, Peter Zijlstra



On Fri, 23 May 2008, Jeremy Fitzhardinge wrote:
> 
> This series adds the pte_rmw_start() and pte_rmw_commit() operations,
> which change this sequence to:
> 
> 	ptent = pte_rmw_start(mm, addr, pte);
> 	ptent = pte_modify(ptent, newprot);
> 	/* ... */
> 	pte_rmw_commit(mm, addr, pte, ptent);

Can you please rename these.

It's not a general "read-modify-write" operation on the PTE, and this 
*only* works for changing protection details. In particular, you cannot 
use pte_rmw_start/commit to change the actual page. So it's very much 
about just protection bits.

It should probably also be called ptep_xyz(), since it takes a pte 
pointer, not a pte.

So maybe calling it "ptent = ptep_modify_prot_start(..)" ... 
"ptep_modify_prot_commit(..)" or something.

			Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 18:27 ` [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Zachary Amsden
@ 2008-05-23 20:32   ` Jeremy Fitzhardinge
  2008-05-23 23:25     ` Zachary Amsden
  0 siblings, 1 reply; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 20:32 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

Zachary Amsden wrote:
> I'm a bit skeptical you can get such a semantic to work without a very
> heavyweight method in the hypervisor.  How do you guarantee no other CPU
> is fizzling the A/D bits in the page table (it can be done by hardware
> with direct page tables), unless you use some kind of IPI?  Is this why
> it is still 7x?
>   

No, you just use cmpxchg.  It's pretty lightweight really.  Xen holds a 
lock internally to stop other cpus from updating the pte in software, so 
the only source of modification is the hardware itself; the cmpxchg loop 
is guaranteed to terminate because the A/D bits can only transition from 
0->1.

I haven't really gone into depth as to exactly where the 7x number comes 
from.  I could increase the batch size (currently max of 32 pte 
updates/hypercall), and some of it is plain overhead from the in-kernel 
infrastructure.  A simpler and more hackish approach which basically 
pastes the Xen hypercall directly into the mprotect loop gets the 
overhead down to about 5.5x.

> Still, a 7x gain from asynchronous batching is very nice.  I wonder if
> that means the average mprotect size in your benchmark is 7 pages.
>   

Yeah, it's around 7x.  The batching pays off even for single page 
mprotects, because the trap and emulate of xchg is so expensive.

>> I believe that other virtualization systems, whether they use direct
>> paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can
>> make use of this interface to improve the performance.
>>     
>
> On VMI, we don't trap the xchg of the pte, thus we don't have any
> bottleneck here to begin with.

If you're doing code rewriting then I guess you can effectively do the 
same trick at that point.  If not, then presumably you take a fault for 
the first pte updated in the mprotect and then sync the shadow up when 
the tlb flush happens; batching that trap and the tlb flush would give 
you some benefit for small mprotects.

    J

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 18:57 ` Linus Torvalds
@ 2008-05-23 20:42   ` Jeremy Fitzhardinge
  2008-05-24 17:25     ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-23 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	Zachary Amsden, kvm-devel, Virtualization Mailing List,
	Rusty Russell, Peter Zijlstra

Linus Torvalds wrote:
> On Fri, 23 May 2008, Jeremy Fitzhardinge wrote:
>   
>> This series adds the pte_rmw_start() and pte_rmw_commit() operations,
>> which change this sequence to:
>>
>> 	ptent = pte_rmw_start(mm, addr, pte);
>> 	ptent = pte_modify(ptent, newprot);
>> 	/* ... */
>> 	pte_rmw_commit(mm, addr, pte, ptent);
>>     
>
> Can you please rename these.
>
> It's not a general "read-modify-write" operation on the PTE, and this 
> *only* works for changing protection details. In particular, you cannot 
> use pte_rmw_start/commit to change the actual page. So it's very much 
> about just protection bits.
>   

Well, you could use it to update the page and protection bits while 
preserving the AD bits, but I don't think that's very useful.

> It should probably also be called ptep_xyz(), since it takes a pte 
> pointer, not a pte.
>
> So maybe calling it "ptent = ptep_modify_prot_start(..)" ... 
> "ptep_modify_prot_commit(..)" or something.
>   

Yep, sounds fine to me.

    J


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 20:32   ` Jeremy Fitzhardinge
@ 2008-05-23 23:25     ` Zachary Amsden
  2008-05-31  0:13       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 14+ messages in thread
From: Zachary Amsden @ 2008-05-23 23:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

On Fri, 2008-05-23 at 21:32 +0100, Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
> > I'm a bit skeptical you can get such a semantic to work without a very
> > heavyweight method in the hypervisor.  How do you guarantee no other CPU
> > is fizzling the A/D bits in the page table (it can be done by hardware
> > with direct page tables), unless you use some kind of IPI?  Is this why
> > it is still 7x?
> >   
> 
> No, you just use cmpxchg.  It's pretty lightweight really.  Xen holds a 
> lock internally to stop other cpus from updating the pte in software, so 
> the only source of modification is the hardware itself; the cmpxchg loop 
> is guaranteed to terminate because the A/D bits can only transition from 
> 0->1.

Ah yes, you're not worried about invalidations.  You can actually do better using a lock; xor combination, which will allow you to flip any of the protection bits without looping (you are guaranteed on Linux not to have concurrent updates by the guest holding the pagetable lock).  It might fail for other guests though, and I'm not sure its any cheaper on modern processors (in fact, it wouldn't surprise me if Intel optimized cmpxchg so it was cheaper).

> >> I believe that other virtualization systems, whether they use direct
> >> paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can
> >> make use of this interface to improve the performance.
> >>     
> >
> > On VMI, we don't trap the xchg of the pte, thus we don't have any
> > bottleneck here to begin with.
> 
> If you're doing code rewriting then I guess you can effectively do the 
> same trick at that point.  If not, then presumably you take a fault for 
> the first pte updated in the mprotect and then sync the shadow up when 
> the tlb flush happens; batching that trap and the tlb flush would give 
> you some benefit for small mprotects.

We don't fault.  We write directly to the primary page tables, and clear
the pte just like native.  We just issue all mprotect updates in the
queue, and flush the queue when leaving lazy mmu mode.  You can't wait
for the TLB flush, you must flush the updates before releasing the
pagetable lock, or you could get misordered updates in an SMP system.

A/D bits are propagated from shadow to primary by getting page faults on
an access that would set an A/D bit in hardware; if we get a page fault
for what would be an A/D bit update in the window where the primary PTE
has been cleared, we convert it to a guest fault (just as native
hardware would).  Linux is already prepared to handle these spurious
faults by revalidating the mapping.

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 20:42   ` Jeremy Fitzhardinge
@ 2008-05-24 17:25     ` Linus Torvalds
  2008-05-24 20:44       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-05-24 17:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	Zachary Amsden, kvm-devel, Virtualization Mailing List,
	Rusty Russell, Peter Zijlstra

On Fri, 23 May 2008, Jeremy Fitzhardinge wrote:
> 
> Well, you could use it to update the page and protection bits while preserving
> the AD bits, but I don't think that's very useful.

It's not that it's not "very useful" - it's that it would be TOTALLY 
WRONG.

If you change the page and the AD bits can change randomly while you do 
so, that means that the AD bits are now _undefined_. Which of the two 
pages did it happen to? The old one? The new one? Nobody can know. So 
you'd effectively have dirty bits that could be associated with the wrong 
physical page, which means that some page may be dirty, but the kernel 
would have it marked clean.

That would be beyond bad. That would be a *serious* VM bug, with very 
subtle data corruption issues.

So no, you cannot, and absolutely must not use this to update the actual 
page mapping. Ever. It is only valid for the protection bits.

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-24 17:25     ` Linus Torvalds
@ 2008-05-24 20:44       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-24 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	Zachary Amsden, kvm-devel, Virtualization Mailing List,
	Rusty Russell, Peter Zijlstra

Linus Torvalds wrote:
> On Fri, 23 May 2008, Jeremy Fitzhardinge wrote:
>   
>> Well, you could use it to update the page and protection bits while preserving
>> the AD bits, but I don't think that's very useful.
>>     
>
> It's not that it's not "very useful" - it's that it would be TOTALLY 
> WRONG.
>   

Sorry, I was using ironic understatement.  I meant "...while preserving 
the AD bits, but I think that would be totally fucking stupid".

    J

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-23 23:25     ` Zachary Amsden
@ 2008-05-31  0:13       ` Jeremy Fitzhardinge
  2008-06-02 20:09         ` Zachary Amsden
  0 siblings, 1 reply; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-31  0:13 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

Zachary Amsden wrote:
> We don't fault.  We write directly to the primary page tables, and clear
> the pte just like native.  We just issue all mprotect updates in the
> queue, and flush the queue when leaving lazy mmu mode.  You can't wait
> for the TLB flush, you must flush the updates before releasing the
> pagetable lock, or you could get misordered updates in an SMP system.
>   

How do you track which ptes need shadow updates?  Do you walk the entire 
pagetable on tlb flush?  Or just rebuild the shadow from scratch on demand?

    J

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
  2008-05-31  0:13       ` Jeremy Fitzhardinge
@ 2008-06-02 20:09         ` Zachary Amsden
  0 siblings, 0 replies; 14+ messages in thread
From: Zachary Amsden @ 2008-06-02 20:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, LKML, xen-devel, Thomas Gleixner, Hugh Dickins,
	kvm-devel, Virtualization Mailing List, Rusty Russell,
	Peter Zijlstra, Linus Torvalds

On Sat, 2008-05-31 at 01:13 +0100, Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
> > We don't fault.  We write directly to the primary page tables, and clear
> > the pte just like native.  We just issue all mprotect updates in the
> > queue, and flush the queue when leaving lazy mmu mode.  You can't wait
> > for the TLB flush, you must flush the updates before releasing the
> > pagetable lock, or you could get misordered updates in an SMP system.
> >   
> 
> How do you track which ptes need shadow updates?  Do you walk the entire 
> pagetable on tlb flush?  Or just rebuild the shadow from scratch on demand?

No, we queue the updates as well as writing the primaries, we just flush
the queue before dropping the PT lock for one trip to the hypervisor.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-06-02 20:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-23 14:20 [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Jeremy Fitzhardinge
2008-05-23 14:20 ` [PATCH 1 of 4] mm: add a pte_rmw transaction abstraction Jeremy Fitzhardinge
2008-05-23 14:20 ` [PATCH 2 of 4] paravirt: add hooks for pte_rmw_start/commit Jeremy Fitzhardinge
2008-05-23 14:20 ` [PATCH 3 of 4] xen: implement pte_rmw_start/commit Jeremy Fitzhardinge
2008-05-23 14:20 ` [PATCH 4 of 4] xen: add mechanism to extend existing multicalls Jeremy Fitzhardinge
2008-05-23 18:27 ` [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction Zachary Amsden
2008-05-23 20:32   ` Jeremy Fitzhardinge
2008-05-23 23:25     ` Zachary Amsden
2008-05-31  0:13       ` Jeremy Fitzhardinge
2008-06-02 20:09         ` Zachary Amsden
2008-05-23 18:57 ` Linus Torvalds
2008-05-23 20:42   ` Jeremy Fitzhardinge
2008-05-24 17:25     ` Linus Torvalds
2008-05-24 20:44       ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).