All of lore.kernel.org
 help / color / mirror / Atom feed
* lock in vhpet
@ 2012-04-17  3:26 Zhang, Yang Z
  2012-04-17  7:27 ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-17  3:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser

Hi keir

I noticed that the changeset 15289 introuduced locking to platform timers. And you mentioned that it only handy for correctness. Are there some potential issues which is fixed by this patch? If not, I wonder why we need those locks? I think it should be OS's responsibly to guarantee the access sequentially, not hypervisor. Am I right? 
I don't know whether all those locks are necessary, but at least the lock for vhpet, especially the reading lock, is not required.

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-17  3:26 lock in vhpet Zhang, Yang Z
@ 2012-04-17  7:27 ` Keir Fraser
  2012-04-18  0:52   ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-17  7:27 UTC (permalink / raw)
  To: Zhang, Yang Z, xen-devel

On 17/04/2012 04:26, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:

> Hi keir
> 
> I noticed that the changeset 15289 introuduced locking to platform timers. And
> you mentioned that it only handy for correctness. Are there some potential
> issues which is fixed by this patch? If not, I wonder why we need those locks?

Yes, issues were fixed by the patch. That's why I bothered to implement it.
However I think the observed issues were with protecting the mechanisms in
vpt.c, and the other locking at least partially may be overly cautious.

> I think it should be OS's responsibly to guarantee the access sequentially,
> not hypervisor. Am I right?

It depends. Where an access is an apparently-atomic memory-mapped access,
but implemented as a sequence of operations in the hypervisor, the
hypervisor might need to maintain atomicity through locking.

> I don't know whether all those locks are necessary, but at least the lock for
> vhpet, especially the reading lock, is not required.

This is definitely not true, for example locking is required around calls to
create_periodic_time(), to serialise them. So in general the locking I
added, even in vhpet.c is required. If you have a specific hot path you are
looking to optimise, and especially if you have numbers to back that up,
then we can consider specific localised optimisations to avoid locking where
we can reason it is not needed.

 -- Keir

> best regards
> yang
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-17  7:27 ` Keir Fraser
@ 2012-04-18  0:52   ` Zhang, Yang Z
  2012-04-18  7:13     ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-18  0:52 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com] On Behalf Of Keir Fraser
> 
> > Hi keir
> >
> > I noticed that the changeset 15289 introuduced locking to platform
> > timers. And you mentioned that it only handy for correctness. Are
> > there some potential issues which is fixed by this patch? If not, I wonder why
> we need those locks?
> 
> Yes, issues were fixed by the patch. That's why I bothered to implement it.
> However I think the observed issues were with protecting the mechanisms in
> vpt.c, and the other locking at least partially may be overly cautious.
> 
> > I think it should be OS's responsibly to guarantee the access
> > sequentially, not hypervisor. Am I right?
> 
> It depends. Where an access is an apparently-atomic memory-mapped access,
> but implemented as a sequence of operations in the hypervisor, the hypervisor
> might need to maintain atomicity through locking.

But if there already has lock inside guest for those access, and there have no other code patch(like timer callback function) in hypervisor to access the shared data, then we don't need to use lock in hypersivor.

> > I don't know whether all those locks are necessary, but at least the
> > lock for vhpet, especially the reading lock, is not required.
> 
> This is definitely not true, for example locking is required around calls to
> create_periodic_time(), to serialise them. So in general the locking I added,
> even in vhpet.c is required. If you have a specific hot path you are looking to
> optimise, and especially if you have numbers to back that up, then we can
> consider specific localised optimisations to avoid locking where we can reason
> it is not needed.
As I mentioned, if the guest can ensure there only one CPU to access the hpet at same time, this means the access itself already is serialized. 
Yes.Win8 is booting very slowly w/ more than 16 VCPUs. And this is due to the big lock in reading hpet. Also, the xentrace data shows lots of vmexit which mainly from PAUSE instruction from other cpus. So I think if guest already uses lock to protect the hpet access, why hypervisor do the same thing too?

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  0:52   ` Zhang, Yang Z
@ 2012-04-18  7:13     ` Keir Fraser
  2012-04-18  7:55       ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-18  7:13 UTC (permalink / raw)
  To: Zhang, Yang Z, xen-devel

On 18/04/2012 01:52, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:

>> It depends. Where an access is an apparently-atomic memory-mapped access,
>> but implemented as a sequence of operations in the hypervisor, the hypervisor
>> might need to maintain atomicity through locking.
> 
> But if there already has lock inside guest for those access, and there have no
> other code patch(like timer callback function) in hypervisor to access the
> shared data, then we don't need to use lock in hypersivor.

If there is a memory-mapped register access which is atomic on bare metal,
the guest may not bother with locking. We have to maintain that apparent
atomicity ourselves by implementing serialisation in the hypervisor.

>>> I don't know whether all those locks are necessary, but at least the
>>> lock for vhpet, especially the reading lock, is not required.
>> 
>> This is definitely not true, for example locking is required around calls to
>> create_periodic_time(), to serialise them. So in general the locking I added,
>> even in vhpet.c is required. If you have a specific hot path you are looking
>> to
>> optimise, and especially if you have numbers to back that up, then we can
>> consider specific localised optimisations to avoid locking where we can
>> reason
>> it is not needed.
> As I mentioned, if the guest can ensure there only one CPU to access the hpet
> at same time, this means the access itself already is serialized.
> Yes.Win8 is booting very slowly w/ more than 16 VCPUs. And this is due to the
> big lock in reading hpet. Also, the xentrace data shows lots of vmexit which
> mainly from PAUSE instruction from other cpus. So I think if guest already
> uses lock to protect the hpet access, why hypervisor do the same thing too?

If the HPET accesses are atomic on bare metal, we have to maintain that,
even if some guests have extra locking themselves. Also, in some cases Xen
needs locking to correctly maintain its own internal state regardless of
what an (untrusted) guest might do. So we cannot just get rid of the vhpet
lock everywhere. It's definitely not correct to do so. Optimising the hpet
read path however, sounds okay. I agree the lock may not be needed on that
specific path.

 -- Keir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  7:13     ` Keir Fraser
@ 2012-04-18  7:55       ` Zhang, Yang Z
  2012-04-18  8:29         ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-18  7:55 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Sent: Wednesday, April 18, 2012 3:13 PM
> To: Zhang, Yang Z; xen-devel@lists.xensource.com
> Subject: Re: lock in vhpet
> 
> On 18/04/2012 01:52, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
> 
> >> It depends. Where an access is an apparently-atomic memory-mapped
> >> access, but implemented as a sequence of operations in the
> >> hypervisor, the hypervisor might need to maintain atomicity through locking.
> >
> > But if there already has lock inside guest for those access, and there
> > have no other code patch(like timer callback function) in hypervisor
> > to access the shared data, then we don't need to use lock in hypersivor.
> 
> If there is a memory-mapped register access which is atomic on bare metal,
> the guest may not bother with locking. We have to maintain that apparent
> atomicity ourselves by implementing serialisation in the hypervisor.
> 
> >>> I don't know whether all those locks are necessary, but at least the
> >>> lock for vhpet, especially the reading lock, is not required.
> >>
> >> This is definitely not true, for example locking is required around
> >> calls to create_periodic_time(), to serialise them. So in general the
> >> locking I added, even in vhpet.c is required. If you have a specific
> >> hot path you are looking to optimise, and especially if you have
> >> numbers to back that up, then we can consider specific localised
> >> optimisations to avoid locking where we can reason it is not needed.
> > As I mentioned, if the guest can ensure there only one CPU to access
> > the hpet at same time, this means the access itself already is serialized.
> > Yes.Win8 is booting very slowly w/ more than 16 VCPUs. And this is due
> > to the big lock in reading hpet. Also, the xentrace data shows lots of
> > vmexit which mainly from PAUSE instruction from other cpus. So I think
> > if guest already uses lock to protect the hpet access, why hypervisor do the
> same thing too?
> 
> If the HPET accesses are atomic on bare metal, we have to maintain that, even
> if some guests have extra locking themselves. Also, in some cases Xen needs
> locking to correctly maintain its own internal state regardless of what an
> (untrusted) guest might do. So we cannot just get rid of the vhpet lock
> everywhere. It's definitely not correct to do so. Optimising the hpet read path
> however, sounds okay. I agree the lock may not be needed on that specific
> path.

You are right.
For this case, since the main access of hpet is to read the main counter, so I think the rwlock is a better choice. 

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  7:55       ` Zhang, Yang Z
@ 2012-04-18  8:29         ` Keir Fraser
  2012-04-18  9:14           ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-18  8:29 UTC (permalink / raw)
  To: Zhang, Yang Z, xen-devel

On 18/04/2012 08:55, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:

>> If the HPET accesses are atomic on bare metal, we have to maintain that, even
>> if some guests have extra locking themselves. Also, in some cases Xen needs
>> locking to correctly maintain its own internal state regardless of what an
>> (untrusted) guest might do. So we cannot just get rid of the vhpet lock
>> everywhere. It's definitely not correct to do so. Optimising the hpet read
>> path
>> however, sounds okay. I agree the lock may not be needed on that specific
>> path.
> 
> You are right.
> For this case, since the main access of hpet is to read the main counter, so I
> think the rwlock is a better choice.

I'll see if I can make a patch...

 -- Keir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  8:29         ` Keir Fraser
@ 2012-04-18  9:14           ` Keir Fraser
  2012-04-18  9:30             ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-18  9:14 UTC (permalink / raw)
  To: Zhang, Yang Z, xen-devel

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

On 18/04/2012 09:29, "Keir Fraser" <keir@xen.org> wrote:

> On 18/04/2012 08:55, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
> 
>>> If the HPET accesses are atomic on bare metal, we have to maintain that,
>>> even
>>> if some guests have extra locking themselves. Also, in some cases Xen needs
>>> locking to correctly maintain its own internal state regardless of what an
>>> (untrusted) guest might do. So we cannot just get rid of the vhpet lock
>>> everywhere. It's definitely not correct to do so. Optimising the hpet read
>>> path
>>> however, sounds okay. I agree the lock may not be needed on that specific
>>> path.
>> 
>> You are right.
>> For this case, since the main access of hpet is to read the main counter, so
>> I
>> think the rwlock is a better choice.
> 
> I'll see if I can make a patch...

Please try the attached patch (build tested only).

 -- Keir

>  -- Keir
> 
> 


[-- Attachment #2: 00-hpet-lockfree --]
[-- Type: application/octet-stream, Size: 5154 bytes --]

diff -r cf129a80e47e xen/arch/x86/hvm/hpet.c
--- a/xen/arch/x86/hvm/hpet.c	Tue Apr 17 15:37:05 2012 +0200
+++ b/xen/arch/x86/hvm/hpet.c	Wed Apr 18 10:13:07 2012 +0100
@@ -73,14 +73,36 @@
     ((timer_config(h, n) & HPET_TN_INT_ROUTE_CAP_MASK) \
         >> HPET_TN_INT_ROUTE_CAP_SHIFT)
 
-static inline uint64_t hpet_read_maincounter(HPETState *h)
+/*
+ * hpet_{read,set}_maincounter():
+ *  Atomically get/set h->mc_config to allow safe lock-free read access to
+ *  the HPET main counter. mc_config[0] is 1 when the counter is enabled.
+ *  When the counter is disabled, mc_config[63:1] is the counter value.
+ *  When the counter is enabled, mc_config[63:1] is a signed counter offset.
+ */
+static uint64_t hpet_read_maincounter(HPETState *h)
 {
+    int64_t mc_config = read_atomic(&h->mc_config);
+    int64_t counter = mc_config >> 1;
+    bool_t enabled = mc_config & 1;
+
+    if ( enabled )
+        counter += guest_time_hpet(h);
+
+    return (uint64_t)counter;
+}
+
+static void hpet_set_maincounter(HPETState *h)
+{
+    int64_t mc_config;
+
     ASSERT(spin_is_locked(&h->lock));
 
-    if ( hpet_enabled(h) )
-        return guest_time_hpet(h) + h->mc_offset;
-    else 
-        return h->hpet.mc64;
+    mc_config = (hpet_enabled(h)
+                 ? (((h->hpet.mc64 - guest_time_hpet(h)) << 1) | 1ull)
+                 : (h->hpet.mc64 << 1));
+
+    write_atomic(&h->mc_config, mc_config);
 }
 
 static uint64_t hpet_get_comparator(HPETState *h, unsigned int tn)
@@ -107,7 +129,8 @@ static uint64_t hpet_get_comparator(HPET
     h->hpet.timers[tn].cmp = comparator;
     return comparator;
 }
-static inline uint64_t hpet_read64(HPETState *h, unsigned long addr)
+
+static uint64_t __hpet_read64(HPETState *h, unsigned long addr)
 {
     addr &= ~7;
 
@@ -138,6 +161,23 @@ static inline uint64_t hpet_read64(HPETS
     return 0;
 }
 
+static uint64_t hpet_read64(HPETState *h, unsigned long addr)
+{
+    /* Allow lock-free access to main counter. */
+    bool_t lock = ((addr & ~7) != HPET_COUNTER);
+    uint64_t val;
+
+    if ( lock )
+        spin_lock(&h->lock);
+
+    val = __hpet_read64(h, addr);
+
+    if ( lock )
+        spin_unlock(&h->lock);
+
+    return val;
+}
+
 static inline int hpet_check_access_length(
     unsigned long addr, unsigned long len)
 {
@@ -172,16 +212,12 @@ static int hpet_read(
         goto out;
     }
 
-    spin_lock(&h->lock);
-
     val = hpet_read64(h, addr);
 
     result = val;
     if ( length != 8 )
         result = (val >> ((addr & 7) * 8)) & ((1ULL << (length * 8)) - 1);
 
-    spin_unlock(&h->lock);
-
  out:
     *pval = result;
     return X86EMUL_OKAY;
@@ -291,7 +327,7 @@ static int hpet_write(
 
     spin_lock(&h->lock);
 
-    old_val = hpet_read64(h, addr);
+    old_val = __hpet_read64(h, addr);
     new_val = val;
     if ( length != 8 )
         new_val = hpet_fixup_reg(
@@ -306,7 +342,7 @@ static int hpet_write(
         if ( !(old_val & HPET_CFG_ENABLE) && (new_val & HPET_CFG_ENABLE) )
         {
             /* Enable main counter and interrupt generation. */
-            h->mc_offset = h->hpet.mc64 - guest_time_hpet(h);
+            hpet_set_maincounter(h);
             for ( i = 0; i < HPET_TIMER_NUM; i++ )
             {
                 h->hpet.comparator64[i] =
@@ -320,7 +356,7 @@ static int hpet_write(
         else if ( (old_val & HPET_CFG_ENABLE) && !(new_val & HPET_CFG_ENABLE) )
         {
             /* Halt main counter and disable interrupt generation. */
-            h->hpet.mc64 = h->mc_offset + guest_time_hpet(h);
+            hpet_set_maincounter(h);
             for ( i = 0; i < HPET_TIMER_NUM; i++ )
                 if ( timer_enabled(h, i) )
                     set_stop_timer(i);
@@ -476,7 +512,7 @@ static int hpet_save(struct domain *d, h
     spin_lock(&hp->lock);
 
     /* Write the proper value into the main counter */
-    hp->hpet.mc64 = hp->mc_offset + guest_time_hpet(hp);
+    hp->hpet.mc64 = hpet_read_maincounter(hp);
 
     /* Save the HPET registers */
     rc = _hvm_init_entry(h, HVM_SAVE_CODE(HPET), 0, HVM_SAVE_LENGTH(HPET));
@@ -554,8 +590,7 @@ static int hpet_load(struct domain *d, h
     }
 #undef C
     
-    /* Recalculate the offset between the main counter and guest time */
-    hp->mc_offset = hp->hpet.mc64 - guest_time_hpet(hp);
+    hpet_set_maincounter(hp);
 
     /* restart all timers */
 
diff -r cf129a80e47e xen/include/asm-x86/hvm/vpt.h
--- a/xen/include/asm-x86/hvm/vpt.h	Tue Apr 17 15:37:05 2012 +0200
+++ b/xen/include/asm-x86/hvm/vpt.h	Wed Apr 18 10:13:07 2012 +0100
@@ -94,7 +94,13 @@ typedef struct HPETState {
     uint64_t stime_freq;
     uint64_t hpet_to_ns_scale; /* hpet ticks to ns (multiplied by 2^10) */
     uint64_t hpet_to_ns_limit; /* max hpet ticks convertable to ns      */
-    uint64_t mc_offset;
+    /*
+     * mc_config: Allows safe lock-free read access to the main counter
+     *  bit 0: set if timer is enabled
+     *  1-63: main counter value (if timer enabled)
+     *  1-63: main counter offset from system time (if timer disabled)
+     */
+    int64_t mc_config;
     struct periodic_time pt[HPET_TIMER_NUM];
     spinlock_t lock;
 } HPETState;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  9:14           ` Keir Fraser
@ 2012-04-18  9:30             ` Keir Fraser
  2012-04-19  5:19               ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-18  9:30 UTC (permalink / raw)
  To: Zhang, Yang Z, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On 18/04/2012 10:14, "Keir Fraser" <keir@xen.org> wrote:

> On 18/04/2012 09:29, "Keir Fraser" <keir@xen.org> wrote:
> 
>> On 18/04/2012 08:55, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
>> 
>>>> If the HPET accesses are atomic on bare metal, we have to maintain that,
>>>> even
>>>> if some guests have extra locking themselves. Also, in some cases Xen needs
>>>> locking to correctly maintain its own internal state regardless of what an
>>>> (untrusted) guest might do. So we cannot just get rid of the vhpet lock
>>>> everywhere. It's definitely not correct to do so. Optimising the hpet read
>>>> path
>>>> however, sounds okay. I agree the lock may not be needed on that specific
>>>> path.
>>> 
>>> You are right.
>>> For this case, since the main access of hpet is to read the main counter, so
>>> I
>>> think the rwlock is a better choice.
>> 
>> I'll see if I can make a patch...
> 
> Please try the attached patch (build tested only).

Actually try this updated one. :-)

>  -- Keir
> 
>>  -- Keir
>> 
>> 
> 


[-- Attachment #2: 00-hpet-lockfree-v2 --]
[-- Type: application/octet-stream, Size: 5589 bytes --]

diff -r cf129a80e47e xen/arch/x86/hvm/hpet.c
--- a/xen/arch/x86/hvm/hpet.c	Tue Apr 17 15:37:05 2012 +0200
+++ b/xen/arch/x86/hvm/hpet.c	Wed Apr 18 10:29:28 2012 +0100
@@ -73,14 +73,36 @@
     ((timer_config(h, n) & HPET_TN_INT_ROUTE_CAP_MASK) \
         >> HPET_TN_INT_ROUTE_CAP_SHIFT)
 
-static inline uint64_t hpet_read_maincounter(HPETState *h)
+/*
+ * hpet_{read,set}_maincounter():
+ *  Atomically get/set h->mc_config to allow safe lock-free read access to
+ *  the HPET main counter. mc_config[0] is 1 when the counter is enabled.
+ *  When the counter is disabled, mc_config[63:1] is the counter value.
+ *  When the counter is enabled, mc_config[63:1] is a signed counter offset.
+ */
+static uint64_t hpet_read_maincounter(HPETState *h)
 {
+    int64_t mc_config = read_atomic(&h->mc_config);
+    int64_t counter = mc_config >> 1;
+    bool_t enabled = mc_config & 1;
+
+    if ( enabled )
+        counter += guest_time_hpet(h);
+
+    return (uint64_t)counter;
+}
+
+static void hpet_set_maincounter(HPETState *h, uint64_t mc)
+{
+    int64_t mc_config;
+
     ASSERT(spin_is_locked(&h->lock));
 
-    if ( hpet_enabled(h) )
-        return guest_time_hpet(h) + h->mc_offset;
-    else 
-        return h->hpet.mc64;
+    mc_config = (hpet_enabled(h)
+                 ? (((mc - guest_time_hpet(h)) << 1) | 1ull)
+                 : (mc << 1));
+
+    write_atomic(&h->mc_config, mc_config);
 }
 
 static uint64_t hpet_get_comparator(HPETState *h, unsigned int tn)
@@ -107,7 +129,8 @@ static uint64_t hpet_get_comparator(HPET
     h->hpet.timers[tn].cmp = comparator;
     return comparator;
 }
-static inline uint64_t hpet_read64(HPETState *h, unsigned long addr)
+
+static uint64_t __hpet_read64(HPETState *h, unsigned long addr)
 {
     addr &= ~7;
 
@@ -138,6 +161,23 @@ static inline uint64_t hpet_read64(HPETS
     return 0;
 }
 
+static uint64_t hpet_read64(HPETState *h, unsigned long addr)
+{
+    /* Allow lock-free access to main counter. */
+    bool_t lock = ((addr & ~7) != HPET_COUNTER);
+    uint64_t val;
+
+    if ( lock )
+        spin_lock(&h->lock);
+
+    val = __hpet_read64(h, addr);
+
+    if ( lock )
+        spin_unlock(&h->lock);
+
+    return val;
+}
+
 static inline int hpet_check_access_length(
     unsigned long addr, unsigned long len)
 {
@@ -172,16 +212,12 @@ static int hpet_read(
         goto out;
     }
 
-    spin_lock(&h->lock);
-
     val = hpet_read64(h, addr);
 
     result = val;
     if ( length != 8 )
         result = (val >> ((addr & 7) * 8)) & ((1ULL << (length * 8)) - 1);
 
-    spin_unlock(&h->lock);
-
  out:
     *pval = result;
     return X86EMUL_OKAY;
@@ -291,7 +327,7 @@ static int hpet_write(
 
     spin_lock(&h->lock);
 
-    old_val = hpet_read64(h, addr);
+    old_val = __hpet_read64(h, addr);
     new_val = val;
     if ( length != 8 )
         new_val = hpet_fixup_reg(
@@ -300,13 +336,15 @@ static int hpet_write(
 
     switch ( addr & ~7 )
     {
-    case HPET_CFG:
+    case HPET_CFG: {
+        uint64_t mc = hpet_read_maincounter(h);
+
         h->hpet.config = hpet_fixup_reg(new_val, old_val, 0x3);
 
         if ( !(old_val & HPET_CFG_ENABLE) && (new_val & HPET_CFG_ENABLE) )
         {
             /* Enable main counter and interrupt generation. */
-            h->mc_offset = h->hpet.mc64 - guest_time_hpet(h);
+            hpet_set_maincounter(h, mc);
             for ( i = 0; i < HPET_TIMER_NUM; i++ )
             {
                 h->hpet.comparator64[i] =
@@ -320,15 +358,16 @@ static int hpet_write(
         else if ( (old_val & HPET_CFG_ENABLE) && !(new_val & HPET_CFG_ENABLE) )
         {
             /* Halt main counter and disable interrupt generation. */
-            h->hpet.mc64 = h->mc_offset + guest_time_hpet(h);
+            hpet_set_maincounter(h, mc);
             for ( i = 0; i < HPET_TIMER_NUM; i++ )
                 if ( timer_enabled(h, i) )
                     set_stop_timer(i);
         }
         break;
+    }
 
     case HPET_COUNTER:
-        h->hpet.mc64 = new_val;
+        hpet_set_maincounter(h, new_val);
         if ( hpet_enabled(h) )
         {
             gdprintk(XENLOG_WARNING, 
@@ -475,8 +514,7 @@ static int hpet_save(struct domain *d, h
 
     spin_lock(&hp->lock);
 
-    /* Write the proper value into the main counter */
-    hp->hpet.mc64 = hp->mc_offset + guest_time_hpet(hp);
+    hp->hpet.mc64 = hpet_read_maincounter(hp);
 
     /* Save the HPET registers */
     rc = _hvm_init_entry(h, HVM_SAVE_CODE(HPET), 0, HVM_SAVE_LENGTH(HPET));
@@ -554,8 +592,7 @@ static int hpet_load(struct domain *d, h
     }
 #undef C
     
-    /* Recalculate the offset between the main counter and guest time */
-    hp->mc_offset = hp->hpet.mc64 - guest_time_hpet(hp);
+    hpet_set_maincounter(hp, hp->hpet.mc64);
 
     /* restart all timers */
 
diff -r cf129a80e47e xen/include/asm-x86/hvm/vpt.h
--- a/xen/include/asm-x86/hvm/vpt.h	Tue Apr 17 15:37:05 2012 +0200
+++ b/xen/include/asm-x86/hvm/vpt.h	Wed Apr 18 10:29:28 2012 +0100
@@ -94,7 +94,13 @@ typedef struct HPETState {
     uint64_t stime_freq;
     uint64_t hpet_to_ns_scale; /* hpet ticks to ns (multiplied by 2^10) */
     uint64_t hpet_to_ns_limit; /* max hpet ticks convertable to ns      */
-    uint64_t mc_offset;
+    /*
+     * mc_config: Allows safe lock-free read access to the main counter
+     *  bit 0: set if timer is enabled
+     *  1-63: main counter value (if timer enabled)
+     *  1-63: main counter offset from system time (if timer disabled)
+     */
+    int64_t mc_config;
     struct periodic_time pt[HPET_TIMER_NUM];
     spinlock_t lock;
 } HPETState;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-18  9:30             ` Keir Fraser
@ 2012-04-19  5:19               ` Zhang, Yang Z
  2012-04-19  8:27                 ` Tim Deegan
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-19  5:19 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

There have no problem with this patch, it works well. But it cannot fix the win8 issue. It seems there has some other issues with hpet. I will look into it.
Thanks for your quick patch.

best regards
yang


> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com] On Behalf Of Keir Fraser
> Sent: Wednesday, April 18, 2012 5:31 PM
> To: Zhang, Yang Z; xen-devel@lists.xensource.com
> Subject: Re: lock in vhpet
> 
> On 18/04/2012 10:14, "Keir Fraser" <keir@xen.org> wrote:
> 
> > On 18/04/2012 09:29, "Keir Fraser" <keir@xen.org> wrote:
> >
> >> On 18/04/2012 08:55, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
> >>
> >>>> If the HPET accesses are atomic on bare metal, we have to maintain
> >>>> that, even if some guests have extra locking themselves. Also, in
> >>>> some cases Xen needs locking to correctly maintain its own internal
> >>>> state regardless of what an
> >>>> (untrusted) guest might do. So we cannot just get rid of the vhpet
> >>>> lock everywhere. It's definitely not correct to do so. Optimising
> >>>> the hpet read path however, sounds okay. I agree the lock may not
> >>>> be needed on that specific path.
> >>>
> >>> You are right.
> >>> For this case, since the main access of hpet is to read the main
> >>> counter, so I think the rwlock is a better choice.
> >>
> >> I'll see if I can make a patch...
> >
> > Please try the attached patch (build tested only).
> 
> Actually try this updated one. :-)
> 
> >  -- Keir
> >
> >>  -- Keir
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-19  5:19               ` Zhang, Yang Z
@ 2012-04-19  8:27                 ` Tim Deegan
  2012-04-19  8:47                   ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Tim Deegan @ 2012-04-19  8:27 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser

At 05:19 +0000 on 19 Apr (1334812779), Zhang, Yang Z wrote:
> There have no problem with this patch, it works well. But it cannot
> fix the win8 issue. It seems there has some other issues with hpet. I
> will look into it.  Thanks for your quick patch.

The lock in hvm_get_guest_time() will still be serializing the hpet
reads.  But since it needs to update a shared variable, that will need to
haul cachelines around anyway. 

Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-19  8:27                 ` Tim Deegan
@ 2012-04-19  8:47                   ` Keir Fraser
  2012-04-23  7:36                     ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Keir Fraser @ 2012-04-19  8:47 UTC (permalink / raw)
  To: Tim Deegan, Zhang, Yang Z; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

On 19/04/2012 09:27, "Tim Deegan" <tim@xen.org> wrote:

> At 05:19 +0000 on 19 Apr (1334812779), Zhang, Yang Z wrote:
>> There have no problem with this patch, it works well. But it cannot
>> fix the win8 issue. It seems there has some other issues with hpet. I
>> will look into it.  Thanks for your quick patch.
> 
> The lock in hvm_get_guest_time() will still be serializing the hpet
> reads.  But since it needs to update a shared variable, that will need to
> haul cachelines around anyway.

Yes, that's true. You could try the attached hacky patch out of interest, to
see what that lock is costing you in your scenario. But if we want
consistent monotonically-increasing guest time, I suspect we can't get rid
of the lock, so that's going to limit our scalability unavoidably. Shame.

 -- Keir

> Tim.
> 


[-- Attachment #2: 00-lockfree-hvm-time --]
[-- Type: application/octet-stream, Size: 634 bytes --]

diff -r 7c777cb8f705 xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Wed Apr 18 16:49:55 2012 +0100
+++ b/xen/arch/x86/hvm/vpt.c	Thu Apr 19 09:45:43 2012 +0100
@@ -43,14 +43,7 @@ u64 hvm_get_guest_time(struct vcpu *v)
     /* Called from device models shared with PV guests. Be careful. */
     ASSERT(is_hvm_vcpu(v));
 
-    spin_lock(&pl->pl_time_lock);
     now = get_s_time() + pl->stime_offset;
-    if ( (int64_t)(now - pl->last_guest_time) > 0 )
-        pl->last_guest_time = now;
-    else
-        now = ++pl->last_guest_time;
-    spin_unlock(&pl->pl_time_lock);
-
     return now + v->arch.hvm_vcpu.stime_offset;
 }
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-19  8:47                   ` Keir Fraser
@ 2012-04-23  7:36                     ` Zhang, Yang Z
  2012-04-23  7:43                       ` Jan Beulich
  2012-04-23  9:14                       ` Tim Deegan
  0 siblings, 2 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-23  7:36 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel, andres

The p2m lock in __get_gfn_type_access() is the culprit. Here is the profiling data with 10 seconds:

(XEN) p2m_lock 1 lock:
(XEN)   lock:      190733(00000000:14CE5726), block:       67159(00000007:6AAA53F3)

Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle is waiting for the p2m lock. And those data only for idle guest. The impaction is more seriously when run some workload inside guest. 
I noticed that this change was adding by cs 24770. And before it, we don't require the p2m lock in _get_gfn_type_access. So is this lock really necessary?

best regards
yang


> -----Original Message-----
> From: Keir Fraser [mailto:keir.xen@gmail.com] On Behalf Of Keir Fraser
> Sent: Thursday, April 19, 2012 4:47 PM
> To: Tim Deegan; Zhang, Yang Z
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] lock in vhpet
> 
> On 19/04/2012 09:27, "Tim Deegan" <tim@xen.org> wrote:
> 
> > At 05:19 +0000 on 19 Apr (1334812779), Zhang, Yang Z wrote:
> >> There have no problem with this patch, it works well. But it cannot
> >> fix the win8 issue. It seems there has some other issues with hpet. I
> >> will look into it.  Thanks for your quick patch.
> >
> > The lock in hvm_get_guest_time() will still be serializing the hpet
> > reads.  But since it needs to update a shared variable, that will need
> > to haul cachelines around anyway.
> 
> Yes, that's true. You could try the attached hacky patch out of interest, to see
> what that lock is costing you in your scenario. But if we want consistent
> monotonically-increasing guest time, I suspect we can't get rid of the lock, so
> that's going to limit our scalability unavoidably. Shame.
> 
>  -- Keir
> 
> > Tim.
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  7:36                     ` Zhang, Yang Z
@ 2012-04-23  7:43                       ` Jan Beulich
  2012-04-23  8:15                         ` Zhang, Yang Z
  2012-04-23  9:14                       ` Tim Deegan
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2012-04-23  7:43 UTC (permalink / raw)
  To: Yang Z Zhang, Keir Fraser, Tim Deegan; +Cc: xen-devel, andres

>>> On 23.04.12 at 09:36, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
> The p2m lock in __get_gfn_type_access() is the culprit. Here is the profiling 
> data with 10 seconds:
> 
> (XEN) p2m_lock 1 lock:
> (XEN)   lock:      190733(00000000:14CE5726), block:       
> 67159(00000007:6AAA53F3)
> 
> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs blocked 
> 30 seconds with 10 sec's profiling. It means 18% of cpu cycle is waiting for 
> the p2m lock. And those data only for idle guest. The impaction is more 
> seriously when run some workload inside guest. 
> I noticed that this change was adding by cs 24770. And before it, we don't 
> require the p2m lock in _get_gfn_type_access. So is this lock really 
> necessary?

Or shouldn't such a lock frequently taken on a read path be an rwlock
instead?

Jan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  7:43                       ` Jan Beulich
@ 2012-04-23  8:15                         ` Zhang, Yang Z
  2012-04-23  8:22                           ` Keir Fraser
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-23  8:15 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, Tim Deegan; +Cc: xen-devel, andres

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, April 23, 2012 3:43 PM
> To: Zhang, Yang Z; Keir Fraser; Tim Deegan
> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] lock in vhpet
> 
> >>> On 23.04.12 at 09:36, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:
> > The p2m lock in __get_gfn_type_access() is the culprit. Here is the
> > profiling data with 10 seconds:
> >
> > (XEN) p2m_lock 1 lock:
> > (XEN)   lock:      190733(00000000:14CE5726), block:
> > 67159(00000007:6AAA53F3)
> >
> > Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
> > blocked
> > 30 seconds with 10 sec's profiling. It means 18% of cpu cycle is
> > waiting for the p2m lock. And those data only for idle guest. The
> > impaction is more seriously when run some workload inside guest.
> > I noticed that this change was adding by cs 24770. And before it, we
> > don't require the p2m lock in _get_gfn_type_access. So is this lock
> > really necessary?
> 
> Or shouldn't such a lock frequently taken on a read path be an rwlock instead?
> 
Right. Using rwlock would make more sense. 

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  8:15                         ` Zhang, Yang Z
@ 2012-04-23  8:22                           ` Keir Fraser
  0 siblings, 0 replies; 45+ messages in thread
From: Keir Fraser @ 2012-04-23  8:22 UTC (permalink / raw)
  To: Zhang, Yang Z, Jan Beulich, Tim Deegan; +Cc: xen-devel, Andres Lagar-Cavilla

On 23/04/2012 09:15, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote:

>>> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
>>> blocked
>>> 30 seconds with 10 sec's profiling. It means 18% of cpu cycle is
>>> waiting for the p2m lock. And those data only for idle guest. The
>>> impaction is more seriously when run some workload inside guest.
>>> I noticed that this change was adding by cs 24770. And before it, we
>>> don't require the p2m lock in _get_gfn_type_access. So is this lock
>>> really necessary?
>> 
>> Or shouldn't such a lock frequently taken on a read path be an rwlock
>> instead?
>> 
> Right. Using rwlock would make more sense.

Interested to see if it would improve performance. My guess would be no.

 -- Keir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  7:36                     ` Zhang, Yang Z
  2012-04-23  7:43                       ` Jan Beulich
@ 2012-04-23  9:14                       ` Tim Deegan
  2012-04-23 15:26                         ` Andres Lagar-Cavilla
  2012-04-23 17:18                         ` Andres Lagar-Cavilla
  1 sibling, 2 replies; 45+ messages in thread
From: Tim Deegan @ 2012-04-23  9:14 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser, andres

At 07:36 +0000 on 23 Apr (1335166577), Zhang, Yang Z wrote:
> The p2m lock in __get_gfn_type_access() is the culprit. Here is the profiling data with 10 seconds:
> 
> (XEN) p2m_lock 1 lock:
> (XEN)   lock:      190733(00000000:14CE5726), block:       67159(00000007:6AAA53F3)
> 
> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
> blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle
> is waiting for the p2m lock. And those data only for idle guest. The
> impaction is more seriously when run some workload inside guest.  I
> noticed that this change was adding by cs 24770. And before it, we
> don't require the p2m lock in _get_gfn_type_access. So is this lock
> really necessary?

Ugh; that certainly is a regression.  We used to be lock-free on p2m
lookups and losing that will be bad for perf in lots of ways.  IIRC the
original aim was to use fine-grained per-page locks for this -- there
should be no need to hold a per-domain lock during a normal read.
Andres, what happened to that code?

Making it an rwlock would be tricky as this interface doesn't
differenctiate readers from writers.  For the common case (no
sharing/paging/mem-access) it ought to be a win since there is hardly
any writing.  But it would be better to make this particular lock/unlock
go away.

Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  9:14                       ` Tim Deegan
@ 2012-04-23 15:26                         ` Andres Lagar-Cavilla
  2012-04-24  9:15                           ` Tim Deegan
  2012-04-23 17:18                         ` Andres Lagar-Cavilla
  1 sibling, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-23 15:26 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

> At 07:36 +0000 on 23 Apr (1335166577), Zhang, Yang Z wrote:
>> The p2m lock in __get_gfn_type_access() is the culprit. Here is the
>> profiling data with 10 seconds:
>>
>> (XEN) p2m_lock 1 lock:
>> (XEN)   lock:      190733(00000000:14CE5726), block:
>> 67159(00000007:6AAA53F3)
>>
>> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
>> blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle
>> is waiting for the p2m lock. And those data only for idle guest. The
>> impaction is more seriously when run some workload inside guest.  I
>> noticed that this change was adding by cs 24770. And before it, we
>> don't require the p2m lock in _get_gfn_type_access. So is this lock
>> really necessary?
>
> Ugh; that certainly is a regression.  We used to be lock-free on p2m
> lookups and losing that will be bad for perf in lots of ways.  IIRC the
> original aim was to use fine-grained per-page locks for this -- there
> should be no need to hold a per-domain lock during a normal read.
> Andres, what happened to that code?

The fine-grained p2m locking code is stashed somewhere and untested.
Obviously not meant for 4.2. I don't think it'll be useful here: all vcpus
are hitting the same gfn for the hpet mmio address.

Thanks for the report Yang, it would be great to understand at which call
sites the p2m lock is contended, so we can try to alleviate contention at
the right place.

Looking closely at the code, it would seem the only get_gfn in
hvmemul_do_io is called on ram_gpa == 0 (?!). Shouldn't ram_gpa underlie
the target mmio address for emulation?

Notwithstanding the above, we've optimized p2m access on hvmemul_do_io to
have as brief a critical section as possible: get_gfn, get_page, put_gfn,
work, put_page later.

So contention might arise from (bogusly) doing get_gfn(0) by every vcpu
(when it should instead be get_gfn(hpet_gfn)). The only way to alleviate
that contention is to use get_gfn_query_unlocked, and that will be unsafe
against paging or sharing. I am very skeptical this is causing the
slowdown you see, since you're not using paging or sharing. The critical
section protected by the p2m lock is literally one cmpxchg.

The other source of contention might come from hvmemul_rep_movs, which
holds the p2m lock for the duration of the mmio operation. I can optimize
that one using the get_gfn/get_page/put_gfn pattern mentioned above.

The third source of contention might be the __hvm_copy to/from the target
address of the hpet value read/written. That one can be slightly optimized
by doing the memcpy outside of the scope of the p2m lock.

>
> Making it an rwlock would be tricky as this interface doesn't
> differenctiate readers from writers.  For the common case (no
> sharing/paging/mem-access) it ought to be a win since there is hardly
> any writing.  But it would be better to make this particular lock/unlock
> go away.
>

We had a discussion about rwlocks way back when. It's impossible (or
nearly so) to tell when an access might upgrade to write privileges.
Deadlock has a high likelihood.

Andres

> Tim.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23  9:14                       ` Tim Deegan
  2012-04-23 15:26                         ` Andres Lagar-Cavilla
@ 2012-04-23 17:18                         ` Andres Lagar-Cavilla
  2012-04-24  8:58                           ` Zhang, Yang Z
  1 sibling, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-23 17:18 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

> At 07:36 +0000 on 23 Apr (1335166577), Zhang, Yang Z wrote:
>> The p2m lock in __get_gfn_type_access() is the culprit. Here is the
>> profiling data with 10 seconds:
>>
>> (XEN) p2m_lock 1 lock:
>> (XEN)   lock:      190733(00000000:14CE5726), block:
>> 67159(00000007:6AAA53F3)
>>
>> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
>> blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle
>> is waiting for the p2m lock. And those data only for idle guest. The
>> impaction is more seriously when run some workload inside guest.  I
>> noticed that this change was adding by cs 24770. And before it, we
>> don't require the p2m lock in _get_gfn_type_access. So is this lock
>> really necessary?
>
> Ugh; that certainly is a regression.  We used to be lock-free on p2m
> lookups and losing that will be bad for perf in lots of ways.  IIRC the
> original aim was to use fine-grained per-page locks for this -- there
> should be no need to hold a per-domain lock during a normal read.
> Andres, what happened to that code?

Sigh, scratch a lot of nonsense I just spewed on the "hpet gfn". No actual
page backing that one, no concerns.

Still is the case that ram_gpa is zero in many cases going into
hvmemul_do_io. There really is no point in doing a get_page(get_gfn(0)).
How about the following (there's more email after this patch):
# HG changeset patch
# Parent ccc64793187f7d9c926343a1cd4ae822a4364bd1
x86/emulation: No need to get_gfn on zero ram_gpa.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r ccc64793187f xen/arch/x86/hvm/emulate.c
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -60,33 +60,37 @@ static int hvmemul_do_io(
     ioreq_t *p = get_ioreq(curr);
     unsigned long ram_gfn = paddr_to_pfn(ram_gpa);
     p2m_type_t p2mt;
-    mfn_t ram_mfn;
+    mfn_t ram_mfn = _mfn(INVALID_MFN);
     int rc;

-    /* Check for paged out page */
-    ram_mfn = get_gfn_unshare(curr->domain, ram_gfn, &p2mt);
-    if ( p2m_is_paging(p2mt) )
-    {
-        put_gfn(curr->domain, ram_gfn);
-        p2m_mem_paging_populate(curr->domain, ram_gfn);
-        return X86EMUL_RETRY;
-    }
-    if ( p2m_is_shared(p2mt) )
-    {
-        put_gfn(curr->domain, ram_gfn);
-        return X86EMUL_RETRY;
-    }
-
-    /* Maintain a ref on the mfn to ensure liveness. Put the gfn
-     * to avoid potential deadlock wrt event channel lock, later. */
-    if ( mfn_valid(mfn_x(ram_mfn)) )
-        if ( !get_page(mfn_to_page(mfn_x(ram_mfn)),
-             curr->domain) )
+    /* Many callers pass a stub zero ram_gpa addresses. */
+    if ( ram_gfn != 0 )
+    {
+        /* Check for paged out page */
+        ram_mfn = get_gfn_unshare(curr->domain, ram_gfn, &p2mt);
+        if ( p2m_is_paging(p2mt) )
         {
-            put_gfn(curr->domain, ram_gfn);
+            put_gfn(curr->domain, ram_gfn);
+            p2m_mem_paging_populate(curr->domain, ram_gfn);
             return X86EMUL_RETRY;
         }
-    put_gfn(curr->domain, ram_gfn);
+        if ( p2m_is_shared(p2mt) )
+        {
+            put_gfn(curr->domain, ram_gfn);
+            return X86EMUL_RETRY;
+        }
+
+        /* Maintain a ref on the mfn to ensure liveness. Put the gfn
+         * to avoid potential deadlock wrt event channel lock, later. */
+        if ( mfn_valid(mfn_x(ram_mfn)) )
+            if ( !get_page(mfn_to_page(mfn_x(ram_mfn)),
+                 curr->domain) )
+            {
+                put_gfn(curr->domain, ram_gfn);
+                return X86EMUL_RETRY;
+            }
+        put_gfn(curr->domain, ram_gfn);
+    }

     /*
      * Weird-sized accesses have undefined behaviour: we discard writes


If contention is coming in from the emul_rep_movs path, then that might be
taken care of with the following:
# HG changeset patch
# Parent 18b694840961cb6e3934628b23902a7979f00bac
x86/emulate: Relieve contention of p2m lock in emulation of rep movs.

get_two_gfns is used to query the source and target physical addresses of the
emulated rep movs. It is not necessary to hold onto the two gfn's for the
duration of the emulation: further calls down the line will do the
appropriate
unsahring or paging in, and unwind correctly on failure.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 18b694840961 xen/arch/x86/hvm/emulate.c
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -714,25 +714,23 @@ static int hvmemul_rep_movs(
     if ( rc != X86EMUL_OKAY )
         return rc;

+    /* The query on the gfns is to establish the need for mmio. In the
two mmio
+     * cases, a proper get_gfn for the "other" gfn will be enacted, with
paging in
+     * or unsharing if necessary. In the memmove case, the gfn might
change given
+     * the bytes mov'ed, and, again, proper get_gfn's will be enacted in
+     * __hvm_copy. */
     get_two_gfns(current->domain, sgpa >> PAGE_SHIFT, &sp2mt, NULL, NULL,
                  current->domain, dgpa >> PAGE_SHIFT, &dp2mt, NULL, NULL,
                  P2M_ALLOC, &tg);
-
+    put_two_gfns(&tg);
+
     if ( !p2m_is_ram(sp2mt) && !p2m_is_grant(sp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+        return hvmemul_do_mmio(
             sgpa, reps, bytes_per_rep, dgpa, IOREQ_READ, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }

     if ( !p2m_is_ram(dp2mt) && !p2m_is_grant(dp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+        return hvmemul_do_mmio(
             dgpa, reps, bytes_per_rep, sgpa, IOREQ_WRITE, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }

     /* RAM-to-RAM copy: emulate as equivalent of memmove(dgpa, sgpa,
bytes). */
     bytes = *reps * bytes_per_rep;
@@ -747,10 +745,7 @@ static int hvmemul_rep_movs(
      * can be emulated by a source-to-buffer-to-destination block copy.
      */
     if ( ((dgpa + bytes_per_rep) > sgpa) && (dgpa < (sgpa + bytes)) )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }

     /* Adjust destination address for reverse copy. */
     if ( df )
@@ -759,10 +754,7 @@ static int hvmemul_rep_movs(
     /* Allocate temporary buffer. Fall back to slow emulation if this
fails. */
     buf = xmalloc_bytes(bytes);
     if ( buf == NULL )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }

     /*
      * We do a modicum of checking here, just for paranoia's sake and to
@@ -773,7 +765,6 @@ static int hvmemul_rep_movs(
         rc = hvm_copy_to_guest_phys(dgpa, buf, bytes);

     xfree(buf);
-    put_two_gfns(&tg);

     if ( rc == HVMCOPY_gfn_paged_out )
         return X86EMUL_RETRY;


Let me know if any of this helps
Thanks,
Andres

>
> Making it an rwlock would be tricky as this interface doesn't
> differenctiate readers from writers.  For the common case (no
> sharing/paging/mem-access) it ought to be a win since there is hardly
> any writing.  But it would be better to make this particular lock/unlock
> go away.
>
> Tim.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23 17:18                         ` Andres Lagar-Cavilla
@ 2012-04-24  8:58                           ` Zhang, Yang Z
  2012-04-24  9:16                             ` Tim Deegan
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-24  8:58 UTC (permalink / raw)
  To: andres, Tim Deegan; +Cc: xen-devel, Keir Fraser

> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Tuesday, April 24, 2012 1:19 AM
> 
> Let me know if any of this helps
No, it not works. 

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-23 15:26                         ` Andres Lagar-Cavilla
@ 2012-04-24  9:15                           ` Tim Deegan
  2012-04-24 13:28                             ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 45+ messages in thread
From: Tim Deegan @ 2012-04-24  9:15 UTC (permalink / raw)
  To: Andres Lagar-Cavilla; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

At 08:26 -0700 on 23 Apr (1335169568), Andres Lagar-Cavilla wrote:
> > At 07:36 +0000 on 23 Apr (1335166577), Zhang, Yang Z wrote:
> >> The p2m lock in __get_gfn_type_access() is the culprit. Here is the
> >> profiling data with 10 seconds:
> >>
> >> (XEN) p2m_lock 1 lock:
> >> (XEN)   lock:      190733(00000000:14CE5726), block:
> >> 67159(00000007:6AAA53F3)
> >>
> >> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
> >> blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle
> >> is waiting for the p2m lock. And those data only for idle guest. The
> >> impaction is more seriously when run some workload inside guest.  I
> >> noticed that this change was adding by cs 24770. And before it, we
> >> don't require the p2m lock in _get_gfn_type_access. So is this lock
> >> really necessary?
> >
> > Ugh; that certainly is a regression.  We used to be lock-free on p2m
> > lookups and losing that will be bad for perf in lots of ways.  IIRC the
> > original aim was to use fine-grained per-page locks for this -- there
> > should be no need to hold a per-domain lock during a normal read.
> > Andres, what happened to that code?
> 
> The fine-grained p2m locking code is stashed somewhere and untested.
> Obviously not meant for 4.2. I don't think it'll be useful here: all vcpus
> are hitting the same gfn for the hpet mmio address.

We'll have to do _something_ for 4.2 if it's introducing an 18% CPU
overhead in an otherwise idle VM.

In any case I think this means I probably shouldn't take the patch that
turns on this locking for shadow VMs.  They do a lot more p2m lookups. 

> The other source of contention might come from hvmemul_rep_movs, which
> holds the p2m lock for the duration of the mmio operation. I can optimize
> that one using the get_gfn/get_page/put_gfn pattern mentioned above.

But wouldn't that be unsafe?  What if the p2m changes during the
operation?  Or, conversely, could you replace all uses of the lock in
p2m lookups with get_page() on the result and still get what you need?

Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-24  8:58                           ` Zhang, Yang Z
@ 2012-04-24  9:16                             ` Tim Deegan
  2012-04-25  0:27                               ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Tim Deegan @ 2012-04-24  9:16 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser, andres

At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
> > -----Original Message-----
> > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> > Sent: Tuesday, April 24, 2012 1:19 AM
> > 
> > Let me know if any of this helps
> No, it not works. 

Do you mean that it doesn't help with the CPU overhead, or that it's
broken in some other way?

Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-24  9:15                           ` Tim Deegan
@ 2012-04-24 13:28                             ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-24 13:28 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

> At 08:26 -0700 on 23 Apr (1335169568), Andres Lagar-Cavilla wrote:
>> > At 07:36 +0000 on 23 Apr (1335166577), Zhang, Yang Z wrote:
>> >> The p2m lock in __get_gfn_type_access() is the culprit. Here is the
>> >> profiling data with 10 seconds:
>> >>
>> >> (XEN) p2m_lock 1 lock:
>> >> (XEN)   lock:      190733(00000000:14CE5726), block:
>> >> 67159(00000007:6AAA53F3)
>> >>
>> >> Those data is collected when win8 guest(16 vcpus) is idle. 16 VCPUs
>> >> blocked 30 seconds with 10 sec's profiling. It means 18% of cpu cycle
>> >> is waiting for the p2m lock. And those data only for idle guest. The
>> >> impaction is more seriously when run some workload inside guest.  I
>> >> noticed that this change was adding by cs 24770. And before it, we
>> >> don't require the p2m lock in _get_gfn_type_access. So is this lock
>> >> really necessary?
>> >
>> > Ugh; that certainly is a regression.  We used to be lock-free on p2m
>> > lookups and losing that will be bad for perf in lots of ways.  IIRC
>> the
>> > original aim was to use fine-grained per-page locks for this -- there
>> > should be no need to hold a per-domain lock during a normal read.
>> > Andres, what happened to that code?
>>
>> The fine-grained p2m locking code is stashed somewhere and untested.
>> Obviously not meant for 4.2. I don't think it'll be useful here: all
>> vcpus
>> are hitting the same gfn for the hpet mmio address.
>
> We'll have to do _something_ for 4.2 if it's introducing an 18% CPU
> overhead in an otherwise idle VM.

An hpet mmio read or write hits get_gfn twice: one for gfn zero at the
beginning of hvmemul_do_io, the other during hvm copy. The patch I sent to
Yang yesterday removes the bogus first get_gfn. The second one has to
perform a locked query.

Yang, is there any possibility to get more information here? The ability
to identify the call site that contends for the p2m lock would be crucial.
Even something as crude as sampling vcpu stack traces by hitting 'd'
repeatedly on the serial line, and seeing what sticks out frequently.

>
> In any case I think this means I probably shouldn't take the patch that
> turns on this locking for shadow VMs.  They do a lot more p2m lookups.
>
>> The other source of contention might come from hvmemul_rep_movs, which
>> holds the p2m lock for the duration of the mmio operation. I can
>> optimize
>> that one using the get_gfn/get_page/put_gfn pattern mentioned above.
>
> But wouldn't that be unsafe?  What if the p2m changes during the
> operation?  Or, conversely, could you replace all uses of the lock in
> p2m lookups with get_page() on the result and still get what you need?

I've been thinking of wrapping the pattern into a handy p2m accessor. I
see this pattern repeating itself for hvm_copy, tss/segment entry, page
table walking, nested hvm, etc, in which the consumer wants to map the
result of the translation. By the time you finish with a get_gfn_unshare,
most possible p2m transformations will have been taken care of (PoD-alloc,
page in, unshare). By taking a reference to the underlying page, paging
out is prevented, and then the vcpu can safely let go of the p2m lock.

Conceivably guest_physmap_remove_page and guest_physmap_add_entry can
still come in and change the p2m entry.

Andres

>
> Tim.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-24  9:16                             ` Tim Deegan
@ 2012-04-25  0:27                               ` Zhang, Yang Z
  2012-04-25  1:40                                 ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-25  0:27 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, andres

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Tuesday, April 24, 2012 5:17 PM
> To: Zhang, Yang Z
> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir Fraser
> Subject: Re: [Xen-devel] lock in vhpet
> 
> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
> > > -----Original Message-----
> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> > > Sent: Tuesday, April 24, 2012 1:19 AM
> > >
> > > Let me know if any of this helps
> > No, it not works.
> 
> Do you mean that it doesn't help with the CPU overhead, or that it's broken in
> some other way?
> 
It cannot help with the CPU overhead

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  0:27                               ` Zhang, Yang Z
@ 2012-04-25  1:40                                 ` Andres Lagar-Cavilla
  2012-04-25  1:48                                   ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-25  1:40 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Keir Fraser, xen-devel, Tim Deegan

>> -----Original Message-----
>> From: Tim Deegan [mailto:tim@xen.org]
>> Sent: Tuesday, April 24, 2012 5:17 PM
>> To: Zhang, Yang Z
>> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir Fraser
>> Subject: Re: [Xen-devel] lock in vhpet
>>
>> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
>> > > -----Original Message-----
>> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> > > Sent: Tuesday, April 24, 2012 1:19 AM
>> > >
>> > > Let me know if any of this helps
>> > No, it not works.
>>
>> Do you mean that it doesn't help with the CPU overhead, or that it's
>> broken in
>> some other way?
>>
> It cannot help with the CPU overhead

Yang, is there any further information you can provide? A rough idea of
where vcpus are spending time spinning for the p2m lock would be
tremendously useful.

Thanks
Andres

>
> best regards
> yang
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  1:40                                 ` Andres Lagar-Cavilla
@ 2012-04-25  1:48                                   ` Zhang, Yang Z
  2012-04-25  2:31                                     ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-25  1:48 UTC (permalink / raw)
  To: andres; +Cc: Keir Fraser, xen-devel, Tim Deegan


> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Wednesday, April 25, 2012 9:40 AM
> To: Zhang, Yang Z
> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> Subject: RE: [Xen-devel] lock in vhpet
> 
> >> -----Original Message-----
> >> From: Tim Deegan [mailto:tim@xen.org]
> >> Sent: Tuesday, April 24, 2012 5:17 PM
> >> To: Zhang, Yang Z
> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir
> >> Fraser
> >> Subject: Re: [Xen-devel] lock in vhpet
> >>
> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
> >> > > -----Original Message-----
> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
> >> > >
> >> > > Let me know if any of this helps
> >> > No, it not works.
> >>
> >> Do you mean that it doesn't help with the CPU overhead, or that it's
> >> broken in some other way?
> >>
> > It cannot help with the CPU overhead
> 
> Yang, is there any further information you can provide? A rough idea of where
> vcpus are spending time spinning for the p2m lock would be tremendously
> useful.
> 
I am doing the further investigation. Hope can get more useful information. 
But actually, the first cs introduced this issue is 24770. When win8 booting and if hpet is enabled, it will use hpet as the time source and there have lots of hpet access and EPT violation. In EPT violation handler, it call get_gfn_type_access to get the mfn. The cs 24770 introduces the gfn_lock for p2m lookups, and then the issue happens. After I removed the gfn_lock, the issue goes. But in latest xen, even I remove this lock, it still shows high cpu utilization.

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  1:48                                   ` Zhang, Yang Z
@ 2012-04-25  2:31                                     ` Andres Lagar-Cavilla
  2012-04-25  2:36                                       ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-25  2:31 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Keir Fraser, xen-devel, Tim Deegan

>
>> -----Original Message-----
>> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> Sent: Wednesday, April 25, 2012 9:40 AM
>> To: Zhang, Yang Z
>> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> Subject: RE: [Xen-devel] lock in vhpet
>>
>> >> -----Original Message-----
>> >> From: Tim Deegan [mailto:tim@xen.org]
>> >> Sent: Tuesday, April 24, 2012 5:17 PM
>> >> To: Zhang, Yang Z
>> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir
>> >> Fraser
>> >> Subject: Re: [Xen-devel] lock in vhpet
>> >>
>> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
>> >> > > -----Original Message-----
>> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
>> >> > >
>> >> > > Let me know if any of this helps
>> >> > No, it not works.
>> >>
>> >> Do you mean that it doesn't help with the CPU overhead, or that it's
>> >> broken in some other way?
>> >>
>> > It cannot help with the CPU overhead
>>
>> Yang, is there any further information you can provide? A rough idea of
>> where
>> vcpus are spending time spinning for the p2m lock would be tremendously
>> useful.
>>
> I am doing the further investigation. Hope can get more useful
> information.

Thanks, looking forward to that.

> But actually, the first cs introduced this issue is 24770. When win8
> booting and if hpet is enabled, it will use hpet as the time source and
> there have lots of hpet access and EPT violation. In EPT violation
> handler, it call get_gfn_type_access to get the mfn. The cs 24770
> introduces the gfn_lock for p2m lookups, and then the issue happens. After
> I removed the gfn_lock, the issue goes. But in latest xen, even I remove
> this lock, it still shows high cpu utilization.
>

It would seem then that even the briefest lock-protected critical section
would cause this? In the mmio case, the p2m lock taken in the hap fault
handler is held during the actual lookup, and for a couple of branch
instructions afterwards.

In latest Xen, with lock removed for get_gfn, on which lock is time spent?

Thanks,
Andres

> yang
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  2:31                                     ` Andres Lagar-Cavilla
@ 2012-04-25  2:36                                       ` Zhang, Yang Z
  2012-04-25  2:42                                         ` Andres Lagar-Cavilla
  2012-04-26 21:25                                         ` Tim Deegan
  0 siblings, 2 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-25  2:36 UTC (permalink / raw)
  To: andres; +Cc: Keir Fraser, xen-devel, Tim Deegan

> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Wednesday, April 25, 2012 10:31 AM
> To: Zhang, Yang Z
> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> Subject: RE: [Xen-devel] lock in vhpet
> 
> >
> >> -----Original Message-----
> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> Sent: Wednesday, April 25, 2012 9:40 AM
> >> To: Zhang, Yang Z
> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> >> Subject: RE: [Xen-devel] lock in vhpet
> >>
> >> >> -----Original Message-----
> >> >> From: Tim Deegan [mailto:tim@xen.org]
> >> >> Sent: Tuesday, April 24, 2012 5:17 PM
> >> >> To: Zhang, Yang Z
> >> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir
> >> >> Fraser
> >> >> Subject: Re: [Xen-devel] lock in vhpet
> >> >>
> >> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
> >> >> > > -----Original Message-----
> >> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
> >> >> > >
> >> >> > > Let me know if any of this helps
> >> >> > No, it not works.
> >> >>
> >> >> Do you mean that it doesn't help with the CPU overhead, or that
> >> >> it's broken in some other way?
> >> >>
> >> > It cannot help with the CPU overhead
> >>
> >> Yang, is there any further information you can provide? A rough idea
> >> of where vcpus are spending time spinning for the p2m lock would be
> >> tremendously useful.
> >>
> > I am doing the further investigation. Hope can get more useful
> > information.
> 
> Thanks, looking forward to that.
> 
> > But actually, the first cs introduced this issue is 24770. When win8
> > booting and if hpet is enabled, it will use hpet as the time source
> > and there have lots of hpet access and EPT violation. In EPT violation
> > handler, it call get_gfn_type_access to get the mfn. The cs 24770
> > introduces the gfn_lock for p2m lookups, and then the issue happens.
> > After I removed the gfn_lock, the issue goes. But in latest xen, even
> > I remove this lock, it still shows high cpu utilization.
> >
> 
> It would seem then that even the briefest lock-protected critical section would
> cause this? In the mmio case, the p2m lock taken in the hap fault handler is
> held during the actual lookup, and for a couple of branch instructions
> afterwards.
> 
> In latest Xen, with lock removed for get_gfn, on which lock is time spent?
Still the p2m_lock.

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  2:36                                       ` Zhang, Yang Z
@ 2012-04-25  2:42                                         ` Andres Lagar-Cavilla
  2012-04-25  3:12                                           ` Zhang, Yang Z
  2012-04-26 21:25                                         ` Tim Deegan
  1 sibling, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-25  2:42 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Keir Fraser, xen-devel, Tim Deegan

>> -----Original Message-----
>> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> Sent: Wednesday, April 25, 2012 10:31 AM
>> To: Zhang, Yang Z
>> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> Subject: RE: [Xen-devel] lock in vhpet
>>
>> >
>> >> -----Original Message-----
>> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> Sent: Wednesday, April 25, 2012 9:40 AM
>> >> To: Zhang, Yang Z
>> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> >> Subject: RE: [Xen-devel] lock in vhpet
>> >>
>> >> >> -----Original Message-----
>> >> >> From: Tim Deegan [mailto:tim@xen.org]
>> >> >> Sent: Tuesday, April 24, 2012 5:17 PM
>> >> >> To: Zhang, Yang Z
>> >> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com; Keir
>> >> >> Fraser
>> >> >> Subject: Re: [Xen-devel] lock in vhpet
>> >> >>
>> >> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
>> >> >> > > -----Original Message-----
>> >> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
>> >> >> > >
>> >> >> > > Let me know if any of this helps
>> >> >> > No, it not works.
>> >> >>
>> >> >> Do you mean that it doesn't help with the CPU overhead, or that
>> >> >> it's broken in some other way?
>> >> >>
>> >> > It cannot help with the CPU overhead
>> >>
>> >> Yang, is there any further information you can provide? A rough idea
>> >> of where vcpus are spending time spinning for the p2m lock would be
>> >> tremendously useful.
>> >>
>> > I am doing the further investigation. Hope can get more useful
>> > information.
>>
>> Thanks, looking forward to that.
>>
>> > But actually, the first cs introduced this issue is 24770. When win8
>> > booting and if hpet is enabled, it will use hpet as the time source
>> > and there have lots of hpet access and EPT violation. In EPT violation
>> > handler, it call get_gfn_type_access to get the mfn. The cs 24770
>> > introduces the gfn_lock for p2m lookups, and then the issue happens.
>> > After I removed the gfn_lock, the issue goes. But in latest xen, even
>> > I remove this lock, it still shows high cpu utilization.
>> >
>>
>> It would seem then that even the briefest lock-protected critical
>> section would
>> cause this? In the mmio case, the p2m lock taken in the hap fault
>> handler is
>> held during the actual lookup, and for a couple of branch instructions
>> afterwards.
>>
>> In latest Xen, with lock removed for get_gfn, on which lock is time
>> spent?
> Still the p2m_lock.

How are you removing the lock from get_gfn?

The p2m lock is taken on a few specific code paths outside of get_gfn
(change type of an entry, add a new p2m entry, setup and teardown), and
I'm surprised any of those code paths is being used by the hpet mmio
handler.

Andres

>
> yang
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  2:42                                         ` Andres Lagar-Cavilla
@ 2012-04-25  3:12                                           ` Zhang, Yang Z
  2012-04-25  3:34                                             ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-25  3:12 UTC (permalink / raw)
  To: andres; +Cc: Keir Fraser, xen-devel, Tim Deegan

> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Wednesday, April 25, 2012 10:42 AM
> To: Zhang, Yang Z
> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> Subject: RE: [Xen-devel] lock in vhpet
> 
> >> -----Original Message-----
> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> Sent: Wednesday, April 25, 2012 10:31 AM
> >> To: Zhang, Yang Z
> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> >> Subject: RE: [Xen-devel] lock in vhpet
> >>
> >> >
> >> >> -----Original Message-----
> >> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> >> Sent: Wednesday, April 25, 2012 9:40 AM
> >> >> To: Zhang, Yang Z
> >> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
> >> >> Subject: RE: [Xen-devel] lock in vhpet
> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Tim Deegan [mailto:tim@xen.org]
> >> >> >> Sent: Tuesday, April 24, 2012 5:17 PM
> >> >> >> To: Zhang, Yang Z
> >> >> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com;
> >> >> >> Keir Fraser
> >> >> >> Subject: Re: [Xen-devel] lock in vhpet
> >> >> >>
> >> >> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
> >> >> >> > > -----Original Message-----
> >> >> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> >> >> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
> >> >> >> > >
> >> >> >> > > Let me know if any of this helps
> >> >> >> > No, it not works.
> >> >> >>
> >> >> >> Do you mean that it doesn't help with the CPU overhead, or that
> >> >> >> it's broken in some other way?
> >> >> >>
> >> >> > It cannot help with the CPU overhead
> >> >>
> >> >> Yang, is there any further information you can provide? A rough
> >> >> idea of where vcpus are spending time spinning for the p2m lock
> >> >> would be tremendously useful.
> >> >>
> >> > I am doing the further investigation. Hope can get more useful
> >> > information.
> >>
> >> Thanks, looking forward to that.
> >>
> >> > But actually, the first cs introduced this issue is 24770. When
> >> > win8 booting and if hpet is enabled, it will use hpet as the time
> >> > source and there have lots of hpet access and EPT violation. In EPT
> >> > violation handler, it call get_gfn_type_access to get the mfn. The
> >> > cs 24770 introduces the gfn_lock for p2m lookups, and then the issue
> happens.
> >> > After I removed the gfn_lock, the issue goes. But in latest xen,
> >> > even I remove this lock, it still shows high cpu utilization.
> >> >
> >>
> >> It would seem then that even the briefest lock-protected critical
> >> section would cause this? In the mmio case, the p2m lock taken in the
> >> hap fault handler is held during the actual lookup, and for a couple
> >> of branch instructions afterwards.
> >>
> >> In latest Xen, with lock removed for get_gfn, on which lock is time
> >> spent?
> > Still the p2m_lock.
> 
> How are you removing the lock from get_gfn?
>
> The p2m lock is taken on a few specific code paths outside of get_gfn (change
> type of an entry, add a new p2m entry, setup and teardown), and I'm surprised
> any of those code paths is being used by the hpet mmio handler.

Sorry, what I said maybe not accurate. In latest xen, I use a workaround way to skip calling get_gfn_type_access in hvm_hap_nested_page_fault(). So the p2m_lock is still existing. 
Now, I found the contention of p2m_lock is coming from __hvm_copy. In mmio handler, it has some code paths to call it(hvm_fetch_from_guest_virt_nofault(),  hvm_copy_from_guest_virt()). When lots of mmio access happened, the contention is very obviously.

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  3:12                                           ` Zhang, Yang Z
@ 2012-04-25  3:34                                             ` Andres Lagar-Cavilla
  2012-04-25  5:18                                               ` Zhang, Yang Z
  2012-04-25  8:07                                               ` Jan Beulich
  0 siblings, 2 replies; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-25  3:34 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Keir Fraser, xen-devel, Tim Deegan

>> -----Original Message-----
>> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> Sent: Wednesday, April 25, 2012 10:42 AM
>> To: Zhang, Yang Z
>> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> Subject: RE: [Xen-devel] lock in vhpet
>>
>> >> -----Original Message-----
>> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> Sent: Wednesday, April 25, 2012 10:31 AM
>> >> To: Zhang, Yang Z
>> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> >> Subject: RE: [Xen-devel] lock in vhpet
>> >>
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> >> Sent: Wednesday, April 25, 2012 9:40 AM
>> >> >> To: Zhang, Yang Z
>> >> >> Cc: Tim Deegan; xen-devel@lists.xensource.com; Keir Fraser
>> >> >> Subject: RE: [Xen-devel] lock in vhpet
>> >> >>
>> >> >> >> -----Original Message-----
>> >> >> >> From: Tim Deegan [mailto:tim@xen.org]
>> >> >> >> Sent: Tuesday, April 24, 2012 5:17 PM
>> >> >> >> To: Zhang, Yang Z
>> >> >> >> Cc: andres@lagarcavilla.org; xen-devel@lists.xensource.com;
>> >> >> >> Keir Fraser
>> >> >> >> Subject: Re: [Xen-devel] lock in vhpet
>> >> >> >>
>> >> >> >> At 08:58 +0000 on 24 Apr (1335257909), Zhang, Yang Z wrote:
>> >> >> >> > > -----Original Message-----
>> >> >> >> > > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
>> >> >> >> > > Sent: Tuesday, April 24, 2012 1:19 AM
>> >> >> >> > >
>> >> >> >> > > Let me know if any of this helps
>> >> >> >> > No, it not works.
>> >> >> >>
>> >> >> >> Do you mean that it doesn't help with the CPU overhead, or that
>> >> >> >> it's broken in some other way?
>> >> >> >>
>> >> >> > It cannot help with the CPU overhead
>> >> >>
>> >> >> Yang, is there any further information you can provide? A rough
>> >> >> idea of where vcpus are spending time spinning for the p2m lock
>> >> >> would be tremendously useful.
>> >> >>
>> >> > I am doing the further investigation. Hope can get more useful
>> >> > information.
>> >>
>> >> Thanks, looking forward to that.
>> >>
>> >> > But actually, the first cs introduced this issue is 24770. When
>> >> > win8 booting and if hpet is enabled, it will use hpet as the time
>> >> > source and there have lots of hpet access and EPT violation. In EPT
>> >> > violation handler, it call get_gfn_type_access to get the mfn. The
>> >> > cs 24770 introduces the gfn_lock for p2m lookups, and then the
>> issue
>> happens.
>> >> > After I removed the gfn_lock, the issue goes. But in latest xen,
>> >> > even I remove this lock, it still shows high cpu utilization.
>> >> >
>> >>
>> >> It would seem then that even the briefest lock-protected critical
>> >> section would cause this? In the mmio case, the p2m lock taken in the
>> >> hap fault handler is held during the actual lookup, and for a couple
>> >> of branch instructions afterwards.
>> >>
>> >> In latest Xen, with lock removed for get_gfn, on which lock is time
>> >> spent?
>> > Still the p2m_lock.
>>
>> How are you removing the lock from get_gfn?
>>
>> The p2m lock is taken on a few specific code paths outside of get_gfn
>> (change
>> type of an entry, add a new p2m entry, setup and teardown), and I'm
>> surprised
>> any of those code paths is being used by the hpet mmio handler.
>
> Sorry, what I said maybe not accurate. In latest xen, I use a workaround
> way to skip calling get_gfn_type_access in hvm_hap_nested_page_fault(). So
> the p2m_lock is still existing.
> Now, I found the contention of p2m_lock is coming from __hvm_copy. In mmio
> handler, it has some code paths to call
> it(hvm_fetch_from_guest_virt_nofault(),  hvm_copy_from_guest_virt()). When
> lots of mmio access happened, the contention is very obviously.

Thanks. Can you please try this:
http://lists.xen.org/archives/html/xen-devel/2012-04/msg01861.html

in conjunction with the patch below?
Andres

diff -r 7a7443e80b99 xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2383,6 +2383,8 @@ static enum hvm_copy_result __hvm_copy(

     while ( todo > 0 )
     {
+        struct page_info *pg;
+
         count = min_t(int, PAGE_SIZE - (addr & ~PAGE_MASK), todo);

         if ( flags & HVMCOPY_virt )
@@ -2427,7 +2429,11 @@ static enum hvm_copy_result __hvm_copy(
             put_gfn(curr->domain, gfn);
             return HVMCOPY_bad_gfn_to_mfn;
         }
+
         ASSERT(mfn_valid(mfn));
+        pg = mfn_to_page(mfn);
+        ASSERT(get_page(pg, curr->domain));
+        put_gfn(curr->domain, gfn);

         p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);

@@ -2457,7 +2463,7 @@ static enum hvm_copy_result __hvm_copy(
         addr += count;
         buf  += count;
         todo -= count;
-        put_gfn(curr->domain, gfn);
+        put_page(pg);
     }

     return HVMCOPY_okay;

>
> yang
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  3:34                                             ` Andres Lagar-Cavilla
@ 2012-04-25  5:18                                               ` Zhang, Yang Z
  2012-04-25  8:07                                               ` Jan Beulich
  1 sibling, 0 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-25  5:18 UTC (permalink / raw)
  To: andres; +Cc: Keir Fraser, xen-devel, Tim Deegan

> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Wednesday, April 25, 2012 11:34 AM
> 
> Thanks. Can you please try this:
> http://lists.xen.org/archives/html/xen-devel/2012-04/msg01861.html
> 
> in conjunction with the patch below?
> Andres
> 
> diff -r 7a7443e80b99 xen/arch/x86/hvm/hvm.c
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2383,6 +2383,8 @@ static enum hvm_copy_result __hvm_copy(
> 
>      while ( todo > 0 )
>      {
> +        struct page_info *pg;
> +
>          count = min_t(int, PAGE_SIZE - (addr & ~PAGE_MASK), todo);
> 
>          if ( flags & HVMCOPY_virt )
> @@ -2427,7 +2429,11 @@ static enum hvm_copy_result __hvm_copy(
>              put_gfn(curr->domain, gfn);
>              return HVMCOPY_bad_gfn_to_mfn;
>          }
> +
>          ASSERT(mfn_valid(mfn));
> +        pg = mfn_to_page(mfn);
> +        ASSERT(get_page(pg, curr->domain));
> +        put_gfn(curr->domain, gfn);
> 
>          p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
> 
> @@ -2457,7 +2463,7 @@ static enum hvm_copy_result __hvm_copy(
>          addr += count;
>          buf  += count;
>          todo -= count;
> -        put_gfn(curr->domain, gfn);
> +        put_page(pg);
>      }
> 
>      return HVMCOPY_okay;
No, it doesn't work. On the contrary, the competition is more fiercer than before: 
Here is the p2m_lock competition with 10 seconds with 16 vcpus:
lock:      560583(00000000:83362735), block:      321453(00000009:364CA49B)

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  3:34                                             ` Andres Lagar-Cavilla
  2012-04-25  5:18                                               ` Zhang, Yang Z
@ 2012-04-25  8:07                                               ` Jan Beulich
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2012-04-25  8:07 UTC (permalink / raw)
  To: andres; +Cc: Yang Z Zhang, Tim Deegan, xen-devel, Keir Fraser

>>> On 25.04.12 at 05:34, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2383,6 +2383,8 @@ static enum hvm_copy_result __hvm_copy(
> 
>      while ( todo > 0 )
>      {
> +        struct page_info *pg;
> +
>          count = min_t(int, PAGE_SIZE - (addr & ~PAGE_MASK), todo);
> 
>          if ( flags & HVMCOPY_virt )
> @@ -2427,7 +2429,11 @@ static enum hvm_copy_result __hvm_copy(
>              put_gfn(curr->domain, gfn);
>              return HVMCOPY_bad_gfn_to_mfn;
>          }
> +
>          ASSERT(mfn_valid(mfn));
> +        pg = mfn_to_page(mfn);
> +        ASSERT(get_page(pg, curr->domain));

You really shouldn't ever put expressions with (side) effects inside
an ASSERT(), not even for debugging patches that you want others
to apply (you're of course free to shoot yourself in the foot ;-) ), as
it just won't work in non-debug builds.

Jan

> +        put_gfn(curr->domain, gfn);
> 
>          p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
> 
> @@ -2457,7 +2463,7 @@ static enum hvm_copy_result __hvm_copy(
>          addr += count;
>          buf  += count;
>          todo -= count;
> -        put_gfn(curr->domain, gfn);
> +        put_page(pg);
>      }
> 
>      return HVMCOPY_okay;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-25  2:36                                       ` Zhang, Yang Z
  2012-04-25  2:42                                         ` Andres Lagar-Cavilla
@ 2012-04-26 21:25                                         ` Tim Deegan
  2012-04-27  0:46                                           ` Zhang, Yang Z
                                                             ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Tim Deegan @ 2012-04-26 21:25 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser, andres

[-- Attachment #1: Type: text/plain, Size: 2388 bytes --]

At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> > > But actually, the first cs introduced this issue is 24770. When win8
> > > booting and if hpet is enabled, it will use hpet as the time source
> > > and there have lots of hpet access and EPT violation. In EPT violation
> > > handler, it call get_gfn_type_access to get the mfn. The cs 24770
> > > introduces the gfn_lock for p2m lookups, and then the issue happens.
> > > After I removed the gfn_lock, the issue goes. But in latest xen, even
> > > I remove this lock, it still shows high cpu utilization.
> > 
> > It would seem then that even the briefest lock-protected critical section would
> > cause this? In the mmio case, the p2m lock taken in the hap fault handler is
> > held during the actual lookup, and for a couple of branch instructions
> > afterwards.
> > 
> > In latest Xen, with lock removed for get_gfn, on which lock is time spent?
> Still the p2m_lock.

Can you please try the attached patch?  I think you'll need this one
plus the ones that take the locks out of the hpet code. 

This patch makes the p2m lock into an rwlock and adjusts a number of the
paths that don't update the p2m so they only take the read lock.  It's a
bit rough but I can boot 16-way win7 guest with it.

N.B. Since rwlocks don't show up the the existing lock profiling, please
don't try to use the lock-profiling numbers to see if it's helping!

Andres, this is basically the big-hammer version of your "take a
pagecount" changes, plus the change you made to hvmemul_rep_movs().
If this works I intend to follow it up with a patch to make some of the
read-modify-write paths avoid taking the lock (by using a
compare-exchange operation so they only take the lock on a write).  If
that succeeds I might drop put_gfn() altogether. 

But first it will need a lot of tidying up.  Noticeably missing:
 - SVM code equivalents to the vmx.c changes
 - grant-table operations still use the lock, because frankly I 
   could not follow the current code, and it's quite late in the evening.
I also have a long list of uglinesses in the mm code that I found while
writing this lot. 

Keir, I have no objection to later replacing this with something better
than an rwlock. :)  Or with making a NUMA-friendly rwlock
implementation, since I really expect this to be heavily read-mostly
when paging/sharing/pod are not enabled.

Cheers,

Tim.

[-- Attachment #2: get-page-from-gfn --]
[-- Type: text/plain, Size: 75998 bytes --]

# HG changeset patch
# Parent 107285938c50f82667bd4d014820b439a077c22c

diff -r 107285938c50 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/domain.c	Thu Apr 26 22:00:25 2012 +0100
@@ -716,7 +716,7 @@ int arch_set_info_guest(
 {
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
-    unsigned long cr3_pfn = INVALID_MFN;
+    struct page_info *cr3_page;
     unsigned long flags, cr4;
     unsigned int i;
     int rc = 0, compat;
@@ -925,46 +925,45 @@ int arch_set_info_guest(
     if ( !compat )
     {
         cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);
-        cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+        cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-        if ( !mfn_valid(cr3_pfn) ||
-             (paging_mode_refcounts(d)
-              ? !get_page(mfn_to_page(cr3_pfn), d)
-              : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                   PGT_base_page_table)) )
+        if ( !cr3_page )
         {
-            put_gfn(d, cr3_gfn);
+            destroy_gdt(v);
+            return -EINVAL;
+        }
+        if ( !paging_mode_refcounts(d)
+             && !get_page_type(cr3_page, PGT_base_page_table) )
+        {
+            put_page(cr3_page);
             destroy_gdt(v);
             return -EINVAL;
         }
 
-        v->arch.guest_table = pagetable_from_pfn(cr3_pfn);
-        put_gfn(d, cr3_gfn);
+        v->arch.guest_table = pagetable_from_page(cr3_page);
 #ifdef __x86_64__
         if ( c.nat->ctrlreg[1] )
         {
             cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]);
-            cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+            cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-            if ( !mfn_valid(cr3_pfn) ||
-                 (paging_mode_refcounts(d)
-                  ? !get_page(mfn_to_page(cr3_pfn), d)
-                  : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                       PGT_base_page_table)) )
+            if ( !cr3_page ||
+                 (!paging_mode_refcounts(d)
+                  && !get_page_type(cr3_page, PGT_base_page_table)) )
             {
-                cr3_pfn = pagetable_get_pfn(v->arch.guest_table);
+                if (cr3_page)
+                    put_page(cr3_page);
+                cr3_page = pagetable_get_page(v->arch.guest_table);
                 v->arch.guest_table = pagetable_null();
                 if ( paging_mode_refcounts(d) )
-                    put_page(mfn_to_page(cr3_pfn));
+                    put_page(cr3_page);
                 else
-                    put_page_and_type(mfn_to_page(cr3_pfn));
-                put_gfn(d, cr3_gfn); 
+                    put_page_and_type(cr3_page);
                 destroy_gdt(v);
                 return -EINVAL;
             }
 
-            v->arch.guest_table_user = pagetable_from_pfn(cr3_pfn);
-            put_gfn(d, cr3_gfn); 
+            v->arch.guest_table_user = pagetable_from_page(cr3_page);
         }
         else if ( !(flags & VGCF_in_kernel) )
         {
@@ -977,23 +976,25 @@ int arch_set_info_guest(
         l4_pgentry_t *l4tab;
 
         cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]);
-        cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+        cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-        if ( !mfn_valid(cr3_pfn) ||
-             (paging_mode_refcounts(d)
-              ? !get_page(mfn_to_page(cr3_pfn), d)
-              : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                   PGT_l3_page_table)) )
+        if ( !cr3_page)
         {
-            put_gfn(d, cr3_gfn); 
+            destroy_gdt(v);
+            return -EINVAL;
+        }
+
+        if (!paging_mode_refcounts(d)
+            && !get_page_and_type(cr3_page, d, PGT_l3_page_table) )
+        {
+            put_page(cr3_page);
             destroy_gdt(v);
             return -EINVAL;
         }
 
         l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
-        *l4tab = l4e_from_pfn(
-            cr3_pfn, _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
-        put_gfn(d, cr3_gfn); 
+        *l4tab = l4e_from_pfn(page_to_mfn(cr3_page),
+            _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
 #endif
     }
 
@@ -1064,7 +1065,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     struct domain *d = v->domain;
     void *mapping;
     vcpu_info_t *new_info;
-    unsigned long mfn;
+    struct page_info *page;
     int i;
 
     if ( offset > (PAGE_SIZE - sizeof(vcpu_info_t)) )
@@ -1077,19 +1078,20 @@ map_vcpu_info(struct vcpu *v, unsigned l
     if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
         return -EINVAL;
 
-    mfn = get_gfn_untyped(d, gfn);
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
+    if ( !page )
+        return -EINVAL;
+
+    if ( !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gfn); 
+        put_page(page);
         return -EINVAL;
     }
 
-    mapping = map_domain_page_global(mfn);
+    mapping = __map_domain_page_global(page);
     if ( mapping == NULL )
     {
-        put_page_and_type(mfn_to_page(mfn));
-        put_gfn(d, gfn); 
+        put_page_and_type(page);
         return -ENOMEM;
     }
 
@@ -1106,7 +1108,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     }
 
     v->vcpu_info = new_info;
-    v->arch.pv_vcpu.vcpu_info_mfn = mfn;
+    v->arch.pv_vcpu.vcpu_info_mfn = page_to_mfn(page);
 
     /* Set new vcpu_info pointer /before/ setting pending flags. */
     wmb();
@@ -1119,7 +1121,6 @@ map_vcpu_info(struct vcpu *v, unsigned l
     for ( i = 0; i < BITS_PER_EVTCHN_WORD(d); i++ )
         set_bit(i, &vcpu_info(v, evtchn_pending_sel));
 
-    put_gfn(d, gfn); 
     return 0;
 }
 
diff -r 107285938c50 xen/arch/x86/domctl.c
--- a/xen/arch/x86/domctl.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/domctl.c	Thu Apr 26 22:00:25 2012 +0100
@@ -202,16 +202,16 @@ long arch_do_domctl(
 
                 for ( j = 0; j < k; j++ )
                 {
-                    unsigned long type = 0, mfn = get_gfn_untyped(d, arr[j]);
+                    unsigned long type = 0;
 
-                    page = mfn_to_page(mfn);
+                    page = get_page_from_gfn(d, arr[j], NULL, P2M_ALLOC);
 
-                    if ( unlikely(!mfn_valid(mfn)) ||
-                         unlikely(is_xen_heap_mfn(mfn)) )
+                    if ( unlikely(!page) ||
+                         unlikely(is_xen_heap_page(page)) )
                         type = XEN_DOMCTL_PFINFO_XTAB;
                     else if ( xsm_getpageframeinfo(page) != 0 )
                         ;
-                    else if ( likely(get_page(page, d)) )
+                    else
                     {
                         switch( page->u.inuse.type_info & PGT_type_mask )
                         {
@@ -231,13 +231,10 @@ long arch_do_domctl(
 
                         if ( page->u.inuse.type_info & PGT_pinned )
                             type |= XEN_DOMCTL_PFINFO_LPINTAB;
+                    }
 
+                    if ( page )
                         put_page(page);
-                    }
-                    else
-                        type = XEN_DOMCTL_PFINFO_XTAB;
-
-                    put_gfn(d, arr[j]);
                     arr[j] = type;
                 }
 
@@ -304,21 +301,21 @@ long arch_do_domctl(
             {      
                 struct page_info *page;
                 unsigned long gfn = arr32[j];
-                unsigned long mfn = get_gfn_untyped(d, gfn);
 
-                page = mfn_to_page(mfn);
+                page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
 
                 if ( domctl->cmd == XEN_DOMCTL_getpageframeinfo3)
                     arr32[j] = 0;
 
-                if ( unlikely(!mfn_valid(mfn)) ||
-                     unlikely(is_xen_heap_mfn(mfn)) )
+                if ( unlikely(!page) ||
+                     unlikely(is_xen_heap_page(page)) )
                     arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
                 else if ( xsm_getpageframeinfo(page) != 0 )
                 {
-                    put_gfn(d, gfn); 
+                    put_page(page);
                     continue;
-                } else if ( likely(get_page(page, d)) )
+                }
+                else
                 {
                     unsigned long type = 0;
 
@@ -341,12 +338,10 @@ long arch_do_domctl(
                     if ( page->u.inuse.type_info & PGT_pinned )
                         type |= XEN_DOMCTL_PFINFO_LPINTAB;
                     arr32[j] |= type;
+                }
+
+                if ( page )
                     put_page(page);
-                }
-                else
-                    arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
-
-                put_gfn(d, gfn); 
             }
 
             if ( copy_to_guest_offset(domctl->u.getpageframeinfo2.array,
@@ -419,7 +414,7 @@ long arch_do_domctl(
     {
         struct domain *d = rcu_lock_domain_by_id(domctl->domain);
         unsigned long gmfn = domctl->u.hypercall_init.gmfn;
-        unsigned long mfn;
+        struct page_info *page;
         void *hypercall_page;
 
         ret = -ESRCH;
@@ -433,26 +428,25 @@ long arch_do_domctl(
             break;
         }
 
-        mfn = get_gfn_untyped(d, gmfn);
+        page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
 
         ret = -EACCES;
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+        if ( !page || !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(d, gmfn); 
+            if ( page )
+                put_page(page);
             rcu_unlock_domain(d);
             break;
         }
 
         ret = 0;
 
-        hypercall_page = map_domain_page(mfn);
+        hypercall_page = __map_domain_page(page);
         hypercall_page_initialise(d, hypercall_page);
         unmap_domain_page(hypercall_page);
 
-        put_page_and_type(mfn_to_page(mfn));
+        put_page_and_type(page);
 
-        put_gfn(d, gmfn); 
         rcu_unlock_domain(d);
     }
     break;
diff -r 107285938c50 xen/arch/x86/hvm/emulate.c
--- a/xen/arch/x86/hvm/emulate.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/emulate.c	Thu Apr 26 22:00:25 2012 +0100
@@ -60,34 +60,25 @@ static int hvmemul_do_io(
     ioreq_t *p = get_ioreq(curr);
     unsigned long ram_gfn = paddr_to_pfn(ram_gpa);
     p2m_type_t p2mt;
-    mfn_t ram_mfn;
+    struct page_info *ram_page;
     int rc;
 
     /* Check for paged out page */
-    ram_mfn = get_gfn_unshare(curr->domain, ram_gfn, &p2mt);
+    ram_page = get_page_from_gfn(curr->domain, ram_gfn, &p2mt, P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(curr->domain, ram_gfn); 
+        if ( ram_page )
+            put_page(ram_page);
         p2m_mem_paging_populate(curr->domain, ram_gfn);
         return X86EMUL_RETRY;
     }
     if ( p2m_is_shared(p2mt) )
     {
-        put_gfn(curr->domain, ram_gfn); 
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_RETRY;
     }
 
-    /* Maintain a ref on the mfn to ensure liveness. Put the gfn
-     * to avoid potential deadlock wrt event channel lock, later. */
-    if ( mfn_valid(mfn_x(ram_mfn)) )
-        if ( !get_page(mfn_to_page(mfn_x(ram_mfn)),
-             curr->domain) )
-        {
-            put_gfn(curr->domain, ram_gfn);
-            return X86EMUL_RETRY;
-        }
-    put_gfn(curr->domain, ram_gfn);
-
     /*
      * Weird-sized accesses have undefined behaviour: we discard writes
      * and read all-ones.
@@ -98,8 +89,8 @@ static int hvmemul_do_io(
         ASSERT(p_data != NULL); /* cannot happen with a REP prefix */
         if ( dir == IOREQ_READ )
             memset(p_data, ~0, size);
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -120,8 +111,8 @@ static int hvmemul_do_io(
             unsigned int bytes = vio->mmio_large_write_bytes;
             if ( (addr >= pa) && ((addr + size) <= (pa + bytes)) )
             {
-                if ( mfn_valid(mfn_x(ram_mfn)) )
-                    put_page(mfn_to_page(mfn_x(ram_mfn)));
+                if ( ram_page )
+                    put_page(ram_page);
                 return X86EMUL_OKAY;
             }
         }
@@ -133,8 +124,8 @@ static int hvmemul_do_io(
             {
                 memcpy(p_data, &vio->mmio_large_read[addr - pa],
                        size);
-                if ( mfn_valid(mfn_x(ram_mfn)) )
-                    put_page(mfn_to_page(mfn_x(ram_mfn)));
+                if ( ram_page )
+                    put_page(ram_page);
                 return X86EMUL_OKAY;
             }
         }
@@ -148,8 +139,8 @@ static int hvmemul_do_io(
         vio->io_state = HVMIO_none;
         if ( p_data == NULL )
         {
-            if ( mfn_valid(mfn_x(ram_mfn)) )
-                put_page(mfn_to_page(mfn_x(ram_mfn)));
+            if ( ram_page )
+                put_page(ram_page);
             return X86EMUL_UNHANDLEABLE;
         }
         goto finish_access;
@@ -159,13 +150,13 @@ static int hvmemul_do_io(
              (addr == (vio->mmio_large_write_pa +
                        vio->mmio_large_write_bytes)) )
         {
-            if ( mfn_valid(mfn_x(ram_mfn)) )
-                put_page(mfn_to_page(mfn_x(ram_mfn)));
+            if ( ram_page )
+                put_page(ram_page);
             return X86EMUL_RETRY;
         }
     default:
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -173,8 +164,8 @@ static int hvmemul_do_io(
     {
         gdprintk(XENLOG_WARNING, "WARNING: io already pending (%d)?\n",
                  p->state);
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -226,8 +217,8 @@ static int hvmemul_do_io(
 
     if ( rc != X86EMUL_OKAY )
     {
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return rc;
     }
 
@@ -263,8 +254,8 @@ static int hvmemul_do_io(
         }
     }
 
-    if ( mfn_valid(mfn_x(ram_mfn)) )
-        put_page(mfn_to_page(mfn_x(ram_mfn)));
+    if ( ram_page )
+        put_page(ram_page);
     return X86EMUL_OKAY;
 }
 
@@ -686,7 +677,6 @@ static int hvmemul_rep_movs(
     p2m_type_t sp2mt, dp2mt;
     int rc, df = !!(ctxt->regs->eflags & X86_EFLAGS_DF);
     char *buf;
-    struct two_gfns tg;
 
     rc = hvmemul_virtual_to_linear(
         src_seg, src_offset, bytes_per_rep, reps, hvm_access_read,
@@ -714,25 +704,17 @@ static int hvmemul_rep_movs(
     if ( rc != X86EMUL_OKAY )
         return rc;
 
-    get_two_gfns(current->domain, sgpa >> PAGE_SHIFT, &sp2mt, NULL, NULL,
-                 current->domain, dgpa >> PAGE_SHIFT, &dp2mt, NULL, NULL,
-                 P2M_ALLOC, &tg);
+    /* Check for MMIO ops */
+    (void) get_gfn_query_unlocked(current->domain, sgpa >> PAGE_SHIFT, &sp2mt);
+    (void) get_gfn_query_unlocked(current->domain, dgpa >> PAGE_SHIFT, &dp2mt);
 
-    if ( !p2m_is_ram(sp2mt) && !p2m_is_grant(sp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+    if ( sp2mt == p2m_mmio_dm )
+        return hvmemul_do_mmio(
             sgpa, reps, bytes_per_rep, dgpa, IOREQ_READ, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }
 
-    if ( !p2m_is_ram(dp2mt) && !p2m_is_grant(dp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+    if ( dp2mt == p2m_mmio_dm )
+        return hvmemul_do_mmio(
             dgpa, reps, bytes_per_rep, sgpa, IOREQ_WRITE, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }
 
     /* RAM-to-RAM copy: emulate as equivalent of memmove(dgpa, sgpa, bytes). */
     bytes = *reps * bytes_per_rep;
@@ -747,10 +729,7 @@ static int hvmemul_rep_movs(
      * can be emulated by a source-to-buffer-to-destination block copy.
      */
     if ( ((dgpa + bytes_per_rep) > sgpa) && (dgpa < (sgpa + bytes)) )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }
 
     /* Adjust destination address for reverse copy. */
     if ( df )
@@ -759,10 +738,7 @@ static int hvmemul_rep_movs(
     /* Allocate temporary buffer. Fall back to slow emulation if this fails. */
     buf = xmalloc_bytes(bytes);
     if ( buf == NULL )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }
 
     /*
      * We do a modicum of checking here, just for paranoia's sake and to
@@ -773,7 +749,6 @@ static int hvmemul_rep_movs(
         rc = hvm_copy_to_guest_phys(dgpa, buf, bytes);
 
     xfree(buf);
-    put_two_gfns(&tg);
 
     if ( rc == HVMCOPY_gfn_paged_out )
         return X86EMUL_RETRY;
diff -r 107285938c50 xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/hvm.c	Thu Apr 26 22:00:25 2012 +0100
@@ -395,48 +395,41 @@ int prepare_ring_for_helper(
 {
     struct page_info *page;
     p2m_type_t p2mt;
-    unsigned long mfn;
     void *va;
 
-    mfn = mfn_x(get_gfn_unshare(d, gmfn, &p2mt));
-    if ( !p2m_is_ram(p2mt) )
-    {
-        put_gfn(d, gmfn);
-        return -EINVAL;
-    }
+    page = get_page_from_gfn(d, gmfn, &p2mt, P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(d, gmfn);
+        if ( page )
+            put_page(page);
         p2m_mem_paging_populate(d, gmfn);
         return -ENOENT;
     }
     if ( p2m_is_shared(p2mt) )
     {
-        put_gfn(d, gmfn);
+        if ( page )
+            put_page(page);
         return -ENOENT;
     }
-    ASSERT(mfn_valid(mfn));
-
-    page = mfn_to_page(mfn);
-    if ( !get_page_and_type(page, d, PGT_writable_page) )
+    if ( !page )
+        return -EINVAL;
+
+    if ( !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn);
+        put_page(page);
         return -EINVAL;
     }
 
-    va = map_domain_page_global(mfn);
+    va = __map_domain_page_global(page);
     if ( va == NULL )
     {
         put_page_and_type(page);
-        put_gfn(d, gmfn);
         return -ENOMEM;
     }
 
     *_va = va;
     *_page = page;
 
-    put_gfn(d, gmfn);
-
     return 0;
 }
 
@@ -1607,8 +1600,8 @@ int hvm_mov_from_cr(unsigned int cr, uns
 int hvm_set_cr0(unsigned long value)
 {
     struct vcpu *v = current;
-    p2m_type_t p2mt;
-    unsigned long gfn, mfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
+    unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
+    struct page_info *page;
 
     HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR0 value = %lx", value);
 
@@ -1647,23 +1640,20 @@ int hvm_set_cr0(unsigned long value)
         {
             /* The guest CR3 must be pointing to the guest physical. */
             gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
-            mfn = mfn_x(get_gfn(v->domain, gfn, &p2mt));
-            if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
-                 !get_page(mfn_to_page(mfn), v->domain))
+            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            if ( !page )
             {
-                put_gfn(v->domain, gfn);
-                gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx (mfn=%lx)\n",
-                         v->arch.hvm_vcpu.guest_cr[3], mfn);
+                gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
+                         v->arch.hvm_vcpu.guest_cr[3]);
                 domain_crash(v->domain);
                 return X86EMUL_UNHANDLEABLE;
             }
 
             /* Now arch.guest_table points to machine physical. */
-            v->arch.guest_table = pagetable_from_pfn(mfn);
+            v->arch.guest_table = pagetable_from_page(page);
 
             HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx, mfn = %lx",
-                        v->arch.hvm_vcpu.guest_cr[3], mfn);
-            put_gfn(v->domain, gfn);
+                        v->arch.hvm_vcpu.guest_cr[3], page_to_mfn(page));
         }
     }
     else if ( !(value & X86_CR0_PG) && (old_value & X86_CR0_PG) )
@@ -1738,26 +1728,21 @@ int hvm_set_cr0(unsigned long value)
 
 int hvm_set_cr3(unsigned long value)
 {
-    unsigned long mfn;
-    p2m_type_t p2mt;
     struct vcpu *v = current;
+    struct page_info *page;
 
     if ( hvm_paging_enabled(v) && !paging_mode_hap(v->domain) &&
          (value != v->arch.hvm_vcpu.guest_cr[3]) )
     {
         /* Shadow-mode CR3 change. Check PDBR and update refcounts. */
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "CR3 value = %lx", value);
-        mfn = mfn_x(get_gfn(v->domain, value >> PAGE_SHIFT, &p2mt));
-        if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
-             !get_page(mfn_to_page(mfn), v->domain) )
-        {
-              put_gfn(v->domain, value >> PAGE_SHIFT);
-              goto bad_cr3;
-        }
+        page = get_page_from_gfn(v->domain, value >> PAGE_SHIFT,
+                                 NULL, P2M_ALLOC);
+        if ( !page )
+            goto bad_cr3;
 
         put_page(pagetable_get_page(v->arch.guest_table));
-        v->arch.guest_table = pagetable_from_pfn(mfn);
-        put_gfn(v->domain, value >> PAGE_SHIFT);
+        v->arch.guest_table = pagetable_from_page(page);
 
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx", value);
     }
@@ -1914,46 +1899,29 @@ int hvm_virtual_to_linear_addr(
 static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable)
 {
     void *map;
-    unsigned long mfn;
     p2m_type_t p2mt;
-    struct page_info *pg;
+    struct page_info *page;
     struct domain *d = current->domain;
-    int rc;
-
-    mfn = mfn_x(writable
-                ? get_gfn_unshare(d, gfn, &p2mt)
-                : get_gfn(d, gfn, &p2mt));
-    if ( (p2m_is_shared(p2mt) && writable) || !p2m_is_ram(p2mt) )
+
+    page = get_page_from_gfn(d, gfn, &p2mt,
+                             writable ? P2M_UNSHARE : P2M_ALLOC);
+    if ( (p2m_is_shared(p2mt) && writable) || !page )
     {
-        put_gfn(d, gfn);
+        if ( page )
+            put_page(page);
         return NULL;
     }
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(d, gfn);
+        put_page(page);
         p2m_mem_paging_populate(d, gfn);
         return NULL;
     }
 
-    ASSERT(mfn_valid(mfn));
-
     if ( writable )
-        paging_mark_dirty(d, mfn);
-
-    /* Get a ref on the page, considering that it could be shared */
-    pg = mfn_to_page(mfn);
-    rc = get_page(pg, d);
-    if ( !rc && !writable )
-        /* Page could be shared */
-        rc = get_page(pg, dom_cow);
-    if ( !rc )
-    {
-        put_gfn(d, gfn);
-        return NULL;
-    }
-
-    map = map_domain_page(mfn);
-    put_gfn(d, gfn);
+        paging_mark_dirty(d, page_to_mfn(page));
+
+    map = __map_domain_page(page);
     return map;
 }
 
@@ -2358,7 +2326,8 @@ static enum hvm_copy_result __hvm_copy(
     void *buf, paddr_t addr, int size, unsigned int flags, uint32_t pfec)
 {
     struct vcpu *curr = current;
-    unsigned long gfn, mfn;
+    unsigned long gfn;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
@@ -2402,32 +2371,33 @@ static enum hvm_copy_result __hvm_copy(
             gfn = addr >> PAGE_SHIFT;
         }
 
-        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
+        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
 
         if ( p2m_is_paging(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             p2m_mem_paging_populate(curr->domain, gfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_gfn_shared;
         }
         if ( p2m_is_grant(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_unhandleable;
         }
-        if ( !p2m_is_ram(p2mt) )
+        if ( !page )
         {
-            put_gfn(curr->domain, gfn);
             return HVMCOPY_bad_gfn_to_mfn;
         }
-        ASSERT(mfn_valid(mfn));
-
-        p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
+
+        p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( flags & HVMCOPY_to_guest )
         {
@@ -2437,12 +2407,12 @@ static enum hvm_copy_result __hvm_copy(
                 if ( xchg(&lastpage, gfn) != gfn )
                     gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
                              " memory page. gfn=%#lx, mfn=%#lx\n",
-                             gfn, mfn);
+                             gfn, page_to_mfn(page));
             }
             else
             {
                 memcpy(p, buf, count);
-                paging_mark_dirty(curr->domain, mfn);
+                paging_mark_dirty(curr->domain, page_to_mfn(page));
             }
         }
         else
@@ -2455,7 +2425,7 @@ static enum hvm_copy_result __hvm_copy(
         addr += count;
         buf  += count;
         todo -= count;
-        put_gfn(curr->domain, gfn);
+        put_page(page);
     }
 
     return HVMCOPY_okay;
@@ -2464,7 +2434,8 @@ static enum hvm_copy_result __hvm_copy(
 static enum hvm_copy_result __hvm_clear(paddr_t addr, int size)
 {
     struct vcpu *curr = current;
-    unsigned long gfn, mfn;
+    unsigned long gfn;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
@@ -2500,32 +2471,35 @@ static enum hvm_copy_result __hvm_clear(
             return HVMCOPY_bad_gva_to_gfn;
         }
 
-        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
+        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
 
         if ( p2m_is_paging(p2mt) )
         {
+            if ( page )
+                put_page(page);
             p2m_mem_paging_populate(curr->domain, gfn);
-            put_gfn(curr->domain, gfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_gfn_shared;
         }
         if ( p2m_is_grant(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_unhandleable;
         }
-        if ( !p2m_is_ram(p2mt) )
+        if ( !page )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_bad_gfn_to_mfn;
         }
-        ASSERT(mfn_valid(mfn));
-
-        p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
+
+        p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( p2mt == p2m_ram_ro )
         {
@@ -2533,19 +2507,19 @@ static enum hvm_copy_result __hvm_clear(
             if ( xchg(&lastpage, gfn) != gfn )
                 gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
                         " memory page. gfn=%#lx, mfn=%#lx\n",
-                        gfn, mfn);
+                         gfn, page_to_mfn(page));
         }
         else
         {
             memset(p, 0x00, count);
-            paging_mark_dirty(curr->domain, mfn);
+            paging_mark_dirty(curr->domain, page_to_mfn(page));
         }
 
         unmap_domain_page(p);
 
         addr += count;
         todo -= count;
-        put_gfn(curr->domain, gfn);
+        put_page(page);
     }
 
     return HVMCOPY_okay;
@@ -4000,35 +3974,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
 
         for ( pfn = a.first_pfn; pfn < a.first_pfn + a.nr; pfn++ )
         {
-            p2m_type_t t;
-            mfn_t mfn = get_gfn_unshare(d, pfn, &t);
-            if ( p2m_is_paging(t) )
+            struct page_info *page;
+            page = get_page_from_gfn(d, pfn, NULL, P2M_UNSHARE);
+            if ( page )
             {
-                put_gfn(d, pfn);
-                p2m_mem_paging_populate(d, pfn);
-                rc = -EINVAL;
-                goto param_fail3;
-            }
-            if( p2m_is_shared(t) )
-            {
-                /* If it insists on not unsharing itself, crash the domain 
-                 * rather than crashing the host down in mark dirty */
-                gdprintk(XENLOG_WARNING,
-                         "shared pfn 0x%lx modified?\n", pfn);
-                domain_crash(d);
-                put_gfn(d, pfn);
-                rc = -EINVAL;
-                goto param_fail3;
-            }
-            
-            if ( mfn_x(mfn) != INVALID_MFN )
-            {
-                paging_mark_dirty(d, mfn_x(mfn));
+                paging_mark_dirty(d, page_to_mfn(page));
                 /* These are most probably not page tables any more */
                 /* don't take a long time and don't die either */
-                sh_remove_shadows(d->vcpu[0], mfn, 1, 0);
+                sh_remove_shadows(d->vcpu[0], _mfn(page_to_mfn(page)), 1, 0);
+                put_page(page);
             }
-            put_gfn(d, pfn);
         }
 
     param_fail3:
diff -r 107285938c50 xen/arch/x86/hvm/stdvga.c
--- a/xen/arch/x86/hvm/stdvga.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/stdvga.c	Thu Apr 26 22:00:25 2012 +0100
@@ -482,7 +482,8 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_to_guest_phys(data, &tmp, p->size) !=
                      HVMCOPY_okay )
                 {
-                    (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt);
+                    struct page_info *dp = get_page_from_gfn(
+                            d, data >> PAGE_SHIFT, &p2mt, P2M_ALLOC);
                     /*
                      * The only case we handle is vga_mem <-> vga_mem.
                      * Anything else disables caching and leaves it to qemu-dm.
@@ -490,11 +491,12 @@ static int mmio_move(struct hvm_hw_stdvg
                     if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) ||
                          ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) )
                     {
-                        put_gfn(d, data >> PAGE_SHIFT);
+                        if ( dp )
+                            put_page(dp);
                         return 0;
                     }
+                    ASSERT(!dp);
                     stdvga_mem_write(data, tmp, p->size);
-                    put_gfn(d, data >> PAGE_SHIFT);
                 }
                 data += sign * p->size;
                 addr += sign * p->size;
@@ -508,15 +510,16 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_from_guest_phys(&tmp, data, p->size) !=
                      HVMCOPY_okay )
                 {
-                    (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt);
+                    struct page_info *dp = get_page_from_gfn(
+                        d, data >> PAGE_SHIFT, &p2mt, P2M_ALLOC);
                     if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) ||
                          ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) )
                     {
-                        put_gfn(d, data >> PAGE_SHIFT);
+                        if ( dp )
+                            put_page(dp);
                         return 0;
                     }
                     tmp = stdvga_mem_read(data, p->size);
-                    put_gfn(d, data >> PAGE_SHIFT);
                 }
                 stdvga_mem_write(addr, tmp, p->size);
                 data += sign * p->size;
diff -r 107285938c50 xen/arch/x86/hvm/viridian.c
--- a/xen/arch/x86/hvm/viridian.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/viridian.c	Thu Apr 26 22:00:25 2012 +0100
@@ -134,18 +134,19 @@ void dump_apic_assist(struct vcpu *v)
 static void enable_hypercall_page(struct domain *d)
 {
     unsigned long gmfn = d->arch.hvm_domain.viridian.hypercall_gpa.fields.pfn;
-    unsigned long mfn = get_gfn_untyped(d, gmfn);
+    struct page_info *page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
     uint8_t *p;
 
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    if ( !page || !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn); 
-        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn, mfn);
+        if ( page )
+            put_page(page);
+        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
+                 page_to_mfn(page));
         return;
     }
 
-    p = map_domain_page(mfn);
+    p = __map_domain_page(page);
 
     /*
      * We set the bit 31 in %eax (reserved field in the Viridian hypercall
@@ -162,15 +163,14 @@ static void enable_hypercall_page(struct
 
     unmap_domain_page(p);
 
-    put_page_and_type(mfn_to_page(mfn));
-    put_gfn(d, gmfn); 
+    put_page_and_type(page);
 }
 
 void initialize_apic_assist(struct vcpu *v)
 {
     struct domain *d = v->domain;
     unsigned long gmfn = v->arch.hvm_vcpu.viridian.apic_assist.fields.pfn;
-    unsigned long mfn = get_gfn_untyped(d, gmfn);
+    struct page_info *page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
     uint8_t *p;
 
     /*
@@ -183,22 +183,22 @@ void initialize_apic_assist(struct vcpu 
      * details of how Windows uses the page.
      */
 
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    if ( !page || !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn); 
-        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn, mfn);
+        if ( page )
+            put_page(page);
+        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
+                 page_to_mfn(page));
         return;
     }
 
-    p = map_domain_page(mfn);
+    p = __map_domain_page(page);
 
     *(u32 *)p = 0;
 
     unmap_domain_page(p);
 
-    put_page_and_type(mfn_to_page(mfn));
-    put_gfn(d, gmfn); 
+    put_page_and_type(page);
 }
 
 int wrmsr_viridian_regs(uint32_t idx, uint64_t val)
diff -r 107285938c50 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 26 22:00:25 2012 +0100
@@ -480,17 +480,16 @@ static void vmx_vmcs_save(struct vcpu *v
 static int vmx_restore_cr0_cr3(
     struct vcpu *v, unsigned long cr0, unsigned long cr3)
 {
-    unsigned long mfn = 0;
-    p2m_type_t p2mt;
+    struct page_info *page = NULL;
 
     if ( paging_mode_shadow(v->domain) )
     {
         if ( cr0 & X86_CR0_PG )
         {
-            mfn = mfn_x(get_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt));
-            if ( !p2m_is_ram(p2mt) || !get_page(mfn_to_page(mfn), v->domain) )
+            page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT,
+                                     NULL, P2M_ALLOC);
+            if ( !page )
             {
-                put_gfn(v->domain, cr3 >> PAGE_SHIFT);
                 gdprintk(XENLOG_ERR, "Invalid CR3 value=0x%lx\n", cr3);
                 return -EINVAL;
             }
@@ -499,9 +498,8 @@ static int vmx_restore_cr0_cr3(
         if ( hvm_paging_enabled(v) )
             put_page(pagetable_get_page(v->arch.guest_table));
 
-        v->arch.guest_table = pagetable_from_pfn(mfn);
-        if ( cr0 & X86_CR0_PG )
-            put_gfn(v->domain, cr3 >> PAGE_SHIFT);
+        v->arch.guest_table =
+            page ? pagetable_from_page(page) : pagetable_null();
     }
 
     v->arch.hvm_vcpu.guest_cr[0] = cr0 | X86_CR0_ET;
@@ -1026,8 +1024,9 @@ static void vmx_set_interrupt_shadow(str
 
 static void vmx_load_pdptrs(struct vcpu *v)
 {
-    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3], mfn;
+    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
     uint64_t *guest_pdptrs;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
 
@@ -1038,24 +1037,19 @@ static void vmx_load_pdptrs(struct vcpu 
     if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
         goto crash;
 
-    mfn = mfn_x(get_gfn_unshare(v->domain, cr3 >> PAGE_SHIFT, &p2mt));
-    if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) || 
-         /* If we didn't succeed in unsharing, get_page will fail
-          * (page still belongs to dom_cow) */
-         !get_page(mfn_to_page(mfn), v->domain) )
+    page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, P2M_UNSHARE);
+    if ( !page )
     {
         /* Ideally you don't want to crash but rather go into a wait 
          * queue, but this is the wrong place. We're holding at least
          * the paging lock */
         gdprintk(XENLOG_ERR,
-                 "Bad cr3 on load pdptrs gfn %lx mfn %lx type %d\n",
-                 cr3 >> PAGE_SHIFT, mfn, (int) p2mt);
-        put_gfn(v->domain, cr3 >> PAGE_SHIFT);
+                 "Bad cr3 on load pdptrs gfn %lx type %d\n",
+                 cr3 >> PAGE_SHIFT, (int) p2mt);
         goto crash;
     }
-    put_gfn(v->domain, cr3 >> PAGE_SHIFT);
-
-    p = map_domain_page(mfn);
+
+    p = __map_domain_page(page);
 
     guest_pdptrs = (uint64_t *)(p + (cr3 & ~PAGE_MASK));
 
@@ -1081,7 +1075,7 @@ static void vmx_load_pdptrs(struct vcpu 
     vmx_vmcs_exit(v);
 
     unmap_domain_page(p);
-    put_page(mfn_to_page(mfn));
+    put_page(page);
     return;
 
  crash:
diff -r 107285938c50 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm.c	Thu Apr 26 22:00:25 2012 +0100
@@ -651,7 +651,8 @@ int map_ldt_shadow_page(unsigned int off
 {
     struct vcpu *v = current;
     struct domain *d = v->domain;
-    unsigned long gmfn, mfn;
+    unsigned long gmfn;
+    struct page_info *page;
     l1_pgentry_t l1e, nl1e;
     unsigned long gva = v->arch.pv_vcpu.ldt_base + (off << PAGE_SHIFT);
     int okay;
@@ -663,28 +664,24 @@ int map_ldt_shadow_page(unsigned int off
         return 0;
 
     gmfn = l1e_get_pfn(l1e);
-    mfn = get_gfn_untyped(d, gmfn);
-    if ( unlikely(!mfn_valid(mfn)) )
+    page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
+    if ( unlikely(!page) )
+        return 0;
+
+    okay = get_page_type(page, PGT_seg_desc_page);
+    if ( unlikely(!okay) )
     {
-        put_gfn(d, gmfn); 
+        put_page(page);
         return 0;
     }
 
-    okay = get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page);
-    if ( unlikely(!okay) )
-    {
-        put_gfn(d, gmfn); 
-        return 0;
-    }
-
-    nl1e = l1e_from_pfn(mfn, l1e_get_flags(l1e) | _PAGE_RW);
+    nl1e = l1e_from_pfn(page_to_mfn(page), l1e_get_flags(l1e) | _PAGE_RW);
 
     spin_lock(&v->arch.pv_vcpu.shadow_ldt_lock);
     l1e_write(&v->arch.perdomain_ptes[off + 16], nl1e);
     v->arch.pv_vcpu.shadow_ldt_mapcnt++;
     spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
 
-    put_gfn(d, gmfn); 
     return 1;
 }
 
@@ -1819,7 +1816,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
 {
     l1_pgentry_t ol1e;
     struct domain *pt_dom = pt_vcpu->domain;
-    p2m_type_t p2mt;
     int rc = 0;
 
     if ( unlikely(__copy_from_user(&ol1e, pl1e, sizeof(ol1e)) != 0) )
@@ -1835,22 +1831,21 @@ static int mod_l1_entry(l1_pgentry_t *pl
     if ( l1e_get_flags(nl1e) & _PAGE_PRESENT )
     {
         /* Translate foreign guest addresses. */
-        unsigned long mfn, gfn;
-        gfn = l1e_get_pfn(nl1e);
-        mfn = mfn_x(get_gfn(pg_dom, gfn, &p2mt));
-        if ( !p2m_is_ram(p2mt) || unlikely(mfn == INVALID_MFN) )
+        struct page_info *page = NULL;
+        if ( paging_mode_translate(pg_dom) )
         {
-            put_gfn(pg_dom, gfn);
-            return -EINVAL;
+            page = get_page_from_gfn(pg_dom, l1e_get_pfn(nl1e), NULL, P2M_ALLOC);
+            if ( !page )
+                return -EINVAL;
+            nl1e = l1e_from_pfn(page_to_mfn(page), l1e_get_flags(nl1e));
         }
-        ASSERT((mfn & ~(PADDR_MASK >> PAGE_SHIFT)) == 0);
-        nl1e = l1e_from_pfn(mfn, l1e_get_flags(nl1e));
 
         if ( unlikely(l1e_get_flags(nl1e) & l1_disallow_mask(pt_dom)) )
         {
             MEM_LOG("Bad L1 flags %x",
                     l1e_get_flags(nl1e) & l1_disallow_mask(pt_dom));
-            put_gfn(pg_dom, gfn);
+            if ( page )
+                put_page(page);
             return -EINVAL;
         }
 
@@ -1860,15 +1855,21 @@ static int mod_l1_entry(l1_pgentry_t *pl
             adjust_guest_l1e(nl1e, pt_dom);
             if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                               preserve_ad) )
+            {
+                if ( page )
+                    put_page(page);
                 return 0;
-            put_gfn(pg_dom, gfn);
+            }
+            if ( page )
+                put_page(page);
             return -EBUSY;
         }
 
         switch ( rc = get_page_from_l1e(nl1e, pt_dom, pg_dom) )
         {
         default:
-            put_gfn(pg_dom, gfn);
+            if ( page )
+                put_page(page);
             return rc;
         case 0:
             break;
@@ -1876,7 +1877,9 @@ static int mod_l1_entry(l1_pgentry_t *pl
             l1e_remove_flags(nl1e, _PAGE_RW);
             break;
         }
-        
+        if ( page )
+            put_page(page);
+
         adjust_guest_l1e(nl1e, pt_dom);
         if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                     preserve_ad)) )
@@ -1884,7 +1887,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
             ol1e = nl1e;
             rc = -EBUSY;
         }
-        put_gfn(pg_dom, gfn);
     }
     else if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                      preserve_ad)) )
@@ -3042,7 +3044,6 @@ int do_mmuext_op(
             type = PGT_l4_page_table;
 
         pin_page: {
-            unsigned long mfn;
             struct page_info *page;
 
             /* Ignore pinning of invalid paging levels. */
@@ -3052,25 +3053,28 @@ int do_mmuext_op(
             if ( paging_mode_refcounts(pg_owner) )
                 break;
 
-            mfn = get_gfn_untyped(pg_owner, op.arg1.mfn);
-            rc = get_page_and_type_from_pagenr(mfn, type, pg_owner, 0, 1);
+            page = get_page_from_gfn(pg_owner, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!page) )
+            {
+                rc = -EINVAL;
+                break;
+            }
+
+            rc = get_page_type_preemptible(page, type);
             okay = !rc;
             if ( unlikely(!okay) )
             {
                 if ( rc == -EINTR )
                     rc = -EAGAIN;
                 else if ( rc != -EAGAIN )
-                    MEM_LOG("Error while pinning mfn %lx", mfn);
-                put_gfn(pg_owner, op.arg1.mfn);
+                    MEM_LOG("Error while pinning mfn %lx", page_to_mfn(page));
+                put_page(page);
                 break;
             }
 
-            page = mfn_to_page(mfn);
-
             if ( (rc = xsm_memory_pin_page(d, page)) != 0 )
             {
                 put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
                 okay = 0;
                 break;
             }
@@ -3078,16 +3082,15 @@ int do_mmuext_op(
             if ( unlikely(test_and_set_bit(_PGT_pinned,
                                            &page->u.inuse.type_info)) )
             {
-                MEM_LOG("Mfn %lx already pinned", mfn);
+                MEM_LOG("Mfn %lx already pinned", page_to_mfn(page));
                 put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
                 okay = 0;
                 break;
             }
 
             /* A page is dirtied when its pin status is set. */
-            paging_mark_dirty(pg_owner, mfn);
-           
+            paging_mark_dirty(pg_owner, page_to_mfn(page));
+
             /* We can race domain destruction (domain_relinquish_resources). */
             if ( unlikely(pg_owner != d) )
             {
@@ -3099,35 +3102,29 @@ int do_mmuext_op(
                 spin_unlock(&pg_owner->page_alloc_lock);
                 if ( drop_ref )
                     put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
             }
 
             break;
         }
 
         case MMUEXT_UNPIN_TABLE: {
-            unsigned long mfn;
             struct page_info *page;
 
             if ( paging_mode_refcounts(pg_owner) )
                 break;
 
-            mfn = get_gfn_untyped(pg_owner, op.arg1.mfn);
-            if ( unlikely(!(okay = get_page_from_pagenr(mfn, pg_owner))) )
+            page = get_page_from_gfn(pg_owner, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!page) )
             {
-                put_gfn(pg_owner, op.arg1.mfn);
-                MEM_LOG("Mfn %lx bad domain", mfn);
+                MEM_LOG("Mfn %lx bad domain", op.arg1.mfn);
                 break;
             }
 
-            page = mfn_to_page(mfn);
-
             if ( !test_and_clear_bit(_PGT_pinned, &page->u.inuse.type_info) )
             {
                 okay = 0;
                 put_page(page);
-                put_gfn(pg_owner, op.arg1.mfn);
-                MEM_LOG("Mfn %lx not pinned", mfn);
+                MEM_LOG("Mfn %lx not pinned", op.arg1.mfn);
                 break;
             }
 
@@ -3135,40 +3132,43 @@ int do_mmuext_op(
             put_page(page);
 
             /* A page is dirtied when its pin status is cleared. */
-            paging_mark_dirty(pg_owner, mfn);
-
-            put_gfn(pg_owner, op.arg1.mfn);
+            paging_mark_dirty(pg_owner, page_to_mfn(page));
+
             break;
         }
 
         case MMUEXT_NEW_BASEPTR:
-            okay = new_guest_cr3(get_gfn_untyped(d, op.arg1.mfn));
-            put_gfn(d, op.arg1.mfn);
+            okay = (!paging_mode_translate(d)
+                    && new_guest_cr3(op.arg1.mfn));
             break;
+
         
 #ifdef __x86_64__
         case MMUEXT_NEW_USER_BASEPTR: {
-            unsigned long old_mfn, mfn;
-
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            if ( mfn != 0 )
+            unsigned long old_mfn;
+
+            if ( paging_mode_translate(current->domain) )
+            {
+                okay = 0;
+                break;
+            }
+
+            if ( op.arg1.mfn != 0 )
             {
                 if ( paging_mode_refcounts(d) )
-                    okay = get_page_from_pagenr(mfn, d);
+                    okay = get_page_from_pagenr(op.arg1.mfn, d);
                 else
                     okay = !get_page_and_type_from_pagenr(
-                        mfn, PGT_root_page_table, d, 0, 0);
+                        op.arg1.mfn, PGT_root_page_table, d, 0, 0);
                 if ( unlikely(!okay) )
                 {
-                    put_gfn(d, op.arg1.mfn);
-                    MEM_LOG("Error while installing new mfn %lx", mfn);
+                    MEM_LOG("Error while installing new mfn %lx", op.arg1.mfn);
                     break;
                 }
             }
 
             old_mfn = pagetable_get_pfn(curr->arch.guest_table_user);
-            curr->arch.guest_table_user = pagetable_from_pfn(mfn);
-            put_gfn(d, op.arg1.mfn);
+            curr->arch.guest_table_user = pagetable_from_pfn(op.arg1.mfn);
 
             if ( old_mfn != 0 )
             {
@@ -3283,28 +3283,26 @@ int do_mmuext_op(
         }
 
         case MMUEXT_CLEAR_PAGE: {
-            unsigned long mfn;
+            struct page_info *page;
             unsigned char *ptr;
 
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            okay = !get_page_and_type_from_pagenr(
-                mfn, PGT_writable_page, d, 0, 0);
-            if ( unlikely(!okay) )
+            page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( !page || get_page_type(page, PGT_writable_page) )
             {
-                put_gfn(d, op.arg1.mfn);
-                MEM_LOG("Error while clearing mfn %lx", mfn);
+                if ( page )
+                    put_page(page);
+                MEM_LOG("Error while clearing mfn %lx", op.arg1.mfn);
                 break;
             }
 
             /* A page is dirtied when it's being cleared. */
-            paging_mark_dirty(d, mfn);
-
-            ptr = fixmap_domain_page(mfn);
+            paging_mark_dirty(d, page_to_mfn(page));
+
+            ptr = fixmap_domain_page(page_to_mfn(page));
             clear_page(ptr);
             fixunmap_domain_page(ptr);
 
-            put_page_and_type(mfn_to_page(mfn));
-            put_gfn(d, op.arg1.mfn);
+            put_page_and_type(page);
             break;
         }
 
@@ -3312,42 +3310,38 @@ int do_mmuext_op(
         {
             const unsigned char *src;
             unsigned char *dst;
-            unsigned long src_mfn, mfn;
-
-            src_mfn = get_gfn_untyped(d, op.arg2.src_mfn);
-            okay = get_page_from_pagenr(src_mfn, d);
+            struct page_info *src_page, *dst_page;
+
+            src_page = get_page_from_gfn(d, op.arg2.src_mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!src_page) )
+            {
+                okay = 0;
+                MEM_LOG("Error while copying from mfn %lx", op.arg2.src_mfn);
+                break;
+            }
+
+            dst_page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
+            okay = (dst_page && get_page_type(dst_page, PGT_writable_page));
             if ( unlikely(!okay) )
             {
-                put_gfn(d, op.arg2.src_mfn);
-                MEM_LOG("Error while copying from mfn %lx", src_mfn);
+                put_page(src_page);
+                if ( dst_page )
+                    put_page(dst_page);
+                MEM_LOG("Error while copying to mfn %lx", op.arg1.mfn);
                 break;
             }
 
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            okay = !get_page_and_type_from_pagenr(
-                mfn, PGT_writable_page, d, 0, 0);
-            if ( unlikely(!okay) )
-            {
-                put_gfn(d, op.arg1.mfn);
-                put_page(mfn_to_page(src_mfn));
-                put_gfn(d, op.arg2.src_mfn);
-                MEM_LOG("Error while copying to mfn %lx", mfn);
-                break;
-            }
-
             /* A page is dirtied when it's being copied to. */
-            paging_mark_dirty(d, mfn);
-
-            src = map_domain_page(src_mfn);
-            dst = fixmap_domain_page(mfn);
+            paging_mark_dirty(d, page_to_mfn(dst_page));
+
+            src = __map_domain_page(src_page);
+            dst = fixmap_domain_page(page_to_mfn(dst_page));
             copy_page(dst, src);
             fixunmap_domain_page(dst);
             unmap_domain_page(src);
 
-            put_page_and_type(mfn_to_page(mfn));
-            put_gfn(d, op.arg1.mfn);
-            put_page(mfn_to_page(src_mfn));
-            put_gfn(d, op.arg2.src_mfn);
+            put_page_and_type(dst_page);
+            put_page(src_page);
             break;
         }
 
@@ -3538,29 +3532,26 @@ int do_mmu_update(
 
             req.ptr -= cmd;
             gmfn = req.ptr >> PAGE_SHIFT;
-            mfn = mfn_x(get_gfn(pt_owner, gmfn, &p2mt));
-            if ( !p2m_is_valid(p2mt) )
-                mfn = INVALID_MFN;
+            page = get_page_from_gfn(pt_owner, gmfn, &p2mt, P2M_ALLOC);
 
             if ( p2m_is_paged(p2mt) )
             {
-                put_gfn(pt_owner, gmfn);
+                ASSERT(!page);
                 p2m_mem_paging_populate(pg_owner, gmfn);
                 rc = -ENOENT;
                 break;
             }
 
-            if ( unlikely(!get_page_from_pagenr(mfn, pt_owner)) )
+            if ( unlikely(!page) )
             {
                 MEM_LOG("Could not get page for normal update");
-                put_gfn(pt_owner, gmfn);
                 break;
             }
 
+            mfn = page_to_mfn(page);
             va = map_domain_page_with_cache(mfn, &mapcache);
             va = (void *)((unsigned long)va +
                           (unsigned long)(req.ptr & ~PAGE_MASK));
-            page = mfn_to_page(mfn);
 
             if ( page_lock(page) )
             {
@@ -3569,22 +3560,23 @@ int do_mmu_update(
                 case PGT_l1_page_table:
                 {
                     l1_pgentry_t l1e = l1e_from_intpte(req.val);
-                    p2m_type_t l1e_p2mt;
-                    unsigned long l1egfn = l1e_get_pfn(l1e), l1emfn;
-    
-                    l1emfn = mfn_x(get_gfn(pg_owner, l1egfn, &l1e_p2mt));
+                    p2m_type_t l1e_p2mt = p2m_ram_rw;
+                    struct page_info *target = NULL;
+
+                    if ( paging_mode_translate(pg_owner) )
+                        target = get_page_from_gfn(pg_owner, l1e_get_pfn(l1e),
+                                                   &l1e_p2mt, P2M_ALLOC);
 
                     if ( p2m_is_paged(l1e_p2mt) )
                     {
-                        put_gfn(pg_owner, l1egfn);
+                        if ( target )
+                            put_page(target);
                         p2m_mem_paging_populate(pg_owner, l1e_get_pfn(l1e));
                         rc = -ENOENT;
                         break;
                     }
-                    else if ( p2m_ram_paging_in == l1e_p2mt && 
-                                !mfn_valid(l1emfn) )
+                    else if ( p2m_ram_paging_in == l1e_p2mt && !target )
                     {
-                        put_gfn(pg_owner, l1egfn);
                         rc = -ENOENT;
                         break;
                     }
@@ -3601,7 +3593,8 @@ int do_mmu_update(
                             rc = mem_sharing_unshare_page(pg_owner, gfn, 0); 
                             if ( rc )
                             {
-                                put_gfn(pg_owner, l1egfn);
+                                if ( target )
+                                    put_page(target);
                                 /* Notify helper, don't care about errors, will not
                                  * sleep on wq, since we're a foreign domain. */
                                 (void)mem_sharing_notify_enomem(pg_owner, gfn, 0);
@@ -3614,112 +3607,22 @@ int do_mmu_update(
                     rc = mod_l1_entry(va, l1e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v,
                                       pg_owner);
-                    put_gfn(pg_owner, l1egfn);
+                    if ( target )
+                        put_page(target);
                 }
                 break;
                 case PGT_l2_page_table:
-                {
-                    l2_pgentry_t l2e = l2e_from_intpte(req.val);
-                    p2m_type_t l2e_p2mt;
-                    unsigned long l2egfn = l2e_get_pfn(l2e), l2emfn;
-
-                    l2emfn = mfn_x(get_gfn(pg_owner, l2egfn, &l2e_p2mt));
-
-                    if ( p2m_is_paged(l2e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        p2m_mem_paging_populate(pg_owner, l2egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l2e_p2mt && 
-                                !mfn_valid(l2emfn) )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l2e_p2mt )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-
-                    rc = mod_l2_entry(va, l2e, mfn,
+                    rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    put_gfn(pg_owner, l2egfn);
-                }
-                break;
+                    break;
                 case PGT_l3_page_table:
-                {
-                    l3_pgentry_t l3e = l3e_from_intpte(req.val);
-                    p2m_type_t l3e_p2mt;
-                    unsigned long l3egfn = l3e_get_pfn(l3e), l3emfn;
-
-                    l3emfn = mfn_x(get_gfn(pg_owner, l3egfn, &l3e_p2mt));
-
-                    if ( p2m_is_paged(l3e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        p2m_mem_paging_populate(pg_owner, l3egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l3e_p2mt && 
-                                !mfn_valid(l3emfn) )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l3e_p2mt )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-                    rc = mod_l3_entry(va, l3e, mfn,
+                    rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
-                    put_gfn(pg_owner, l3egfn);
-                }
-                break;
+                    break;
 #if CONFIG_PAGING_LEVELS >= 4
                 case PGT_l4_page_table:
-                {
-                    l4_pgentry_t l4e = l4e_from_intpte(req.val);
-                    p2m_type_t l4e_p2mt;
-                    unsigned long l4egfn = l4e_get_pfn(l4e), l4emfn;
-
-                    l4emfn = mfn_x(get_gfn(pg_owner, l4egfn, &l4e_p2mt));
-
-                    if ( p2m_is_paged(l4e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        p2m_mem_paging_populate(pg_owner, l4egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l4e_p2mt && 
-                                !mfn_valid(l4emfn) )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l4e_p2mt )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-                    rc = mod_l4_entry(va, l4e, mfn,
+                    rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
-                    put_gfn(pg_owner, l4egfn);
-                }
                 break;
 #endif
                 case PGT_writable_page:
@@ -3742,7 +3645,6 @@ int do_mmu_update(
 
             unmap_domain_page_with_cache(va, &mapcache);
             put_page(page);
-            put_gfn(pt_owner, gmfn);
         }
         break;
 
diff -r 107285938c50 xen/arch/x86/mm/guest_walk.c
--- a/xen/arch/x86/mm/guest_walk.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/guest_walk.c	Thu Apr 26 22:00:25 2012 +0100
@@ -94,39 +94,37 @@ static inline void *map_domain_gfn(struc
                                    p2m_type_t *p2mt,
                                    uint32_t *rc) 
 {
-    p2m_access_t p2ma;
+    struct page_info *page;
     void *map;
 
     /* Translate the gfn, unsharing if shared */
-    *mfn = get_gfn_type_access(p2m, gfn_x(gfn), p2mt, &p2ma, 
-                               P2M_ALLOC | P2M_UNSHARE, NULL);
+    page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), p2mt, NULL,
+                                  P2M_ALLOC | P2M_UNSHARE);
     if ( p2m_is_paging(*p2mt) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
-        __put_gfn(p2m, gfn_x(gfn));
+        if ( page )
+            put_page(page);
         p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
         *rc = _PAGE_PAGED;
         return NULL;
     }
     if ( p2m_is_shared(*p2mt) )
     {
-        __put_gfn(p2m, gfn_x(gfn));
+        if ( page )
+            put_page(page);
         *rc = _PAGE_SHARED;
         return NULL;
     }
-    if ( !p2m_is_ram(*p2mt) ) 
+    if ( !page )
     {
-        __put_gfn(p2m, gfn_x(gfn));
         *rc |= _PAGE_PRESENT;
         return NULL;
     }
+    *mfn = _mfn(page_to_mfn(page));
     ASSERT(mfn_valid(mfn_x(*mfn)));
-    
-    /* Get an extra ref to the page to ensure liveness of the map.
-     * Then we can safely put gfn */
-    page_get_owner_and_reference(mfn_to_page(mfn_x(*mfn)));
+
     map = map_domain_page(mfn_x(*mfn));
-    __put_gfn(p2m, gfn_x(gfn));
     return map;
 }
 
diff -r 107285938c50 xen/arch/x86/mm/hap/guest_walk.c
--- a/xen/arch/x86/mm/hap/guest_walk.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/hap/guest_walk.c	Thu Apr 26 22:00:25 2012 +0100
@@ -54,34 +54,37 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     mfn_t top_mfn;
     void *top_map;
     p2m_type_t p2mt;
-    p2m_access_t p2ma;
     walk_t gw;
     unsigned long top_gfn;
+    struct page_info *top_page;
 
     /* Get the top-level table's MFN */
     top_gfn = cr3 >> PAGE_SHIFT;
-    top_mfn = get_gfn_type_access(p2m, top_gfn, &p2mt, &p2ma, 
-                                  P2M_ALLOC | P2M_UNSHARE, NULL);
+    top_page = get_page_from_gfn_p2m(p2m->domain, p2m, top_gfn,
+                                     &p2mt, NULL, P2M_ALLOC | P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
         pfec[0] = PFEC_page_paged;
-        __put_gfn(p2m, top_gfn);
+        if ( top_page )
+            put_page(top_page);
         p2m_mem_paging_populate(p2m->domain, cr3 >> PAGE_SHIFT);
         return INVALID_GFN;
     }
     if ( p2m_is_shared(p2mt) )
     {
         pfec[0] = PFEC_page_shared;
-        __put_gfn(p2m, top_gfn);
+        if ( top_page )
+            put_page(top_page);
         return INVALID_GFN;
     }
-    if ( !p2m_is_ram(p2mt) )
+    if ( !top_page )
     {
         pfec[0] &= ~PFEC_page_present;
-        __put_gfn(p2m, top_gfn);
+        put_page(top_page);
         return INVALID_GFN;
     }
+    top_mfn = _mfn(page_to_mfn(top_page));
 
     /* Map the top-level table and call the tree-walker */
     ASSERT(mfn_valid(mfn_x(top_mfn)));
@@ -91,31 +94,30 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #endif
     missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
     unmap_domain_page(top_map);
-    __put_gfn(p2m, top_gfn);
+    put_page(top_page);
 
     /* Interpret the answer */
     if ( missing == 0 )
     {
         gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
-        (void)get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma,
-                                  P2M_ALLOC | P2M_UNSHARE, NULL); 
+        struct page_info *page;
+        page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
+                                     NULL, P2M_ALLOC | P2M_UNSHARE);
+        if ( page )
+            put_page(page);
         if ( p2m_is_paging(p2mt) )
         {
             ASSERT(!p2m_is_nestedp2m(p2m));
             pfec[0] = PFEC_page_paged;
-            __put_gfn(p2m, gfn_x(gfn));
             p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
             return INVALID_GFN;
         }
         if ( p2m_is_shared(p2mt) )
         {
             pfec[0] = PFEC_page_shared;
-            __put_gfn(p2m, gfn_x(gfn));
             return INVALID_GFN;
         }
 
-        __put_gfn(p2m, gfn_x(gfn));
-
         if ( page_order )
             *page_order = guest_walk_to_page_order(&gw);
 
diff -r 107285938c50 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/mm-locks.h	Thu Apr 26 22:00:25 2012 +0100
@@ -166,13 +166,39 @@ declare_mm_lock(nestedp2m)
  * and later mutate it.
  */
 
-declare_mm_lock(p2m)
-#define p2m_lock(p)           mm_lock_recursive(p2m, &(p)->lock)
-#define gfn_lock(p,g,o)       mm_lock_recursive(p2m, &(p)->lock)
-#define p2m_unlock(p)         mm_unlock(&(p)->lock)
-#define gfn_unlock(p,g,o)     mm_unlock(&(p)->lock)
-#define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
-#define gfn_locked_by_me(p,g) mm_locked_by_me(&(p)->lock)
+/* P2M lock is become an rwlock, purely so we can implement
+ * get_page_from_gfn.  The mess below is a ghastly hack to make a
+ * recursive rwlock.  If it works I'll come back and fix up the
+ * order-contraints magic. */
+
+static inline void p2m_lock(struct p2m_domain *p)
+{
+    if ( p->wcpu != current->processor )
+    {
+        write_lock(&p->lock);
+        p->wcpu = current->processor;
+        ASSERT(p->wcount == 0);
+    }
+    p->wcount++;
+}
+
+static inline void p2m_unlock(struct p2m_domain *p)
+{
+    ASSERT(p->wcpu == current->processor);
+    if (--(p->wcount) == 0)
+    {
+        p->wcpu = -1;
+        write_unlock(&p->lock);
+    }
+}
+
+#define gfn_lock(p,g,o)       p2m_lock(p)
+#define gfn_unlock(p,g,o)     p2m_unlock(p)
+#define p2m_read_lock(p)      read_lock(&(p)->lock)
+#define p2m_read_unlock(p)    read_unlock(&(p)->lock)
+#define p2m_locked_by_me(p)   ((p)->wcpu == current->processor)
+#define gfn_locked_by_me(p,g) p2m_locked_by_me(p)
+
 
 /* Sharing per page lock
  *
@@ -203,8 +229,8 @@ declare_mm_order_constraint(per_page_sha
  * counts, page lists, sweep parameters. */
 
 declare_mm_lock(pod)
-#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
-#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
+#define pod_lock(p) do { p2m_lock(p); mm_lock(pod, &(p)->pod.lock); } while (0)
+#define pod_unlock(p) do { mm_unlock(&(p)->pod.lock); p2m_unlock(p);} while (0)
 #define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
 
 /* Page alloc lock (per-domain)
diff -r 107285938c50 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/p2m.c	Thu Apr 26 22:00:25 2012 +0100
@@ -71,7 +71,9 @@ boolean_param("hap_2mb", opt_hap_2mb);
 /* Init the datastructures for later use by the p2m code */
 static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
 {
-    mm_lock_init(&p2m->lock);
+    rwlock_init(&p2m->lock);
+    p2m->wcount = 0;
+    p2m->wcpu = -1;
     mm_lock_init(&p2m->pod.lock);
     INIT_LIST_HEAD(&p2m->np2m_list);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
@@ -207,6 +209,61 @@ void __put_gfn(struct p2m_domain *p2m, u
     gfn_unlock(p2m, gfn, 0);
 }
 
+/* Atomically look up a GFN and take a reference count on the backing page. */
+struct page_info *get_page_from_gfn_p2m(
+    struct domain *d, struct p2m_domain *p2m, unsigned long gfn,
+    p2m_type_t *t, p2m_access_t *a, p2m_query_t q)
+{
+    struct page_info *page = NULL;
+    p2m_access_t _a;
+    p2m_type_t _t;
+    mfn_t mfn;
+
+    /* Allow t or a to be NULL */
+    t = t ?: &_t;
+    a = a ?: &_a;
+
+    if ( likely(!p2m_locked_by_me(p2m)) )
+    {
+        /* Fast path: look up and get out */
+        p2m_read_lock(p2m);
+        mfn = __get_gfn_type_access(p2m, gfn, t, a, 0, NULL, 0);
+        if ( (p2m_is_ram(*t) || p2m_is_grant(*t))
+             && mfn_valid(mfn)
+             && !((q & P2M_UNSHARE) && p2m_is_shared(*t)) )
+        {
+            page = mfn_to_page(mfn);
+            if ( !get_page(page, d)
+                 /* Page could be shared */
+                 && !get_page(page, dom_cow) )
+                page = NULL;
+        }
+        p2m_read_unlock(p2m);
+
+        if ( page )
+            return page;
+
+        /* Error path: not a suitable GFN at all */
+        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_magic(*t) )
+            return NULL;
+    }
+
+    /* Slow path: take the write lock and do fixups */
+    p2m_lock(p2m);
+    mfn = get_gfn_type_access(p2m, gfn, t, a, q, NULL);
+    if ( p2m_is_ram(*t) && mfn_valid(mfn) )
+    {
+        page = mfn_to_page(mfn);
+        if ( !get_page(page, d) )
+            page = NULL;
+    }
+    put_gfn(d, gfn);
+    p2m_unlock(p2m);
+
+    return page;
+}
+
+
 int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, 
                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t p2ma)
 {
diff -r 107285938c50 xen/arch/x86/physdev.c
--- a/xen/arch/x86/physdev.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/physdev.c	Thu Apr 26 22:00:25 2012 +0100
@@ -306,26 +306,27 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
     case PHYSDEVOP_pirq_eoi_gmfn_v1: {
         struct physdev_pirq_eoi_gmfn info;
         unsigned long mfn;
+        struct page_info *page;
 
         ret = -EFAULT;
         if ( copy_from_guest(&info, arg, 1) != 0 )
             break;
 
         ret = -EINVAL;
-        mfn = get_gfn_untyped(current->domain, info.gmfn);
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), v->domain,
-                                PGT_writable_page) )
+        page = get_page_from_gfn(current->domain, info.gmfn, NULL, P2M_ALLOC);
+        if ( !page )
+            break;
+        if ( !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(current->domain, info.gmfn);
+            put_page(page);
             break;
         }
+        mfn = page_to_mfn(page);
 
         if ( cmpxchg(&v->domain->arch.pv_domain.pirq_eoi_map_mfn,
                      0, mfn) != 0 )
         {
             put_page_and_type(mfn_to_page(mfn));
-            put_gfn(current->domain, info.gmfn);
             ret = -EBUSY;
             break;
         }
@@ -335,14 +336,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         {
             v->domain->arch.pv_domain.pirq_eoi_map_mfn = 0;
             put_page_and_type(mfn_to_page(mfn));
-            put_gfn(current->domain, info.gmfn);
             ret = -ENOSPC;
             break;
         }
         if ( cmd == PHYSDEVOP_pirq_eoi_gmfn_v1 )
             v->domain->arch.pv_domain.auto_unmask = 1;
 
-        put_gfn(current->domain, info.gmfn);
         ret = 0;
         break;
     }
diff -r 107285938c50 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/traps.c	Thu Apr 26 22:00:25 2012 +0100
@@ -662,9 +662,9 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
     case 0:
     {
         void *hypercall_page;
-        unsigned long mfn;
         unsigned long gmfn = val >> 12;
         unsigned int idx  = val & 0xfff;
+        struct page_info *page;
 
         if ( idx > 0 )
         {
@@ -674,24 +674,23 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
             return 0;
         }
 
-        mfn = get_gfn_untyped(d, gmfn);
-
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+        page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
+
+        if ( !page || !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(d, gmfn);
+            if ( page )
+                put_page(page);
             gdprintk(XENLOG_WARNING,
                      "Bad GMFN %lx (MFN %lx) to MSR %08x\n",
-                     gmfn, mfn, base + idx);
+                     gmfn, page_to_mfn(page), base + idx);
             return 0;
         }
 
-        hypercall_page = map_domain_page(mfn);
+        hypercall_page = __map_domain_page(page);
         hypercall_page_initialise(d, hypercall_page);
         unmap_domain_page(hypercall_page);
 
-        put_page_and_type(mfn_to_page(mfn));
-        put_gfn(d, gmfn);
+        put_page_and_type(page);
         break;
     }
 
@@ -2374,7 +2373,8 @@ static int emulate_privileged_op(struct 
             break;
 
         case 3: {/* Write CR3 */
-            unsigned long mfn, gfn;
+            unsigned long gfn;
+            struct page_info *page;
             domain_lock(v->domain);
             if ( !is_pv_32on64_vcpu(v) )
             {
@@ -2384,9 +2384,10 @@ static int emulate_privileged_op(struct 
                 gfn = compat_cr3_to_pfn(*reg);
 #endif
             }
-            mfn = get_gfn_untyped(v->domain, gfn);
-            rc = new_guest_cr3(mfn);
-            put_gfn(v->domain, gfn);
+            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            rc = page ? new_guest_cr3(page_to_mfn(page)) : 0;
+            if ( page )
+                put_page(page);
             domain_unlock(v->domain);
             if ( rc == 0 ) /* not okay */
                 goto fail;
diff -r 107285938c50 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/include/asm-x86/p2m.h	Thu Apr 26 22:00:25 2012 +0100
@@ -192,7 +192,10 @@ typedef unsigned int p2m_query_t;
 /* Per-p2m-table state */
 struct p2m_domain {
     /* Lock that protects updates to the p2m */
-    mm_lock_t          lock;
+    rwlock_t           lock;
+    int                wcpu;
+    int                wcount;
+    const char        *wfunc;
 
     /* Shadow translated domain: p2m mapping */
     pagetable_t        phys_table;
@@ -377,6 +380,33 @@ static inline mfn_t get_gfn_query_unlock
     return __get_gfn_type_access(p2m_get_hostp2m(d), gfn, t, &a, 0, NULL, 0);
 }
 
+/* Atomically look up a GFN and take a reference count on the backing page.
+ * This makes sure the page doesn't get freed (or shared) underfoot,
+ * and should be used by any path that intends to write to the backing page.
+ * Returns NULL if the page is not backed by RAM.
+ * The caller is responsible for calling put_page() afterwards. */
+struct page_info *get_page_from_gfn_p2m(struct domain *d,
+                                        struct p2m_domain *p2m,
+                                        unsigned long gfn,
+                                        p2m_type_t *t, p2m_access_t *a,
+                                        p2m_query_t q);
+
+static inline struct page_info *get_page_from_gfn(
+    struct domain *d, unsigned long gfn, p2m_type_t *t, p2m_query_t q)
+{
+    struct page_info *page;
+
+    if ( paging_mode_translate(d) )
+        return get_page_from_gfn_p2m(d, p2m_get_hostp2m(d), gfn, t, NULL, q);
+
+    /* Non-translated guests see 1-1 RAM mappings everywhere */
+    if (t)
+        *t = p2m_ram_rw;
+    page = __mfn_to_page(gfn);
+    return get_page(page, d) ? page : NULL;
+}
+
+
 /* General conversion function from mfn to gfn */
 static inline unsigned long mfn_to_gfn(struct domain *d, mfn_t mfn)
 {
diff -r 107285938c50 xen/xsm/flask/hooks.c
--- a/xen/xsm/flask/hooks.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/xsm/flask/hooks.c	Thu Apr 26 22:00:25 2012 +0100
@@ -1318,6 +1318,7 @@ static int flask_mmu_normal_update(struc
     struct domain_security_struct *dsec;
     u32 fsid;
     struct avc_audit_data ad;
+    struct page_info *page;
 
     if (d != t)
         rc = domain_has_perm(d, t, SECCLASS_MMU, MMU__REMOTE_REMAP);
@@ -1333,7 +1334,8 @@ static int flask_mmu_normal_update(struc
         map_perms |= MMU__MAP_WRITE;
 
     AVC_AUDIT_DATA_INIT(&ad, MEMORY);
-    fmfn = get_gfn_untyped(f, l1e_get_pfn(l1e_from_intpte(fpte)));
+    page = get_page_from_gfn(f, l1e_get_pfn(l1e_from_intpte(fpte)), P2M_ALLOC);
+    mfn = page ? page_to_mfn(page) : INVALID_MFN;
 
     ad.sdom = d;
     ad.tdom = f;
@@ -1342,7 +1344,8 @@ static int flask_mmu_normal_update(struc
 
     rc = get_mfn_sid(fmfn, &fsid);
 
-    put_gfn(f, fmfn);
+    if ( page )
+        put_page(page);
 
     if ( rc )
         return rc;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-26 21:25                                         ` Tim Deegan
@ 2012-04-27  0:46                                           ` Zhang, Yang Z
  2012-04-27  0:51                                             ` Andres Lagar-Cavilla
  2012-04-27  3:02                                           ` Andres Lagar-Cavilla
  2012-05-16 11:36                                           ` Zhang, Yang Z
  2 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-27  0:46 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, andres

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Friday, April 27, 2012 5:26 AM
> To: Zhang, Yang Z
> Cc: andres@lagarcavilla.org; Keir Fraser; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] lock in vhpet
> 
> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> > > > But actually, the first cs introduced this issue is 24770. When
> > > > win8 booting and if hpet is enabled, it will use hpet as the time
> > > > source and there have lots of hpet access and EPT violation. In
> > > > EPT violation handler, it call get_gfn_type_access to get the mfn.
> > > > The cs 24770 introduces the gfn_lock for p2m lookups, and then the issue
> happens.
> > > > After I removed the gfn_lock, the issue goes. But in latest xen,
> > > > even I remove this lock, it still shows high cpu utilization.
> > >
> > > It would seem then that even the briefest lock-protected critical
> > > section would cause this? In the mmio case, the p2m lock taken in
> > > the hap fault handler is held during the actual lookup, and for a
> > > couple of branch instructions afterwards.
> > >
> > > In latest Xen, with lock removed for get_gfn, on which lock is time spent?
> > Still the p2m_lock.
> 
> Can you please try the attached patch?  I think you'll need this one plus the
> ones that take the locks out of the hpet code.
> 
> This patch makes the p2m lock into an rwlock and adjusts a number of the
> paths that don't update the p2m so they only take the read lock.  It's a bit
> rough but I can boot 16-way win7 guest with it.

This really a great work! Now, the win7 guest is booting very fast and never saw the BSOD again.
But the changes are so large in your patch. I think we need to do more sanity testing to avoid any regressions. After you finish all the work, I'd like to do a whole testing.:)

> N.B. Since rwlocks don't show up the the existing lock profiling, please don't try
> to use the lock-profiling numbers to see if it's helping!
> 
> Andres, this is basically the big-hammer version of your "take a pagecount"
> changes, plus the change you made to hvmemul_rep_movs().
> If this works I intend to follow it up with a patch to make some of the
> read-modify-write paths avoid taking the lock (by using a compare-exchange
> operation so they only take the lock on a write).  If that succeeds I might drop
> put_gfn() altogether.
> 
> But first it will need a lot of tidying up.  Noticeably missing:
>  - SVM code equivalents to the vmx.c changes
>  - grant-table operations still use the lock, because frankly I
>    could not follow the current code, and it's quite late in the evening.
> I also have a long list of uglinesses in the mm code that I found while writing
> this lot.
> 
> Keir, I have no objection to later replacing this with something better than an
> rwlock. :)  Or with making a NUMA-friendly rwlock implementation, since I
> really expect this to be heavily read-mostly when paging/sharing/pod are not
> enabled.
> 
> Cheers,
> 
> Tim.

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  0:46                                           ` Zhang, Yang Z
@ 2012-04-27  0:51                                             ` Andres Lagar-Cavilla
  2012-04-27  1:24                                               ` Zhang, Yang Z
  2012-04-27  8:36                                               ` Zhang, Yang Z
  0 siblings, 2 replies; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-27  0:51 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: George.Dunlap, Keir Fraser, olaf, xen-devel, Tim Deegan

>> -----Original Message-----
>> From: Tim Deegan [mailto:tim@xen.org]
>> Sent: Friday, April 27, 2012 5:26 AM
>> To: Zhang, Yang Z
>> Cc: andres@lagarcavilla.org; Keir Fraser; xen-devel@lists.xensource.com
>> Subject: Re: [Xen-devel] lock in vhpet
>>
>> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
>> > > > But actually, the first cs introduced this issue is 24770. When
>> > > > win8 booting and if hpet is enabled, it will use hpet as the time
>> > > > source and there have lots of hpet access and EPT violation. In
>> > > > EPT violation handler, it call get_gfn_type_access to get the mfn.
>> > > > The cs 24770 introduces the gfn_lock for p2m lookups, and then the
>> issue
>> happens.
>> > > > After I removed the gfn_lock, the issue goes. But in latest xen,
>> > > > even I remove this lock, it still shows high cpu utilization.
>> > >
>> > > It would seem then that even the briefest lock-protected critical
>> > > section would cause this? In the mmio case, the p2m lock taken in
>> > > the hap fault handler is held during the actual lookup, and for a
>> > > couple of branch instructions afterwards.
>> > >
>> > > In latest Xen, with lock removed for get_gfn, on which lock is time
>> spent?
>> > Still the p2m_lock.
>>
>> Can you please try the attached patch?  I think you'll need this one
>> plus the
>> ones that take the locks out of the hpet code.
>>
>> This patch makes the p2m lock into an rwlock and adjusts a number of the
>> paths that don't update the p2m so they only take the read lock.  It's a
>> bit
>> rough but I can boot 16-way win7 guest with it.

That is great news.

Tim, thanks for the amazing work. I'm poring over the patch presently, and
will shoot at it with everything I've got testing-wise.

I'm taking the liberty of pulling in Olaf (paging) and George (PoD) as the
changeset affects those subsystems.

Andres

>
> This really a great work! Now, the win7 guest is booting very fast and
> never saw the BSOD again.
> But the changes are so large in your patch. I think we need to do more
> sanity testing to avoid any regressions. After you finish all the work,
> I'd like to do a whole testing.:)
>
>> N.B. Since rwlocks don't show up the the existing lock profiling, please
>> don't try
>> to use the lock-profiling numbers to see if it's helping!
>>
>> Andres, this is basically the big-hammer version of your "take a
>> pagecount"
>> changes, plus the change you made to hvmemul_rep_movs().
>> If this works I intend to follow it up with a patch to make some of the
>> read-modify-write paths avoid taking the lock (by using a
>> compare-exchange
>> operation so they only take the lock on a write).  If that succeeds I
>> might drop
>> put_gfn() altogether.
>>
>> But first it will need a lot of tidying up.  Noticeably missing:
>>  - SVM code equivalents to the vmx.c changes
>>  - grant-table operations still use the lock, because frankly I
>>    could not follow the current code, and it's quite late in the
>> evening.
>> I also have a long list of uglinesses in the mm code that I found while
>> writing
>> this lot.
>>
>> Keir, I have no objection to later replacing this with something better
>> than an
>> rwlock. :)  Or with making a NUMA-friendly rwlock implementation, since
>> I
>> really expect this to be heavily read-mostly when paging/sharing/pod are
>> not
>> enabled.
>>
>> Cheers,
>>
>> Tim.
>
> best regards
> yang
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  0:51                                             ` Andres Lagar-Cavilla
@ 2012-04-27  1:24                                               ` Zhang, Yang Z
  2012-04-27  8:36                                               ` Zhang, Yang Z
  1 sibling, 0 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-27  1:24 UTC (permalink / raw)
  To: andres; +Cc: George.Dunlap, Keir Fraser, olaf, xen-devel, Tim Deegan


> -----Original Message-----
> From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> Sent: Friday, April 27, 2012 8:52 AM
> To: Zhang, Yang Z
> Cc: Tim Deegan; Keir Fraser; xen-devel@lists.xensource.com; olaf@aepfle.de;
> George.Dunlap@eu.citrix.com
> Subject: RE: [Xen-devel] lock in vhpet
> 
> >> -----Original Message-----
> >> From: Tim Deegan [mailto:tim@xen.org]
> >> Sent: Friday, April 27, 2012 5:26 AM
> >> To: Zhang, Yang Z
> >> Cc: andres@lagarcavilla.org; Keir Fraser;
> >> xen-devel@lists.xensource.com
> >> Subject: Re: [Xen-devel] lock in vhpet
> >>
> >> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> >> > > > But actually, the first cs introduced this issue is 24770. When
> >> > > > win8 booting and if hpet is enabled, it will use hpet as the
> >> > > > time source and there have lots of hpet access and EPT
> >> > > > violation. In EPT violation handler, it call get_gfn_type_access to get
> the mfn.
> >> > > > The cs 24770 introduces the gfn_lock for p2m lookups, and then
> >> > > > the
> >> issue
> >> happens.
> >> > > > After I removed the gfn_lock, the issue goes. But in latest
> >> > > > xen, even I remove this lock, it still shows high cpu utilization.
> >> > >
> >> > > It would seem then that even the briefest lock-protected critical
> >> > > section would cause this? In the mmio case, the p2m lock taken in
> >> > > the hap fault handler is held during the actual lookup, and for a
> >> > > couple of branch instructions afterwards.
> >> > >
> >> > > In latest Xen, with lock removed for get_gfn, on which lock is
> >> > > time
> >> spent?
> >> > Still the p2m_lock.
> >>
> >> Can you please try the attached patch?  I think you'll need this one
> >> plus the ones that take the locks out of the hpet code.
> >>
> >> This patch makes the p2m lock into an rwlock and adjusts a number of
> >> the paths that don't update the p2m so they only take the read lock.
> >> It's a bit rough but I can boot 16-way win7 guest with it.
> 
> That is great news.
> 
> Tim, thanks for the amazing work. I'm poring over the patch presently, and will
> shoot at it with everything I've got testing-wise.
> 
> I'm taking the liberty of pulling in Olaf (paging) and George (PoD) as the
> changeset affects those subsystems.

But win8 guest shows BSOD with 32 VCPUs. :(
The reason of BSOD is : SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (ACPI.sys)


best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-26 21:25                                         ` Tim Deegan
  2012-04-27  0:46                                           ` Zhang, Yang Z
@ 2012-04-27  3:02                                           ` Andres Lagar-Cavilla
  2012-04-27  9:26                                             ` Tim Deegan
  2012-05-16 11:36                                           ` Zhang, Yang Z
  2 siblings, 1 reply; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-27  3:02 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
>> > > But actually, the first cs introduced this issue is 24770. When win8
>> > > booting and if hpet is enabled, it will use hpet as the time source
>> > > and there have lots of hpet access and EPT violation. In EPT
>> violation
>> > > handler, it call get_gfn_type_access to get the mfn. The cs 24770
>> > > introduces the gfn_lock for p2m lookups, and then the issue happens.
>> > > After I removed the gfn_lock, the issue goes. But in latest xen,
>> even
>> > > I remove this lock, it still shows high cpu utilization.
>> >
>> > It would seem then that even the briefest lock-protected critical
>> section would
>> > cause this? In the mmio case, the p2m lock taken in the hap fault
>> handler is
>> > held during the actual lookup, and for a couple of branch instructions
>> > afterwards.
>> >
>> > In latest Xen, with lock removed for get_gfn, on which lock is time
>> spent?
>> Still the p2m_lock.
>
> Can you please try the attached patch?  I think you'll need this one
> plus the ones that take the locks out of the hpet code.

Right off the bat I'm getting a multitude of
(XEN) mm.c:3294:d0 Error while clearing mfn 100cbb7
And a hung dom0 during initramfs. I'm a little baffled as to why, but it's
there (32 bit dom0, XenServer6).

>
> This patch makes the p2m lock into an rwlock and adjusts a number of the
> paths that don't update the p2m so they only take the read lock.  It's a
> bit rough but I can boot 16-way win7 guest with it.
>
> N.B. Since rwlocks don't show up the the existing lock profiling, please
> don't try to use the lock-profiling numbers to see if it's helping!
>
> Andres, this is basically the big-hammer version of your "take a
> pagecount" changes, plus the change you made to hvmemul_rep_movs().
> If this works I intend to follow it up with a patch to make some of the
> read-modify-write paths avoid taking the lock (by using a
> compare-exchange operation so they only take the lock on a write).  If
> that succeeds I might drop put_gfn() altogether.

You mean cmpxchg the whole p2m entry? I don't think I parse the plan.
There are code paths that do get_gfn_query -> p2m_change_type -> put_gfn.
But I guess those could lock the p2m up-front if they become the only
consumers of put_gfn left.

>
> But first it will need a lot of tidying up.  Noticeably missing:
>  - SVM code equivalents to the vmx.c changes

load_pdptrs doesn't exist on svm. There are a couple of debugging
get_gfn_query that can be made unlocked. And svm_vmcb_restore needs
similar treatment to what you did in vmx.c. The question is who will be
able to test it...

>  - grant-table operations still use the lock, because frankly I
>    could not follow the current code, and it's quite late in the evening.

It's pretty complex with serious nesting, and ifdef's for arm and 32 bit.
gfn_to_mfn_private callers will suffer from altering the current meaning,
as put_gfn resolves to the right thing for the ifdef'ed arch. The other
user is grant_transfer which also relies on the page *not* having an extra
ref in steal_page. So it's a prime candidate to be left alone.

> I also have a long list of uglinesses in the mm code that I found

Uh, ugly stuff, how could that have happened?

I have a few preliminary observations on the patch. Pasting relevant bits
here, since the body of the patch seems to have been lost by the email
thread:

@@ -977,23 +976,25 @@ int arch_set_info_guest(
...
+
+        if (!paging_mode_refcounts(d)
+            && !get_page_and_type(cr3_page, d, PGT_l3_page_table) )
replace with && !get_page_type() )

@@ -2404,32 +2373,33 @@ static enum hvm_copy_result __hvm_copy(
             gfn = addr >> PAGE_SHIFT;
         }

-        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
+        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
replace with (flags & HVMCOPY_to_guest) ? P2M_UNSHARE : P2M_ALLOC (and
same logic when checking p2m_is_shared). Not truly related to your patch
bit since we're at it.

Same, further down
-        if ( !p2m_is_ram(p2mt) )
+        if ( !page )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
Last two lines are redundant

@@ -4019,35 +3993,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
    case HVMOP_modified_memory: a lot of error checking has been removed.
At the very least:
if ( page )
{ ...
} else {
    rc = -EINVAL;
    goto param_fail3;
}

arch/x86/mm.c:do_mmu_update -> you blew up all the paging/sharing checking
for target gfns of mmu updates of l2/3/4 entries. It seems that this
wouldn't work anyways, that's why you killed it?

+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -54,34 +54,37 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
...
+    if ( !top_page )
     {
         pfec[0] &= ~PFEC_page_present;
-        __put_gfn(p2m, top_gfn);
+        put_page(top_page);
top_page is NULL here, remove put_page

get_page_from_gfn_p2m, slow path: no need for p2m_lock/unlock since
locking is already done by get_gfn_type_access/__put_gfn

(hope those observations made sense without inlining them in the actual
patch)

Andres

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  0:51                                             ` Andres Lagar-Cavilla
  2012-04-27  1:24                                               ` Zhang, Yang Z
@ 2012-04-27  8:36                                               ` Zhang, Yang Z
  1 sibling, 0 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-04-27  8:36 UTC (permalink / raw)
  To: Zhang, Yang Z, andres
  Cc: George.Dunlap, Keir Fraser, olaf, xen-devel, Tim Deegan

> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Friday, April 27, 2012 9:25 AM
> To: andres@lagarcavilla.org
> Cc: Tim Deegan; Keir Fraser; xen-devel@lists.xensource.com; olaf@aepfle.de;
> George.Dunlap@eu.citrix.com
> Subject: RE: [Xen-devel] lock in vhpet
> 
> 
> > -----Original Message-----
> > From: Andres Lagar-Cavilla [mailto:andres@lagarcavilla.org]
> > Sent: Friday, April 27, 2012 8:52 AM
> > To: Zhang, Yang Z
> > Cc: Tim Deegan; Keir Fraser; xen-devel@lists.xensource.com;
> > olaf@aepfle.de; George.Dunlap@eu.citrix.com
> > Subject: RE: [Xen-devel] lock in vhpet
> >
> > >> -----Original Message-----
> > >> From: Tim Deegan [mailto:tim@xen.org]
> > >> Sent: Friday, April 27, 2012 5:26 AM
> > >> To: Zhang, Yang Z
> > >> Cc: andres@lagarcavilla.org; Keir Fraser;
> > >> xen-devel@lists.xensource.com
> > >> Subject: Re: [Xen-devel] lock in vhpet
> > >>
> > >> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> > >> > > > But actually, the first cs introduced this issue is 24770.
> > >> > > > When
> > >> > > > win8 booting and if hpet is enabled, it will use hpet as the
> > >> > > > time source and there have lots of hpet access and EPT
> > >> > > > violation. In EPT violation handler, it call
> > >> > > > get_gfn_type_access to get
> > the mfn.
> > >> > > > The cs 24770 introduces the gfn_lock for p2m lookups, and
> > >> > > > then the
> > >> issue
> > >> happens.
> > >> > > > After I removed the gfn_lock, the issue goes. But in latest
> > >> > > > xen, even I remove this lock, it still shows high cpu utilization.
> > >> > >
> > >> > > It would seem then that even the briefest lock-protected
> > >> > > critical section would cause this? In the mmio case, the p2m
> > >> > > lock taken in the hap fault handler is held during the actual
> > >> > > lookup, and for a couple of branch instructions afterwards.
> > >> > >
> > >> > > In latest Xen, with lock removed for get_gfn, on which lock is
> > >> > > time
> > >> spent?
> > >> > Still the p2m_lock.
> > >>
> > >> Can you please try the attached patch?  I think you'll need this
> > >> one plus the ones that take the locks out of the hpet code.
> > >>
> > >> This patch makes the p2m lock into an rwlock and adjusts a number
> > >> of the paths that don't update the p2m so they only take the read lock.
> > >> It's a bit rough but I can boot 16-way win7 guest with it.
> >
> > That is great news.
> >
> > Tim, thanks for the amazing work. I'm poring over the patch presently,
> > and will shoot at it with everything I've got testing-wise.
> >
> > I'm taking the liberty of pulling in Olaf (paging) and George (PoD) as
> > the changeset affects those subsystems.
> 
> But win8 guest shows BSOD with 32 VCPUs. :( The reason of BSOD is :
> SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (ACPI.sys)
> 
Um....., I find this issue is related to xl not hypervisor.  
Will send a patch to fix it later.

yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  3:02                                           ` Andres Lagar-Cavilla
@ 2012-04-27  9:26                                             ` Tim Deegan
  2012-04-27 14:17                                               ` Andres Lagar-Cavilla
  2012-04-27 21:08                                               ` Andres Lagar-Cavilla
  0 siblings, 2 replies; 45+ messages in thread
From: Tim Deegan @ 2012-04-27  9:26 UTC (permalink / raw)
  To: Andres Lagar-Cavilla; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 5717 bytes --]

At 20:02 -0700 on 26 Apr (1335470547), Andres Lagar-Cavilla wrote:
> > Can you please try the attached patch?  I think you'll need this one
> > plus the ones that take the locks out of the hpet code.
> 
> Right off the bat I'm getting a multitude of
> (XEN) mm.c:3294:d0 Error while clearing mfn 100cbb7
> And a hung dom0 during initramfs. I'm a little baffled as to why, but it's
> there (32 bit dom0, XenServer6).

Curses, I knew there'd be one somewhere.  I've been replacing
get_page_and_type_from_pagenr()s (which return 0 for success) with
old-school get_page_type()s (which return 1 for success) and not always
getting the right number of inversions.  That's a horrible horrible
beartrap of an API, BTW, which had me cursing at the screen, but I had
enough on my plate yesterday without touching _that_ code too!

> > Andres, this is basically the big-hammer version of your "take a
> > pagecount" changes, plus the change you made to hvmemul_rep_movs().
> > If this works I intend to follow it up with a patch to make some of the
> > read-modify-write paths avoid taking the lock (by using a
> > compare-exchange operation so they only take the lock on a write).  If
> > that succeeds I might drop put_gfn() altogether.
> 
> You mean cmpxchg the whole p2m entry? I don't think I parse the plan.
> There are code paths that do get_gfn_query -> p2m_change_type -> put_gfn.
> But I guess those could lock the p2m up-front if they become the only
> consumers of put_gfn left.

Well, that's more or less what happens now.  I was thinking of replacing
any remaining

 (implicit) lock ; read ; think a bit ; maybe write ; unlock

code with the fast-path-friendlier:

 read ; think ; maybe-cmpxchg (and on failure undo or retry 

which avoids taking the write lock altogether if there's no work to do. 
But maybe there aren't many of those left now.  Obviously any path
which will always write should just take the write-lock first. 
 
> >  - grant-table operations still use the lock, because frankly I
> >    could not follow the current code, and it's quite late in the evening.
> 
> It's pretty complex with serious nesting, and ifdef's for arm and 32 bit.
> gfn_to_mfn_private callers will suffer from altering the current meaning,
> as put_gfn resolves to the right thing for the ifdef'ed arch. The other
> user is grant_transfer which also relies on the page *not* having an extra
> ref in steal_page. So it's a prime candidate to be left alone.

Sadly, I think it's not.  The PV backends will be doing lots of grant
ops, which shouldn't get serialized against all other P2M lookups. 

> > I also have a long list of uglinesses in the mm code that I found
> 
> Uh, ugly stuff, how could that have happened?

I can't imagine. :)  Certainly nothing to do with me thinking "I'll
clean that up when I get some time."

> I have a few preliminary observations on the patch. Pasting relevant bits
> here, since the body of the patch seems to have been lost by the email
> thread:
> 
> @@ -977,23 +976,25 @@ int arch_set_info_guest(
> ...
> +
> +        if (!paging_mode_refcounts(d)
> +            && !get_page_and_type(cr3_page, d, PGT_l3_page_table) )
> replace with && !get_page_type() )

Yep.

> @@ -2404,32 +2373,33 @@ static enum hvm_copy_result __hvm_copy(
>              gfn = addr >> PAGE_SHIFT;
>          }
> 
> -        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
> +        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
> replace with (flags & HVMCOPY_to_guest) ? P2M_UNSHARE : P2M_ALLOC (and
> same logic when checking p2m_is_shared). Not truly related to your patch
> bit since we're at it.

OK, but not in this patch.

> Same, further down
> -        if ( !p2m_is_ram(p2mt) )
> +        if ( !page )
>          {
> -            put_gfn(curr->domain, gfn);
> +            if ( page )
> +                put_page(page);
> Last two lines are redundant

Yep.

> @@ -4019,35 +3993,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
>     case HVMOP_modified_memory: a lot of error checking has been removed.

Yes, but it was bogus - there's a race between the actual modification
and the call, during which anything might have happened.  The best we
can do is throw log-dirty bits at everything, and the caller can't do
anything with the error anyway.

When I come to tidy up I'll just add a new mark_gfn_dirty function
and skip the pointless gfn->mfn->gfn translation on this path.

> arch/x86/mm.c:do_mmu_update -> you blew up all the paging/sharing checking
> for target gfns of mmu updates of l2/3/4 entries. It seems that this
> wouldn't work anyways, that's why you killed it?

Yeah - since only L1es can point at foreign mappings it was all just
noise, and even if there had been real p2m lookups on those paths there
was no equivalent to the translate-in-place that happens in
mod_l1_entry so it would have been broken in a much worse way.

> +++ b/xen/arch/x86/mm/hap/guest_walk.c
> @@ -54,34 +54,37 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
> ...
> +    if ( !top_page )
>      {
>          pfec[0] &= ~PFEC_page_present;
> -        __put_gfn(p2m, top_gfn);
> +        put_page(top_page);
> top_page is NULL here, remove put_page

Yep.

> get_page_from_gfn_p2m, slow path: no need for p2m_lock/unlock since
> locking is already done by get_gfn_type_access/__put_gfn

Yeah, but I was writing that with half an eye on killing that lock. :) 
I'll drop them for now.

> (hope those observations made sense without inlining them in the actual
> patch)

Yes, absolutely - thanks for the review!

If we can get this to work well enough I'll tidy it up into a sensible
series next week.   In the meantime, an updated verison of the
monster patch is attached. 

Cheers,

Tim.

[-- Attachment #2: get-page-from-gfn --]
[-- Type: text/plain, Size: 75921 bytes --]

# HG changeset patch
# Parent 107285938c50f82667bd4d014820b439a077c22c

diff -r 107285938c50 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/domain.c	Fri Apr 27 10:23:28 2012 +0100
@@ -716,7 +716,7 @@ int arch_set_info_guest(
 {
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
-    unsigned long cr3_pfn = INVALID_MFN;
+    struct page_info *cr3_page;
     unsigned long flags, cr4;
     unsigned int i;
     int rc = 0, compat;
@@ -925,46 +925,45 @@ int arch_set_info_guest(
     if ( !compat )
     {
         cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);
-        cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+        cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-        if ( !mfn_valid(cr3_pfn) ||
-             (paging_mode_refcounts(d)
-              ? !get_page(mfn_to_page(cr3_pfn), d)
-              : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                   PGT_base_page_table)) )
+        if ( !cr3_page )
         {
-            put_gfn(d, cr3_gfn);
+            destroy_gdt(v);
+            return -EINVAL;
+        }
+        if ( !paging_mode_refcounts(d)
+             && !get_page_type(cr3_page, PGT_base_page_table) )
+        {
+            put_page(cr3_page);
             destroy_gdt(v);
             return -EINVAL;
         }
 
-        v->arch.guest_table = pagetable_from_pfn(cr3_pfn);
-        put_gfn(d, cr3_gfn);
+        v->arch.guest_table = pagetable_from_page(cr3_page);
 #ifdef __x86_64__
         if ( c.nat->ctrlreg[1] )
         {
             cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]);
-            cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+            cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-            if ( !mfn_valid(cr3_pfn) ||
-                 (paging_mode_refcounts(d)
-                  ? !get_page(mfn_to_page(cr3_pfn), d)
-                  : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                       PGT_base_page_table)) )
+            if ( !cr3_page ||
+                 (!paging_mode_refcounts(d)
+                  && !get_page_type(cr3_page, PGT_base_page_table)) )
             {
-                cr3_pfn = pagetable_get_pfn(v->arch.guest_table);
+                if (cr3_page)
+                    put_page(cr3_page);
+                cr3_page = pagetable_get_page(v->arch.guest_table);
                 v->arch.guest_table = pagetable_null();
                 if ( paging_mode_refcounts(d) )
-                    put_page(mfn_to_page(cr3_pfn));
+                    put_page(cr3_page);
                 else
-                    put_page_and_type(mfn_to_page(cr3_pfn));
-                put_gfn(d, cr3_gfn); 
+                    put_page_and_type(cr3_page);
                 destroy_gdt(v);
                 return -EINVAL;
             }
 
-            v->arch.guest_table_user = pagetable_from_pfn(cr3_pfn);
-            put_gfn(d, cr3_gfn); 
+            v->arch.guest_table_user = pagetable_from_page(cr3_page);
         }
         else if ( !(flags & VGCF_in_kernel) )
         {
@@ -977,23 +976,25 @@ int arch_set_info_guest(
         l4_pgentry_t *l4tab;
 
         cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]);
-        cr3_pfn = get_gfn_untyped(d, cr3_gfn);
+        cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
 
-        if ( !mfn_valid(cr3_pfn) ||
-             (paging_mode_refcounts(d)
-              ? !get_page(mfn_to_page(cr3_pfn), d)
-              : !get_page_and_type(mfn_to_page(cr3_pfn), d,
-                                   PGT_l3_page_table)) )
+        if ( !cr3_page)
         {
-            put_gfn(d, cr3_gfn); 
+            destroy_gdt(v);
+            return -EINVAL;
+        }
+
+        if (!paging_mode_refcounts(d)
+            && !get_page_type(cr3_page, PGT_l3_page_table) )
+        {
+            put_page(cr3_page);
             destroy_gdt(v);
             return -EINVAL;
         }
 
         l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
-        *l4tab = l4e_from_pfn(
-            cr3_pfn, _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
-        put_gfn(d, cr3_gfn); 
+        *l4tab = l4e_from_pfn(page_to_mfn(cr3_page),
+            _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
 #endif
     }
 
@@ -1064,7 +1065,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     struct domain *d = v->domain;
     void *mapping;
     vcpu_info_t *new_info;
-    unsigned long mfn;
+    struct page_info *page;
     int i;
 
     if ( offset > (PAGE_SIZE - sizeof(vcpu_info_t)) )
@@ -1077,19 +1078,20 @@ map_vcpu_info(struct vcpu *v, unsigned l
     if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
         return -EINVAL;
 
-    mfn = get_gfn_untyped(d, gfn);
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
+    if ( !page )
+        return -EINVAL;
+
+    if ( !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gfn); 
+        put_page(page);
         return -EINVAL;
     }
 
-    mapping = map_domain_page_global(mfn);
+    mapping = __map_domain_page_global(page);
     if ( mapping == NULL )
     {
-        put_page_and_type(mfn_to_page(mfn));
-        put_gfn(d, gfn); 
+        put_page_and_type(page);
         return -ENOMEM;
     }
 
@@ -1106,7 +1108,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     }
 
     v->vcpu_info = new_info;
-    v->arch.pv_vcpu.vcpu_info_mfn = mfn;
+    v->arch.pv_vcpu.vcpu_info_mfn = page_to_mfn(page);
 
     /* Set new vcpu_info pointer /before/ setting pending flags. */
     wmb();
@@ -1119,7 +1121,6 @@ map_vcpu_info(struct vcpu *v, unsigned l
     for ( i = 0; i < BITS_PER_EVTCHN_WORD(d); i++ )
         set_bit(i, &vcpu_info(v, evtchn_pending_sel));
 
-    put_gfn(d, gfn); 
     return 0;
 }
 
diff -r 107285938c50 xen/arch/x86/domctl.c
--- a/xen/arch/x86/domctl.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/domctl.c	Fri Apr 27 10:23:28 2012 +0100
@@ -202,16 +202,16 @@ long arch_do_domctl(
 
                 for ( j = 0; j < k; j++ )
                 {
-                    unsigned long type = 0, mfn = get_gfn_untyped(d, arr[j]);
+                    unsigned long type = 0;
 
-                    page = mfn_to_page(mfn);
+                    page = get_page_from_gfn(d, arr[j], NULL, P2M_ALLOC);
 
-                    if ( unlikely(!mfn_valid(mfn)) ||
-                         unlikely(is_xen_heap_mfn(mfn)) )
+                    if ( unlikely(!page) ||
+                         unlikely(is_xen_heap_page(page)) )
                         type = XEN_DOMCTL_PFINFO_XTAB;
                     else if ( xsm_getpageframeinfo(page) != 0 )
                         ;
-                    else if ( likely(get_page(page, d)) )
+                    else
                     {
                         switch( page->u.inuse.type_info & PGT_type_mask )
                         {
@@ -231,13 +231,10 @@ long arch_do_domctl(
 
                         if ( page->u.inuse.type_info & PGT_pinned )
                             type |= XEN_DOMCTL_PFINFO_LPINTAB;
+                    }
 
+                    if ( page )
                         put_page(page);
-                    }
-                    else
-                        type = XEN_DOMCTL_PFINFO_XTAB;
-
-                    put_gfn(d, arr[j]);
                     arr[j] = type;
                 }
 
@@ -304,21 +301,21 @@ long arch_do_domctl(
             {      
                 struct page_info *page;
                 unsigned long gfn = arr32[j];
-                unsigned long mfn = get_gfn_untyped(d, gfn);
 
-                page = mfn_to_page(mfn);
+                page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
 
                 if ( domctl->cmd == XEN_DOMCTL_getpageframeinfo3)
                     arr32[j] = 0;
 
-                if ( unlikely(!mfn_valid(mfn)) ||
-                     unlikely(is_xen_heap_mfn(mfn)) )
+                if ( unlikely(!page) ||
+                     unlikely(is_xen_heap_page(page)) )
                     arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
                 else if ( xsm_getpageframeinfo(page) != 0 )
                 {
-                    put_gfn(d, gfn); 
+                    put_page(page);
                     continue;
-                } else if ( likely(get_page(page, d)) )
+                }
+                else
                 {
                     unsigned long type = 0;
 
@@ -341,12 +338,10 @@ long arch_do_domctl(
                     if ( page->u.inuse.type_info & PGT_pinned )
                         type |= XEN_DOMCTL_PFINFO_LPINTAB;
                     arr32[j] |= type;
+                }
+
+                if ( page )
                     put_page(page);
-                }
-                else
-                    arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
-
-                put_gfn(d, gfn); 
             }
 
             if ( copy_to_guest_offset(domctl->u.getpageframeinfo2.array,
@@ -419,7 +414,7 @@ long arch_do_domctl(
     {
         struct domain *d = rcu_lock_domain_by_id(domctl->domain);
         unsigned long gmfn = domctl->u.hypercall_init.gmfn;
-        unsigned long mfn;
+        struct page_info *page;
         void *hypercall_page;
 
         ret = -ESRCH;
@@ -433,26 +428,25 @@ long arch_do_domctl(
             break;
         }
 
-        mfn = get_gfn_untyped(d, gmfn);
+        page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
 
         ret = -EACCES;
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+        if ( !page || !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(d, gmfn); 
+            if ( page )
+                put_page(page);
             rcu_unlock_domain(d);
             break;
         }
 
         ret = 0;
 
-        hypercall_page = map_domain_page(mfn);
+        hypercall_page = __map_domain_page(page);
         hypercall_page_initialise(d, hypercall_page);
         unmap_domain_page(hypercall_page);
 
-        put_page_and_type(mfn_to_page(mfn));
+        put_page_and_type(page);
 
-        put_gfn(d, gmfn); 
         rcu_unlock_domain(d);
     }
     break;
diff -r 107285938c50 xen/arch/x86/hvm/emulate.c
--- a/xen/arch/x86/hvm/emulate.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/emulate.c	Fri Apr 27 10:23:28 2012 +0100
@@ -60,34 +60,25 @@ static int hvmemul_do_io(
     ioreq_t *p = get_ioreq(curr);
     unsigned long ram_gfn = paddr_to_pfn(ram_gpa);
     p2m_type_t p2mt;
-    mfn_t ram_mfn;
+    struct page_info *ram_page;
     int rc;
 
     /* Check for paged out page */
-    ram_mfn = get_gfn_unshare(curr->domain, ram_gfn, &p2mt);
+    ram_page = get_page_from_gfn(curr->domain, ram_gfn, &p2mt, P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(curr->domain, ram_gfn); 
+        if ( ram_page )
+            put_page(ram_page);
         p2m_mem_paging_populate(curr->domain, ram_gfn);
         return X86EMUL_RETRY;
     }
     if ( p2m_is_shared(p2mt) )
     {
-        put_gfn(curr->domain, ram_gfn); 
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_RETRY;
     }
 
-    /* Maintain a ref on the mfn to ensure liveness. Put the gfn
-     * to avoid potential deadlock wrt event channel lock, later. */
-    if ( mfn_valid(mfn_x(ram_mfn)) )
-        if ( !get_page(mfn_to_page(mfn_x(ram_mfn)),
-             curr->domain) )
-        {
-            put_gfn(curr->domain, ram_gfn);
-            return X86EMUL_RETRY;
-        }
-    put_gfn(curr->domain, ram_gfn);
-
     /*
      * Weird-sized accesses have undefined behaviour: we discard writes
      * and read all-ones.
@@ -98,8 +89,8 @@ static int hvmemul_do_io(
         ASSERT(p_data != NULL); /* cannot happen with a REP prefix */
         if ( dir == IOREQ_READ )
             memset(p_data, ~0, size);
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -120,8 +111,8 @@ static int hvmemul_do_io(
             unsigned int bytes = vio->mmio_large_write_bytes;
             if ( (addr >= pa) && ((addr + size) <= (pa + bytes)) )
             {
-                if ( mfn_valid(mfn_x(ram_mfn)) )
-                    put_page(mfn_to_page(mfn_x(ram_mfn)));
+                if ( ram_page )
+                    put_page(ram_page);
                 return X86EMUL_OKAY;
             }
         }
@@ -133,8 +124,8 @@ static int hvmemul_do_io(
             {
                 memcpy(p_data, &vio->mmio_large_read[addr - pa],
                        size);
-                if ( mfn_valid(mfn_x(ram_mfn)) )
-                    put_page(mfn_to_page(mfn_x(ram_mfn)));
+                if ( ram_page )
+                    put_page(ram_page);
                 return X86EMUL_OKAY;
             }
         }
@@ -148,8 +139,8 @@ static int hvmemul_do_io(
         vio->io_state = HVMIO_none;
         if ( p_data == NULL )
         {
-            if ( mfn_valid(mfn_x(ram_mfn)) )
-                put_page(mfn_to_page(mfn_x(ram_mfn)));
+            if ( ram_page )
+                put_page(ram_page);
             return X86EMUL_UNHANDLEABLE;
         }
         goto finish_access;
@@ -159,13 +150,13 @@ static int hvmemul_do_io(
              (addr == (vio->mmio_large_write_pa +
                        vio->mmio_large_write_bytes)) )
         {
-            if ( mfn_valid(mfn_x(ram_mfn)) )
-                put_page(mfn_to_page(mfn_x(ram_mfn)));
+            if ( ram_page )
+                put_page(ram_page);
             return X86EMUL_RETRY;
         }
     default:
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -173,8 +164,8 @@ static int hvmemul_do_io(
     {
         gdprintk(XENLOG_WARNING, "WARNING: io already pending (%d)?\n",
                  p->state);
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -226,8 +217,8 @@ static int hvmemul_do_io(
 
     if ( rc != X86EMUL_OKAY )
     {
-        if ( mfn_valid(mfn_x(ram_mfn)) )
-            put_page(mfn_to_page(mfn_x(ram_mfn)));
+        if ( ram_page )
+            put_page(ram_page);
         return rc;
     }
 
@@ -263,8 +254,8 @@ static int hvmemul_do_io(
         }
     }
 
-    if ( mfn_valid(mfn_x(ram_mfn)) )
-        put_page(mfn_to_page(mfn_x(ram_mfn)));
+    if ( ram_page )
+        put_page(ram_page);
     return X86EMUL_OKAY;
 }
 
@@ -686,7 +677,6 @@ static int hvmemul_rep_movs(
     p2m_type_t sp2mt, dp2mt;
     int rc, df = !!(ctxt->regs->eflags & X86_EFLAGS_DF);
     char *buf;
-    struct two_gfns tg;
 
     rc = hvmemul_virtual_to_linear(
         src_seg, src_offset, bytes_per_rep, reps, hvm_access_read,
@@ -714,25 +704,17 @@ static int hvmemul_rep_movs(
     if ( rc != X86EMUL_OKAY )
         return rc;
 
-    get_two_gfns(current->domain, sgpa >> PAGE_SHIFT, &sp2mt, NULL, NULL,
-                 current->domain, dgpa >> PAGE_SHIFT, &dp2mt, NULL, NULL,
-                 P2M_ALLOC, &tg);
+    /* Check for MMIO ops */
+    (void) get_gfn_query_unlocked(current->domain, sgpa >> PAGE_SHIFT, &sp2mt);
+    (void) get_gfn_query_unlocked(current->domain, dgpa >> PAGE_SHIFT, &dp2mt);
 
-    if ( !p2m_is_ram(sp2mt) && !p2m_is_grant(sp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+    if ( sp2mt == p2m_mmio_dm )
+        return hvmemul_do_mmio(
             sgpa, reps, bytes_per_rep, dgpa, IOREQ_READ, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }
 
-    if ( !p2m_is_ram(dp2mt) && !p2m_is_grant(dp2mt) )
-    {
-        rc = hvmemul_do_mmio(
+    if ( dp2mt == p2m_mmio_dm )
+        return hvmemul_do_mmio(
             dgpa, reps, bytes_per_rep, sgpa, IOREQ_WRITE, df, NULL);
-        put_two_gfns(&tg);
-        return rc;
-    }
 
     /* RAM-to-RAM copy: emulate as equivalent of memmove(dgpa, sgpa, bytes). */
     bytes = *reps * bytes_per_rep;
@@ -747,10 +729,7 @@ static int hvmemul_rep_movs(
      * can be emulated by a source-to-buffer-to-destination block copy.
      */
     if ( ((dgpa + bytes_per_rep) > sgpa) && (dgpa < (sgpa + bytes)) )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }
 
     /* Adjust destination address for reverse copy. */
     if ( df )
@@ -759,10 +738,7 @@ static int hvmemul_rep_movs(
     /* Allocate temporary buffer. Fall back to slow emulation if this fails. */
     buf = xmalloc_bytes(bytes);
     if ( buf == NULL )
-    {
-        put_two_gfns(&tg);
         return X86EMUL_UNHANDLEABLE;
-    }
 
     /*
      * We do a modicum of checking here, just for paranoia's sake and to
@@ -773,7 +749,6 @@ static int hvmemul_rep_movs(
         rc = hvm_copy_to_guest_phys(dgpa, buf, bytes);
 
     xfree(buf);
-    put_two_gfns(&tg);
 
     if ( rc == HVMCOPY_gfn_paged_out )
         return X86EMUL_RETRY;
diff -r 107285938c50 xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/hvm.c	Fri Apr 27 10:23:28 2012 +0100
@@ -395,48 +395,41 @@ int prepare_ring_for_helper(
 {
     struct page_info *page;
     p2m_type_t p2mt;
-    unsigned long mfn;
     void *va;
 
-    mfn = mfn_x(get_gfn_unshare(d, gmfn, &p2mt));
-    if ( !p2m_is_ram(p2mt) )
-    {
-        put_gfn(d, gmfn);
-        return -EINVAL;
-    }
+    page = get_page_from_gfn(d, gmfn, &p2mt, P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(d, gmfn);
+        if ( page )
+            put_page(page);
         p2m_mem_paging_populate(d, gmfn);
         return -ENOENT;
     }
     if ( p2m_is_shared(p2mt) )
     {
-        put_gfn(d, gmfn);
+        if ( page )
+            put_page(page);
         return -ENOENT;
     }
-    ASSERT(mfn_valid(mfn));
-
-    page = mfn_to_page(mfn);
-    if ( !get_page_and_type(page, d, PGT_writable_page) )
+    if ( !page )
+        return -EINVAL;
+
+    if ( !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn);
+        put_page(page);
         return -EINVAL;
     }
 
-    va = map_domain_page_global(mfn);
+    va = __map_domain_page_global(page);
     if ( va == NULL )
     {
         put_page_and_type(page);
-        put_gfn(d, gmfn);
         return -ENOMEM;
     }
 
     *_va = va;
     *_page = page;
 
-    put_gfn(d, gmfn);
-
     return 0;
 }
 
@@ -1607,8 +1600,8 @@ int hvm_mov_from_cr(unsigned int cr, uns
 int hvm_set_cr0(unsigned long value)
 {
     struct vcpu *v = current;
-    p2m_type_t p2mt;
-    unsigned long gfn, mfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
+    unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
+    struct page_info *page;
 
     HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR0 value = %lx", value);
 
@@ -1647,23 +1640,20 @@ int hvm_set_cr0(unsigned long value)
         {
             /* The guest CR3 must be pointing to the guest physical. */
             gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
-            mfn = mfn_x(get_gfn(v->domain, gfn, &p2mt));
-            if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
-                 !get_page(mfn_to_page(mfn), v->domain))
+            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            if ( !page )
             {
-                put_gfn(v->domain, gfn);
-                gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx (mfn=%lx)\n",
-                         v->arch.hvm_vcpu.guest_cr[3], mfn);
+                gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
+                         v->arch.hvm_vcpu.guest_cr[3]);
                 domain_crash(v->domain);
                 return X86EMUL_UNHANDLEABLE;
             }
 
             /* Now arch.guest_table points to machine physical. */
-            v->arch.guest_table = pagetable_from_pfn(mfn);
+            v->arch.guest_table = pagetable_from_page(page);
 
             HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx, mfn = %lx",
-                        v->arch.hvm_vcpu.guest_cr[3], mfn);
-            put_gfn(v->domain, gfn);
+                        v->arch.hvm_vcpu.guest_cr[3], page_to_mfn(page));
         }
     }
     else if ( !(value & X86_CR0_PG) && (old_value & X86_CR0_PG) )
@@ -1738,26 +1728,21 @@ int hvm_set_cr0(unsigned long value)
 
 int hvm_set_cr3(unsigned long value)
 {
-    unsigned long mfn;
-    p2m_type_t p2mt;
     struct vcpu *v = current;
+    struct page_info *page;
 
     if ( hvm_paging_enabled(v) && !paging_mode_hap(v->domain) &&
          (value != v->arch.hvm_vcpu.guest_cr[3]) )
     {
         /* Shadow-mode CR3 change. Check PDBR and update refcounts. */
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "CR3 value = %lx", value);
-        mfn = mfn_x(get_gfn(v->domain, value >> PAGE_SHIFT, &p2mt));
-        if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
-             !get_page(mfn_to_page(mfn), v->domain) )
-        {
-              put_gfn(v->domain, value >> PAGE_SHIFT);
-              goto bad_cr3;
-        }
+        page = get_page_from_gfn(v->domain, value >> PAGE_SHIFT,
+                                 NULL, P2M_ALLOC);
+        if ( !page )
+            goto bad_cr3;
 
         put_page(pagetable_get_page(v->arch.guest_table));
-        v->arch.guest_table = pagetable_from_pfn(mfn);
-        put_gfn(v->domain, value >> PAGE_SHIFT);
+        v->arch.guest_table = pagetable_from_page(page);
 
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx", value);
     }
@@ -1914,46 +1899,29 @@ int hvm_virtual_to_linear_addr(
 static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable)
 {
     void *map;
-    unsigned long mfn;
     p2m_type_t p2mt;
-    struct page_info *pg;
+    struct page_info *page;
     struct domain *d = current->domain;
-    int rc;
-
-    mfn = mfn_x(writable
-                ? get_gfn_unshare(d, gfn, &p2mt)
-                : get_gfn(d, gfn, &p2mt));
-    if ( (p2m_is_shared(p2mt) && writable) || !p2m_is_ram(p2mt) )
+
+    page = get_page_from_gfn(d, gfn, &p2mt,
+                             writable ? P2M_UNSHARE : P2M_ALLOC);
+    if ( (p2m_is_shared(p2mt) && writable) || !page )
     {
-        put_gfn(d, gfn);
+        if ( page )
+            put_page(page);
         return NULL;
     }
     if ( p2m_is_paging(p2mt) )
     {
-        put_gfn(d, gfn);
+        put_page(page);
         p2m_mem_paging_populate(d, gfn);
         return NULL;
     }
 
-    ASSERT(mfn_valid(mfn));
-
     if ( writable )
-        paging_mark_dirty(d, mfn);
-
-    /* Get a ref on the page, considering that it could be shared */
-    pg = mfn_to_page(mfn);
-    rc = get_page(pg, d);
-    if ( !rc && !writable )
-        /* Page could be shared */
-        rc = get_page(pg, dom_cow);
-    if ( !rc )
-    {
-        put_gfn(d, gfn);
-        return NULL;
-    }
-
-    map = map_domain_page(mfn);
-    put_gfn(d, gfn);
+        paging_mark_dirty(d, page_to_mfn(page));
+
+    map = __map_domain_page(page);
     return map;
 }
 
@@ -2358,7 +2326,8 @@ static enum hvm_copy_result __hvm_copy(
     void *buf, paddr_t addr, int size, unsigned int flags, uint32_t pfec)
 {
     struct vcpu *curr = current;
-    unsigned long gfn, mfn;
+    unsigned long gfn;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
@@ -2402,32 +2371,33 @@ static enum hvm_copy_result __hvm_copy(
             gfn = addr >> PAGE_SHIFT;
         }
 
-        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
+        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
 
         if ( p2m_is_paging(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             p2m_mem_paging_populate(curr->domain, gfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_gfn_shared;
         }
         if ( p2m_is_grant(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_unhandleable;
         }
-        if ( !p2m_is_ram(p2mt) )
+        if ( !page )
         {
-            put_gfn(curr->domain, gfn);
             return HVMCOPY_bad_gfn_to_mfn;
         }
-        ASSERT(mfn_valid(mfn));
-
-        p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
+
+        p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( flags & HVMCOPY_to_guest )
         {
@@ -2437,12 +2407,12 @@ static enum hvm_copy_result __hvm_copy(
                 if ( xchg(&lastpage, gfn) != gfn )
                     gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
                              " memory page. gfn=%#lx, mfn=%#lx\n",
-                             gfn, mfn);
+                             gfn, page_to_mfn(page));
             }
             else
             {
                 memcpy(p, buf, count);
-                paging_mark_dirty(curr->domain, mfn);
+                paging_mark_dirty(curr->domain, page_to_mfn(page));
             }
         }
         else
@@ -2455,7 +2425,7 @@ static enum hvm_copy_result __hvm_copy(
         addr += count;
         buf  += count;
         todo -= count;
-        put_gfn(curr->domain, gfn);
+        put_page(page);
     }
 
     return HVMCOPY_okay;
@@ -2464,7 +2434,8 @@ static enum hvm_copy_result __hvm_copy(
 static enum hvm_copy_result __hvm_clear(paddr_t addr, int size)
 {
     struct vcpu *curr = current;
-    unsigned long gfn, mfn;
+    unsigned long gfn;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
@@ -2500,32 +2471,35 @@ static enum hvm_copy_result __hvm_clear(
             return HVMCOPY_bad_gva_to_gfn;
         }
 
-        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
+        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
 
         if ( p2m_is_paging(p2mt) )
         {
+            if ( page )
+                put_page(page);
             p2m_mem_paging_populate(curr->domain, gfn);
-            put_gfn(curr->domain, gfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_gfn_shared;
         }
         if ( p2m_is_grant(p2mt) )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_unhandleable;
         }
-        if ( !p2m_is_ram(p2mt) )
+        if ( !page )
         {
-            put_gfn(curr->domain, gfn);
+            if ( page )
+                put_page(page);
             return HVMCOPY_bad_gfn_to_mfn;
         }
-        ASSERT(mfn_valid(mfn));
-
-        p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
+
+        p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( p2mt == p2m_ram_ro )
         {
@@ -2533,19 +2507,19 @@ static enum hvm_copy_result __hvm_clear(
             if ( xchg(&lastpage, gfn) != gfn )
                 gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
                         " memory page. gfn=%#lx, mfn=%#lx\n",
-                        gfn, mfn);
+                         gfn, page_to_mfn(page));
         }
         else
         {
             memset(p, 0x00, count);
-            paging_mark_dirty(curr->domain, mfn);
+            paging_mark_dirty(curr->domain, page_to_mfn(page));
         }
 
         unmap_domain_page(p);
 
         addr += count;
         todo -= count;
-        put_gfn(curr->domain, gfn);
+        put_page(page);
     }
 
     return HVMCOPY_okay;
@@ -4000,35 +3974,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
 
         for ( pfn = a.first_pfn; pfn < a.first_pfn + a.nr; pfn++ )
         {
-            p2m_type_t t;
-            mfn_t mfn = get_gfn_unshare(d, pfn, &t);
-            if ( p2m_is_paging(t) )
+            struct page_info *page;
+            page = get_page_from_gfn(d, pfn, NULL, P2M_UNSHARE);
+            if ( page )
             {
-                put_gfn(d, pfn);
-                p2m_mem_paging_populate(d, pfn);
-                rc = -EINVAL;
-                goto param_fail3;
-            }
-            if( p2m_is_shared(t) )
-            {
-                /* If it insists on not unsharing itself, crash the domain 
-                 * rather than crashing the host down in mark dirty */
-                gdprintk(XENLOG_WARNING,
-                         "shared pfn 0x%lx modified?\n", pfn);
-                domain_crash(d);
-                put_gfn(d, pfn);
-                rc = -EINVAL;
-                goto param_fail3;
-            }
-            
-            if ( mfn_x(mfn) != INVALID_MFN )
-            {
-                paging_mark_dirty(d, mfn_x(mfn));
+                paging_mark_dirty(d, page_to_mfn(page));
                 /* These are most probably not page tables any more */
                 /* don't take a long time and don't die either */
-                sh_remove_shadows(d->vcpu[0], mfn, 1, 0);
+                sh_remove_shadows(d->vcpu[0], _mfn(page_to_mfn(page)), 1, 0);
+                put_page(page);
             }
-            put_gfn(d, pfn);
         }
 
     param_fail3:
diff -r 107285938c50 xen/arch/x86/hvm/stdvga.c
--- a/xen/arch/x86/hvm/stdvga.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/stdvga.c	Fri Apr 27 10:23:28 2012 +0100
@@ -482,7 +482,8 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_to_guest_phys(data, &tmp, p->size) !=
                      HVMCOPY_okay )
                 {
-                    (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt);
+                    struct page_info *dp = get_page_from_gfn(
+                            d, data >> PAGE_SHIFT, &p2mt, P2M_ALLOC);
                     /*
                      * The only case we handle is vga_mem <-> vga_mem.
                      * Anything else disables caching and leaves it to qemu-dm.
@@ -490,11 +491,12 @@ static int mmio_move(struct hvm_hw_stdvg
                     if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) ||
                          ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) )
                     {
-                        put_gfn(d, data >> PAGE_SHIFT);
+                        if ( dp )
+                            put_page(dp);
                         return 0;
                     }
+                    ASSERT(!dp);
                     stdvga_mem_write(data, tmp, p->size);
-                    put_gfn(d, data >> PAGE_SHIFT);
                 }
                 data += sign * p->size;
                 addr += sign * p->size;
@@ -508,15 +510,16 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_from_guest_phys(&tmp, data, p->size) !=
                      HVMCOPY_okay )
                 {
-                    (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt);
+                    struct page_info *dp = get_page_from_gfn(
+                        d, data >> PAGE_SHIFT, &p2mt, P2M_ALLOC);
                     if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) ||
                          ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) )
                     {
-                        put_gfn(d, data >> PAGE_SHIFT);
+                        if ( dp )
+                            put_page(dp);
                         return 0;
                     }
                     tmp = stdvga_mem_read(data, p->size);
-                    put_gfn(d, data >> PAGE_SHIFT);
                 }
                 stdvga_mem_write(addr, tmp, p->size);
                 data += sign * p->size;
diff -r 107285938c50 xen/arch/x86/hvm/viridian.c
--- a/xen/arch/x86/hvm/viridian.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/viridian.c	Fri Apr 27 10:23:28 2012 +0100
@@ -134,18 +134,19 @@ void dump_apic_assist(struct vcpu *v)
 static void enable_hypercall_page(struct domain *d)
 {
     unsigned long gmfn = d->arch.hvm_domain.viridian.hypercall_gpa.fields.pfn;
-    unsigned long mfn = get_gfn_untyped(d, gmfn);
+    struct page_info *page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
     uint8_t *p;
 
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    if ( !page || !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn); 
-        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn, mfn);
+        if ( page )
+            put_page(page);
+        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
+                 page_to_mfn(page));
         return;
     }
 
-    p = map_domain_page(mfn);
+    p = __map_domain_page(page);
 
     /*
      * We set the bit 31 in %eax (reserved field in the Viridian hypercall
@@ -162,15 +163,14 @@ static void enable_hypercall_page(struct
 
     unmap_domain_page(p);
 
-    put_page_and_type(mfn_to_page(mfn));
-    put_gfn(d, gmfn); 
+    put_page_and_type(page);
 }
 
 void initialize_apic_assist(struct vcpu *v)
 {
     struct domain *d = v->domain;
     unsigned long gmfn = v->arch.hvm_vcpu.viridian.apic_assist.fields.pfn;
-    unsigned long mfn = get_gfn_untyped(d, gmfn);
+    struct page_info *page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
     uint8_t *p;
 
     /*
@@ -183,22 +183,22 @@ void initialize_apic_assist(struct vcpu 
      * details of how Windows uses the page.
      */
 
-    if ( !mfn_valid(mfn) ||
-         !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+    if ( !page || !get_page_type(page, PGT_writable_page) )
     {
-        put_gfn(d, gmfn); 
-        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn, mfn);
+        if ( page )
+            put_page(page);
+        gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
+                 page_to_mfn(page));
         return;
     }
 
-    p = map_domain_page(mfn);
+    p = __map_domain_page(page);
 
     *(u32 *)p = 0;
 
     unmap_domain_page(p);
 
-    put_page_and_type(mfn_to_page(mfn));
-    put_gfn(d, gmfn); 
+    put_page_and_type(page);
 }
 
 int wrmsr_viridian_regs(uint32_t idx, uint64_t val)
diff -r 107285938c50 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Fri Apr 27 10:23:28 2012 +0100
@@ -480,17 +480,16 @@ static void vmx_vmcs_save(struct vcpu *v
 static int vmx_restore_cr0_cr3(
     struct vcpu *v, unsigned long cr0, unsigned long cr3)
 {
-    unsigned long mfn = 0;
-    p2m_type_t p2mt;
+    struct page_info *page = NULL;
 
     if ( paging_mode_shadow(v->domain) )
     {
         if ( cr0 & X86_CR0_PG )
         {
-            mfn = mfn_x(get_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt));
-            if ( !p2m_is_ram(p2mt) || !get_page(mfn_to_page(mfn), v->domain) )
+            page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT,
+                                     NULL, P2M_ALLOC);
+            if ( !page )
             {
-                put_gfn(v->domain, cr3 >> PAGE_SHIFT);
                 gdprintk(XENLOG_ERR, "Invalid CR3 value=0x%lx\n", cr3);
                 return -EINVAL;
             }
@@ -499,9 +498,8 @@ static int vmx_restore_cr0_cr3(
         if ( hvm_paging_enabled(v) )
             put_page(pagetable_get_page(v->arch.guest_table));
 
-        v->arch.guest_table = pagetable_from_pfn(mfn);
-        if ( cr0 & X86_CR0_PG )
-            put_gfn(v->domain, cr3 >> PAGE_SHIFT);
+        v->arch.guest_table =
+            page ? pagetable_from_page(page) : pagetable_null();
     }
 
     v->arch.hvm_vcpu.guest_cr[0] = cr0 | X86_CR0_ET;
@@ -1026,8 +1024,9 @@ static void vmx_set_interrupt_shadow(str
 
 static void vmx_load_pdptrs(struct vcpu *v)
 {
-    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3], mfn;
+    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
     uint64_t *guest_pdptrs;
+    struct page_info *page;
     p2m_type_t p2mt;
     char *p;
 
@@ -1038,24 +1037,19 @@ static void vmx_load_pdptrs(struct vcpu 
     if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
         goto crash;
 
-    mfn = mfn_x(get_gfn_unshare(v->domain, cr3 >> PAGE_SHIFT, &p2mt));
-    if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) || 
-         /* If we didn't succeed in unsharing, get_page will fail
-          * (page still belongs to dom_cow) */
-         !get_page(mfn_to_page(mfn), v->domain) )
+    page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, P2M_UNSHARE);
+    if ( !page )
     {
         /* Ideally you don't want to crash but rather go into a wait 
          * queue, but this is the wrong place. We're holding at least
          * the paging lock */
         gdprintk(XENLOG_ERR,
-                 "Bad cr3 on load pdptrs gfn %lx mfn %lx type %d\n",
-                 cr3 >> PAGE_SHIFT, mfn, (int) p2mt);
-        put_gfn(v->domain, cr3 >> PAGE_SHIFT);
+                 "Bad cr3 on load pdptrs gfn %lx type %d\n",
+                 cr3 >> PAGE_SHIFT, (int) p2mt);
         goto crash;
     }
-    put_gfn(v->domain, cr3 >> PAGE_SHIFT);
-
-    p = map_domain_page(mfn);
+
+    p = __map_domain_page(page);
 
     guest_pdptrs = (uint64_t *)(p + (cr3 & ~PAGE_MASK));
 
@@ -1081,7 +1075,7 @@ static void vmx_load_pdptrs(struct vcpu 
     vmx_vmcs_exit(v);
 
     unmap_domain_page(p);
-    put_page(mfn_to_page(mfn));
+    put_page(page);
     return;
 
  crash:
diff -r 107285938c50 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm.c	Fri Apr 27 10:23:28 2012 +0100
@@ -651,7 +651,8 @@ int map_ldt_shadow_page(unsigned int off
 {
     struct vcpu *v = current;
     struct domain *d = v->domain;
-    unsigned long gmfn, mfn;
+    unsigned long gmfn;
+    struct page_info *page;
     l1_pgentry_t l1e, nl1e;
     unsigned long gva = v->arch.pv_vcpu.ldt_base + (off << PAGE_SHIFT);
     int okay;
@@ -663,28 +664,24 @@ int map_ldt_shadow_page(unsigned int off
         return 0;
 
     gmfn = l1e_get_pfn(l1e);
-    mfn = get_gfn_untyped(d, gmfn);
-    if ( unlikely(!mfn_valid(mfn)) )
+    page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
+    if ( unlikely(!page) )
+        return 0;
+
+    okay = get_page_type(page, PGT_seg_desc_page);
+    if ( unlikely(!okay) )
     {
-        put_gfn(d, gmfn); 
+        put_page(page);
         return 0;
     }
 
-    okay = get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page);
-    if ( unlikely(!okay) )
-    {
-        put_gfn(d, gmfn); 
-        return 0;
-    }
-
-    nl1e = l1e_from_pfn(mfn, l1e_get_flags(l1e) | _PAGE_RW);
+    nl1e = l1e_from_pfn(page_to_mfn(page), l1e_get_flags(l1e) | _PAGE_RW);
 
     spin_lock(&v->arch.pv_vcpu.shadow_ldt_lock);
     l1e_write(&v->arch.perdomain_ptes[off + 16], nl1e);
     v->arch.pv_vcpu.shadow_ldt_mapcnt++;
     spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
 
-    put_gfn(d, gmfn); 
     return 1;
 }
 
@@ -1819,7 +1816,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
 {
     l1_pgentry_t ol1e;
     struct domain *pt_dom = pt_vcpu->domain;
-    p2m_type_t p2mt;
     int rc = 0;
 
     if ( unlikely(__copy_from_user(&ol1e, pl1e, sizeof(ol1e)) != 0) )
@@ -1835,22 +1831,21 @@ static int mod_l1_entry(l1_pgentry_t *pl
     if ( l1e_get_flags(nl1e) & _PAGE_PRESENT )
     {
         /* Translate foreign guest addresses. */
-        unsigned long mfn, gfn;
-        gfn = l1e_get_pfn(nl1e);
-        mfn = mfn_x(get_gfn(pg_dom, gfn, &p2mt));
-        if ( !p2m_is_ram(p2mt) || unlikely(mfn == INVALID_MFN) )
+        struct page_info *page = NULL;
+        if ( paging_mode_translate(pg_dom) )
         {
-            put_gfn(pg_dom, gfn);
-            return -EINVAL;
+            page = get_page_from_gfn(pg_dom, l1e_get_pfn(nl1e), NULL, P2M_ALLOC);
+            if ( !page )
+                return -EINVAL;
+            nl1e = l1e_from_pfn(page_to_mfn(page), l1e_get_flags(nl1e));
         }
-        ASSERT((mfn & ~(PADDR_MASK >> PAGE_SHIFT)) == 0);
-        nl1e = l1e_from_pfn(mfn, l1e_get_flags(nl1e));
 
         if ( unlikely(l1e_get_flags(nl1e) & l1_disallow_mask(pt_dom)) )
         {
             MEM_LOG("Bad L1 flags %x",
                     l1e_get_flags(nl1e) & l1_disallow_mask(pt_dom));
-            put_gfn(pg_dom, gfn);
+            if ( page )
+                put_page(page);
             return -EINVAL;
         }
 
@@ -1860,15 +1855,21 @@ static int mod_l1_entry(l1_pgentry_t *pl
             adjust_guest_l1e(nl1e, pt_dom);
             if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                               preserve_ad) )
+            {
+                if ( page )
+                    put_page(page);
                 return 0;
-            put_gfn(pg_dom, gfn);
+            }
+            if ( page )
+                put_page(page);
             return -EBUSY;
         }
 
         switch ( rc = get_page_from_l1e(nl1e, pt_dom, pg_dom) )
         {
         default:
-            put_gfn(pg_dom, gfn);
+            if ( page )
+                put_page(page);
             return rc;
         case 0:
             break;
@@ -1876,7 +1877,9 @@ static int mod_l1_entry(l1_pgentry_t *pl
             l1e_remove_flags(nl1e, _PAGE_RW);
             break;
         }
-        
+        if ( page )
+            put_page(page);
+
         adjust_guest_l1e(nl1e, pt_dom);
         if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                     preserve_ad)) )
@@ -1884,7 +1887,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
             ol1e = nl1e;
             rc = -EBUSY;
         }
-        put_gfn(pg_dom, gfn);
     }
     else if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                      preserve_ad)) )
@@ -3042,7 +3044,6 @@ int do_mmuext_op(
             type = PGT_l4_page_table;
 
         pin_page: {
-            unsigned long mfn;
             struct page_info *page;
 
             /* Ignore pinning of invalid paging levels. */
@@ -3052,25 +3053,28 @@ int do_mmuext_op(
             if ( paging_mode_refcounts(pg_owner) )
                 break;
 
-            mfn = get_gfn_untyped(pg_owner, op.arg1.mfn);
-            rc = get_page_and_type_from_pagenr(mfn, type, pg_owner, 0, 1);
+            page = get_page_from_gfn(pg_owner, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!page) )
+            {
+                rc = -EINVAL;
+                break;
+            }
+
+            rc = get_page_type_preemptible(page, type);
             okay = !rc;
             if ( unlikely(!okay) )
             {
                 if ( rc == -EINTR )
                     rc = -EAGAIN;
                 else if ( rc != -EAGAIN )
-                    MEM_LOG("Error while pinning mfn %lx", mfn);
-                put_gfn(pg_owner, op.arg1.mfn);
+                    MEM_LOG("Error while pinning mfn %lx", page_to_mfn(page));
+                put_page(page);
                 break;
             }
 
-            page = mfn_to_page(mfn);
-
             if ( (rc = xsm_memory_pin_page(d, page)) != 0 )
             {
                 put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
                 okay = 0;
                 break;
             }
@@ -3078,16 +3082,15 @@ int do_mmuext_op(
             if ( unlikely(test_and_set_bit(_PGT_pinned,
                                            &page->u.inuse.type_info)) )
             {
-                MEM_LOG("Mfn %lx already pinned", mfn);
+                MEM_LOG("Mfn %lx already pinned", page_to_mfn(page));
                 put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
                 okay = 0;
                 break;
             }
 
             /* A page is dirtied when its pin status is set. */
-            paging_mark_dirty(pg_owner, mfn);
-           
+            paging_mark_dirty(pg_owner, page_to_mfn(page));
+
             /* We can race domain destruction (domain_relinquish_resources). */
             if ( unlikely(pg_owner != d) )
             {
@@ -3099,35 +3102,29 @@ int do_mmuext_op(
                 spin_unlock(&pg_owner->page_alloc_lock);
                 if ( drop_ref )
                     put_page_and_type(page);
-                put_gfn(pg_owner, op.arg1.mfn);
             }
 
             break;
         }
 
         case MMUEXT_UNPIN_TABLE: {
-            unsigned long mfn;
             struct page_info *page;
 
             if ( paging_mode_refcounts(pg_owner) )
                 break;
 
-            mfn = get_gfn_untyped(pg_owner, op.arg1.mfn);
-            if ( unlikely(!(okay = get_page_from_pagenr(mfn, pg_owner))) )
+            page = get_page_from_gfn(pg_owner, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!page) )
             {
-                put_gfn(pg_owner, op.arg1.mfn);
-                MEM_LOG("Mfn %lx bad domain", mfn);
+                MEM_LOG("Mfn %lx bad domain", op.arg1.mfn);
                 break;
             }
 
-            page = mfn_to_page(mfn);
-
             if ( !test_and_clear_bit(_PGT_pinned, &page->u.inuse.type_info) )
             {
                 okay = 0;
                 put_page(page);
-                put_gfn(pg_owner, op.arg1.mfn);
-                MEM_LOG("Mfn %lx not pinned", mfn);
+                MEM_LOG("Mfn %lx not pinned", op.arg1.mfn);
                 break;
             }
 
@@ -3135,40 +3132,43 @@ int do_mmuext_op(
             put_page(page);
 
             /* A page is dirtied when its pin status is cleared. */
-            paging_mark_dirty(pg_owner, mfn);
-
-            put_gfn(pg_owner, op.arg1.mfn);
+            paging_mark_dirty(pg_owner, page_to_mfn(page));
+
             break;
         }
 
         case MMUEXT_NEW_BASEPTR:
-            okay = new_guest_cr3(get_gfn_untyped(d, op.arg1.mfn));
-            put_gfn(d, op.arg1.mfn);
+            okay = (!paging_mode_translate(d)
+                    && new_guest_cr3(op.arg1.mfn));
             break;
+
         
 #ifdef __x86_64__
         case MMUEXT_NEW_USER_BASEPTR: {
-            unsigned long old_mfn, mfn;
-
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            if ( mfn != 0 )
+            unsigned long old_mfn;
+
+            if ( paging_mode_translate(current->domain) )
+            {
+                okay = 0;
+                break;
+            }
+
+            if ( op.arg1.mfn != 0 )
             {
                 if ( paging_mode_refcounts(d) )
-                    okay = get_page_from_pagenr(mfn, d);
+                    okay = get_page_from_pagenr(op.arg1.mfn, d);
                 else
                     okay = !get_page_and_type_from_pagenr(
-                        mfn, PGT_root_page_table, d, 0, 0);
+                        op.arg1.mfn, PGT_root_page_table, d, 0, 0);
                 if ( unlikely(!okay) )
                 {
-                    put_gfn(d, op.arg1.mfn);
-                    MEM_LOG("Error while installing new mfn %lx", mfn);
+                    MEM_LOG("Error while installing new mfn %lx", op.arg1.mfn);
                     break;
                 }
             }
 
             old_mfn = pagetable_get_pfn(curr->arch.guest_table_user);
-            curr->arch.guest_table_user = pagetable_from_pfn(mfn);
-            put_gfn(d, op.arg1.mfn);
+            curr->arch.guest_table_user = pagetable_from_pfn(op.arg1.mfn);
 
             if ( old_mfn != 0 )
             {
@@ -3283,28 +3283,26 @@ int do_mmuext_op(
         }
 
         case MMUEXT_CLEAR_PAGE: {
-            unsigned long mfn;
+            struct page_info *page;
             unsigned char *ptr;
 
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            okay = !get_page_and_type_from_pagenr(
-                mfn, PGT_writable_page, d, 0, 0);
-            if ( unlikely(!okay) )
+            page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
+            if ( !page || !get_page_type(page, PGT_writable_page) )
             {
-                put_gfn(d, op.arg1.mfn);
-                MEM_LOG("Error while clearing mfn %lx", mfn);
+                if ( page )
+                    put_page(page);
+                MEM_LOG("Error while clearing mfn %lx", op.arg1.mfn);
                 break;
             }
 
             /* A page is dirtied when it's being cleared. */
-            paging_mark_dirty(d, mfn);
-
-            ptr = fixmap_domain_page(mfn);
+            paging_mark_dirty(d, page_to_mfn(page));
+
+            ptr = fixmap_domain_page(page_to_mfn(page));
             clear_page(ptr);
             fixunmap_domain_page(ptr);
 
-            put_page_and_type(mfn_to_page(mfn));
-            put_gfn(d, op.arg1.mfn);
+            put_page_and_type(page);
             break;
         }
 
@@ -3312,42 +3310,38 @@ int do_mmuext_op(
         {
             const unsigned char *src;
             unsigned char *dst;
-            unsigned long src_mfn, mfn;
-
-            src_mfn = get_gfn_untyped(d, op.arg2.src_mfn);
-            okay = get_page_from_pagenr(src_mfn, d);
+            struct page_info *src_page, *dst_page;
+
+            src_page = get_page_from_gfn(d, op.arg2.src_mfn, NULL, P2M_ALLOC);
+            if ( unlikely(!src_page) )
+            {
+                okay = 0;
+                MEM_LOG("Error while copying from mfn %lx", op.arg2.src_mfn);
+                break;
+            }
+
+            dst_page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
+            okay = (dst_page && get_page_type(dst_page, PGT_writable_page));
             if ( unlikely(!okay) )
             {
-                put_gfn(d, op.arg2.src_mfn);
-                MEM_LOG("Error while copying from mfn %lx", src_mfn);
+                put_page(src_page);
+                if ( dst_page )
+                    put_page(dst_page);
+                MEM_LOG("Error while copying to mfn %lx", op.arg1.mfn);
                 break;
             }
 
-            mfn = get_gfn_untyped(d, op.arg1.mfn);
-            okay = !get_page_and_type_from_pagenr(
-                mfn, PGT_writable_page, d, 0, 0);
-            if ( unlikely(!okay) )
-            {
-                put_gfn(d, op.arg1.mfn);
-                put_page(mfn_to_page(src_mfn));
-                put_gfn(d, op.arg2.src_mfn);
-                MEM_LOG("Error while copying to mfn %lx", mfn);
-                break;
-            }
-
             /* A page is dirtied when it's being copied to. */
-            paging_mark_dirty(d, mfn);
-
-            src = map_domain_page(src_mfn);
-            dst = fixmap_domain_page(mfn);
+            paging_mark_dirty(d, page_to_mfn(dst_page));
+
+            src = __map_domain_page(src_page);
+            dst = fixmap_domain_page(page_to_mfn(dst_page));
             copy_page(dst, src);
             fixunmap_domain_page(dst);
             unmap_domain_page(src);
 
-            put_page_and_type(mfn_to_page(mfn));
-            put_gfn(d, op.arg1.mfn);
-            put_page(mfn_to_page(src_mfn));
-            put_gfn(d, op.arg2.src_mfn);
+            put_page_and_type(dst_page);
+            put_page(src_page);
             break;
         }
 
@@ -3538,29 +3532,26 @@ int do_mmu_update(
 
             req.ptr -= cmd;
             gmfn = req.ptr >> PAGE_SHIFT;
-            mfn = mfn_x(get_gfn(pt_owner, gmfn, &p2mt));
-            if ( !p2m_is_valid(p2mt) )
-                mfn = INVALID_MFN;
+            page = get_page_from_gfn(pt_owner, gmfn, &p2mt, P2M_ALLOC);
 
             if ( p2m_is_paged(p2mt) )
             {
-                put_gfn(pt_owner, gmfn);
+                ASSERT(!page);
                 p2m_mem_paging_populate(pg_owner, gmfn);
                 rc = -ENOENT;
                 break;
             }
 
-            if ( unlikely(!get_page_from_pagenr(mfn, pt_owner)) )
+            if ( unlikely(!page) )
             {
                 MEM_LOG("Could not get page for normal update");
-                put_gfn(pt_owner, gmfn);
                 break;
             }
 
+            mfn = page_to_mfn(page);
             va = map_domain_page_with_cache(mfn, &mapcache);
             va = (void *)((unsigned long)va +
                           (unsigned long)(req.ptr & ~PAGE_MASK));
-            page = mfn_to_page(mfn);
 
             if ( page_lock(page) )
             {
@@ -3569,22 +3560,23 @@ int do_mmu_update(
                 case PGT_l1_page_table:
                 {
                     l1_pgentry_t l1e = l1e_from_intpte(req.val);
-                    p2m_type_t l1e_p2mt;
-                    unsigned long l1egfn = l1e_get_pfn(l1e), l1emfn;
-    
-                    l1emfn = mfn_x(get_gfn(pg_owner, l1egfn, &l1e_p2mt));
+                    p2m_type_t l1e_p2mt = p2m_ram_rw;
+                    struct page_info *target = NULL;
+
+                    if ( paging_mode_translate(pg_owner) )
+                        target = get_page_from_gfn(pg_owner, l1e_get_pfn(l1e),
+                                                   &l1e_p2mt, P2M_ALLOC);
 
                     if ( p2m_is_paged(l1e_p2mt) )
                     {
-                        put_gfn(pg_owner, l1egfn);
+                        if ( target )
+                            put_page(target);
                         p2m_mem_paging_populate(pg_owner, l1e_get_pfn(l1e));
                         rc = -ENOENT;
                         break;
                     }
-                    else if ( p2m_ram_paging_in == l1e_p2mt && 
-                                !mfn_valid(l1emfn) )
+                    else if ( p2m_ram_paging_in == l1e_p2mt && !target )
                     {
-                        put_gfn(pg_owner, l1egfn);
                         rc = -ENOENT;
                         break;
                     }
@@ -3601,7 +3593,8 @@ int do_mmu_update(
                             rc = mem_sharing_unshare_page(pg_owner, gfn, 0); 
                             if ( rc )
                             {
-                                put_gfn(pg_owner, l1egfn);
+                                if ( target )
+                                    put_page(target);
                                 /* Notify helper, don't care about errors, will not
                                  * sleep on wq, since we're a foreign domain. */
                                 (void)mem_sharing_notify_enomem(pg_owner, gfn, 0);
@@ -3614,112 +3607,22 @@ int do_mmu_update(
                     rc = mod_l1_entry(va, l1e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v,
                                       pg_owner);
-                    put_gfn(pg_owner, l1egfn);
+                    if ( target )
+                        put_page(target);
                 }
                 break;
                 case PGT_l2_page_table:
-                {
-                    l2_pgentry_t l2e = l2e_from_intpte(req.val);
-                    p2m_type_t l2e_p2mt;
-                    unsigned long l2egfn = l2e_get_pfn(l2e), l2emfn;
-
-                    l2emfn = mfn_x(get_gfn(pg_owner, l2egfn, &l2e_p2mt));
-
-                    if ( p2m_is_paged(l2e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        p2m_mem_paging_populate(pg_owner, l2egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l2e_p2mt && 
-                                !mfn_valid(l2emfn) )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l2e_p2mt )
-                    {
-                        put_gfn(pg_owner, l2egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-
-                    rc = mod_l2_entry(va, l2e, mfn,
+                    rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    put_gfn(pg_owner, l2egfn);
-                }
-                break;
+                    break;
                 case PGT_l3_page_table:
-                {
-                    l3_pgentry_t l3e = l3e_from_intpte(req.val);
-                    p2m_type_t l3e_p2mt;
-                    unsigned long l3egfn = l3e_get_pfn(l3e), l3emfn;
-
-                    l3emfn = mfn_x(get_gfn(pg_owner, l3egfn, &l3e_p2mt));
-
-                    if ( p2m_is_paged(l3e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        p2m_mem_paging_populate(pg_owner, l3egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l3e_p2mt && 
-                                !mfn_valid(l3emfn) )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l3e_p2mt )
-                    {
-                        put_gfn(pg_owner, l3egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-                    rc = mod_l3_entry(va, l3e, mfn,
+                    rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
-                    put_gfn(pg_owner, l3egfn);
-                }
-                break;
+                    break;
 #if CONFIG_PAGING_LEVELS >= 4
                 case PGT_l4_page_table:
-                {
-                    l4_pgentry_t l4e = l4e_from_intpte(req.val);
-                    p2m_type_t l4e_p2mt;
-                    unsigned long l4egfn = l4e_get_pfn(l4e), l4emfn;
-
-                    l4emfn = mfn_x(get_gfn(pg_owner, l4egfn, &l4e_p2mt));
-
-                    if ( p2m_is_paged(l4e_p2mt) )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        p2m_mem_paging_populate(pg_owner, l4egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_paging_in == l4e_p2mt && 
-                                !mfn_valid(l4emfn) )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        rc = -ENOENT;
-                        break;
-                    }
-                    else if ( p2m_ram_shared == l4e_p2mt )
-                    {
-                        put_gfn(pg_owner, l4egfn);
-                        MEM_LOG("Unexpected attempt to map shared page.\n");
-                        break;
-                    }
-
-                    rc = mod_l4_entry(va, l4e, mfn,
+                    rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
-                    put_gfn(pg_owner, l4egfn);
-                }
                 break;
 #endif
                 case PGT_writable_page:
@@ -3742,7 +3645,6 @@ int do_mmu_update(
 
             unmap_domain_page_with_cache(va, &mapcache);
             put_page(page);
-            put_gfn(pt_owner, gmfn);
         }
         break;
 
diff -r 107285938c50 xen/arch/x86/mm/guest_walk.c
--- a/xen/arch/x86/mm/guest_walk.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/guest_walk.c	Fri Apr 27 10:23:28 2012 +0100
@@ -94,39 +94,37 @@ static inline void *map_domain_gfn(struc
                                    p2m_type_t *p2mt,
                                    uint32_t *rc) 
 {
-    p2m_access_t p2ma;
+    struct page_info *page;
     void *map;
 
     /* Translate the gfn, unsharing if shared */
-    *mfn = get_gfn_type_access(p2m, gfn_x(gfn), p2mt, &p2ma, 
-                               P2M_ALLOC | P2M_UNSHARE, NULL);
+    page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), p2mt, NULL,
+                                  P2M_ALLOC | P2M_UNSHARE);
     if ( p2m_is_paging(*p2mt) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
-        __put_gfn(p2m, gfn_x(gfn));
+        if ( page )
+            put_page(page);
         p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
         *rc = _PAGE_PAGED;
         return NULL;
     }
     if ( p2m_is_shared(*p2mt) )
     {
-        __put_gfn(p2m, gfn_x(gfn));
+        if ( page )
+            put_page(page);
         *rc = _PAGE_SHARED;
         return NULL;
     }
-    if ( !p2m_is_ram(*p2mt) ) 
+    if ( !page )
     {
-        __put_gfn(p2m, gfn_x(gfn));
         *rc |= _PAGE_PRESENT;
         return NULL;
     }
+    *mfn = _mfn(page_to_mfn(page));
     ASSERT(mfn_valid(mfn_x(*mfn)));
-    
-    /* Get an extra ref to the page to ensure liveness of the map.
-     * Then we can safely put gfn */
-    page_get_owner_and_reference(mfn_to_page(mfn_x(*mfn)));
+
     map = map_domain_page(mfn_x(*mfn));
-    __put_gfn(p2m, gfn_x(gfn));
     return map;
 }
 
diff -r 107285938c50 xen/arch/x86/mm/hap/guest_walk.c
--- a/xen/arch/x86/mm/hap/guest_walk.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/hap/guest_walk.c	Fri Apr 27 10:23:28 2012 +0100
@@ -54,34 +54,36 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     mfn_t top_mfn;
     void *top_map;
     p2m_type_t p2mt;
-    p2m_access_t p2ma;
     walk_t gw;
     unsigned long top_gfn;
+    struct page_info *top_page;
 
     /* Get the top-level table's MFN */
     top_gfn = cr3 >> PAGE_SHIFT;
-    top_mfn = get_gfn_type_access(p2m, top_gfn, &p2mt, &p2ma, 
-                                  P2M_ALLOC | P2M_UNSHARE, NULL);
+    top_page = get_page_from_gfn_p2m(p2m->domain, p2m, top_gfn,
+                                     &p2mt, NULL, P2M_ALLOC | P2M_UNSHARE);
     if ( p2m_is_paging(p2mt) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
         pfec[0] = PFEC_page_paged;
-        __put_gfn(p2m, top_gfn);
+        if ( top_page )
+            put_page(top_page);
         p2m_mem_paging_populate(p2m->domain, cr3 >> PAGE_SHIFT);
         return INVALID_GFN;
     }
     if ( p2m_is_shared(p2mt) )
     {
         pfec[0] = PFEC_page_shared;
-        __put_gfn(p2m, top_gfn);
+        if ( top_page )
+            put_page(top_page);
         return INVALID_GFN;
     }
-    if ( !p2m_is_ram(p2mt) )
+    if ( !top_page )
     {
         pfec[0] &= ~PFEC_page_present;
-        __put_gfn(p2m, top_gfn);
         return INVALID_GFN;
     }
+    top_mfn = _mfn(page_to_mfn(top_page));
 
     /* Map the top-level table and call the tree-walker */
     ASSERT(mfn_valid(mfn_x(top_mfn)));
@@ -91,31 +93,30 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #endif
     missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
     unmap_domain_page(top_map);
-    __put_gfn(p2m, top_gfn);
+    put_page(top_page);
 
     /* Interpret the answer */
     if ( missing == 0 )
     {
         gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
-        (void)get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma,
-                                  P2M_ALLOC | P2M_UNSHARE, NULL); 
+        struct page_info *page;
+        page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
+                                     NULL, P2M_ALLOC | P2M_UNSHARE);
+        if ( page )
+            put_page(page);
         if ( p2m_is_paging(p2mt) )
         {
             ASSERT(!p2m_is_nestedp2m(p2m));
             pfec[0] = PFEC_page_paged;
-            __put_gfn(p2m, gfn_x(gfn));
             p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
             return INVALID_GFN;
         }
         if ( p2m_is_shared(p2mt) )
         {
             pfec[0] = PFEC_page_shared;
-            __put_gfn(p2m, gfn_x(gfn));
             return INVALID_GFN;
         }
 
-        __put_gfn(p2m, gfn_x(gfn));
-
         if ( page_order )
             *page_order = guest_walk_to_page_order(&gw);
 
diff -r 107285938c50 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/mm-locks.h	Fri Apr 27 10:23:28 2012 +0100
@@ -166,13 +166,39 @@ declare_mm_lock(nestedp2m)
  * and later mutate it.
  */
 
-declare_mm_lock(p2m)
-#define p2m_lock(p)           mm_lock_recursive(p2m, &(p)->lock)
-#define gfn_lock(p,g,o)       mm_lock_recursive(p2m, &(p)->lock)
-#define p2m_unlock(p)         mm_unlock(&(p)->lock)
-#define gfn_unlock(p,g,o)     mm_unlock(&(p)->lock)
-#define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
-#define gfn_locked_by_me(p,g) mm_locked_by_me(&(p)->lock)
+/* P2M lock is become an rwlock, purely so we can implement
+ * get_page_from_gfn.  The mess below is a ghastly hack to make a
+ * recursive rwlock.  If it works I'll come back and fix up the
+ * order-contraints magic. */
+
+static inline void p2m_lock(struct p2m_domain *p)
+{
+    if ( p->wcpu != current->processor )
+    {
+        write_lock(&p->lock);
+        p->wcpu = current->processor;
+        ASSERT(p->wcount == 0);
+    }
+    p->wcount++;
+}
+
+static inline void p2m_unlock(struct p2m_domain *p)
+{
+    ASSERT(p->wcpu == current->processor);
+    if (--(p->wcount) == 0)
+    {
+        p->wcpu = -1;
+        write_unlock(&p->lock);
+    }
+}
+
+#define gfn_lock(p,g,o)       p2m_lock(p)
+#define gfn_unlock(p,g,o)     p2m_unlock(p)
+#define p2m_read_lock(p)      read_lock(&(p)->lock)
+#define p2m_read_unlock(p)    read_unlock(&(p)->lock)
+#define p2m_locked_by_me(p)   ((p)->wcpu == current->processor)
+#define gfn_locked_by_me(p,g) p2m_locked_by_me(p)
+
 
 /* Sharing per page lock
  *
@@ -203,8 +229,8 @@ declare_mm_order_constraint(per_page_sha
  * counts, page lists, sweep parameters. */
 
 declare_mm_lock(pod)
-#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
-#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
+#define pod_lock(p) do { p2m_lock(p); mm_lock(pod, &(p)->pod.lock); } while (0)
+#define pod_unlock(p) do { mm_unlock(&(p)->pod.lock); p2m_unlock(p);} while (0)
 #define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
 
 /* Page alloc lock (per-domain)
diff -r 107285938c50 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/mm/p2m.c	Fri Apr 27 10:23:28 2012 +0100
@@ -71,7 +71,9 @@ boolean_param("hap_2mb", opt_hap_2mb);
 /* Init the datastructures for later use by the p2m code */
 static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
 {
-    mm_lock_init(&p2m->lock);
+    rwlock_init(&p2m->lock);
+    p2m->wcount = 0;
+    p2m->wcpu = -1;
     mm_lock_init(&p2m->pod.lock);
     INIT_LIST_HEAD(&p2m->np2m_list);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
@@ -207,6 +209,59 @@ void __put_gfn(struct p2m_domain *p2m, u
     gfn_unlock(p2m, gfn, 0);
 }
 
+/* Atomically look up a GFN and take a reference count on the backing page. */
+struct page_info *get_page_from_gfn_p2m(
+    struct domain *d, struct p2m_domain *p2m, unsigned long gfn,
+    p2m_type_t *t, p2m_access_t *a, p2m_query_t q)
+{
+    struct page_info *page = NULL;
+    p2m_access_t _a;
+    p2m_type_t _t;
+    mfn_t mfn;
+
+    /* Allow t or a to be NULL */
+    t = t ?: &_t;
+    a = a ?: &_a;
+
+    if ( likely(!p2m_locked_by_me(p2m)) )
+    {
+        /* Fast path: look up and get out */
+        p2m_read_lock(p2m);
+        mfn = __get_gfn_type_access(p2m, gfn, t, a, 0, NULL, 0);
+        if ( (p2m_is_ram(*t) || p2m_is_grant(*t))
+             && mfn_valid(mfn)
+             && !((q & P2M_UNSHARE) && p2m_is_shared(*t)) )
+        {
+            page = mfn_to_page(mfn);
+            if ( !get_page(page, d)
+                 /* Page could be shared */
+                 && !get_page(page, dom_cow) )
+                page = NULL;
+        }
+        p2m_read_unlock(p2m);
+
+        if ( page )
+            return page;
+
+        /* Error path: not a suitable GFN at all */
+        if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_magic(*t) )
+            return NULL;
+    }
+
+    /* Slow path: take the write lock and do fixups */
+    mfn = get_gfn_type_access(p2m, gfn, t, a, q, NULL);
+    if ( p2m_is_ram(*t) && mfn_valid(mfn) )
+    {
+        page = mfn_to_page(mfn);
+        if ( !get_page(page, d) )
+            page = NULL;
+    }
+    put_gfn(d, gfn);
+
+    return page;
+}
+
+
 int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, 
                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t p2ma)
 {
diff -r 107285938c50 xen/arch/x86/physdev.c
--- a/xen/arch/x86/physdev.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/physdev.c	Fri Apr 27 10:23:28 2012 +0100
@@ -306,26 +306,27 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
     case PHYSDEVOP_pirq_eoi_gmfn_v1: {
         struct physdev_pirq_eoi_gmfn info;
         unsigned long mfn;
+        struct page_info *page;
 
         ret = -EFAULT;
         if ( copy_from_guest(&info, arg, 1) != 0 )
             break;
 
         ret = -EINVAL;
-        mfn = get_gfn_untyped(current->domain, info.gmfn);
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), v->domain,
-                                PGT_writable_page) )
+        page = get_page_from_gfn(current->domain, info.gmfn, NULL, P2M_ALLOC);
+        if ( !page )
+            break;
+        if ( !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(current->domain, info.gmfn);
+            put_page(page);
             break;
         }
+        mfn = page_to_mfn(page);
 
         if ( cmpxchg(&v->domain->arch.pv_domain.pirq_eoi_map_mfn,
                      0, mfn) != 0 )
         {
             put_page_and_type(mfn_to_page(mfn));
-            put_gfn(current->domain, info.gmfn);
             ret = -EBUSY;
             break;
         }
@@ -335,14 +336,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         {
             v->domain->arch.pv_domain.pirq_eoi_map_mfn = 0;
             put_page_and_type(mfn_to_page(mfn));
-            put_gfn(current->domain, info.gmfn);
             ret = -ENOSPC;
             break;
         }
         if ( cmd == PHYSDEVOP_pirq_eoi_gmfn_v1 )
             v->domain->arch.pv_domain.auto_unmask = 1;
 
-        put_gfn(current->domain, info.gmfn);
         ret = 0;
         break;
     }
diff -r 107285938c50 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/arch/x86/traps.c	Fri Apr 27 10:23:28 2012 +0100
@@ -662,9 +662,9 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
     case 0:
     {
         void *hypercall_page;
-        unsigned long mfn;
         unsigned long gmfn = val >> 12;
         unsigned int idx  = val & 0xfff;
+        struct page_info *page;
 
         if ( idx > 0 )
         {
@@ -674,24 +674,23 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
             return 0;
         }
 
-        mfn = get_gfn_untyped(d, gmfn);
-
-        if ( !mfn_valid(mfn) ||
-             !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
+        page = get_page_from_gfn(d, gmfn, NULL, P2M_ALLOC);
+
+        if ( !page || !get_page_type(page, PGT_writable_page) )
         {
-            put_gfn(d, gmfn);
+            if ( page )
+                put_page(page);
             gdprintk(XENLOG_WARNING,
                      "Bad GMFN %lx (MFN %lx) to MSR %08x\n",
-                     gmfn, mfn, base + idx);
+                     gmfn, page_to_mfn(page), base + idx);
             return 0;
         }
 
-        hypercall_page = map_domain_page(mfn);
+        hypercall_page = __map_domain_page(page);
         hypercall_page_initialise(d, hypercall_page);
         unmap_domain_page(hypercall_page);
 
-        put_page_and_type(mfn_to_page(mfn));
-        put_gfn(d, gmfn);
+        put_page_and_type(page);
         break;
     }
 
@@ -2374,7 +2373,8 @@ static int emulate_privileged_op(struct 
             break;
 
         case 3: {/* Write CR3 */
-            unsigned long mfn, gfn;
+            unsigned long gfn;
+            struct page_info *page;
             domain_lock(v->domain);
             if ( !is_pv_32on64_vcpu(v) )
             {
@@ -2384,9 +2384,10 @@ static int emulate_privileged_op(struct 
                 gfn = compat_cr3_to_pfn(*reg);
 #endif
             }
-            mfn = get_gfn_untyped(v->domain, gfn);
-            rc = new_guest_cr3(mfn);
-            put_gfn(v->domain, gfn);
+            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            rc = page ? new_guest_cr3(page_to_mfn(page)) : 0;
+            if ( page )
+                put_page(page);
             domain_unlock(v->domain);
             if ( rc == 0 ) /* not okay */
                 goto fail;
diff -r 107285938c50 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/include/asm-x86/p2m.h	Fri Apr 27 10:23:28 2012 +0100
@@ -192,7 +192,10 @@ typedef unsigned int p2m_query_t;
 /* Per-p2m-table state */
 struct p2m_domain {
     /* Lock that protects updates to the p2m */
-    mm_lock_t          lock;
+    rwlock_t           lock;
+    int                wcpu;
+    int                wcount;
+    const char        *wfunc;
 
     /* Shadow translated domain: p2m mapping */
     pagetable_t        phys_table;
@@ -377,6 +380,33 @@ static inline mfn_t get_gfn_query_unlock
     return __get_gfn_type_access(p2m_get_hostp2m(d), gfn, t, &a, 0, NULL, 0);
 }
 
+/* Atomically look up a GFN and take a reference count on the backing page.
+ * This makes sure the page doesn't get freed (or shared) underfoot,
+ * and should be used by any path that intends to write to the backing page.
+ * Returns NULL if the page is not backed by RAM.
+ * The caller is responsible for calling put_page() afterwards. */
+struct page_info *get_page_from_gfn_p2m(struct domain *d,
+                                        struct p2m_domain *p2m,
+                                        unsigned long gfn,
+                                        p2m_type_t *t, p2m_access_t *a,
+                                        p2m_query_t q);
+
+static inline struct page_info *get_page_from_gfn(
+    struct domain *d, unsigned long gfn, p2m_type_t *t, p2m_query_t q)
+{
+    struct page_info *page;
+
+    if ( paging_mode_translate(d) )
+        return get_page_from_gfn_p2m(d, p2m_get_hostp2m(d), gfn, t, NULL, q);
+
+    /* Non-translated guests see 1-1 RAM mappings everywhere */
+    if (t)
+        *t = p2m_ram_rw;
+    page = __mfn_to_page(gfn);
+    return get_page(page, d) ? page : NULL;
+}
+
+
 /* General conversion function from mfn to gfn */
 static inline unsigned long mfn_to_gfn(struct domain *d, mfn_t mfn)
 {
diff -r 107285938c50 xen/xsm/flask/hooks.c
--- a/xen/xsm/flask/hooks.c	Thu Apr 26 10:03:08 2012 +0100
+++ b/xen/xsm/flask/hooks.c	Fri Apr 27 10:23:28 2012 +0100
@@ -1318,6 +1318,7 @@ static int flask_mmu_normal_update(struc
     struct domain_security_struct *dsec;
     u32 fsid;
     struct avc_audit_data ad;
+    struct page_info *page;
 
     if (d != t)
         rc = domain_has_perm(d, t, SECCLASS_MMU, MMU__REMOTE_REMAP);
@@ -1333,7 +1334,8 @@ static int flask_mmu_normal_update(struc
         map_perms |= MMU__MAP_WRITE;
 
     AVC_AUDIT_DATA_INIT(&ad, MEMORY);
-    fmfn = get_gfn_untyped(f, l1e_get_pfn(l1e_from_intpte(fpte)));
+    page = get_page_from_gfn(f, l1e_get_pfn(l1e_from_intpte(fpte)), P2M_ALLOC);
+    mfn = page ? page_to_mfn(page) : INVALID_MFN;
 
     ad.sdom = d;
     ad.tdom = f;
@@ -1342,7 +1344,8 @@ static int flask_mmu_normal_update(struc
 
     rc = get_mfn_sid(fmfn, &fsid);
 
-    put_gfn(f, fmfn);
+    if ( page )
+        put_page(page);
 
     if ( rc )
         return rc;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  9:26                                             ` Tim Deegan
@ 2012-04-27 14:17                                               ` Andres Lagar-Cavilla
  2012-04-27 21:08                                               ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-27 14:17 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, xen-devel, Keir Fraser

> At 20:02 -0700 on 26 Apr (1335470547), Andres Lagar-Cavilla wrote:
>> > Can you please try the attached patch?  I think you'll need this one
>> > plus the ones that take the locks out of the hpet code.
>>
>> Right off the bat I'm getting a multitude of
>> (XEN) mm.c:3294:d0 Error while clearing mfn 100cbb7
>> And a hung dom0 during initramfs. I'm a little baffled as to why, but
>> it's
>> there (32 bit dom0, XenServer6).
>
> Curses, I knew there'd be one somewhere.  I've been replacing
> get_page_and_type_from_pagenr()s (which return 0 for success) with
> old-school get_page_type()s (which return 1 for success) and not always
> getting the right number of inversions.  That's a horrible horrible
> beartrap of an API, BTW, which had me cursing at the screen, but I had
> enough on my plate yesterday without touching _that_ code too!

Found more bugs. Some were predating this (xsm not calling put_gfn after
get_gfn_untyped, etc). Some were new (get_gfn_untyped not arch
independent).

I will be sending you shortly a small patch series on top of your mosnter
patch purely for RFC. Feel free to merge all, pick out bug fixes, etc.

Andres
>
>> > Andres, this is basically the big-hammer version of your "take a
>> > pagecount" changes, plus the change you made to hvmemul_rep_movs().
>> > If this works I intend to follow it up with a patch to make some of
>> the
>> > read-modify-write paths avoid taking the lock (by using a
>> > compare-exchange operation so they only take the lock on a write).  If
>> > that succeeds I might drop put_gfn() altogether.
>>
>> You mean cmpxchg the whole p2m entry? I don't think I parse the plan.
>> There are code paths that do get_gfn_query -> p2m_change_type ->
>> put_gfn.
>> But I guess those could lock the p2m up-front if they become the only
>> consumers of put_gfn left.
>
> Well, that's more or less what happens now.  I was thinking of replacing
> any remaining
>
>  (implicit) lock ; read ; think a bit ; maybe write ; unlock
>
> code with the fast-path-friendlier:
>
>  read ; think ; maybe-cmpxchg (and on failure undo or retry
>
> which avoids taking the write lock altogether if there's no work to do.
> But maybe there aren't many of those left now.  Obviously any path
> which will always write should just take the write-lock first.
>
>> >  - grant-table operations still use the lock, because frankly I
>> >    could not follow the current code, and it's quite late in the
>> evening.
>>
>> It's pretty complex with serious nesting, and ifdef's for arm and 32
>> bit.
>> gfn_to_mfn_private callers will suffer from altering the current
>> meaning,
>> as put_gfn resolves to the right thing for the ifdef'ed arch. The other
>> user is grant_transfer which also relies on the page *not* having an
>> extra
>> ref in steal_page. So it's a prime candidate to be left alone.
>
> Sadly, I think it's not.  The PV backends will be doing lots of grant
> ops, which shouldn't get serialized against all other P2M lookups.
>
>> > I also have a long list of uglinesses in the mm code that I found
>>
>> Uh, ugly stuff, how could that have happened?
>
> I can't imagine. :)  Certainly nothing to do with me thinking "I'll
> clean that up when I get some time."
>
>> I have a few preliminary observations on the patch. Pasting relevant
>> bits
>> here, since the body of the patch seems to have been lost by the email
>> thread:
>>
>> @@ -977,23 +976,25 @@ int arch_set_info_guest(
>> ...
>> +
>> +        if (!paging_mode_refcounts(d)
>> +            && !get_page_and_type(cr3_page, d, PGT_l3_page_table) )
>> replace with && !get_page_type() )
>
> Yep.
>
>> @@ -2404,32 +2373,33 @@ static enum hvm_copy_result __hvm_copy(
>>              gfn = addr >> PAGE_SHIFT;
>>          }
>>
>> -        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
>> +        page = get_page_from_gfn(curr->domain, gfn, &p2mt,
>> P2M_UNSHARE);
>> replace with (flags & HVMCOPY_to_guest) ? P2M_UNSHARE : P2M_ALLOC (and
>> same logic when checking p2m_is_shared). Not truly related to your patch
>> bit since we're at it.
>
> OK, but not in this patch.
>
>> Same, further down
>> -        if ( !p2m_is_ram(p2mt) )
>> +        if ( !page )
>>          {
>> -            put_gfn(curr->domain, gfn);
>> +            if ( page )
>> +                put_page(page);
>> Last two lines are redundant
>
> Yep.
>
>> @@ -4019,35 +3993,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
>>     case HVMOP_modified_memory: a lot of error checking has been
>> removed.
>
> Yes, but it was bogus - there's a race between the actual modification
> and the call, during which anything might have happened.  The best we
> can do is throw log-dirty bits at everything, and the caller can't do
> anything with the error anyway.
>
> When I come to tidy up I'll just add a new mark_gfn_dirty function
> and skip the pointless gfn->mfn->gfn translation on this path.
>
>> arch/x86/mm.c:do_mmu_update -> you blew up all the paging/sharing
>> checking
>> for target gfns of mmu updates of l2/3/4 entries. It seems that this
>> wouldn't work anyways, that's why you killed it?
>
> Yeah - since only L1es can point at foreign mappings it was all just
> noise, and even if there had been real p2m lookups on those paths there
> was no equivalent to the translate-in-place that happens in
> mod_l1_entry so it would have been broken in a much worse way.
>
>> +++ b/xen/arch/x86/mm/hap/guest_walk.c
>> @@ -54,34 +54,37 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
>> ...
>> +    if ( !top_page )
>>      {
>>          pfec[0] &= ~PFEC_page_present;
>> -        __put_gfn(p2m, top_gfn);
>> +        put_page(top_page);
>> top_page is NULL here, remove put_page
>
> Yep.
>
>> get_page_from_gfn_p2m, slow path: no need for p2m_lock/unlock since
>> locking is already done by get_gfn_type_access/__put_gfn
>
> Yeah, but I was writing that with half an eye on killing that lock. :)
> I'll drop them for now.
>
>> (hope those observations made sense without inlining them in the actual
>> patch)
>
> Yes, absolutely - thanks for the review!
>
> If we can get this to work well enough I'll tidy it up into a sensible
> series next week.   In the meantime, an updated verison of the
> monster patch is attached.
>
> Cheers,
>
> Tim.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-27  9:26                                             ` Tim Deegan
  2012-04-27 14:17                                               ` Andres Lagar-Cavilla
@ 2012-04-27 21:08                                               ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 45+ messages in thread
From: Andres Lagar-Cavilla @ 2012-04-27 21:08 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, oalf, xen-devel, Keir Fraser, George.Dunlap

> At 20:02 -0700 on 26 Apr (1335470547), Andres Lagar-Cavilla wrote:
>> > Can you please try the attached patch?  I think you'll need this one
>> > plus the ones that take the locks out of the hpet code.
>>
>> Right off the bat I'm getting a multitude of
>> (XEN) mm.c:3294:d0 Error while clearing mfn 100cbb7
>> And a hung dom0 during initramfs. I'm a little baffled as to why, but
>> it's
>> there (32 bit dom0, XenServer6).
>
> Curses, I knew there'd be one somewhere.  I've been replacing
> get_page_and_type_from_pagenr()s (which return 0 for success) with
> old-school get_page_type()s (which return 1 for success) and not always
> getting the right number of inversions.  That's a horrible horrible
> beartrap of an API, BTW, which had me cursing at the screen, but I had
> enough on my plate yesterday without touching _that_ code too!
>

I now am quite pleased with the testing results on our end. I have a
four-patch series to top up your monster patch, which I'll submit shortly.
I encourage everyone interested to test this (obviously a lot of code is
touched). Including AMD, as I've expanded the code to touch svm too.

>> > Andres, this is basically the big-hammer version of your "take a
>> > pagecount" changes, plus the change you made to hvmemul_rep_movs().
>> > If this works I intend to follow it up with a patch to make some of
>> the
>> > read-modify-write paths avoid taking the lock (by using a
>> > compare-exchange operation so they only take the lock on a write).  If
>> > that succeeds I might drop put_gfn() altogether.
>>
>> You mean cmpxchg the whole p2m entry? I don't think I parse the plan.
>> There are code paths that do get_gfn_query -> p2m_change_type ->
>> put_gfn.
>> But I guess those could lock the p2m up-front if they become the only
>> consumers of put_gfn left.
>
> Well, that's more or less what happens now.  I was thinking of replacing
> any remaining
>
>  (implicit) lock ; read ; think a bit ; maybe write ; unlock
>
> code with the fast-path-friendlier:
>
>  read ; think ; maybe-cmpxchg (and on failure undo or retry
>
> which avoids taking the write lock altogether if there's no work to do.
> But maybe there aren't many of those left now.  Obviously any path
> which will always write should just take the write-lock first.

After my four patches there aren't really any paths like the above left
(exhaustion disclaimer). I believe one or two iterative paths (like
HVMOP_set_mem_type) could be optimized to take the p2m lock up front,
instead of many get_gfn's.

The nice thing about get_gfn/put_gfn is that it will allow for seamless
(har har) transition to a fine-grained p2m. But then maybe we won't ever
need that with a p2m rwlock.

>
>> >  - grant-table operations still use the lock, because frankly I
>> >    could not follow the current code, and it's quite late in the
>> evening.
>>
>> It's pretty complex with serious nesting, and ifdef's for arm and 32
>> bit.
>> gfn_to_mfn_private callers will suffer from altering the current
>> meaning,
>> as put_gfn resolves to the right thing for the ifdef'ed arch. The other
>> user is grant_transfer which also relies on the page *not* having an
>> extra
>> ref in steal_page. So it's a prime candidate to be left alone.
>
> Sadly, I think it's not.  The PV backends will be doing lots of grant
> ops, which shouldn't get serialized against all other P2M lookups.
>

Those are addressed in my patch series now, which should case waves of panic.

Andres

>> > I also have a long list of uglinesses in the mm code that I found
>>
>> Uh, ugly stuff, how could that have happened?
>
> I can't imagine. :)  Certainly nothing to do with me thinking "I'll
> clean that up when I get some time."
>
>> I have a few preliminary observations on the patch. Pasting relevant
>> bits
>> here, since the body of the patch seems to have been lost by the email
>> thread:
>>
>> @@ -977,23 +976,25 @@ int arch_set_info_guest(
>> ...
>> +
>> +        if (!paging_mode_refcounts(d)
>> +            && !get_page_and_type(cr3_page, d, PGT_l3_page_table) )
>> replace with && !get_page_type() )
>
> Yep.
>
>> @@ -2404,32 +2373,33 @@ static enum hvm_copy_result __hvm_copy(
>>              gfn = addr >> PAGE_SHIFT;
>>          }
>>
>> -        mfn = mfn_x(get_gfn_unshare(curr->domain, gfn, &p2mt));
>> +        page = get_page_from_gfn(curr->domain, gfn, &p2mt,
>> P2M_UNSHARE);
>> replace with (flags & HVMCOPY_to_guest) ? P2M_UNSHARE : P2M_ALLOC (and
>> same logic when checking p2m_is_shared). Not truly related to your patch
>> bit since we're at it.
>
> OK, but not in this patch.
>
>> Same, further down
>> -        if ( !p2m_is_ram(p2mt) )
>> +        if ( !page )
>>          {
>> -            put_gfn(curr->domain, gfn);
>> +            if ( page )
>> +                put_page(page);
>> Last two lines are redundant
>
> Yep.
>
>> @@ -4019,35 +3993,16 @@ long do_hvm_op(unsigned long op, XEN_GUE
>>     case HVMOP_modified_memory: a lot of error checking has been
>> removed.
>
> Yes, but it was bogus - there's a race between the actual modification
> and the call, during which anything might have happened.  The best we
> can do is throw log-dirty bits at everything, and the caller can't do
> anything with the error anyway.
>
> When I come to tidy up I'll just add a new mark_gfn_dirty function
> and skip the pointless gfn->mfn->gfn translation on this path.
>
>> arch/x86/mm.c:do_mmu_update -> you blew up all the paging/sharing
>> checking
>> for target gfns of mmu updates of l2/3/4 entries. It seems that this
>> wouldn't work anyways, that's why you killed it?
>
> Yeah - since only L1es can point at foreign mappings it was all just
> noise, and even if there had been real p2m lookups on those paths there
> was no equivalent to the translate-in-place that happens in
> mod_l1_entry so it would have been broken in a much worse way.
>
>> +++ b/xen/arch/x86/mm/hap/guest_walk.c
>> @@ -54,34 +54,37 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
>> ...
>> +    if ( !top_page )
>>      {
>>          pfec[0] &= ~PFEC_page_present;
>> -        __put_gfn(p2m, top_gfn);
>> +        put_page(top_page);
>> top_page is NULL here, remove put_page
>
> Yep.
>
>> get_page_from_gfn_p2m, slow path: no need for p2m_lock/unlock since
>> locking is already done by get_gfn_type_access/__put_gfn
>
> Yeah, but I was writing that with half an eye on killing that lock. :)
> I'll drop them for now.
>
>> (hope those observations made sense without inlining them in the actual
>> patch)
>
> Yes, absolutely - thanks for the review!
>
> If we can get this to work well enough I'll tidy it up into a sensible
> series next week.   In the meantime, an updated verison of the
> monster patch is attached.
>
> Cheers,
>
> Tim.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-04-26 21:25                                         ` Tim Deegan
  2012-04-27  0:46                                           ` Zhang, Yang Z
  2012-04-27  3:02                                           ` Andres Lagar-Cavilla
@ 2012-05-16 11:36                                           ` Zhang, Yang Z
  2012-05-16 12:36                                             ` Tim Deegan
  2 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yang Z @ 2012-05-16 11:36 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, andres

Hi tim,

Did the attached patch apply to upstream xen? I tried the latest xen and still saw the high cpu utilization.

best regards
yang

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Friday, April 27, 2012 5:26 AM
> To: Zhang, Yang Z
> Cc: andres@lagarcavilla.org; Keir Fraser; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] lock in vhpet
> 
> At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> > > > But actually, the first cs introduced this issue is 24770. When win8
> > > > booting and if hpet is enabled, it will use hpet as the time source
> > > > and there have lots of hpet access and EPT violation. In EPT violation
> > > > handler, it call get_gfn_type_access to get the mfn. The cs 24770
> > > > introduces the gfn_lock for p2m lookups, and then the issue happens.
> > > > After I removed the gfn_lock, the issue goes. But in latest xen, even
> > > > I remove this lock, it still shows high cpu utilization.
> > >
> > > It would seem then that even the briefest lock-protected critical section
> would
> > > cause this? In the mmio case, the p2m lock taken in the hap fault handler is
> > > held during the actual lookup, and for a couple of branch instructions
> > > afterwards.
> > >
> > > In latest Xen, with lock removed for get_gfn, on which lock is time spent?
> > Still the p2m_lock.
> 
> Can you please try the attached patch?  I think you'll need this one
> plus the ones that take the locks out of the hpet code.
> 
> This patch makes the p2m lock into an rwlock and adjusts a number of the
> paths that don't update the p2m so they only take the read lock.  It's a
> bit rough but I can boot 16-way win7 guest with it.
> 
> N.B. Since rwlocks don't show up the the existing lock profiling, please
> don't try to use the lock-profiling numbers to see if it's helping!
> 
> Andres, this is basically the big-hammer version of your "take a
> pagecount" changes, plus the change you made to hvmemul_rep_movs().
> If this works I intend to follow it up with a patch to make some of the
> read-modify-write paths avoid taking the lock (by using a
> compare-exchange operation so they only take the lock on a write).  If
> that succeeds I might drop put_gfn() altogether.
> 
> But first it will need a lot of tidying up.  Noticeably missing:
>  - SVM code equivalents to the vmx.c changes
>  - grant-table operations still use the lock, because frankly I
>    could not follow the current code, and it's quite late in the evening.
> I also have a long list of uglinesses in the mm code that I found while
> writing this lot.
> 
> Keir, I have no objection to later replacing this with something better
> than an rwlock. :)  Or with making a NUMA-friendly rwlock
> implementation, since I really expect this to be heavily read-mostly
> when paging/sharing/pod are not enabled.
> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-05-16 11:36                                           ` Zhang, Yang Z
@ 2012-05-16 12:36                                             ` Tim Deegan
  2012-05-17 10:57                                               ` Tim Deegan
  0 siblings, 1 reply; 45+ messages in thread
From: Tim Deegan @ 2012-05-16 12:36 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser, andres

Hi,

At 11:36 +0000 on 16 May (1337168174), Zhang, Yang Z wrote:
> Hi tim,
> 
> Did the attached patch apply to upstream xen? I tried the latest xen
> and still saw the high cpu utilization.

The patch is not yet applied.  It's been cleaned up into a patch series
that I posted last Thursday, and will probably be applied later this
week. 

Cheers,

Tim.

> best regards
> yang
> 
> > -----Original Message-----
> > From: Tim Deegan [mailto:tim@xen.org]
> > Sent: Friday, April 27, 2012 5:26 AM
> > To: Zhang, Yang Z
> > Cc: andres@lagarcavilla.org; Keir Fraser; xen-devel@lists.xensource.com
> > Subject: Re: [Xen-devel] lock in vhpet
> > 
> > At 02:36 +0000 on 25 Apr (1335321409), Zhang, Yang Z wrote:
> > > > > But actually, the first cs introduced this issue is 24770. When win8
> > > > > booting and if hpet is enabled, it will use hpet as the time source
> > > > > and there have lots of hpet access and EPT violation. In EPT violation
> > > > > handler, it call get_gfn_type_access to get the mfn. The cs 24770
> > > > > introduces the gfn_lock for p2m lookups, and then the issue happens.
> > > > > After I removed the gfn_lock, the issue goes. But in latest xen, even
> > > > > I remove this lock, it still shows high cpu utilization.
> > > >
> > > > It would seem then that even the briefest lock-protected critical section
> > would
> > > > cause this? In the mmio case, the p2m lock taken in the hap fault handler is
> > > > held during the actual lookup, and for a couple of branch instructions
> > > > afterwards.
> > > >
> > > > In latest Xen, with lock removed for get_gfn, on which lock is time spent?
> > > Still the p2m_lock.
> > 
> > Can you please try the attached patch?  I think you'll need this one
> > plus the ones that take the locks out of the hpet code.
> > 
> > This patch makes the p2m lock into an rwlock and adjusts a number of the
> > paths that don't update the p2m so they only take the read lock.  It's a
> > bit rough but I can boot 16-way win7 guest with it.
> > 
> > N.B. Since rwlocks don't show up the the existing lock profiling, please
> > don't try to use the lock-profiling numbers to see if it's helping!
> > 
> > Andres, this is basically the big-hammer version of your "take a
> > pagecount" changes, plus the change you made to hvmemul_rep_movs().
> > If this works I intend to follow it up with a patch to make some of the
> > read-modify-write paths avoid taking the lock (by using a
> > compare-exchange operation so they only take the lock on a write).  If
> > that succeeds I might drop put_gfn() altogether.
> > 
> > But first it will need a lot of tidying up.  Noticeably missing:
> >  - SVM code equivalents to the vmx.c changes
> >  - grant-table operations still use the lock, because frankly I
> >    could not follow the current code, and it's quite late in the evening.
> > I also have a long list of uglinesses in the mm code that I found while
> > writing this lot.
> > 
> > Keir, I have no objection to later replacing this with something better
> > than an rwlock. :)  Or with making a NUMA-friendly rwlock
> > implementation, since I really expect this to be heavily read-mostly
> > when paging/sharing/pod are not enabled.
> > 
> > Cheers,
> > 
> > Tim.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-05-16 12:36                                             ` Tim Deegan
@ 2012-05-17 10:57                                               ` Tim Deegan
  2012-05-28  6:54                                                 ` Zhang, Yang Z
  0 siblings, 1 reply; 45+ messages in thread
From: Tim Deegan @ 2012-05-17 10:57 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: xen-devel, Keir Fraser, andres

Hi,

At 13:36 +0100 on 16 May (1337175361), Tim Deegan wrote:
> At 11:36 +0000 on 16 May (1337168174), Zhang, Yang Z wrote:
> > Did the attached patch apply to upstream xen? I tried the latest xen
> > and still saw the high cpu utilization.
> 
> The patch is not yet applied.  It's been cleaned up into a patch series
> that I posted last Thursday, and will probably be applied later this
> week. 

It's now been applied to the staging tree, as csets 25350--25360.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: lock in vhpet
  2012-05-17 10:57                                               ` Tim Deegan
@ 2012-05-28  6:54                                                 ` Zhang, Yang Z
  0 siblings, 0 replies; 45+ messages in thread
From: Zhang, Yang Z @ 2012-05-28  6:54 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, andres

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, May 17, 2012 6:58 PM
> To: Zhang, Yang Z
> Cc: xen-devel@lists.xensource.com; Keir Fraser; andres@lagarcavilla.org
> Subject: Re: [Xen-devel] lock in vhpet
> 
> Hi,
> 
> At 13:36 +0100 on 16 May (1337175361), Tim Deegan wrote:
> > At 11:36 +0000 on 16 May (1337168174), Zhang, Yang Z wrote:
> > > Did the attached patch apply to upstream xen? I tried the latest xen
> > > and still saw the high cpu utilization.
> >
> > The patch is not yet applied.  It's been cleaned up into a patch series
> > that I posted last Thursday, and will probably be applied later this
> > week.
> 
> It's now been applied to the staging tree, as csets 25350--25360.

It's a great work! With one week's testing, we didn't find any regression with those patches.

best regards
yang

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-05-28  6:54 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17  3:26 lock in vhpet Zhang, Yang Z
2012-04-17  7:27 ` Keir Fraser
2012-04-18  0:52   ` Zhang, Yang Z
2012-04-18  7:13     ` Keir Fraser
2012-04-18  7:55       ` Zhang, Yang Z
2012-04-18  8:29         ` Keir Fraser
2012-04-18  9:14           ` Keir Fraser
2012-04-18  9:30             ` Keir Fraser
2012-04-19  5:19               ` Zhang, Yang Z
2012-04-19  8:27                 ` Tim Deegan
2012-04-19  8:47                   ` Keir Fraser
2012-04-23  7:36                     ` Zhang, Yang Z
2012-04-23  7:43                       ` Jan Beulich
2012-04-23  8:15                         ` Zhang, Yang Z
2012-04-23  8:22                           ` Keir Fraser
2012-04-23  9:14                       ` Tim Deegan
2012-04-23 15:26                         ` Andres Lagar-Cavilla
2012-04-24  9:15                           ` Tim Deegan
2012-04-24 13:28                             ` Andres Lagar-Cavilla
2012-04-23 17:18                         ` Andres Lagar-Cavilla
2012-04-24  8:58                           ` Zhang, Yang Z
2012-04-24  9:16                             ` Tim Deegan
2012-04-25  0:27                               ` Zhang, Yang Z
2012-04-25  1:40                                 ` Andres Lagar-Cavilla
2012-04-25  1:48                                   ` Zhang, Yang Z
2012-04-25  2:31                                     ` Andres Lagar-Cavilla
2012-04-25  2:36                                       ` Zhang, Yang Z
2012-04-25  2:42                                         ` Andres Lagar-Cavilla
2012-04-25  3:12                                           ` Zhang, Yang Z
2012-04-25  3:34                                             ` Andres Lagar-Cavilla
2012-04-25  5:18                                               ` Zhang, Yang Z
2012-04-25  8:07                                               ` Jan Beulich
2012-04-26 21:25                                         ` Tim Deegan
2012-04-27  0:46                                           ` Zhang, Yang Z
2012-04-27  0:51                                             ` Andres Lagar-Cavilla
2012-04-27  1:24                                               ` Zhang, Yang Z
2012-04-27  8:36                                               ` Zhang, Yang Z
2012-04-27  3:02                                           ` Andres Lagar-Cavilla
2012-04-27  9:26                                             ` Tim Deegan
2012-04-27 14:17                                               ` Andres Lagar-Cavilla
2012-04-27 21:08                                               ` Andres Lagar-Cavilla
2012-05-16 11:36                                           ` Zhang, Yang Z
2012-05-16 12:36                                             ` Tim Deegan
2012-05-17 10:57                                               ` Tim Deegan
2012-05-28  6:54                                                 ` Zhang, Yang Z

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.