All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] x86/PoD: defer nested P2M flushes
@ 2021-10-19 12:52 Jan Beulich
  2021-10-19 13:53 ` Roger Pau Monné
  2021-10-20  7:27 ` Tian, Kevin
  0 siblings, 2 replies; 5+ messages in thread
From: Jan Beulich @ 2021-10-19 12:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Roger Pau Monné, Kevin Tian

With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
order violation when the PoD lock is held around it. Hence such flushing
needs to be deferred. Steal the approach from p2m_change_type_range().
(Note that strictly speaking the change at the out_of_memory label is
not needed, as the domain gets crashed there anyway. The change is being
made nevertheless to avoid setting up a trap from someone meaning to
deal with that case better than by domain_crash().)

Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
p2m_flush_nestedp2m() invocation conditional. Note that this then also
alters behavior of p2m_change_type_range() on EPT, deferring the nested
flushes there as well. I think this should have been that way from the
introduction of the flag.

Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the
    out_of_memory label. Extend description to cover these.

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru
     {
         if ( p2m_is_nestedp2m(p2m) )
             ept = &p2m_get_hostp2m(d)->ept;
-        else
+        else if ( !p2m->defer_nested_flush )
             p2m_flush_nestedp2m(d);
     }
 
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -24,6 +24,7 @@
 #include <xen/mm.h>
 #include <xen/sched.h>
 #include <xen/trace.h>
+#include <asm/hvm/nestedhvm.h>
 #include <asm/page.h>
 #include <asm/paging.h>
 #include <asm/p2m.h>
@@ -494,6 +495,13 @@ p2m_pod_offline_or_broken_replace(struct
 static int
 p2m_pod_zero_check_superpage(struct p2m_domain *p2m, gfn_t gfn);
 
+static void pod_unlock_and_flush(struct p2m_domain *p2m)
+{
+    pod_unlock(p2m);
+    p2m->defer_nested_flush = false;
+    if ( nestedhvm_enabled(p2m->domain) )
+        p2m_flush_nestedp2m(p2m->domain);
+}
 
 /*
  * This function is needed for two reasons:
@@ -514,6 +522,7 @@ p2m_pod_decrease_reservation(struct doma
 
     gfn_lock(p2m, gfn, order);
     pod_lock(p2m);
+    p2m->defer_nested_flush = true;
 
     /*
      * If we don't have any outstanding PoD entries, let things take their
@@ -665,7 +674,7 @@ out_entry_check:
     }
 
 out_unlock:
-    pod_unlock(p2m);
+    pod_unlock_and_flush(p2m);
     gfn_unlock(p2m, gfn, order);
     return ret;
 }
@@ -1144,8 +1153,10 @@ p2m_pod_demand_populate(struct p2m_domai
      * won't start until we're done.
      */
     if ( unlikely(d->is_dying) )
-        goto out_fail;
-
+    {
+        pod_unlock(p2m);
+        return false;
+    }
 
     /*
      * Because PoD does not have cache list for 1GB pages, it has to remap
@@ -1167,6 +1178,8 @@ p2m_pod_demand_populate(struct p2m_domai
                               p2m_populate_on_demand, p2m->default_access);
     }
 
+    p2m->defer_nested_flush = true;
+
     /* Only reclaim if we're in actual need of more cache. */
     if ( p2m->pod.entry_count > p2m->pod.count )
         pod_eager_reclaim(p2m);
@@ -1229,22 +1242,25 @@ p2m_pod_demand_populate(struct p2m_domai
         __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
     }
 
-    pod_unlock(p2m);
+    pod_unlock_and_flush(p2m);
     return true;
+
 out_of_memory:
-    pod_unlock(p2m);
+    pod_unlock_and_flush(p2m);
 
     printk("%s: Dom%d out of PoD memory! (tot=%"PRIu32" ents=%ld dom%d)\n",
            __func__, d->domain_id, domain_tot_pages(d),
            p2m->pod.entry_count, current->domain->domain_id);
     domain_crash(d);
     return false;
+
 out_fail:
-    pod_unlock(p2m);
+    pod_unlock_and_flush(p2m);
     return false;
+
 remap_and_retry:
     BUG_ON(order != PAGE_ORDER_2M);
-    pod_unlock(p2m);
+    pod_unlock_and_flush(p2m);
 
     /*
      * Remap this 2-meg region in singleton chunks. See the comment on the



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] x86/PoD: defer nested P2M flushes
  2021-10-19 12:52 [PATCH v2] x86/PoD: defer nested P2M flushes Jan Beulich
@ 2021-10-19 13:53 ` Roger Pau Monné
  2021-10-19 15:06   ` Jan Beulich
  2021-10-20  7:27 ` Tian, Kevin
  1 sibling, 1 reply; 5+ messages in thread
From: Roger Pau Monné @ 2021-10-19 13:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, George Dunlap, Kevin Tian

On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote:
> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
> order violation when the PoD lock is held around it. Hence such flushing
> needs to be deferred. Steal the approach from p2m_change_type_range().
> (Note that strictly speaking the change at the out_of_memory label is
> not needed, as the domain gets crashed there anyway. The change is being
> made nevertheless to avoid setting up a trap from someone meaning to
> deal with that case better than by domain_crash().)
> 
> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
> p2m_flush_nestedp2m() invocation conditional. Note that this then also
> alters behavior of p2m_change_type_range() on EPT, deferring the nested
> flushes there as well. I think this should have been that way from the
> introduction of the flag.
> 
> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the
>     out_of_memory label. Extend description to cover these.
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru
>      {
>          if ( p2m_is_nestedp2m(p2m) )
>              ept = &p2m_get_hostp2m(d)->ept;
> -        else
> +        else if ( !p2m->defer_nested_flush )
>              p2m_flush_nestedp2m(d);

I find this model slightly concerning, as we don't actually notify the
caller that a nested flush as been deferred, so we must make sure that
whoever sets defer_nested_flush also performs a flush unconditionally
when clearing the flag.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] x86/PoD: defer nested P2M flushes
  2021-10-19 13:53 ` Roger Pau Monné
@ 2021-10-19 15:06   ` Jan Beulich
  2021-10-20  7:52     ` Roger Pau Monné
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2021-10-19 15:06 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, George Dunlap, Kevin Tian

On 19.10.2021 15:53, Roger Pau Monné wrote:
> On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote:
>> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
>> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
>> order violation when the PoD lock is held around it. Hence such flushing
>> needs to be deferred. Steal the approach from p2m_change_type_range().
>> (Note that strictly speaking the change at the out_of_memory label is
>> not needed, as the domain gets crashed there anyway. The change is being
>> made nevertheless to avoid setting up a trap from someone meaning to
>> deal with that case better than by domain_crash().)
>>
>> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
>> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
>> p2m_flush_nestedp2m() invocation conditional. Note that this then also
>> alters behavior of p2m_change_type_range() on EPT, deferring the nested
>> flushes there as well. I think this should have been that way from the
>> introduction of the flag.
>>
>> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru
>>      {
>>          if ( p2m_is_nestedp2m(p2m) )
>>              ept = &p2m_get_hostp2m(d)->ept;
>> -        else
>> +        else if ( !p2m->defer_nested_flush )
>>              p2m_flush_nestedp2m(d);
> 
> I find this model slightly concerning, as we don't actually notify the
> caller that a nested flush as been deferred, so we must make sure that
> whoever sets defer_nested_flush also performs a flush unconditionally
> when clearing the flag.

Well, this _is_ the model used for now. Until this change there was
just a single party setting the flag. And like here, any new party
setting the flag will also need to invoke a flush upon clearing it.
It's not clear to me what alternative model you may have in mind.

Jan



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] x86/PoD: defer nested P2M flushes
  2021-10-19 12:52 [PATCH v2] x86/PoD: defer nested P2M flushes Jan Beulich
  2021-10-19 13:53 ` Roger Pau Monné
@ 2021-10-20  7:27 ` Tian, Kevin
  1 sibling, 0 replies; 5+ messages in thread
From: Tian, Kevin @ 2021-10-20  7:27 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Cooper, Andrew, George Dunlap, Roger Pau Monné

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, October 19, 2021 8:52 PM
> 
> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
> order violation when the PoD lock is held around it. Hence such flushing
> needs to be deferred. Steal the approach from p2m_change_type_range().
> (Note that strictly speaking the change at the out_of_memory label is
> not needed, as the domain gets crashed there anyway. The change is being
> made nevertheless to avoid setting up a trap from someone meaning to
> deal with that case better than by domain_crash().)
> 
> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
> p2m_flush_nestedp2m() invocation conditional. Note that this then also
> alters behavior of p2m_change_type_range() on EPT, deferring the nested
> flushes there as well. I think this should have been that way from the
> introduction of the flag.
> 
> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v2: Also adjust ept_sync_domain_prepare(). Also convert the flush at the
>     out_of_memory label. Extend description to cover these.
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru
>      {
>          if ( p2m_is_nestedp2m(p2m) )
>              ept = &p2m_get_hostp2m(d)->ept;
> -        else
> +        else if ( !p2m->defer_nested_flush )
>              p2m_flush_nestedp2m(d);
>      }
> 
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -24,6 +24,7 @@
>  #include <xen/mm.h>
>  #include <xen/sched.h>
>  #include <xen/trace.h>
> +#include <asm/hvm/nestedhvm.h>
>  #include <asm/page.h>
>  #include <asm/paging.h>
>  #include <asm/p2m.h>
> @@ -494,6 +495,13 @@ p2m_pod_offline_or_broken_replace(struct
>  static int
>  p2m_pod_zero_check_superpage(struct p2m_domain *p2m, gfn_t gfn);
> 
> +static void pod_unlock_and_flush(struct p2m_domain *p2m)
> +{
> +    pod_unlock(p2m);
> +    p2m->defer_nested_flush = false;
> +    if ( nestedhvm_enabled(p2m->domain) )
> +        p2m_flush_nestedp2m(p2m->domain);
> +}
> 
>  /*
>   * This function is needed for two reasons:
> @@ -514,6 +522,7 @@ p2m_pod_decrease_reservation(struct doma
> 
>      gfn_lock(p2m, gfn, order);
>      pod_lock(p2m);
> +    p2m->defer_nested_flush = true;
> 
>      /*
>       * If we don't have any outstanding PoD entries, let things take their
> @@ -665,7 +674,7 @@ out_entry_check:
>      }
> 
>  out_unlock:
> -    pod_unlock(p2m);
> +    pod_unlock_and_flush(p2m);
>      gfn_unlock(p2m, gfn, order);
>      return ret;
>  }
> @@ -1144,8 +1153,10 @@ p2m_pod_demand_populate(struct p2m_domai
>       * won't start until we're done.
>       */
>      if ( unlikely(d->is_dying) )
> -        goto out_fail;
> -
> +    {
> +        pod_unlock(p2m);
> +        return false;
> +    }
> 
>      /*
>       * Because PoD does not have cache list for 1GB pages, it has to remap
> @@ -1167,6 +1178,8 @@ p2m_pod_demand_populate(struct p2m_domai
>                                p2m_populate_on_demand, p2m->default_access);
>      }
> 
> +    p2m->defer_nested_flush = true;
> +
>      /* Only reclaim if we're in actual need of more cache. */
>      if ( p2m->pod.entry_count > p2m->pod.count )
>          pod_eager_reclaim(p2m);
> @@ -1229,22 +1242,25 @@ p2m_pod_demand_populate(struct p2m_domai
>          __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
>      }
> 
> -    pod_unlock(p2m);
> +    pod_unlock_and_flush(p2m);
>      return true;
> +
>  out_of_memory:
> -    pod_unlock(p2m);
> +    pod_unlock_and_flush(p2m);
> 
>      printk("%s: Dom%d out of PoD memory! (tot=%"PRIu32" ents=%ld
> dom%d)\n",
>             __func__, d->domain_id, domain_tot_pages(d),
>             p2m->pod.entry_count, current->domain->domain_id);
>      domain_crash(d);
>      return false;
> +
>  out_fail:
> -    pod_unlock(p2m);
> +    pod_unlock_and_flush(p2m);
>      return false;
> +
>  remap_and_retry:
>      BUG_ON(order != PAGE_ORDER_2M);
> -    pod_unlock(p2m);
> +    pod_unlock_and_flush(p2m);
> 
>      /*
>       * Remap this 2-meg region in singleton chunks. See the comment on the


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] x86/PoD: defer nested P2M flushes
  2021-10-19 15:06   ` Jan Beulich
@ 2021-10-20  7:52     ` Roger Pau Monné
  0 siblings, 0 replies; 5+ messages in thread
From: Roger Pau Monné @ 2021-10-20  7:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, George Dunlap, Kevin Tian

On Tue, Oct 19, 2021 at 05:06:27PM +0200, Jan Beulich wrote:
> On 19.10.2021 15:53, Roger Pau Monné wrote:
> > On Tue, Oct 19, 2021 at 02:52:27PM +0200, Jan Beulich wrote:
> >> With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
> >> write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
> >> order violation when the PoD lock is held around it. Hence such flushing
> >> needs to be deferred. Steal the approach from p2m_change_type_range().
> >> (Note that strictly speaking the change at the out_of_memory label is
> >> not needed, as the domain gets crashed there anyway. The change is being
> >> made nevertheless to avoid setting up a trap from someone meaning to
> >> deal with that case better than by domain_crash().)
> >>
> >> Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
> >> ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
> >> p2m_flush_nestedp2m() invocation conditional. Note that this then also
> >> alters behavior of p2m_change_type_range() on EPT, deferring the nested
> >> flushes there as well. I think this should have been that way from the
> >> introduction of the flag.
> >>
> >> Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> >> --- a/xen/arch/x86/mm/p2m-ept.c
> >> +++ b/xen/arch/x86/mm/p2m-ept.c
> >> @@ -1253,7 +1253,7 @@ static void ept_sync_domain_prepare(stru
> >>      {
> >>          if ( p2m_is_nestedp2m(p2m) )
> >>              ept = &p2m_get_hostp2m(d)->ept;
> >> -        else
> >> +        else if ( !p2m->defer_nested_flush )
> >>              p2m_flush_nestedp2m(d);
> > 
> > I find this model slightly concerning, as we don't actually notify the
> > caller that a nested flush as been deferred, so we must make sure that
> > whoever sets defer_nested_flush also performs a flush unconditionally
> > when clearing the flag.
> 
> Well, this _is_ the model used for now. Until this change there was
> just a single party setting the flag. And like here, any new party
> setting the flag will also need to invoke a flush upon clearing it.
> It's not clear to me what alternative model you may have in mind.

I was mostly thinking of something similar to the
defer_flush/need_flush pair, where the need for a flush is signaled in
need_flush, so setting defer_flush doesn't automatically imply a flush
when clearing it.

Anyway, this is simple enough that it doesn't warrant the use of the
more complicated logic.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-20  7:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 12:52 [PATCH v2] x86/PoD: defer nested P2M flushes Jan Beulich
2021-10-19 13:53 ` Roger Pau Monné
2021-10-19 15:06   ` Jan Beulich
2021-10-20  7:52     ` Roger Pau Monné
2021-10-20  7:27 ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.