All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH 0/2] x86/pvshim: improve tlb flush performance
@ 2019-11-25 17:22 Roger Pau Monne
  2019-11-25 17:22 ` [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary Roger Pau Monne
  2019-11-25 17:22 ` [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode Roger Pau Monne
  0 siblings, 2 replies; 8+ messages in thread
From: Roger Pau Monne @ 2019-11-25 17:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monne

Hello,

Enabling PGE in CR4 causes a huge performance penalty when running the
shim on AMD hardware, this patch series avoids enabling PGE when in
shim mode, and makes a small adjustment in do_tlb_flush to perform a
flush by writing to CR3 if PGE is not enabled.

Roger Pau Monne (2):
  x86/tlbflush: do not toggle the PGE CR4 bit unless necessary
  x86/pvshim: do not enable global pages in shim mode

 xen/arch/x86/flushtlb.c  | 9 +++++----
 xen/arch/x86/pv/domain.c | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

-- 
2.24.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary
  2019-11-25 17:22 [Xen-devel] [PATCH 0/2] x86/pvshim: improve tlb flush performance Roger Pau Monne
@ 2019-11-25 17:22 ` Roger Pau Monne
  2019-11-29 12:04   ` Jan Beulich
  2019-11-25 17:22 ` [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode Roger Pau Monne
  1 sibling, 1 reply; 8+ messages in thread
From: Roger Pau Monne @ 2019-11-25 17:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monne

When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.

Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/flushtlb.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index c1ae0d9467..540209c856 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -84,6 +84,7 @@ static void post_flush(u32 t)
 static void do_tlb_flush(void)
 {
     unsigned long flags;
+    unsigned long cr4;
     u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
@@ -93,13 +94,13 @@ static void do_tlb_flush(void)
 
     if ( use_invpcid )
         invpcid_flush_all();
-    else
+    else if ( (cr4 = read_cr4()) & X86_CR4_PGE )
     {
-        unsigned long cr4 = read_cr4();
-
-        write_cr4(cr4 ^ X86_CR4_PGE);
+        write_cr4(cr4 & ~X86_CR4_PGE);
         write_cr4(cr4);
     }
+    else
+        write_cr3(read_cr3());
 
     post_flush(t);
 
-- 
2.24.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode
  2019-11-25 17:22 [Xen-devel] [PATCH 0/2] x86/pvshim: improve tlb flush performance Roger Pau Monne
  2019-11-25 17:22 ` [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary Roger Pau Monne
@ 2019-11-25 17:22 ` Roger Pau Monne
  2019-11-29 12:09   ` Jan Beulich
  1 sibling, 1 reply; 8+ messages in thread
From: Roger Pau Monne @ 2019-11-25 17:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monne

When using global pages a full tlb flush can only be performed by
toggling the PGE bit in CR4, which is usually quite expensive in terms
of performance when running virtualized. This is specially relevant on
AMD hardware, which doesn't have the ability to do selective CR4
trapping, but can also be relevant on Intel if the underlying
hypervisor also traps on accesses to the PGE CR4 bit.

In order to avoid this performance penalty, do not use global pages
when running in shim mode. Note this is done when running on both
Intel or AMD hardware, since older versions of Xen capable of running
the shim don't make use of Intel selective CR4 trapping feature and
will vmexit on every access to CR4.

The above figures are from a PV shim running on AMD hardware with
32 vCPUs:

PGE enabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804b01c0, lockval=1adb1adb, not locked
(XEN)   lock:1841883(1375128998543), block:1658716(10193054890781)

Average lock time:   746588ns
Average block time: 6145147ns

PGE disabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804af1c0, lockval=a8bfa8bf, not locked
(XEN)   lock:2730175(657505389886), block:2039716(2963768247738)

Average lock time:   240829ns
Average block time: 1453029ns

As seen from the above figures the block time of the flush lock is
reduced to approximately 1/3 of the original value.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/pv/domain.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 4b6f48dea2..36f3903dcb 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -13,6 +13,7 @@
 #include <asm/invpcid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/pv/domain.h>
+#include <asm/pv/shim.h>
 #include <asm/shadow.h>
 
 static __read_mostly enum {
@@ -130,7 +131,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
      */
     if ( d->arch.pv.pcid )
         cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv.xpti )
+    else if ( !d->arch.pv.xpti && !pv_shim )
         cr4 |= X86_CR4_PGE;
 
     /*
-- 
2.24.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary
  2019-11-25 17:22 ` [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary Roger Pau Monne
@ 2019-11-29 12:04   ` Jan Beulich
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2019-11-29 12:04 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Wei Liu, Andrew Cooper

On 25.11.2019 18:22, Roger Pau Monne wrote:
> When PCID is not available Xen does a full tlbflush by toggling the
> PGE bit in CR4. This is not necessary if PGE is not enabled, since a
> flush can be performed by writing to CR3 in that case.
> 
> Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
> it's already enabled, otherwise do the tlb flush by writing to CR3.
> This is relevant when running virtualized, since hypervisors don't
> usually trap accesses to CR3 when using hardware assisted paging, but
> do trap accesses to CR4 specially on AMD hardware, which makes such
> accesses much more expensive.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one remark:

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -84,6 +84,7 @@ static void post_flush(u32 t)
>  static void do_tlb_flush(void)
>  {
>      unsigned long flags;
> +    unsigned long cr4;

This would better be merged with the adjacent declaration. Can surely
be done while committing.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode
  2019-11-25 17:22 ` [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode Roger Pau Monne
@ 2019-11-29 12:09   ` Jan Beulich
  2019-11-29 12:12     ` Andrew Cooper
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2019-11-29 12:09 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Wei Liu, Andrew Cooper

On 25.11.2019 18:22, Roger Pau Monne wrote:
> When using global pages a full tlb flush can only be performed by
> toggling the PGE bit in CR4, which is usually quite expensive in terms
> of performance when running virtualized. This is specially relevant on
> AMD hardware, which doesn't have the ability to do selective CR4
> trapping, but can also be relevant on Intel if the underlying
> hypervisor also traps on accesses to the PGE CR4 bit.
> 
> In order to avoid this performance penalty, do not use global pages
> when running in shim mode. Note this is done when running on both
> Intel or AMD hardware, since older versions of Xen capable of running
> the shim don't make use of Intel selective CR4 trapping feature and
> will vmexit on every access to CR4.

So here you say you do this uniformly because of older Xen.
What about newer Xen? Is this still a win (or at least not a
loss) there? Independent of underlying hardware? In case of
any kind of doubt I think this would want to be command line
controllable.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode
  2019-11-29 12:09   ` Jan Beulich
@ 2019-11-29 12:12     ` Andrew Cooper
  2019-11-29 14:36       ` Roger Pau Monné
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2019-11-29 12:12 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: xen-devel, Wei Liu

On 29/11/2019 12:09, Jan Beulich wrote:
> On 25.11.2019 18:22, Roger Pau Monne wrote:
>> When using global pages a full tlb flush can only be performed by
>> toggling the PGE bit in CR4, which is usually quite expensive in terms
>> of performance when running virtualized. This is specially relevant on
>> AMD hardware, which doesn't have the ability to do selective CR4
>> trapping, but can also be relevant on Intel if the underlying
>> hypervisor also traps on accesses to the PGE CR4 bit.
>>
>> In order to avoid this performance penalty, do not use global pages
>> when running in shim mode. Note this is done when running on both
>> Intel or AMD hardware, since older versions of Xen capable of running
>> the shim don't make use of Intel selective CR4 trapping feature and
>> will vmexit on every access to CR4.
> So here you say you do this uniformly because of older Xen.
> What about newer Xen? Is this still a win (or at least not a
> loss) there? Independent of underlying hardware? In case of
> any kind of doubt I think this would want to be command line
> controllable.

Older Xen has VMExits for all CR4.PGE flips.

Newer Xen (since 4.10? iirc) on Intel hardware (with HAP) arranged for
CR4.PGE flips not to vmexit.

There is no ability to cause CR4.PGE flips to not vmexit on AMD, other
than to give the guest full control of CR4 which is a BadThing(tm).

I agree that this wants a command line control, but it wants to be
enabled by default any time we find ourselves nested on AMD hardware,
not just in shim.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode
  2019-11-29 12:12     ` Andrew Cooper
@ 2019-11-29 14:36       ` Roger Pau Monné
  2019-11-29 14:46         ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Roger Pau Monné @ 2019-11-29 14:36 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich; +Cc: xen-devel, Wei Liu

On Fri, Nov 29, 2019 at 12:12:51PM +0000, Andrew Cooper wrote:
> On 29/11/2019 12:09, Jan Beulich wrote:
> > On 25.11.2019 18:22, Roger Pau Monne wrote:
> >> When using global pages a full tlb flush can only be performed by
> >> toggling the PGE bit in CR4, which is usually quite expensive in terms
> >> of performance when running virtualized. This is specially relevant on
> >> AMD hardware, which doesn't have the ability to do selective CR4
> >> trapping, but can also be relevant on Intel if the underlying
> >> hypervisor also traps on accesses to the PGE CR4 bit.
> >>
> >> In order to avoid this performance penalty, do not use global pages
> >> when running in shim mode. Note this is done when running on both
> >> Intel or AMD hardware, since older versions of Xen capable of running
> >> the shim don't make use of Intel selective CR4 trapping feature and
> >> will vmexit on every access to CR4.
> > So here you say you do this uniformly because of older Xen.
> > What about newer Xen? Is this still a win (or at least not a
> > loss) there? Independent of underlying hardware?

So on Intel hw that don't trap CR4 PGE accesses disabling PGE seems to
make performance slightly worse when doing a `make -j8 xen` on a 8 vCPU
pv-shim guest:

CR4 PGE enabled:
real	2m40.889s
real	2m41.700s
real	2m40.453s

CR4 PGE disabled:
real	2m43.197s
real	2m41.940s
real	2m42.727s

> > In case of
> > any kind of doubt I think this would want to be command line
> > controllable.
> 
> Older Xen has VMExits for all CR4.PGE flips.
> 
> Newer Xen (since 4.10? iirc) on Intel hardware (with HAP) arranged for
> CR4.PGE flips not to vmexit.
> 
> There is no ability to cause CR4.PGE flips to not vmexit on AMD, other
> than to give the guest full control of CR4 which is a BadThing(tm).
> 
> I agree that this wants a command line control, but it wants to be
> enabled by default any time we find ourselves nested on AMD hardware,
> not just in shim.

Only on AMD hardware? Newer versions of Xen don't trap CR4 PGE writes,
but what about other hypervisors?

I think it would be better to avoid using PGE when the hypervisor
CPUID bit is set, regardless of whether the hardware is AMD or not.
The performance penalty doesn't seem that bad, taking into account
that using PGE when CR4 is trapped is much worse. Alternatively we
could try to detect how slow a flush from CR4 is and act accordingly,
but that seems tricky.

I can add a command line option to force or prevent the usage of PGE.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode
  2019-11-29 14:36       ` Roger Pau Monné
@ 2019-11-29 14:46         ` Jan Beulich
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2019-11-29 14:46 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, Wei Liu, xen-devel

On 29.11.2019 15:36, Roger Pau Monné  wrote:
> On Fri, Nov 29, 2019 at 12:12:51PM +0000, Andrew Cooper wrote:
>> I agree that this wants a command line control, but it wants to be
>> enabled by default any time we find ourselves nested on AMD hardware,
>> not just in shim.
> 
> Only on AMD hardware? Newer versions of Xen don't trap CR4 PGE writes,
> but what about other hypervisors?

I guess all hypervisors will strive to force as little exits as
possible.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-29 14:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 17:22 [Xen-devel] [PATCH 0/2] x86/pvshim: improve tlb flush performance Roger Pau Monne
2019-11-25 17:22 ` [Xen-devel] [PATCH 1/2] x86/tlbflush: do not toggle the PGE CR4 bit unless necessary Roger Pau Monne
2019-11-29 12:04   ` Jan Beulich
2019-11-25 17:22 ` [Xen-devel] [PATCH 2/2] x86/pvshim: do not enable global pages in shim mode Roger Pau Monne
2019-11-29 12:09   ` Jan Beulich
2019-11-29 12:12     ` Andrew Cooper
2019-11-29 14:36       ` Roger Pau Monné
2019-11-29 14:46         ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.