All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] x86/pv: Split pv_hypercall() in two
@ 2021-10-11 18:05 Andrew Cooper
  2021-10-11 18:05 ` [PATCH] " Andrew Cooper
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Cooper @ 2021-10-11 18:05 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

Full perf anlaysis.  Time is raw TSC cycles for a xen_version() hypercall,
compared across the change in patch 1, with obvious obvious outliers excluded.
i.e. Idealised best case improvement.

Some general notes.  pv64 is `syscall`, while pv32 is `int $0x82` and
therefore has more overhead to begin with.  Consequently, dropping two lfences
is less of an overall change in the path.

First, AMD Milan (Zen3):

  $ ministat -A milan-hcall-pv64-{before,after}
  x milan-hcall-pv64-before
  + milan-hcall-pv64-after
      N           Min           Max        Median           Avg        Stddev
  x  98           420           460           440     438.97959     6.6564899
  +  98           360           440           380     370.81633      12.57337
  Difference at 95.0% confidence
          -68.1633 +/- 2.81674
          -15.5277% +/- 0.641656%
          (Student's t, pooled s = 10.0598)

  $ ministat -A milan-hcall-pv32-{before,after}
  x milan-hcall-pv32-before
  + milan-hcall-pv32-after
      N           Min           Max        Median           Avg        Stddev
  x  98          1900          2100          1980     1984.2857     22.291416
  +  96          1740          1960          1760        1767.5     35.688713
  Difference at 95.0% confidence
          -216.786 +/- 8.35522
          -10.9251% +/- 0.421069%
          (Student's t, pooled s = 29.6859)

Second, AMD Naples (Zen1):

  $ ministat -A naples-hcall-pv64-{before,after}
  x naples-hcall-pv64-before
  + naples-hcall-pv64-after
      N           Min           Max        Median           Avg        Stddev
  x  97           294           336           315     311.75258     10.207259
  +  97           252           273           252     257.41237     9.2328135
  Difference at 95.0% confidence
          -54.3402 +/- 2.73904
          -17.4306% +/- 0.878593%
          (Student's t, pooled s = 9.73224)

  $ ministat -A naples-hcall-pv32-{before,after}
  x naples-hcall-pv32-before
  + naples-hcall-pv32-after
      N           Min           Max        Median           Avg        Stddev
  x  98          1260          1470          1260     1276.2857     42.913483
  +  95          1218          1470          1239     1250.9368     52.491298
  Difference at 95.0% confidence
          -25.3489 +/- 13.5082
          -1.98614% +/- 1.0584%
          (Student's t, pooled s = 47.8673)

Third, Intel Coffeelake-R:

  $ ministat -A cflr-hcall-pv64-{before,after}
  x cflr-hcall-pv64-before
  + cflr-hcall-pv64-after
      N           Min           Max        Median           Avg        Stddev
  x 100           774          1024           792        825.04     73.608563
  +  95           734           966           756     787.74737     70.580114
  Difference at 95.0% confidence
          -37.2926 +/- 20.2602
          -4.5201% +/- 2.45567%
          (Student's t, pooled s = 72.1494)

  $ ministat -A cflr-hcall-pv32-{before,after}
  x cflr-hcall-pv32-before
  + cflr-hcall-pv32-after
      N           Min           Max        Median           Avg        Stddev
  x 100          2176          3816          2198       2288.84     196.18218
  +  99          2180          2434          2198     2232.4646     75.867677
  Difference at 95.0% confidence
          -56.3754 +/- 41.4084
          -2.46305% +/- 1.80914%
          (Student's t, pooled s = 149.013)

Fourth, Intel Skylake Server:

  $ ministat -A skx-hcall-pv64-{before,after}
  x skx-hcall-pv64-before
  + skx-hcall-pv64-after
      N           Min           Max        Median           Avg        Stddev
  x  99          5642          5720          5686      5686.303     17.909896
  +  98          5520          5544          5540     5536.0816       8.20821
  Difference at 95.0% confidence
          -150.221 +/- 3.89729
          -2.64181% +/- 0.0685382%
          (Student's t, pooled s = 13.9542)

  $ ministat -A skx-hcall-pv32-{before,after}
  x skx-hcall-pv32-before
  + skx-hcall-pv32-after
      N           Min           Max        Median           Avg        Stddev
  x  99          9296          9500          9308     9309.3131     20.418402
  +  96          9110          9266          9180     9175.2292     27.860358
  Difference at 95.0% confidence
          -134.084 +/- 6.84111
          -1.44032% +/- 0.0734868%
          (Student's t, pooled s = 24.3673)


I'm honestly not sure why Naples PV32's improvement is so small, but I've
double checked the numbers.  Clearly there's something on the `int $0x82` path
which is radically higher overhead on Naples vs Milan.

For the Intel numbers, both setups are writing to MSR_SPEC_CTRL on entry/exit,
but for Skylake it is the microcode implementation whereas for CLF-R, it is
the hardware implemenation.  Skylake has XPTI adding further overhead to the
paths.

Andrew Cooper (1):
  x86/pv: Split pv_hypercall() in two

 xen/arch/x86/pv/hypercall.c | 24 +++++++++++++++++++-----
 xen/arch/x86/pv/traps.c     | 11 -----------
 2 files changed, 19 insertions(+), 16 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] x86/pv: Split pv_hypercall() in two
  2021-10-11 18:05 [PATCH 0/1] x86/pv: Split pv_hypercall() in two Andrew Cooper
@ 2021-10-11 18:05 ` Andrew Cooper
  2021-10-12  9:26   ` Jan Beulich
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Cooper @ 2021-10-11 18:05 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
individual path through the function.  There is very little code common
between compat and native, and context-dependent conditionals predict very
badly for a period of time after context switch.

Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
_pv_hypercall() to be static and forced inline.  The delta is:

  add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
  Function                                     old     new   delta
  do_entry_int82                                50     350    +300
  pv_hypercall                                 579     297    -282

which is tiny, but the perf implications are large:

  Guest | Naples | Milan  | SKX    | CFL-R  |
  ------+--------+--------+--------+--------+
  pv64  |  17.4% |  15.5% |   2.6% |   4.5% |
  pv32  |   1.9% |  10.9% |   1.4% |   2.5% |

These are percentage improvements in raw TSC detlas for a xen_version
hypercall, with obvious outliers excluded.  Therefore, it is an idealised best
case improvement.

The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
necessarily has higher overhead.  Therefore, dropping the lfences is less over
an overall improvement.

I don't know why the Naples pv32 improvement is so small, but I've double
checked the numbers and they're correct.  There's something we're doing which
is a large overhead in the pipeline.

On the Intel side, both systems are writing to MSR_SPEC_CTRL on
entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
the hardware implementation), while SKX is suffering further from XPTI for
Meltdown protection.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/pv/hypercall.c | 24 +++++++++++++++++++-----
 xen/arch/x86/pv/traps.c     | 11 -----------
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/pv/hypercall.c b/xen/arch/x86/pv/hypercall.c
index 9765e674cf60..3579ba905c1c 100644
--- a/xen/arch/x86/pv/hypercall.c
+++ b/xen/arch/x86/pv/hypercall.c
@@ -23,6 +23,7 @@
 #include <xen/hypercall.h>
 #include <xen/nospec.h>
 #include <xen/trace.h>
+#include <asm/apic.h>
 #include <asm/multicall.h>
 #include <irq_vectors.h>
 
@@ -109,15 +110,15 @@ const pv_hypercall_table_t pv_hypercall_table[] = {
 #undef COMPAT_CALL
 #undef HYPERCALL
 
-void pv_hypercall(struct cpu_user_regs *regs)
+/* Forced inline to cause 'compat' to be evaluated at compile time. */
+static void always_inline
+_pv_hypercall(struct cpu_user_regs *regs, bool compat)
 {
     struct vcpu *curr = current;
-    unsigned long eax;
+    unsigned long eax = compat ? regs->eax : regs->rax;
 
     ASSERT(guest_kernel_mode(curr, regs));
 
-    eax = is_pv_32bit_vcpu(curr) ? regs->eax : regs->rax;
-
     BUILD_BUG_ON(ARRAY_SIZE(pv_hypercall_table) >
                  ARRAY_SIZE(hypercall_args_table));
 
@@ -137,7 +138,7 @@ void pv_hypercall(struct cpu_user_regs *regs)
 
     curr->hcall_preempted = false;
 
-    if ( !is_pv_32bit_vcpu(curr) )
+    if ( !compat )
     {
         unsigned long rdi = regs->rdi;
         unsigned long rsi = regs->rsi;
@@ -348,8 +349,21 @@ void pv_ring1_init_hypercall_page(void *p)
         *(u8  *)(p+ 7) = 0xc3;    /* ret */
     }
 }
+
+void do_entry_int82(struct cpu_user_regs *regs)
+{
+    if ( unlikely(untrusted_msi) )
+        check_for_unexpected_msi((uint8_t)regs->entry_vector);
+
+    _pv_hypercall(regs, true /* compat */);
+}
 #endif
 
+void pv_hypercall(struct cpu_user_regs *regs)
+{
+    _pv_hypercall(regs, false /* native */);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/pv/traps.c b/xen/arch/x86/pv/traps.c
index 764773c02104..1e05a9f1cdad 100644
--- a/xen/arch/x86/pv/traps.c
+++ b/xen/arch/x86/pv/traps.c
@@ -24,22 +24,11 @@
 #include <xen/lib.h>
 #include <xen/softirq.h>
 
-#include <asm/apic.h>
 #include <asm/pv/trace.h>
 #include <asm/shared.h>
 #include <asm/traps.h>
 #include <irq_vectors.h>
 
-#ifdef CONFIG_PV32
-void do_entry_int82(struct cpu_user_regs *regs)
-{
-    if ( unlikely(untrusted_msi) )
-        check_for_unexpected_msi((uint8_t)regs->entry_vector);
-
-    pv_hypercall(regs);
-}
-#endif
-
 void pv_inject_event(const struct x86_event *event)
 {
     struct vcpu *curr = current;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/pv: Split pv_hypercall() in two
  2021-10-11 18:05 ` [PATCH] " Andrew Cooper
@ 2021-10-12  9:26   ` Jan Beulich
  0 siblings, 0 replies; 3+ messages in thread
From: Jan Beulich @ 2021-10-12  9:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 11.10.2021 20:05, Andrew Cooper wrote:
> The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
> individual path through the function.  There is very little code common
> between compat and native, and context-dependent conditionals predict very
> badly for a period of time after context switch.
> 
> Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
> _pv_hypercall() to be static and forced inline.  The delta is:
> 
>   add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
>   Function                                     old     new   delta
>   do_entry_int82                                50     350    +300
>   pv_hypercall                                 579     297    -282
> 
> which is tiny, but the perf implications are large:
> 
>   Guest | Naples | Milan  | SKX    | CFL-R  |
>   ------+--------+--------+--------+--------+
>   pv64  |  17.4% |  15.5% |   2.6% |   4.5% |
>   pv32  |   1.9% |  10.9% |   1.4% |   2.5% |
> 
> These are percentage improvements in raw TSC detlas for a xen_version
> hypercall, with obvious outliers excluded.  Therefore, it is an idealised best
> case improvement.
> 
> The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
> necessarily has higher overhead.  Therefore, dropping the lfences is less over
> an overall improvement.
> 
> I don't know why the Naples pv32 improvement is so small, but I've double
> checked the numbers and they're correct.  There's something we're doing which
> is a large overhead in the pipeline.
> 
> On the Intel side, both systems are writing to MSR_SPEC_CTRL on
> entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
> the hardware implementation), while SKX is suffering further from XPTI for
> Meltdown protection.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-12  9:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-11 18:05 [PATCH 0/1] x86/pv: Split pv_hypercall() in two Andrew Cooper
2021-10-11 18:05 ` [PATCH] " Andrew Cooper
2021-10-12  9:26   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.