* [PATCH 0/1] x86/pv: Split pv_hypercall() in two
@ 2021-10-11 18:05 Andrew Cooper
2021-10-11 18:05 ` [PATCH] " Andrew Cooper
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Cooper @ 2021-10-11 18:05 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Full perf anlaysis. Time is raw TSC cycles for a xen_version() hypercall,
compared across the change in patch 1, with obvious obvious outliers excluded.
i.e. Idealised best case improvement.
Some general notes. pv64 is `syscall`, while pv32 is `int $0x82` and
therefore has more overhead to begin with. Consequently, dropping two lfences
is less of an overall change in the path.
First, AMD Milan (Zen3):
$ ministat -A milan-hcall-pv64-{before,after}
x milan-hcall-pv64-before
+ milan-hcall-pv64-after
N Min Max Median Avg Stddev
x 98 420 460 440 438.97959 6.6564899
+ 98 360 440 380 370.81633 12.57337
Difference at 95.0% confidence
-68.1633 +/- 2.81674
-15.5277% +/- 0.641656%
(Student's t, pooled s = 10.0598)
$ ministat -A milan-hcall-pv32-{before,after}
x milan-hcall-pv32-before
+ milan-hcall-pv32-after
N Min Max Median Avg Stddev
x 98 1900 2100 1980 1984.2857 22.291416
+ 96 1740 1960 1760 1767.5 35.688713
Difference at 95.0% confidence
-216.786 +/- 8.35522
-10.9251% +/- 0.421069%
(Student's t, pooled s = 29.6859)
Second, AMD Naples (Zen1):
$ ministat -A naples-hcall-pv64-{before,after}
x naples-hcall-pv64-before
+ naples-hcall-pv64-after
N Min Max Median Avg Stddev
x 97 294 336 315 311.75258 10.207259
+ 97 252 273 252 257.41237 9.2328135
Difference at 95.0% confidence
-54.3402 +/- 2.73904
-17.4306% +/- 0.878593%
(Student's t, pooled s = 9.73224)
$ ministat -A naples-hcall-pv32-{before,after}
x naples-hcall-pv32-before
+ naples-hcall-pv32-after
N Min Max Median Avg Stddev
x 98 1260 1470 1260 1276.2857 42.913483
+ 95 1218 1470 1239 1250.9368 52.491298
Difference at 95.0% confidence
-25.3489 +/- 13.5082
-1.98614% +/- 1.0584%
(Student's t, pooled s = 47.8673)
Third, Intel Coffeelake-R:
$ ministat -A cflr-hcall-pv64-{before,after}
x cflr-hcall-pv64-before
+ cflr-hcall-pv64-after
N Min Max Median Avg Stddev
x 100 774 1024 792 825.04 73.608563
+ 95 734 966 756 787.74737 70.580114
Difference at 95.0% confidence
-37.2926 +/- 20.2602
-4.5201% +/- 2.45567%
(Student's t, pooled s = 72.1494)
$ ministat -A cflr-hcall-pv32-{before,after}
x cflr-hcall-pv32-before
+ cflr-hcall-pv32-after
N Min Max Median Avg Stddev
x 100 2176 3816 2198 2288.84 196.18218
+ 99 2180 2434 2198 2232.4646 75.867677
Difference at 95.0% confidence
-56.3754 +/- 41.4084
-2.46305% +/- 1.80914%
(Student's t, pooled s = 149.013)
Fourth, Intel Skylake Server:
$ ministat -A skx-hcall-pv64-{before,after}
x skx-hcall-pv64-before
+ skx-hcall-pv64-after
N Min Max Median Avg Stddev
x 99 5642 5720 5686 5686.303 17.909896
+ 98 5520 5544 5540 5536.0816 8.20821
Difference at 95.0% confidence
-150.221 +/- 3.89729
-2.64181% +/- 0.0685382%
(Student's t, pooled s = 13.9542)
$ ministat -A skx-hcall-pv32-{before,after}
x skx-hcall-pv32-before
+ skx-hcall-pv32-after
N Min Max Median Avg Stddev
x 99 9296 9500 9308 9309.3131 20.418402
+ 96 9110 9266 9180 9175.2292 27.860358
Difference at 95.0% confidence
-134.084 +/- 6.84111
-1.44032% +/- 0.0734868%
(Student's t, pooled s = 24.3673)
I'm honestly not sure why Naples PV32's improvement is so small, but I've
double checked the numbers. Clearly there's something on the `int $0x82` path
which is radically higher overhead on Naples vs Milan.
For the Intel numbers, both setups are writing to MSR_SPEC_CTRL on entry/exit,
but for Skylake it is the microcode implementation whereas for CLF-R, it is
the hardware implemenation. Skylake has XPTI adding further overhead to the
paths.
Andrew Cooper (1):
x86/pv: Split pv_hypercall() in two
xen/arch/x86/pv/hypercall.c | 24 +++++++++++++++++++-----
xen/arch/x86/pv/traps.c | 11 -----------
2 files changed, 19 insertions(+), 16 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH] x86/pv: Split pv_hypercall() in two
2021-10-11 18:05 [PATCH 0/1] x86/pv: Split pv_hypercall() in two Andrew Cooper
@ 2021-10-11 18:05 ` Andrew Cooper
2021-10-12 9:26 ` Jan Beulich
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Cooper @ 2021-10-11 18:05 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu
The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
individual path through the function. There is very little code common
between compat and native, and context-dependent conditionals predict very
badly for a period of time after context switch.
Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
_pv_hypercall() to be static and forced inline. The delta is:
add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
Function old new delta
do_entry_int82 50 350 +300
pv_hypercall 579 297 -282
which is tiny, but the perf implications are large:
Guest | Naples | Milan | SKX | CFL-R |
------+--------+--------+--------+--------+
pv64 | 17.4% | 15.5% | 2.6% | 4.5% |
pv32 | 1.9% | 10.9% | 1.4% | 2.5% |
These are percentage improvements in raw TSC detlas for a xen_version
hypercall, with obvious outliers excluded. Therefore, it is an idealised best
case improvement.
The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
necessarily has higher overhead. Therefore, dropping the lfences is less over
an overall improvement.
I don't know why the Naples pv32 improvement is so small, but I've double
checked the numbers and they're correct. There's something we're doing which
is a large overhead in the pipeline.
On the Intel side, both systems are writing to MSR_SPEC_CTRL on
entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
the hardware implementation), while SKX is suffering further from XPTI for
Meltdown protection.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
xen/arch/x86/pv/hypercall.c | 24 +++++++++++++++++++-----
xen/arch/x86/pv/traps.c | 11 -----------
2 files changed, 19 insertions(+), 16 deletions(-)
diff --git a/xen/arch/x86/pv/hypercall.c b/xen/arch/x86/pv/hypercall.c
index 9765e674cf60..3579ba905c1c 100644
--- a/xen/arch/x86/pv/hypercall.c
+++ b/xen/arch/x86/pv/hypercall.c
@@ -23,6 +23,7 @@
#include <xen/hypercall.h>
#include <xen/nospec.h>
#include <xen/trace.h>
+#include <asm/apic.h>
#include <asm/multicall.h>
#include <irq_vectors.h>
@@ -109,15 +110,15 @@ const pv_hypercall_table_t pv_hypercall_table[] = {
#undef COMPAT_CALL
#undef HYPERCALL
-void pv_hypercall(struct cpu_user_regs *regs)
+/* Forced inline to cause 'compat' to be evaluated at compile time. */
+static void always_inline
+_pv_hypercall(struct cpu_user_regs *regs, bool compat)
{
struct vcpu *curr = current;
- unsigned long eax;
+ unsigned long eax = compat ? regs->eax : regs->rax;
ASSERT(guest_kernel_mode(curr, regs));
- eax = is_pv_32bit_vcpu(curr) ? regs->eax : regs->rax;
-
BUILD_BUG_ON(ARRAY_SIZE(pv_hypercall_table) >
ARRAY_SIZE(hypercall_args_table));
@@ -137,7 +138,7 @@ void pv_hypercall(struct cpu_user_regs *regs)
curr->hcall_preempted = false;
- if ( !is_pv_32bit_vcpu(curr) )
+ if ( !compat )
{
unsigned long rdi = regs->rdi;
unsigned long rsi = regs->rsi;
@@ -348,8 +349,21 @@ void pv_ring1_init_hypercall_page(void *p)
*(u8 *)(p+ 7) = 0xc3; /* ret */
}
}
+
+void do_entry_int82(struct cpu_user_regs *regs)
+{
+ if ( unlikely(untrusted_msi) )
+ check_for_unexpected_msi((uint8_t)regs->entry_vector);
+
+ _pv_hypercall(regs, true /* compat */);
+}
#endif
+void pv_hypercall(struct cpu_user_regs *regs)
+{
+ _pv_hypercall(regs, false /* native */);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/pv/traps.c b/xen/arch/x86/pv/traps.c
index 764773c02104..1e05a9f1cdad 100644
--- a/xen/arch/x86/pv/traps.c
+++ b/xen/arch/x86/pv/traps.c
@@ -24,22 +24,11 @@
#include <xen/lib.h>
#include <xen/softirq.h>
-#include <asm/apic.h>
#include <asm/pv/trace.h>
#include <asm/shared.h>
#include <asm/traps.h>
#include <irq_vectors.h>
-#ifdef CONFIG_PV32
-void do_entry_int82(struct cpu_user_regs *regs)
-{
- if ( unlikely(untrusted_msi) )
- check_for_unexpected_msi((uint8_t)regs->entry_vector);
-
- pv_hypercall(regs);
-}
-#endif
-
void pv_inject_event(const struct x86_event *event)
{
struct vcpu *curr = current;
--
2.11.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] x86/pv: Split pv_hypercall() in two
2021-10-11 18:05 ` [PATCH] " Andrew Cooper
@ 2021-10-12 9:26 ` Jan Beulich
0 siblings, 0 replies; 3+ messages in thread
From: Jan Beulich @ 2021-10-12 9:26 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel
On 11.10.2021 20:05, Andrew Cooper wrote:
> The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
> individual path through the function. There is very little code common
> between compat and native, and context-dependent conditionals predict very
> badly for a period of time after context switch.
>
> Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
> _pv_hypercall() to be static and forced inline. The delta is:
>
> add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
> Function old new delta
> do_entry_int82 50 350 +300
> pv_hypercall 579 297 -282
>
> which is tiny, but the perf implications are large:
>
> Guest | Naples | Milan | SKX | CFL-R |
> ------+--------+--------+--------+--------+
> pv64 | 17.4% | 15.5% | 2.6% | 4.5% |
> pv32 | 1.9% | 10.9% | 1.4% | 2.5% |
>
> These are percentage improvements in raw TSC detlas for a xen_version
> hypercall, with obvious outliers excluded. Therefore, it is an idealised best
> case improvement.
>
> The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
> necessarily has higher overhead. Therefore, dropping the lfences is less over
> an overall improvement.
>
> I don't know why the Naples pv32 improvement is so small, but I've double
> checked the numbers and they're correct. There's something we're doing which
> is a large overhead in the pipeline.
>
> On the Intel side, both systems are writing to MSR_SPEC_CTRL on
> entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
> the hardware implementation), while SKX is suffering further from XPTI for
> Meltdown protection.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-10-12 9:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-11 18:05 [PATCH 0/1] x86/pv: Split pv_hypercall() in two Andrew Cooper
2021-10-11 18:05 ` [PATCH] " Andrew Cooper
2021-10-12 9:26 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.