* [RFC PATCH 0/3] generic hypercall support @ 2009-05-05 13:24 Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins ` (3 more replies) 0 siblings, 4 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw) To: linux-kernel; +Cc: kvm, avi (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) Please see patch 1/3 for a description. This has been tested with a KVM guest on x86_64 and appears to work properly. Comments, please. -Greg --- Gregory Haskins (3): kvm: add pv_cpu_ops.hypercall support to the guest x86: add generic hypercall support add generic hypercall support arch/Kconfig | 3 + arch/x86/Kconfig | 1 arch/x86/include/asm/paravirt.h | 13 ++++++ arch/x86/include/asm/processor.h | 6 +++ arch/x86/kernel/kvm.c | 22 ++++++++++ include/linux/hypercall.h | 83 ++++++++++++++++++++++++++++++++++++++ 6 files changed, 128 insertions(+), 0 deletions(-) create mode 100644 include/linux/hypercall.h -- Signature ^ permalink raw reply [flat|nested] 115+ messages in thread
* [RFC PATCH 1/3] add generic hypercall support 2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins @ 2009-05-05 13:24 ` Gregory Haskins 2009-05-05 17:03 ` Hollis Blanchard 2009-05-06 13:52 ` Anthony Liguori 2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins ` (2 subsequent siblings) 3 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw) To: linux-kernel; +Cc: kvm, avi We add a generic hypercall() mechanism for use by IO code which is compatible with a variety of hypervisors, but which prefers to use hypercalls over other types of hypervisor traps for performance and/or feature reasons. For instance, consider an emulated PCI device in KVM. Today we can chose to do IO over MMIO or PIO infrastructure, but they each have their own distinct disadvantages: *) MMIO causes a page-fault, which must be decoded by the hypervisor and is therefore fairly expensive. *) PIO is more direct than MMIO, but it poses other problems such as: a) can have a small limited address space (x86 is 2^16) b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time) c) not available on all archs (PCI mentions ppc as problematic) and is therefore recommended to avoid. Hypercalls, on the other hand, offer a direct access path like PIOs, yet do not suffer the same drawbacks such as a limited address space or a narrow-band interface. Hypercalls are much more friendly to software to software interaction since we can pack multiple registers in a way that is natural and simple for software to utilize. The problem with hypercalls today is that there is no generic support. There is various support for hypervisor specific implementations (for instance, see kvm_hypercall0() in arch/x86/include/asm/kvm_para.h). This makes it difficult to implement a device that is hypervisor agnostic since it would not only need to know the hypercall ABI, but also which platform specific function call it should make. If we can convey a dynamic binding to a specific hypercall vector in a generic way (out of the scope of this patch series), then an IO driver could utilize that dynamic binding to communicate without requiring hypervisor specific knowledge. Therefore, we implement a system wide hypercall() interface based on a variable length list of unsigned longs (representing registers to pack) and expect that various arch/hypervisor implementations can fill in the details, if supported. This is expected to be done as part of the pv_ops infrastructure, which is the natural hook-point for hypervisor specific code. Note, however, that the generic hypercall() interface does not require the implementation to use pv_ops if so desired. Example use case: ------------------ Consider a PCI device "X". It can already advertise MMIO/PIO regions via its BAR infrastructure. With this new model it could also advertise a hypercall vector in its device-specific upper configuration space. (The allocation and assignment of this vector on the backend is beyond the scope of this series). The guest-side driver for device "X" would sense (via something like a feature-bit) if the hypercall was available and valid, read the value with a configuration cycle, and proceed to ignore the BARs in favor of using the hypercall() interface. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- include/linux/hypercall.h | 83 +++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 83 insertions(+), 0 deletions(-) create mode 100644 include/linux/hypercall.h diff --git a/include/linux/hypercall.h b/include/linux/hypercall.h new file mode 100644 index 0000000..c8a1492 --- /dev/null +++ b/include/linux/hypercall.h @@ -0,0 +1,83 @@ +/* + * Copyright 2009 Novell. All Rights Reserved. + * + * Author: + * Gregory Haskins <ghaskins@novell.com> + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#ifndef _LINUX_HYPERCALL_H +#define _LINUX_HYPERCALL_H + +#ifdef CONFIG_HAVE_HYPERCALL + +long hypercall(unsigned long nr, unsigned long *args, size_t count); + +#else + +static inline long +hypercall(unsigned long nr, unsigned long *args, size_t count) +{ + return -EINVAL; +} + +#endif /* CONFIG_HAVE_HYPERCALL */ + +#define hypercall0(nr) hypercall(nr, NULL, 0) +#define hypercall1(nr, a1) \ + ({ \ + unsigned long __args[] = { a1, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) +#define hypercall2(nr, a1, a2) \ + ({ \ + unsigned long __args[] = { a1, a2, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) +#define hypercall3(nr, a1, a2, a3) \ + ({ \ + unsigned long __args[] = { a1, a2, a3, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) +#define hypercall4(nr, a1, a2, a3, a4) \ + ({ \ + unsigned long __args[] = { a1, a2, a3, a4, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) +#define hypercall5(nr, a1, a2, a3, a4, a5) \ + ({ \ + unsigned long __args[] = { a1, a2, a3, a4, a5, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) +#define hypercall6(nr, a1, a2, a3, a4, a5, a6) \ + ({ \ + unsigned long __args[] = { a1, a2, a3, a4, a5, a6, }; \ + long __ret; \ + __ret = hypercall(nr, __args, ARRAY_SIZE(__args)); \ + __ret; \ + }) + + +#endif /* _LINUX_HYPERCALL_H */ ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 1/3] add generic hypercall support 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins @ 2009-05-05 17:03 ` Hollis Blanchard 2009-05-06 13:52 ` Anthony Liguori 1 sibling, 0 replies; 115+ messages in thread From: Hollis Blanchard @ 2009-05-05 17:03 UTC (permalink / raw) To: Gregory Haskins; +Cc: linux-kernel, kvm, avi On Tue, 2009-05-05 at 09:24 -0400, Gregory Haskins wrote: > > *) PIO is more direct than MMIO, but it poses other problems such as: > a) can have a small limited address space (x86 is 2^16) > b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time) > c) not available on all archs (PCI mentions ppc as problematic) and > is therefore recommended to avoid. Side note: I don't know what PCI has to do with this, and "problematic" isn't the word I would use. ;) As far as I know, x86 is the only still-alive architecture that implements instructions for a separate IO space (not even ia64 does). -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 1/3] add generic hypercall support 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins 2009-05-05 17:03 ` Hollis Blanchard @ 2009-05-06 13:52 ` Anthony Liguori 2009-05-06 15:16 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-06 13:52 UTC (permalink / raw) To: Gregory Haskins; +Cc: linux-kernel, kvm, avi Gregory Haskins wrote: > We add a generic hypercall() mechanism for use by IO code which is > compatible with a variety of hypervisors, but which prefers to use > hypercalls over other types of hypervisor traps for performance and/or > feature reasons. > > For instance, consider an emulated PCI device in KVM. Today we can chose > to do IO over MMIO or PIO infrastructure, but they each have their own > distinct disadvantages: > > *) MMIO causes a page-fault, which must be decoded by the hypervisor and is > therefore fairly expensive. > > *) PIO is more direct than MMIO, but it poses other problems such as: > a) can have a small limited address space (x86 is 2^16) > b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time) > c) not available on all archs (PCI mentions ppc as problematic) and > is therefore recommended to avoid. > > Hypercalls, on the other hand, offer a direct access path like PIOs, yet > do not suffer the same drawbacks such as a limited address space or a > narrow-band interface. Hypercalls are much more friendly to software > to software interaction since we can pack multiple registers in a way > that is natural and simple for software to utilize. > No way. Hypercalls are just about awful because they cannot be implemented sanely with VT/SVM as Intel/AMD couldn't agree on a common instruction for it. This means you either need a hypercall page, which I'm pretty sure makes transparent migration impossible, or you need to do hypercall patching which is going to throw off attestation. If anything, I'd argue that we shouldn't use hypercalls for anything in KVM because it will break run-time attestation. Hypercalls cannot pass any data either. We pass data with hypercalls by relying on external state (like register state). It's just as easy to do this with PIO. VMware does this with vmport, for instance. However, in general, you do not want to pass that much data with a notification. It's better to rely on some external state (like a ring queue) and have the "hypercall" act simply as a notification mechanism. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 1/3] add generic hypercall support 2009-05-06 13:52 ` Anthony Liguori @ 2009-05-06 15:16 ` Gregory Haskins 2009-05-06 16:52 ` Arnd Bergmann 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-06 15:16 UTC (permalink / raw) To: Anthony Liguori; +Cc: linux-kernel, kvm, avi [-- Attachment #1: Type: text/plain, Size: 5719 bytes --] Anthony Liguori wrote: > Gregory Haskins wrote: >> We add a generic hypercall() mechanism for use by IO code which is >> compatible with a variety of hypervisors, but which prefers to use >> hypercalls over other types of hypervisor traps for performance and/or >> feature reasons. >> >> For instance, consider an emulated PCI device in KVM. Today we can >> chose >> to do IO over MMIO or PIO infrastructure, but they each have their own >> distinct disadvantages: >> >> *) MMIO causes a page-fault, which must be decoded by the hypervisor >> and is >> therefore fairly expensive. >> >> *) PIO is more direct than MMIO, but it poses other problems such as: >> a) can have a small limited address space (x86 is 2^16) >> b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a >> time) >> c) not available on all archs (PCI mentions ppc as problematic) >> and >> is therefore recommended to avoid. >> >> Hypercalls, on the other hand, offer a direct access path like PIOs, yet >> do not suffer the same drawbacks such as a limited address space or a >> narrow-band interface. Hypercalls are much more friendly to software >> to software interaction since we can pack multiple registers in a way >> that is natural and simple for software to utilize. >> > > No way. Hypercalls are just about awful because they cannot be > implemented sanely with VT/SVM as Intel/AMD couldn't agree on a common > instruction for it. This means you either need a hypercall page, > which I'm pretty sure makes transparent migration impossible, or you > need to do hypercall patching which is going to throw off attestation. This is irrelevant since KVM already does this patching today and we therefore have to deal with this fact. This new work is riding on the existing infrastructure so it of negligible cost to add new vectors at least w.r.t. your concerns above. > > If anything, I'd argue that we shouldn't use hypercalls for anything > in KVM because it will break run-time attestation. The cats already out of the bag on that one. Besides, even if the hypercall infrastructure does break attestation (which I don't think is proven as fact) not everyone cares about migration. At some point, someone may want to make a decision to trade performance for the ability to migrate (or vice versa). I am ok with that. > > > Hypercalls cannot pass any data either. We pass data with hypercalls > by relying on external state (like register state). This is a silly argument. The CALL or SYSENTER instructions do not pass data either. Rather, they branch the IP and/or execution context, predicated on the notion that the new IP/context can still access the relevant machine state prior to the call. VMCALL type instructions are just one more logical extension of that same construct. PIO on the other hand is different. The architectural assumption is that the target endpoint does not have access to the machine state. Sure, we can "cheat" in virtualization by knowing that it really will have access, but then we are still constrained by all the things I have already mentioned that are disadvantageous to PIOs, plus.... > It's just as easy to do this with PIO. ..you are glossing over the fact that we already have the infrastructure to do proper register setup in kvm_hypercallX(). We would need to come up with an similar (arch specific) "pio_callX()" as well. Oh wait, no we wouldn't, since PIOs apparently only work on x86. ;) Ok, so we would need to come up with these pio_calls for x86, and no other arch can use the infrastructure (but wait, PPC can use PCI too, so how does that work? It must be either MMIO emulation or its not supported? That puts us back to square one). In addition, we can have at most 8k unique vectors on x86_64 (2^16/8 bytes per port). Even this is a gross overestimation because it assumes the entire address space is available for pio_call, which it isn't, and it assumes all allocations are in the precise width of the address space (8-bytes), which they wont be. It doesn't exactly sound like a winner to me. > VMware does this with vmport, for instance. However, in general, you > do not want to pass that much data with a notification. It's better > to rely on some external state (like a ring queue) and have the > "hypercall" act simply as a notification mechanism. Disagree completely. Things like shared-memory rings provide a really nice asynchronous mechanism, and for the majority of your IO and for fast-path thats probably exactly what we should do. However, there are plenty of patterns that fit better with a synchronous model. Case in point: see VIRTIO_VBUS_FUNC_GET_FEATURES, SET_STATUS, and RESET functions in the virtio-vbus driver I posted. And as an aside, the ring notification rides on the same fundamental synchronous transport. So sure, you could argue that you could also make an extra "call" ring, place your request in the ring, and use a flush/block primitive to wait for that item to execute. But that sounds a bit complicated when all I want is the ability to invoke simple synchronous calls. Therefore, lets just get this "right" and build the primitives to express both patterns easily. In the end, I don't specifically care if the "call()" vehicle is a PIO or a hypercall per se, as long as we can meet the following: a) is feasible at least anywhere PCI works b) provides a robust and uniform interface so drivers do not need to care about the implementation. b) has equivalent performance (e.g. if PPC maps PIO as MMIO, that is no-good). Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 1/3] add generic hypercall support 2009-05-06 15:16 ` Gregory Haskins @ 2009-05-06 16:52 ` Arnd Bergmann 0 siblings, 0 replies; 115+ messages in thread From: Arnd Bergmann @ 2009-05-06 16:52 UTC (permalink / raw) To: Gregory Haskins; +Cc: Anthony Liguori, linux-kernel, kvm, avi On Wednesday 06 May 2009, Gregory Haskins wrote: > Ok, so we would > need to come up with these pio_calls for x86, and no other arch can use > the infrastructure (but wait, PPC can use PCI too, so how does that > work? It must be either MMIO emulation or its not supported? That puts > us back to square one). PowerPC already has an abstraction for PIO and MMIO because certain broken hardware chips don't do what they should, see arch/powerpc/platforms/cell/io-workarounds.c for the only current user. If you need to, you could do the same on x86 (or generalize the code), but please don't add another level of indirection on top of this. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* [RFC PATCH 2/3] x86: add generic hypercall support 2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins @ 2009-05-05 13:24 ` Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins 2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity 3 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw) To: linux-kernel; +Cc: kvm, avi This adds a hypercall() vector to x86 pv_cpu_ops to be optionally filled in by a hypervisor driver as it loads its other pv_ops components. We also declare x86 as CONFIG_HAVE_HYPERCALL to enable the generic hypercall code whenever the user builds for x86. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/paravirt.h | 13 +++++++++++++ arch/x86/include/asm/processor.h | 6 ++++++ 4 files changed, 23 insertions(+), 0 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 78a35e9..239b658 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -112,3 +112,6 @@ config HAVE_DMA_API_DEBUG config HAVE_DEFAULT_NO_SPIN_MUTEXES bool + +config HAVE_HYPERCALL + bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index df9e885..3c609cf 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -46,6 +46,7 @@ config X86 select HAVE_KERNEL_GZIP select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_LZMA + select HAVE_HYPERCALL config ARCH_DEFCONFIG string diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 378e369..ed22c84 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -6,6 +6,7 @@ #ifdef CONFIG_PARAVIRT #include <asm/pgtable_types.h> #include <asm/asm.h> +#include <asm/errno.h> /* Bitmask of what can be clobbered: usually at least eax. */ #define CLBR_NONE 0 @@ -203,6 +204,8 @@ struct pv_cpu_ops { void (*swapgs)(void); + long (*hypercall)(unsigned long nr, unsigned long *args, size_t count); + struct pv_lazy_ops lazy_mode; }; @@ -723,6 +726,16 @@ static inline void __cpuid(unsigned int *eax, unsigned int *ebx, PVOP_VCALL4(pv_cpu_ops.cpuid, eax, ebx, ecx, edx); } +static inline long hypercall(unsigned long nr, + unsigned long *args, + size_t count) +{ + if (!pv_cpu_ops.hypercall) + return -EINVAL; + + return pv_cpu_ops.hypercall(nr, args, count); +} + /* * These special macros can be used to get or set a debugging register */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index c2cceae..8fa988d 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -570,6 +570,12 @@ static inline void native_swapgs(void) #define __cpuid native_cpuid #define paravirt_enabled() 0 +static inline long +hypercall(unsigned long nr, unsigned long *args, size_t count) +{ + return -EINVAL; +} + /* * These special macros can be used to get or set a debugging register */ ^ permalink raw reply related [flat|nested] 115+ messages in thread
* [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest 2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins @ 2009-05-05 13:24 ` Gregory Haskins 2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity 3 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw) To: linux-kernel; +Cc: kvm, avi Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- arch/x86/kernel/kvm.c | 22 ++++++++++++++++++++++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 33019dd..d299ed5 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -50,6 +50,26 @@ static void kvm_io_delay(void) { } +static long _kvm_hypercall(unsigned long nr, + unsigned long *args, + size_t count) +{ + switch (count) { + case 0: + return kvm_hypercall0(nr); + case 1: + return kvm_hypercall1(nr, args[0]); + case 2: + return kvm_hypercall2(nr, args[0], args[1]); + case 3: + return kvm_hypercall3(nr, args[0], args[1], args[2]); + case 4: + return kvm_hypercall4(nr, args[0], args[1], args[2], args[3]); + default: + return -EINVAL; + } +} + static void kvm_mmu_op(void *buffer, unsigned len) { int r; @@ -207,6 +227,8 @@ static void paravirt_ops_setup(void) if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY)) pv_cpu_ops.io_delay = kvm_io_delay; + pv_cpu_ops.hypercall = _kvm_hypercall; + if (kvm_para_has_feature(KVM_FEATURE_MMU_OP)) { pv_mmu_ops.set_pte = kvm_set_pte; pv_mmu_ops.set_pte_at = kvm_set_pte_at; ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins ` (2 preceding siblings ...) 2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins @ 2009-05-05 13:36 ` Avi Kivity 2009-05-05 13:40 ` Gregory Haskins 3 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-05 13:36 UTC (permalink / raw) To: Gregory Haskins; +Cc: linux-kernel, kvm Gregory Haskins wrote: > (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) > > Please see patch 1/3 for a description. This has been tested with a KVM > guest on x86_64 and appears to work properly. Comments, please. > What about the hypercalls in include/asm/kvm_para.h? In general, hypercalls cannot be generic since each hypervisor implements its own ABI. The abstraction needs to be at a higher level (pv_ops is such a level). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity @ 2009-05-05 13:40 ` Gregory Haskins 2009-05-05 14:00 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 13:40 UTC (permalink / raw) To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 753 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) >> >> Please see patch 1/3 for a description. This has been tested with a KVM >> guest on x86_64 and appears to work properly. Comments, please. >> > > What about the hypercalls in include/asm/kvm_para.h? > > In general, hypercalls cannot be generic since each hypervisor > implements its own ABI. Please see the prologue to 1/3. Its all described there, including a use case which I think answers your questions. If there is still ambiguity, let me know. > The abstraction needs to be at a higher level (pv_ops is such a level). Yep, agreed. Thats exactly what this series is doing, actually. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 13:40 ` Gregory Haskins @ 2009-05-05 14:00 ` Avi Kivity 2009-05-05 14:14 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-05 14:00 UTC (permalink / raw) To: Gregory Haskins; +Cc: linux-kernel, kvm Gregory Haskins wrote: > Avi Kivity wrote: > >> Gregory Haskins wrote: >> >>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) >>> >>> Please see patch 1/3 for a description. This has been tested with a KVM >>> guest on x86_64 and appears to work properly. Comments, please. >>> >>> >> What about the hypercalls in include/asm/kvm_para.h? >> >> In general, hypercalls cannot be generic since each hypervisor >> implements its own ABI. >> > Please see the prologue to 1/3. Its all described there, including a > use case which I think answers your questions. If there is still > ambiguity, let me know. > > Yeah, sorry. >> The abstraction needs to be at a higher level (pv_ops is such a level). >> > Yep, agreed. Thats exactly what this series is doing, actually. > No, it doesn't. It makes "making hypercalls" a pv_op, but hypervisors don't implement the same ABI. pv_ops all _use_ hypercalls to implement higher level operations, like set_pte (probably the only place set_pte can be considered a high level operation). In this case, the higher level event could be hypervisor_dynamic_event(number); each pv_ops implementation would use its own hypercalls to implement that. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 14:00 ` Avi Kivity @ 2009-05-05 14:14 ` Gregory Haskins 2009-05-05 14:21 ` Gregory Haskins ` (2 more replies) 0 siblings, 3 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 14:14 UTC (permalink / raw) To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 2615 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> Gregory Haskins wrote: >>> >>>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) >>>> >>>> Please see patch 1/3 for a description. This has been tested with >>>> a KVM >>>> guest on x86_64 and appears to work properly. Comments, please. >>>> >>> What about the hypercalls in include/asm/kvm_para.h? >>> >>> In general, hypercalls cannot be generic since each hypervisor >>> implements its own ABI. >>> >> Please see the prologue to 1/3. Its all described there, including a >> use case which I think answers your questions. If there is still >> ambiguity, let me know. >> >> > > Yeah, sorry. > >>> The abstraction needs to be at a higher level (pv_ops is such a >>> level). >>> >> Yep, agreed. Thats exactly what this series is doing, actually. >> > > No, it doesn't. It makes "making hypercalls" a pv_op, but hypervisors > don't implement the same ABI. Yes, that is true, but I think the issue right now is more of semantics. I think we are on the same page. So you would never have someone making a generic hypercall(KVM_HC_MMU_OP). I agree. What I am proposing here is more akin to PIO-BAR + iowrite()/ioread(). E.g. the infrastructure sets up the "addressing" (where in PIO this is literally an address, and for hypercalls this is a vector), but the "device" defines the ABI at that address. So its really the "device end-point" that is defining the ABI here, not the hypervisor (per se) and thats why I thought its ok to declare these "generic". But to your point below... > > pv_ops all _use_ hypercalls to implement higher level operations, like > set_pte (probably the only place set_pte can be considered a high > level operation). > > In this case, the higher level event could be > hypervisor_dynamic_event(number); each pv_ops implementation would use > its own hypercalls to implement that. I see. I had designed it slightly different where KVM could assign any top level vector it wanted and thus that drove the guest-side interface you see here to be more "generic hypercall". However, I think your proposal is perfectly fine too and it makes sense to more narrowly focus these calls as specifically "dynamic"...as thats the only vectors that we could technically use like this anyway. So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" to kvm_para.h, and I will change the interface to follow suit (something like s/hypercall/dynhc). Sound good? Thanks, Avi, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 14:14 ` Gregory Haskins @ 2009-05-05 14:21 ` Gregory Haskins 2009-05-05 15:02 ` Avi Kivity 2009-05-05 23:17 ` Chris Wright 2 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-05 14:21 UTC (permalink / raw) To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 729 bytes --] Gregory Haskins wrote: > So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" > to kvm_para.h, and I will change the interface to follow suit (something > like s/hypercall/dynhc). Sound good? > A small ramification of this change will be that I will need to do something like add a feature-bit to cpuid for detecting if HC_DYNAMIC is supported on the backend or not. The current v1 design doesn't suffer from this requirement because the presence of the dynamic vector itself is enough to know its supported. I like Avi's proposal enough to say that its worth this minor inconvenience, but FYI I will have to additionally submit a userspace patch for v2 if we go this route. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 14:14 ` Gregory Haskins 2009-05-05 14:21 ` Gregory Haskins @ 2009-05-05 15:02 ` Avi Kivity 2009-05-05 23:17 ` Chris Wright 2 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-05 15:02 UTC (permalink / raw) To: Gregory Haskins; +Cc: Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: > I see. I had designed it slightly different where KVM could assign any > top level vector it wanted and thus that drove the guest-side interface > you see here to be more "generic hypercall". However, I think your > proposal is perfectly fine too and it makes sense to more narrowly focus > these calls as specifically "dynamic"...as thats the only vectors that > we could technically use like this anyway. > > So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" > to kvm_para.h, and I will change the interface to follow suit (something > like s/hypercall/dynhc). Sound good? > Yeah. Another couple of points: - on the host side, we'd rig this to hit an eventfd. Nothing stops us from rigging pio to hit an eventfd as well, giving us kernel handling for pio trigger points. - pio actually has an advantage over hypercalls with nested guests. Since hypercalls don't have an associated port number, the lowermost hypervisor must interpret a hypercall as going to a guest's hypervisor, and not any lower-level hypervisors. What it boils down to is that you cannot use device assignment to give a guest access to a virtio/vbus device from a lower level hypervisor. (Bah, that's totally unreadable. What I want is instead of hypervisor[eth0/virtio-server] ----> intermediate[virtio-driver/virtio-server] ----> guest[virtio-driver] do hypervisor[eth0/virtio-server] ----> intermediate[assign virtio device] ----> guest[virtio-driver] well, it's probably still unreadable) -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 14:14 ` Gregory Haskins 2009-05-05 14:21 ` Gregory Haskins 2009-05-05 15:02 ` Avi Kivity @ 2009-05-05 23:17 ` Chris Wright 2009-05-06 3:51 ` Gregory Haskins 2 siblings, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-05 23:17 UTC (permalink / raw) To: Gregory Haskins; +Cc: Avi Kivity, Gregory Haskins, linux-kernel, kvm * Gregory Haskins (gregory.haskins@gmail.com) wrote: > So you would never have someone making a generic > hypercall(KVM_HC_MMU_OP). I agree. Which is why I think the interface proposal you've made is wrong. There's already hypercall interfaces w/ specific ABI and semantic meaning (which are typically called directly/indirectly from an existing pv op hook). But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) means hypercall number and arg list must be the same in order for code to call hypercall() in a hypervisor agnostic way. The pv_ops level need to have semantic meaning, not a free form hypercall multiplexor. thanks, -chris ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-05 23:17 ` Chris Wright @ 2009-05-06 3:51 ` Gregory Haskins 2009-05-06 7:22 ` Chris Wright 2009-05-06 13:56 ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-06 3:51 UTC (permalink / raw) To: Chris Wright; +Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 4182 bytes --] Chris Wright wrote: > * Gregory Haskins (gregory.haskins@gmail.com) wrote: > >> So you would never have someone making a generic >> hypercall(KVM_HC_MMU_OP). I agree. >> > > Which is why I think the interface proposal you've made is wrong. I respectfully disagree. Its only wrong in that the name chosen for the interface was perhaps too broad/vague. I still believe the concept is sound, and the general layering is appropriate. > There's > already hypercall interfaces w/ specific ABI and semantic meaning (which > are typically called directly/indirectly from an existing pv op hook). > Yes, these are different, thus the new interface. > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) > means hypercall number and arg list must be the same in order for code > to call hypercall() in a hypervisor agnostic way. > Yes, and that is exactly the intention. I think its perhaps the point you are missing. I am well aware that historically the things we do over a hypercall interface would inherently have meaning only to a specific hypervisor (e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()). However, this doesn't in any way infer that it is the only use for the general concept. Its just the only way they have been exploited to date. While I acknowledge that the hypervisor certainly must be coordinated with their use, in their essence hypercalls are just another form of IO joining the ranks of things like MMIO and PIO. This is an attempt to bring them out of the bowels of CONFIG_PARAVIRT to make them a first class citizen. The thing I am building here is really not a general hypercall in the broad sense. Rather, its a subset of the hypercall vector namespace. It is designed specifically for dynamic binding a synchronous call() interface to things like virtual devices, and it is therefore these virtual device models that define the particular ABI within that namespace. Thus the ABI in question is explicitly independent of the underlying hypervisor. I therefore stand by the proposed design to have this interface described above the hypervisor support layer (i.e. pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as per my later discussion with Avi). Consider PIO: The hypervisor (or hardware) and OS negotiate a port address, but the two end-points are the driver and the device-model (or real device). The driver doesnt have to say: if (kvm) kvm_iowrite32(addr, ..); else if (lguest) lguest_iowrite32(addr, ...); else native_iowrite32(addr, ...); Instead, it just says "iowrite32(addr, ...);" and the address is used to route the message appropriately by the platform. The ABI of that message, however, is specific to the driver/device and is not interpreted by kvm/lguest/native-hw infrastructure on the way. Today, there is no equivelent of a platform agnostic "iowrite32()" for hypercalls so the driver would look like the pseudocode above except substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal is to allow the hypervisor to assign a dynamic vector to resources in the backend and convey this vector to the guest (such as in PCI config-space as mentioned in my example use-case). The provides the "address negotiation" function that would normally be done for something like a pio port-address. The hypervisor agnostic driver can then use this globally recognized address-token coupled with other device-private ABI parameters to communicate with the device. This can all occur without the core hypervisor needing to understand the details beyond the addressing. What this means to our interface design is that the only thing the hypervisor really cares about is the first "nr" parameter. This acts as our address-token. The optional/variable list of args is just payload as far as the core infrastructure is concerned and are coupled only to our device ABI. They were chosen to be an array of ulongs (vs something like vargs) to reflect the fact that hypercalls are typically passed by packing registers. Hope this helps, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 3:51 ` Gregory Haskins @ 2009-05-06 7:22 ` Chris Wright 2009-05-06 13:17 ` Gregory Haskins 2009-05-06 13:56 ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori 1 sibling, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-06 7:22 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm * Gregory Haskins (ghaskins@novell.com) wrote: > Chris Wright wrote: > > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) > > means hypercall number and arg list must be the same in order for code > > to call hypercall() in a hypervisor agnostic way. > > Yes, and that is exactly the intention. I think its perhaps the point > you are missing. Yes, I was reading this as purely any hypercall, but it seems a bit more like: pv_io_ops->iomap() pv_io_ops->ioread() pv_io_ops->iowrite() <snip> > Today, there is no equivelent of a platform agnostic "iowrite32()" for > hypercalls so the driver would look like the pseudocode above except > substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal > is to allow the hypervisor to assign a dynamic vector to resources in > the backend and convey this vector to the guest (such as in PCI > config-space as mentioned in my example use-case). The provides the > "address negotiation" function that would normally be done for something > like a pio port-address. The hypervisor agnostic driver can then use > this globally recognized address-token coupled with other device-private > ABI parameters to communicate with the device. This can all occur > without the core hypervisor needing to understand the details beyond the > addressing. VF drivers can also have this issue (and typically use mmio). I at least have a better idea what your proposal is, thanks for explanation. Are you able to demonstrate concrete benefit with it yet (improved latency numbers for example)? thanks, -chris ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 7:22 ` Chris Wright @ 2009-05-06 13:17 ` Gregory Haskins 2009-05-06 16:07 ` Chris Wright 2009-05-07 12:29 ` Avi Kivity 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-06 13:17 UTC (permalink / raw) To: Chris Wright; +Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 1934 bytes --] Chris Wright wrote: > * Gregory Haskins (ghaskins@novell.com) wrote: > >> Chris Wright wrote: >> >>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) >>> means hypercall number and arg list must be the same in order for code >>> to call hypercall() in a hypervisor agnostic way. >>> >> Yes, and that is exactly the intention. I think its perhaps the point >> you are missing. >> > > Yes, I was reading this as purely any hypercall, but it seems a bit > more like: > pv_io_ops->iomap() > pv_io_ops->ioread() > pv_io_ops->iowrite() > Right. > <snip> > >> Today, there is no equivelent of a platform agnostic "iowrite32()" for >> hypercalls so the driver would look like the pseudocode above except >> substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal >> is to allow the hypervisor to assign a dynamic vector to resources in >> the backend and convey this vector to the guest (such as in PCI >> config-space as mentioned in my example use-case). The provides the >> "address negotiation" function that would normally be done for something >> like a pio port-address. The hypervisor agnostic driver can then use >> this globally recognized address-token coupled with other device-private >> ABI parameters to communicate with the device. This can all occur >> without the core hypervisor needing to understand the details beyond the >> addressing. >> > > VF drivers can also have this issue (and typically use mmio). > I at least have a better idea what your proposal is, thanks for > explanation. Are you able to demonstrate concrete benefit with it yet > (improved latency numbers for example)? > I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. Thanks Chris, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 13:17 ` Gregory Haskins @ 2009-05-06 16:07 ` Chris Wright 2009-05-07 17:03 ` Gregory Haskins 2009-05-07 12:29 ` Avi Kivity 1 sibling, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-06 16:07 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm * Gregory Haskins (ghaskins@novell.com) wrote: > Chris Wright wrote: > > VF drivers can also have this issue (and typically use mmio). > > I at least have a better idea what your proposal is, thanks for > > explanation. Are you able to demonstrate concrete benefit with it yet > > (improved latency numbers for example)? > > I had a test-harness/numbers for this kind of thing, but its a bit > crufty since its from ~1.5 years ago. I will dig it up, update it, and > generate/post new numbers. That would be useful, because I keep coming back to pio and shared page(s) when think of why not to do this. Seems I'm not alone in that. thanks, -chris ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 16:07 ` Chris Wright @ 2009-05-07 17:03 ` Gregory Haskins 2009-05-07 18:05 ` Avi Kivity 2009-05-07 23:35 ` Marcelo Tosatti 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 17:03 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1186 bytes --] Chris Wright wrote: > * Gregory Haskins (ghaskins@novell.com) wrote: > >> Chris Wright wrote: >> >>> VF drivers can also have this issue (and typically use mmio). >>> I at least have a better idea what your proposal is, thanks for >>> explanation. Are you able to demonstrate concrete benefit with it yet >>> (improved latency numbers for example)? >>> >> I had a test-harness/numbers for this kind of thing, but its a bit >> crufty since its from ~1.5 years ago. I will dig it up, update it, and >> generate/post new numbers. >> > > That would be useful, because I keep coming back to pio and shared > page(s) when think of why not to do this. Seems I'm not alone in that. > > thanks, > -chris > I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' "show me the numbers" and Anthony's "Why reinvent the wheel?" questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 17:03 ` Gregory Haskins @ 2009-05-07 18:05 ` Avi Kivity 2009-05-07 18:08 ` Gregory Haskins 2009-05-07 23:35 ` Marcelo Tosatti 1 sibling, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 18:05 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: > I completed the resurrection of the test and wrote up a little wiki on > the subject, which you can find here: > > http://developer.novell.com/wiki/index.php/WhyHypercalls > > Hopefully this answers Chris' "show me the numbers" and Anthony's "Why > reinvent the wheel?" questions. > > I will include this information when I publish the updated v2 series > with the s/hypercall/dynhc changes. > > Let me know if you have any questions. > Well, 420 ns is not to be sneezed at. What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:05 ` Avi Kivity @ 2009-05-07 18:08 ` Gregory Haskins 2009-05-07 18:12 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 18:08 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 856 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Well, 420 ns is not to be sneezed at. > > What do you think of my mmio hypercall? That will speed up all mmio > to be as fast as a hypercall, and then we can use ordinary mmio/pio > writes to trigger things. > I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:08 ` Gregory Haskins @ 2009-05-07 18:12 ` Avi Kivity 2009-05-07 18:16 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 18:12 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: >> What do you think of my mmio hypercall? That will speed up all mmio >> to be as fast as a hypercall, and then we can use ordinary mmio/pio >> writes to trigger things. >> >> > I like it! > > Bigger question is what kind of work goes into making mmio a pv_op (or > is this already done)? > > Looks like it isn't there. But it isn't any different than set_pte - convert a write into a hypercall. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:12 ` Avi Kivity @ 2009-05-07 18:16 ` Gregory Haskins 2009-05-07 18:24 ` Avi Kivity 2009-05-07 20:00 ` Arnd Bergmann 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 18:16 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 939 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >>> What do you think of my mmio hypercall? That will speed up all mmio >>> to be as fast as a hypercall, and then we can use ordinary mmio/pio >>> writes to trigger things. >>> >>> >> I like it! >> >> Bigger question is what kind of work goes into making mmio a pv_op (or >> is this already done)? >> >> > > Looks like it isn't there. But it isn't any different than set_pte - > convert a write into a hypercall. > > I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. Thoughts? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:16 ` Gregory Haskins @ 2009-05-07 18:24 ` Avi Kivity 2009-05-07 18:37 ` Gregory Haskins 2009-05-07 20:00 ` Arnd Bergmann 1 sibling, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 18:24 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: > I guess technically mmio can just be a simple access of the page which > would be problematic to trap locally without a PF. However it seems > that most mmio always passes through a ioread()/iowrite() call so this > is perhaps the hook point. If we set the stake in the ground that mmios > that go through some other mechanism like PFs can just hit the "slow > path" are an acceptable casualty, I think we can make that work. > That's my thinking exactly. Note we can cheat further. kvm already has a "coalesced mmio" feature where side-effect-free mmios are collected in the kernel and passed to userspace only when some other significant event happens. We could pass those addresses to the guest and let it queue those writes itself, avoiding the hypercall completely. Though it's probably pointless: if the guest is paravirtualized enough to have the mmio hypercall, then it shouldn't be using e1000. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:24 ` Avi Kivity @ 2009-05-07 18:37 ` Gregory Haskins 2009-05-07 19:00 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 18:37 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1593 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> > > That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. > > Note we can cheat further. kvm already has a "coalesced mmio" feature > where side-effect-free mmios are collected in the kernel and passed to > userspace only when some other significant event happens. We could > pass those addresses to the guest and let it queue those writes > itself, avoiding the hypercall completely. > > Though it's probably pointless: if the guest is paravirtualized enough > to have the mmio hypercall, then it shouldn't be using e1000. Yeah...plus at least for my vbus purposes, all my my guest->host transitions are explicitly to cause side-effects, or I wouldn't be doing them in the first place ;) I suspect virtio-pci is exactly the same. I.e. the coalescing has already been done at a higher layer for platforms running "PV" code. Still a cool feature, tho. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:37 ` Gregory Haskins @ 2009-05-07 19:00 ` Avi Kivity 2009-05-07 19:05 ` Gregory Haskins 2009-05-07 19:07 ` Chris Wright 0 siblings, 2 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-07 19:00 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: > Avi Kivity wrote: > >> Gregory Haskins wrote: >> >>> I guess technically mmio can just be a simple access of the page which >>> would be problematic to trap locally without a PF. However it seems >>> that most mmio always passes through a ioread()/iowrite() call so this >>> is perhaps the hook point. If we set the stake in the ground that mmios >>> that go through some other mechanism like PFs can just hit the "slow >>> path" are an acceptable casualty, I think we can make that work. >>> >>> >> That's my thinking exactly. >> > > Cool, I will code this up and submit it. While Im at it, Ill run it > through the "nullio" ringer, too. ;) It would be cool to see the > pv-mmio hit that 2.07us number. I can't think of any reason why this > will not be the case. > Don't - it's broken. It will also catch device assignment mmio and hypercall them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:00 ` Avi Kivity @ 2009-05-07 19:05 ` Gregory Haskins 2009-05-07 19:43 ` Avi Kivity 2009-05-07 19:07 ` Chris Wright 1 sibling, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 19:05 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1080 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> Gregory Haskins wrote: >>> >>>> I guess technically mmio can just be a simple access of the page which >>>> would be problematic to trap locally without a PF. However it seems >>>> that most mmio always passes through a ioread()/iowrite() call so this >>>> is perhaps the hook point. If we set the stake in the ground that >>>> mmios >>>> that go through some other mechanism like PFs can just hit the "slow >>>> path" are an acceptable casualty, I think we can make that work. >>>> >>> That's my thinking exactly. >>> >> >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. >> > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. > Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:05 ` Gregory Haskins @ 2009-05-07 19:43 ` Avi Kivity 2009-05-07 20:07 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 19:43 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: >> Don't - it's broken. It will also catch device assignment mmio and >> hypercall them. >> >> > Ah. Crap. > > Would you be conducive if I continue along with the dynhc() approach then? > Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. And btw, given that eventfd and the underlying infrastructure are so flexible, it's probably better to go back to your original "irqfd gets fd from userspace" just to be consistent everywhere. (no, I'm not deliberately making you rewrite that patch again and again... it's going to be a key piece of infrastructure so I want to get it right) Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:43 ` Avi Kivity @ 2009-05-07 20:07 ` Gregory Haskins 2009-05-07 20:15 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 20:07 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 3090 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >>> Don't - it's broken. It will also catch device assignment mmio and >>> hypercall them. >>> >>> >> Ah. Crap. >> >> Would you be conducive if I continue along with the dynhc() approach >> then? >> > > Oh yes. But don't call it dynhc - like Chris says it's the wrong > semantic. > > Since we want to connect it to an eventfd, call it HC_NOTIFY or > HC_EVENT or something along these lines. You won't be able to pass > any data, but that's fine. Registers are saved to memory anyway. Ok, but how would you access the registers since you would presumably only be getting a waitq::func callback on the eventfd. Or were you saying that more data, if required, is saved in a side-band memory location? I can see the latter working. I can't wrap my head around the former. > > And btw, given that eventfd and the underlying infrastructure are so > flexible, it's probably better to go back to your original "irqfd gets > fd from userspace" just to be consistent everywhere. > > (no, I'm not deliberately making you rewrite that patch again and > again... it's going to be a key piece of infrastructure so I want to > get it right) Ok, np. Actually now that Davide showed me the waitq::func trick, the fd technically doesn't even need to be an eventfd per se. We can just plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. Ill submit this later today. > > > Just to make sure we have everything plumbed down, here's how I see > things working out (using qemu and virtio, use sed to taste): > > 1. qemu starts up, sets up the VM > 2. qemu creates virtio-net-server > 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx > ring, one set for rx ring) > 4. qemu connects the six eventfd to the data-available, > data-not-available, and kick ports of virtio-net-server > 5. the guest starts up and configures virtio-net in pci pin mode > 6. qemu notices and decides it will manage interrupts in user space > since this is complicated (shared level triggered interrupts) > 7. the guest OS boots, loads device driver > 8. device driver switches virtio-net to msix mode > 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the > notify fds as notifyfd > 10. look ma, no hands. > > Under the hood, the following takes place. > > kvm wires the irqfds to schedule a work item which fires the > interrupt. One day the kvm developers get their act together and > change it to inject the interrupt directly when the irqfd is signalled > (which could be from the net softirq or somewhere similarly nasty). > > virtio-net-server wires notifyfd according to its liking. It may > schedule a thread, or it may execute directly. > > And they all lived happily ever after. Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) But even if not, this should provide enough plumbing that we can all coexist together peacefully. Thanks, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:07 ` Gregory Haskins @ 2009-05-07 20:15 ` Avi Kivity 2009-05-07 20:26 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 20:15 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: >> Oh yes. But don't call it dynhc - like Chris says it's the wrong >> semantic. >> >> Since we want to connect it to an eventfd, call it HC_NOTIFY or >> HC_EVENT or something along these lines. You won't be able to pass >> any data, but that's fine. Registers are saved to memory anyway. >> > Ok, but how would you access the registers since you would presumably > only be getting a waitq::func callback on the eventfd. Or were you > saying that more data, if required, is saved in a side-band memory > location? I can see the latter working. Yeah. You basically have that side-band in vbus shmem (or the virtio ring). > I can't wrap my head around > the former. > I only meant that registers aren't faster than memory, since they are just another memory location. In fact registers are accessed through a function call (not that that takes any time these days). >> Just to make sure we have everything plumbed down, here's how I see >> things working out (using qemu and virtio, use sed to taste): >> >> 1. qemu starts up, sets up the VM >> 2. qemu creates virtio-net-server >> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx >> ring, one set for rx ring) >> 4. qemu connects the six eventfd to the data-available, >> data-not-available, and kick ports of virtio-net-server >> 5. the guest starts up and configures virtio-net in pci pin mode >> 6. qemu notices and decides it will manage interrupts in user space >> since this is complicated (shared level triggered interrupts) >> 7. the guest OS boots, loads device driver >> 8. device driver switches virtio-net to msix mode >> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the >> notify fds as notifyfd >> 10. look ma, no hands. >> >> Under the hood, the following takes place. >> >> kvm wires the irqfds to schedule a work item which fires the >> interrupt. One day the kvm developers get their act together and >> change it to inject the interrupt directly when the irqfd is signalled >> (which could be from the net softirq or somewhere similarly nasty). >> >> virtio-net-server wires notifyfd according to its liking. It may >> schedule a thread, or it may execute directly. >> >> And they all lived happily ever after. >> > > Ack. I hope when its all said and done I can convince you that the > framework to code up those virtio backends in the kernel is vbus ;) If vbus doesn't bring significant performance advantages, I'll prefer virtio because of existing investment. > But > even if not, this should provide enough plumbing that we can all coexist > together peacefully. > Yes, vbus and virtio can compete on their merits without bias from some maintainer getting in the way. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:15 ` Avi Kivity @ 2009-05-07 20:26 ` Gregory Haskins 2009-05-08 8:35 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 20:26 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 3230 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >>> Oh yes. But don't call it dynhc - like Chris says it's the wrong >>> semantic. >>> >>> Since we want to connect it to an eventfd, call it HC_NOTIFY or >>> HC_EVENT or something along these lines. You won't be able to pass >>> any data, but that's fine. Registers are saved to memory anyway. >>> >> Ok, but how would you access the registers since you would presumably >> only be getting a waitq::func callback on the eventfd. Or were you >> saying that more data, if required, is saved in a side-band memory >> location? I can see the latter working. > > Yeah. You basically have that side-band in vbus shmem (or the virtio > ring). Ok, got it. > >> I can't wrap my head around >> the former. >> > > I only meant that registers aren't faster than memory, since they are > just another memory location. > > In fact registers are accessed through a function call (not that that > takes any time these days). > > >>> Just to make sure we have everything plumbed down, here's how I see >>> things working out (using qemu and virtio, use sed to taste): >>> >>> 1. qemu starts up, sets up the VM >>> 2. qemu creates virtio-net-server >>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx >>> ring, one set for rx ring) >>> 4. qemu connects the six eventfd to the data-available, >>> data-not-available, and kick ports of virtio-net-server >>> 5. the guest starts up and configures virtio-net in pci pin mode >>> 6. qemu notices and decides it will manage interrupts in user space >>> since this is complicated (shared level triggered interrupts) >>> 7. the guest OS boots, loads device driver >>> 8. device driver switches virtio-net to msix mode >>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the >>> notify fds as notifyfd >>> 10. look ma, no hands. >>> >>> Under the hood, the following takes place. >>> >>> kvm wires the irqfds to schedule a work item which fires the >>> interrupt. One day the kvm developers get their act together and >>> change it to inject the interrupt directly when the irqfd is signalled >>> (which could be from the net softirq or somewhere similarly nasty). >>> >>> virtio-net-server wires notifyfd according to its liking. It may >>> schedule a thread, or it may execute directly. >>> >>> And they all lived happily ever after. >>> >> >> Ack. I hope when its all said and done I can convince you that the >> framework to code up those virtio backends in the kernel is vbus ;) > > If vbus doesn't bring significant performance advantages, I'll prefer > virtio because of existing investment. Just to clarify: vbus is just the container/framework for the in-kernel models. You can implement and deploy virtio devices inside the container (tho I haven't had a chance to sit down and implement one yet). Note that I did publish a virtio transport in the last few series to demonstrate how that might work, so its just ripe for the picking if someone is so inclined. So really the question is whether you implement the in-kernel virtio backend in vbus, in some other framework, or just do it standalone. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:26 ` Gregory Haskins @ 2009-05-08 8:35 ` Avi Kivity 2009-05-08 11:29 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-08 8:35 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori Gregory Haskins wrote: >>> Ack. I hope when its all said and done I can convince you that the >>> framework to code up those virtio backends in the kernel is vbus ;) >>> >> If vbus doesn't bring significant performance advantages, I'll prefer >> virtio because of existing investment. >> > > Just to clarify: vbus is just the container/framework for the in-kernel > models. You can implement and deploy virtio devices inside the > container (tho I haven't had a chance to sit down and implement one > yet). Note that I did publish a virtio transport in the last few series > to demonstrate how that might work, so its just ripe for the picking if > someone is so inclined. > > Yeah I keep getting confused over this. > So really the question is whether you implement the in-kernel virtio > backend in vbus, in some other framework, or just do it standalone. > I prefer the standalone model. Keep the glue in userspace. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 8:35 ` Avi Kivity @ 2009-05-08 11:29 ` Gregory Haskins 0 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 11:29 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 2304 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: > > >>>> Ack. I hope when its all said and done I can convince you that the >>>> framework to code up those virtio backends in the kernel is vbus ;) >>>> >>> If vbus doesn't bring significant performance advantages, I'll prefer >>> virtio because of existing investment. >>> >> >> Just to clarify: vbus is just the container/framework for the in-kernel >> models. You can implement and deploy virtio devices inside the >> container (tho I haven't had a chance to sit down and implement one >> yet). Note that I did publish a virtio transport in the last few series >> to demonstrate how that might work, so its just ripe for the picking if >> someone is so inclined. >> >> > > Yeah I keep getting confused over this. > >> So really the question is whether you implement the in-kernel virtio >> backend in vbus, in some other framework, or just do it standalone. >> > > I prefer the standalone model. Keep the glue in userspace. Just to keep the facts straight: The glue in userspace vs standalone model are independent variables. E.g. you can have the glue in userspace for vbus, too. Its not written that way today for KVM, but its moving in that direction as we work though these subtopics like irqfd, dynhc, etc. What vbus buys you as a core technology is that you can write one backend that works "everywhere" (you only need a glue layer for each environment you want to support). You might say "I can make my backends work everywhere too", and to that I would say "by the time you get it to work, you will have duplicated almost my exact effort on vbus" ;). Of course, you may also say "I don't care if it works anywhere else but KVM", which is a perfectly valid (if not unfortunate) position to take. I think the confusion point is possibly a result of the name "vbus". The vbus core isn't really true bus in the traditional sense. It's just a host-side kernel-based container for these device models. That is all I am talking about here. There is, of course, also an LDM "bus" for rendering vbus devices in the guest as a function of the current kvm-specific glue layer Ive written. Note that this glue layer could render them as PCI in the future, TBD. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:00 ` Avi Kivity 2009-05-07 19:05 ` Gregory Haskins @ 2009-05-07 19:07 ` Chris Wright 2009-05-07 19:12 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-07 19:07 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori * Avi Kivity (avi@redhat.com) wrote: > Gregory Haskins wrote: >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. Not necessarily. It just needs to be creative w/ IO_COND ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:07 ` Chris Wright @ 2009-05-07 19:12 ` Gregory Haskins 2009-05-07 19:21 ` Chris Wright 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 19:12 UTC (permalink / raw) To: Chris Wright Cc: Avi Kivity, Gregory Haskins, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 652 bytes --] Chris Wright wrote: > * Avi Kivity (avi@redhat.com) wrote: > >> Gregory Haskins wrote: >> >>> Cool, I will code this up and submit it. While Im at it, Ill run it >>> through the "nullio" ringer, too. ;) It would be cool to see the >>> pv-mmio hit that 2.07us number. I can't think of any reason why this >>> will not be the case. >>> >> Don't - it's broken. It will also catch device assignment mmio and >> hypercall them. >> > > Not necessarily. It just needs to be creative w/ IO_COND > Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:12 ` Gregory Haskins @ 2009-05-07 19:21 ` Chris Wright 2009-05-07 19:26 ` Avi Kivity 2009-05-07 19:29 ` Gregory Haskins 0 siblings, 2 replies; 115+ messages in thread From: Chris Wright @ 2009-05-07 19:21 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Avi Kivity, Gregory Haskins, linux-kernel, kvm, Anthony Liguori * Gregory Haskins (ghaskins@novell.com) wrote: > Chris Wright wrote: > > * Avi Kivity (avi@redhat.com) wrote: > >> Gregory Haskins wrote: > >>> Cool, I will code this up and submit it. While Im at it, Ill run it > >>> through the "nullio" ringer, too. ;) It would be cool to see the > >>> pv-mmio hit that 2.07us number. I can't think of any reason why this > >>> will not be the case. > >>> > >> Don't - it's broken. It will also catch device assignment mmio and > >> hypercall them. > > > > Not necessarily. It just needs to be creative w/ IO_COND > > Hi Chris, > Could you elaborate? How would you know which pages to hypercall and > which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:21 ` Chris Wright @ 2009-05-07 19:26 ` Avi Kivity 2009-05-07 19:44 ` Avi Kivity 2009-05-07 19:29 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 19:26 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Chris Wright wrote: > * Gregory Haskins (ghaskins@novell.com) wrote: > >> Chris Wright wrote: >> >>> * Avi Kivity (avi@redhat.com) wrote: >>> >>>> Gregory Haskins wrote: >>>> >>>>> Cool, I will code this up and submit it. While Im at it, Ill run it >>>>> through the "nullio" ringer, too. ;) It would be cool to see the >>>>> pv-mmio hit that 2.07us number. I can't think of any reason why this >>>>> will not be the case. >>>>> >>>>> >>>> Don't - it's broken. It will also catch device assignment mmio and >>>> hypercall them. >>>> >>> Not necessarily. It just needs to be creative w/ IO_COND >>> >> Hi Chris, >> Could you elaborate? How would you know which pages to hypercall and >> which to let PF? >> > > Was just thinking of some ugly mangling of the addr (I'm not entirely > sure what would work best). > I think we just past the "too complicated" threshold. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:26 ` Avi Kivity @ 2009-05-07 19:44 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-07 19:44 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Avi Kivity wrote: > > I think we just past the "too complicated" threshold. > And the "can't spel" threshold in the same sentence. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:21 ` Chris Wright 2009-05-07 19:26 ` Avi Kivity @ 2009-05-07 19:29 ` Gregory Haskins 2009-05-07 20:25 ` Chris Wright 2009-05-07 20:54 ` Arnd Bergmann 1 sibling, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 19:29 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1913 bytes --] Chris Wright wrote: > * Gregory Haskins (ghaskins@novell.com) wrote: > >> Chris Wright wrote: >> >>> * Avi Kivity (avi@redhat.com) wrote: >>> >>>> Gregory Haskins wrote: >>>> >>>>> Cool, I will code this up and submit it. While Im at it, Ill run it >>>>> through the "nullio" ringer, too. ;) It would be cool to see the >>>>> pv-mmio hit that 2.07us number. I can't think of any reason why this >>>>> will not be the case. >>>>> >>>>> >>>> Don't - it's broken. It will also catch device assignment mmio and >>>> hypercall them. >>>> >>> Not necessarily. It just needs to be creative w/ IO_COND >>> >> Hi Chris, >> Could you elaborate? How would you know which pages to hypercall and >> which to let PF? >> > > Was just thinking of some ugly mangling of the addr (I'm not entirely > sure what would work best). > Right, I get the part about flagging the address and then keying off that flag in IO_COND (like we do for PIO vs MMIO). What I am not clear on is how you would know to flag the address to begin with. Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it would never be a real device). This means we route any io requests from virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This is not as slick as boosting *everyones* mmio speed as Avi's original idea would have, but it is perhaps a good tradeoff between the entirely new namespace created by my original dynhc() proposal and leaving them all PF based. This way, its just like using my dynhc() proposal except the mmio-addr is the substitute address-token (instead of the dynhc-vector). Additionally, if you do not PV the kernel the IO_COND/pv_io_op is ignored and it just slow-paths through the PF as it does today. Dynhc() would be dependent on pv_ops. Thoughts? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:29 ` Gregory Haskins @ 2009-05-07 20:25 ` Chris Wright 2009-05-07 20:34 ` Gregory Haskins 2009-05-07 20:54 ` Arnd Bergmann 1 sibling, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-07 20:25 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori * Gregory Haskins (gregory.haskins@gmail.com) wrote: > What I am not clear on is how you would know to flag the address to > begin with. That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect would get called on IORESOURCE_PVIO type. This isn't really transparent though (only virtio devices basically), kind of like you're saying below. > Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it > would never be a real device). This means we route any io requests from > virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This > is not as slick as boosting *everyones* mmio speed as Avi's original > idea would have, but it is perhaps a good tradeoff between the entirely > new namespace created by my original dynhc() proposal and leaving them > all PF based. > > This way, its just like using my dynhc() proposal except the mmio-addr > is the substitute address-token (instead of the dynhc-vector). > Additionally, if you do not PV the kernel the IO_COND/pv_io_op is > ignored and it just slow-paths through the PF as it does today. Dynhc() > would be dependent on pv_ops. > > Thoughts? ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:25 ` Chris Wright @ 2009-05-07 20:34 ` Gregory Haskins 0 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 20:34 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 412 bytes --] Chris Wright wrote: > * Gregory Haskins (gregory.haskins@gmail.com) wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect > would get called on IORESOURCE_PVIO type. Yeah, this wasn't clear at the time, but I totally get what you meant now in retrospect. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 19:29 ` Gregory Haskins 2009-05-07 20:25 ` Chris Wright @ 2009-05-07 20:54 ` Arnd Bergmann 2009-05-07 21:13 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Arnd Bergmann @ 2009-05-07 20:54 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori On Thursday 07 May 2009, Gregory Haskins wrote: > What I am not clear on is how you would know to flag the address to > begin with. pci_iomap could look at the bus device that the PCI function sits on. If it detects a PCI bridge that has a certain property (config space setting, vendor/device ID, ...), it assumes that the device itself will be emulated and it should set the address flag for IO_COND. This implies that all pass-through devices need to be on a different PCI bridge from the emulated devices, which should be fairly straightforward to enforce. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:54 ` Arnd Bergmann @ 2009-05-07 21:13 ` Gregory Haskins 2009-05-07 21:57 ` Chris Wright 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 21:13 UTC (permalink / raw) To: Arnd Bergmann Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 902 bytes --] Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > pci_iomap could look at the bus device that the PCI function sits on. > If it detects a PCI bridge that has a certain property (config space > setting, vendor/device ID, ...), it assumes that the device itself > will be emulated and it should set the address flag for IO_COND. > > This implies that all pass-through devices need to be on a different > PCI bridge from the emulated devices, which should be fairly > straightforward to enforce. > Thats actually a pretty good idea. Chris, is that issue with the non ioread/iowrite access of a mangled pointer still an issue here? I would think so, but I am a bit fuzzy on whether there is still an issue of non-wrapped MMIO ever occuring. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 21:13 ` Gregory Haskins @ 2009-05-07 21:57 ` Chris Wright 2009-05-07 22:11 ` Arnd Bergmann 0 siblings, 1 reply; 115+ messages in thread From: Chris Wright @ 2009-05-07 21:57 UTC (permalink / raw) To: Gregory Haskins Cc: Arnd Bergmann, Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori * Gregory Haskins (gregory.haskins@gmail.com) wrote: > Arnd Bergmann wrote: > > pci_iomap could look at the bus device that the PCI function sits on. > > If it detects a PCI bridge that has a certain property (config space > > setting, vendor/device ID, ...), it assumes that the device itself > > will be emulated and it should set the address flag for IO_COND. > > > > This implies that all pass-through devices need to be on a different > > PCI bridge from the emulated devices, which should be fairly > > straightforward to enforce. Hmm, this gets to the grey area of the ABI. I think this would mean an upgrade of the host would suddenly break when the mgmt tool does: (qemu) pci_add pci_addr=0:6 host host=01:10.0 > Thats actually a pretty good idea. > > Chris, is that issue with the non ioread/iowrite access of a mangled > pointer still an issue here? I would think so, but I am a bit fuzzy on > whether there is still an issue of non-wrapped MMIO ever occuring. Arnd was saying it's a bug for other reasons, so perhaps it would work out fine. thanks, -chris ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 21:57 ` Chris Wright @ 2009-05-07 22:11 ` Arnd Bergmann 2009-05-08 22:33 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 115+ messages in thread From: Arnd Bergmann @ 2009-05-07 22:11 UTC (permalink / raw) To: Chris Wright Cc: Gregory Haskins, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori, Benjamin Herrenschmidt On Thursday 07 May 2009, Chris Wright wrote: > > > Chris, is that issue with the non ioread/iowrite access of a mangled > > pointer still an issue here? I would think so, but I am a bit fuzzy on > > whether there is still an issue of non-wrapped MMIO ever occuring. > > Arnd was saying it's a bug for other reasons, so perhaps it would work > out fine. Well, maybe. I only said that __raw_writel and pointer dereference is bad, but not writel. IIRC when we had that discussion about io-workarounds on powerpc, the outcome was that passing an IORESOURCE_MEM resource into pci_iomap must still result in something that can be passed into writel in addition to iowrite32, while an IORESOURCE_IO resource may or may not be valid for writel and/or outl. Unfortunately, this means that either readl/writel needs to be adapted in some way (e.g. the address also ioremapped to the mangled pointer) or the mechanism will be limited to I/O space accesses. Maybe BenH remembers the details better than me. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 22:11 ` Arnd Bergmann @ 2009-05-08 22:33 ` Benjamin Herrenschmidt 2009-05-11 13:01 ` Arnd Bergmann 0 siblings, 1 reply; 115+ messages in thread From: Benjamin Herrenschmidt @ 2009-05-08 22:33 UTC (permalink / raw) To: Arnd Bergmann Cc: Chris Wright, Gregory Haskins, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori On Fri, 2009-05-08 at 00:11 +0200, Arnd Bergmann wrote: > On Thursday 07 May 2009, Chris Wright wrote: > > > > > Chris, is that issue with the non ioread/iowrite access of a mangled > > > pointer still an issue here? I would think so, but I am a bit fuzzy on > > > whether there is still an issue of non-wrapped MMIO ever occuring. > > > > Arnd was saying it's a bug for other reasons, so perhaps it would work > > out fine. > > Well, maybe. I only said that __raw_writel and pointer dereference is > bad, but not writel. I have only vague recollection of that stuff, but basically, it boiled down to me attempting to use annotations to differenciate old style "ioremap" vs. new style iomap and effectively forbid mixing ioremap with iomap in either direction. This was shot down by a vast majority of people, with the outcome being an agreement that for IORESOURCE_MEM, pci_iomap and friends must return something that is strictly interchangeable with what ioremap would have returned. That means that readl and writel must work on the output of pci_iomap() and similar, but I don't see why __raw_writel would be excluded there, I think it's in there too. Direct dereference is illegal in all cases though. The token returned by pci_iomap for other type of resources (IO for example) is also only supported for use by iomap access functions (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those neither. Cheers, Ben. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 22:33 ` Benjamin Herrenschmidt @ 2009-05-11 13:01 ` Arnd Bergmann 2009-05-11 13:04 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Arnd Bergmann @ 2009-05-11 13:01 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Chris Wright, Gregory Haskins, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori On Saturday 09 May 2009, Benjamin Herrenschmidt wrote: > This was shot down by a vast majority of people, with the outcome being > an agreement that for IORESOURCE_MEM, pci_iomap and friends must return > something that is strictly interchangeable with what ioremap would have > returned. > > That means that readl and writel must work on the output of pci_iomap() > and similar, but I don't see why __raw_writel would be excluded there, I > think it's in there too. One of the ideas was to change pci_iomap to return a special token in case of virtual devices that causes iowrite32() to do an hcall, and to just define writel() to do iowrite32(). Unfortunately, there is no __raw_iowrite32(), although I guess we could add this generically if necessary. > Direct dereference is illegal in all cases though. right. > The token returned by pci_iomap for other type of resources (IO for > example) is also only supported for use by iomap access functions > (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those > neither. That still leaves the option to let drivers pass the IORESOURCE_PVIO for its own resources under some conditions, meaning that we will only use hcalls for I/O on these drivers but not on others, as Chris explained earlier. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 13:01 ` Arnd Bergmann @ 2009-05-11 13:04 ` Gregory Haskins 0 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-11 13:04 UTC (permalink / raw) To: Arnd Bergmann Cc: Benjamin Herrenschmidt, Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1519 bytes --] Arnd Bergmann wrote: > On Saturday 09 May 2009, Benjamin Herrenschmidt wrote: > >> This was shot down by a vast majority of people, with the outcome being >> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return >> something that is strictly interchangeable with what ioremap would have >> returned. >> >> That means that readl and writel must work on the output of pci_iomap() >> and similar, but I don't see why __raw_writel would be excluded there, I >> think it's in there too. >> > > One of the ideas was to change pci_iomap to return a special token > in case of virtual devices that causes iowrite32() to do an hcall, > and to just define writel() to do iowrite32(). > > Unfortunately, there is no __raw_iowrite32(), although I guess we > could add this generically if necessary. > > >> Direct dereference is illegal in all cases though. >> > > right. > > >> The token returned by pci_iomap for other type of resources (IO for >> example) is also only supported for use by iomap access functions >> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those >> neither. >> > > That still leaves the option to let drivers pass the IORESOURCE_PVIO > for its own resources under some conditions, meaning that we will > only use hcalls for I/O on these drivers but not on others, as Chris > explained earlier. > Between this, and Avi's "nesting" point, this is the direction I am leaning in right now. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 18:16 ` Gregory Haskins 2009-05-07 18:24 ` Avi Kivity @ 2009-05-07 20:00 ` Arnd Bergmann 2009-05-07 20:31 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Arnd Bergmann @ 2009-05-07 20:00 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori On Thursday 07 May 2009, Gregory Haskins wrote: > I guess technically mmio can just be a simple access of the page which > would be problematic to trap locally without a PF. However it seems > that most mmio always passes through a ioread()/iowrite() call so this > is perhaps the hook point. If we set the stake in the ground that mmios > that go through some other mechanism like PFs can just hit the "slow > path" are an acceptable casualty, I think we can make that work. > > Thoughts? An mmio that goes through a PF is a bug, it's certainly broken on a number of platforms, so performance should not be an issue there. Note that are four commonly used interface classes for PIO/MMIO: 1. readl/writel: little-endian MMIO 2. inl/outl: little-endian PIO 3. ioread32/iowrite32: converged little-endian PIO/MMIO 4. __raw_readl/__raw_writel: native-endian MMIO without checks You don't need to worry about the __raw_* stuff, as this should never be used in device drivers. As a simplification, you could mandate that all drivers that want to use this get converted to the ioread/iowrite class of interfaces and leave the others slow. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:00 ` Arnd Bergmann @ 2009-05-07 20:31 ` Gregory Haskins 2009-05-07 20:42 ` Arnd Bergmann 2009-05-07 20:50 ` Chris Wright 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-07 20:31 UTC (permalink / raw) To: Arnd Bergmann Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 2369 bytes --] Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> >> Thoughts? >> > > An mmio that goes through a PF is a bug, it's certainly broken on > a number of platforms, so performance should not be an issue there. > This may be my own ignorance, but I thought a VMEXIT of type "PF" was how MMIO worked in VT/SVM. I didn't mean to imply that the guest nor the host took a traditional PF exception in their respective IDT, if that is what you thought I meant here. Rather, the mmio region is unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of type PF, and the host side code then consults the guest page-table to see if its an MMIO or not. I could very well be mistaken as I have only a cursory understanding of what happens in KVM today with this path. After posting my numbers today, what I *can* tell you definitively that its significantly slower to VMEXIT via MMIO. I guess I do not really know the reason for sure. :) > Note that are four commonly used interface classes for PIO/MMIO: > > 1. readl/writel: little-endian MMIO > 2. inl/outl: little-endian PIO > 3. ioread32/iowrite32: converged little-endian PIO/MMIO > 4. __raw_readl/__raw_writel: native-endian MMIO without checks > > You don't need to worry about the __raw_* stuff, as this should never > be used in device drivers. > > As a simplification, you could mandate that all drivers that want to > use this get converted to the ioread/iowrite class of interfaces and > leave the others slow. > I guess the problem that was later pointed out is that we cannot discern which devices might be pass-through and therefore should not be revectored through a HC. But I am even less knowledgeable about how pass-through works than I am about the MMIO traps, so I might be completely off here. In any case, thank you kindly for the suggestions. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:31 ` Gregory Haskins @ 2009-05-07 20:42 ` Arnd Bergmann 2009-05-07 20:47 ` Arnd Bergmann 2009-05-07 20:50 ` Chris Wright 1 sibling, 1 reply; 115+ messages in thread From: Arnd Bergmann @ 2009-05-07 20:42 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori On Thursday 07 May 2009, Gregory Haskins wrote: > Arnd Bergmann wrote: > > An mmio that goes through a PF is a bug, it's certainly broken on > > a number of platforms, so performance should not be an issue there. > > > > This may be my own ignorance, but I thought a VMEXIT of type "PF" was > how MMIO worked in VT/SVM. You are right that all MMIOs (and PIO on most non-x86 architectures) are handled this way in the end. What I meant was that an MMIO that traps because of a simple pointer dereference as in __raw_writel is a bug, while any actual writel() call could be diverted to do an hcall and therefore not cause a PF once the infrastructure is there. > I guess the problem that was later pointed out is that we cannot discern > which devices might be pass-through and therefore should not be > revectored through a HC. But I am even less knowledgeable about how > pass-through works than I am about the MMIO traps, so I might be > completely off here. An easy way to deal with the pass-through case might be to actually use __raw_writel there. In guest-to-guest communication, the two sides are known to have the same endianess (I assume) and you can still add the appropriate smp_mb() and such into the code. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:42 ` Arnd Bergmann @ 2009-05-07 20:47 ` Arnd Bergmann 0 siblings, 0 replies; 115+ messages in thread From: Arnd Bergmann @ 2009-05-07 20:47 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori On Thursday 07 May 2009, Arnd Bergmann wrote: > An easy way to deal with the pass-through case might be to actually use > __raw_writel there. In guest-to-guest communication, the two sides are > known to have the same endianess (I assume) and you can still add the > appropriate smp_mb() and such into the code. Ok, that was nonsense. I thought you meant pass-through to a memory range on the host that is potentially shared with other processes or guests. For pass-through to a real device, it obviously would not work. Arnd <>< ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 20:31 ` Gregory Haskins 2009-05-07 20:42 ` Arnd Bergmann @ 2009-05-07 20:50 ` Chris Wright 1 sibling, 0 replies; 115+ messages in thread From: Chris Wright @ 2009-05-07 20:50 UTC (permalink / raw) To: Gregory Haskins Cc: Arnd Bergmann, Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori * Gregory Haskins (gregory.haskins@gmail.com) wrote: > After posting my numbers today, what I *can* tell you definitively that > its significantly slower to VMEXIT via MMIO. I guess I do not really > know the reason for sure. :) there's certainly more work, including insn decoding ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 17:03 ` Gregory Haskins 2009-05-07 18:05 ` Avi Kivity @ 2009-05-07 23:35 ` Marcelo Tosatti 2009-05-07 23:43 ` Marcelo Tosatti ` (3 more replies) 1 sibling, 4 replies; 115+ messages in thread From: Marcelo Tosatti @ 2009-05-07 23:35 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > Chris Wright wrote: > > * Gregory Haskins (ghaskins@novell.com) wrote: > > > >> Chris Wright wrote: > >> > >>> VF drivers can also have this issue (and typically use mmio). > >>> I at least have a better idea what your proposal is, thanks for > >>> explanation. Are you able to demonstrate concrete benefit with it yet > >>> (improved latency numbers for example)? > >>> > >> I had a test-harness/numbers for this kind of thing, but its a bit > >> crufty since its from ~1.5 years ago. I will dig it up, update it, and > >> generate/post new numbers. > >> > > > > That would be useful, because I keep coming back to pio and shared > > page(s) when think of why not to do this. Seems I'm not alone in that. > > > > thanks, > > -chris > > > > I completed the resurrection of the test and wrote up a little wiki on > the subject, which you can find here: > > http://developer.novell.com/wiki/index.php/WhyHypercalls > > Hopefully this answers Chris' "show me the numbers" and Anthony's "Why > reinvent the wheel?" questions. > > I will include this information when I publish the updated v2 series > with the s/hypercall/dynhc changes. > > Let me know if you have any questions. Greg, I think comparison is not entirely fair. You're using KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that (on Intel) to only one register read: nr = kvm_register_read(vcpu, VCPU_REGS_RAX); Whereas in a real hypercall for (say) PIO you would need the address, size, direction and data. Also for PIO/MMIO you're adding this unoptimized lookup to the measurement: pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); if (pio_dev) { kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); complete_pio(vcpu); return 1; } Whereas for hypercall measurement you don't. I believe a fair comparison would be have a shared guest/host memory area where you store guest/host TSC values and then do, on guest: rdtscll(&shared_area->guest_tsc); pio/mmio/hypercall ... back to host rdtscll(&shared_area->host_tsc); And then calculate the difference (minus guests TSC_OFFSET of course)? ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:35 ` Marcelo Tosatti @ 2009-05-07 23:43 ` Marcelo Tosatti 2009-05-08 3:17 ` Gregory Haskins 2009-05-08 7:55 ` Avi Kivity 2009-05-08 3:13 ` Gregory Haskins ` (2 subsequent siblings) 3 siblings, 2 replies; 115+ messages in thread From: Marcelo Tosatti @ 2009-05-07 23:43 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" Core2 Xeon 5130 @2.00Ghz, 4GB RAM. Also it would be interesting to see the MMIO comparison with EPT/NPT, it probably sucks much less than what you're seeing. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:43 ` Marcelo Tosatti @ 2009-05-08 3:17 ` Gregory Haskins 2009-05-08 7:55 ` Avi Kivity 1 sibling, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 3:17 UTC (permalink / raw) To: Marcelo Tosatti Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1217 bytes --] Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); >> return 1; >> } >> >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" > Core2 Xeon 5130 @2.00Ghz, 4GB RAM. > > Also it would be interesting to see the MMIO comparison with EPT/NPT, > it probably sucks much less than what you're seeing. > > Agreed. If you or someone on this thread has such a beast, please fire up my test and post the numbers. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:43 ` Marcelo Tosatti 2009-05-08 3:17 ` Gregory Haskins @ 2009-05-08 7:55 ` Avi Kivity [not found] ` <20090508103253.GC3011@amt.cnet> 2009-05-08 14:35 ` Marcelo Tosatti 1 sibling, 2 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-08 7:55 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Marcelo Tosatti wrote: > Also it would be interesting to see the MMIO comparison with EPT/NPT, > it probably sucks much less than what you're seeing. > Why would NPT improve mmio? If anything, it would be worse, since the processor has to do the nested walk. Of course, these are newer machines, so the absolute results as well as the difference will be smaller. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
[parent not found: <20090508103253.GC3011@amt.cnet>]
* Re: [RFC PATCH 0/3] generic hypercall support [not found] ` <20090508103253.GC3011@amt.cnet> @ 2009-05-08 11:37 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-08 11:37 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Marcelo Tosatti wrote: >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>> it probably sucks much less than what you're seeing. >>> >>> >> Why would NPT improve mmio? If anything, it would be worse, since the >> processor has to do the nested walk. >> > > I suppose the hardware is much more efficient than walk_addr? There's > all this kmalloc, spinlock, etc overhead in the fault path. > mmio still has to do a walk_addr, even with npt. We don't take the mmu lock during walk_addr. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 7:55 ` Avi Kivity [not found] ` <20090508103253.GC3011@amt.cnet> @ 2009-05-08 14:35 ` Marcelo Tosatti 2009-05-08 14:45 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Marcelo Tosatti @ 2009-05-08 14:35 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, Gregory Haskins, kvm, Anthony Liguori On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > Marcelo Tosatti wrote: >> Also it would be interesting to see the MMIO comparison with EPT/NPT, >> it probably sucks much less than what you're seeing. >> > > Why would NPT improve mmio? If anything, it would be worse, since the > processor has to do the nested walk. > > Of course, these are newer machines, so the absolute results as well as > the difference will be smaller. Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: NPT enabled: test 0: 3088633284634 - 3059375712321 = 29257572313 test 1: 3121754636397 - 3088633419760 = 33121216637 test 2: 3204666462763 - 3121754668573 = 82911794190 NPT disabled: test 0: 3638061646250 - 3609416811687 = 28644834563 test 1: 3669413430258 - 3638061771291 = 31351658967 test 2: 3736287253287 - 3669413463506 = 66873789781 ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 14:35 ` Marcelo Tosatti @ 2009-05-08 14:45 ` Gregory Haskins 2009-05-08 15:51 ` Marcelo Tosatti 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 14:45 UTC (permalink / raw) To: Marcelo Tosatti Cc: Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1440 bytes --] Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > >> Marcelo Tosatti wrote: >> >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>> it probably sucks much less than what you're seeing. >>> >>> >> Why would NPT improve mmio? If anything, it would be worse, since the >> processor has to do the nested walk. >> >> Of course, these are newer machines, so the absolute results as well as >> the difference will be smaller. >> > > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: > > NPT enabled: > test 0: 3088633284634 - 3059375712321 = 29257572313 > test 1: 3121754636397 - 3088633419760 = 33121216637 > test 2: 3204666462763 - 3121754668573 = 82911794190 > > NPT disabled: > test 0: 3638061646250 - 3609416811687 = 28644834563 > test 1: 3669413430258 - 3638061771291 = 31351658967 > test 2: 3736287253287 - 3669413463506 = 66873789781 > > Thanks for running that. Its interesting to see that NPT was in fact worse as Avi predicted. Would you mind if I graphed the result and added this data to my wiki? If so, could you adjust the tsc result into IOPs using the proper time-base and the test_count you ran with? I can show a graph with the data as is and the relative differences will properly surface..but it would be nice to have apples to apples in terms of IOPS units with my other run. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 14:45 ` Gregory Haskins @ 2009-05-08 15:51 ` Marcelo Tosatti 2009-05-08 19:56 ` David S. Ahern 0 siblings, 1 reply; 115+ messages in thread From: Marcelo Tosatti @ 2009-05-08 15:51 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: > Marcelo Tosatti wrote: > > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > > > >> Marcelo Tosatti wrote: > >> > >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, > >>> it probably sucks much less than what you're seeing. > >>> > >>> > >> Why would NPT improve mmio? If anything, it would be worse, since the > >> processor has to do the nested walk. > >> > >> Of course, these are newer machines, so the absolute results as well as > >> the difference will be smaller. > >> > > > > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: > > > > NPT enabled: > > test 0: 3088633284634 - 3059375712321 = 29257572313 > > test 1: 3121754636397 - 3088633419760 = 33121216637 > > test 2: 3204666462763 - 3121754668573 = 82911794190 > > > > NPT disabled: > > test 0: 3638061646250 - 3609416811687 = 28644834563 > > test 1: 3669413430258 - 3638061771291 = 31351658967 > > test 2: 3736287253287 - 3669413463506 = 66873789781 > > > > > Thanks for running that. Its interesting to see that NPT was in fact > worse as Avi predicted. > > Would you mind if I graphed the result and added this data to my wiki? > If so, could you adjust the tsc result into IOPs using the proper > time-base and the test_count you ran with? I can show a graph with the > data as is and the relative differences will properly surface..but it > would be nice to have apples to apples in terms of IOPS units with my > other run. > > -Greg Please, that'll be nice. Quad-Core AMD Opteron(tm) Processor 2358 SE host: 2.6.30-rc2 guest: 2.6.29.1-102.fc11.x86_64 test_count=1000000, tsc freq=2402882804 Hz NPT disabled: test 0 = 2771200766 test 1 = 3018726738 test 2 = 6414705418 test 3 = 2890332864 NPT enabled: test 0 = 2908604045 test 1 = 3174687394 test 2 = 7912464804 test 3 = 3046085805 ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 15:51 ` Marcelo Tosatti @ 2009-05-08 19:56 ` David S. Ahern 2009-05-08 20:01 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: David S. Ahern @ 2009-05-08 19:56 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >> Marcelo Tosatti wrote: >>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: >>> >>>> Marcelo Tosatti wrote: >>>> >>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>>>> it probably sucks much less than what you're seeing. >>>>> >>>>> >>>> Why would NPT improve mmio? If anything, it would be worse, since the >>>> processor has to do the nested walk. >>>> >>>> Of course, these are newer machines, so the absolute results as well as >>>> the difference will be smaller. >>>> >>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: >>> >>> NPT enabled: >>> test 0: 3088633284634 - 3059375712321 = 29257572313 >>> test 1: 3121754636397 - 3088633419760 = 33121216637 >>> test 2: 3204666462763 - 3121754668573 = 82911794190 >>> >>> NPT disabled: >>> test 0: 3638061646250 - 3609416811687 = 28644834563 >>> test 1: 3669413430258 - 3638061771291 = 31351658967 >>> test 2: 3736287253287 - 3669413463506 = 66873789781 >>> >>> >> Thanks for running that. Its interesting to see that NPT was in fact >> worse as Avi predicted. >> >> Would you mind if I graphed the result and added this data to my wiki? >> If so, could you adjust the tsc result into IOPs using the proper >> time-base and the test_count you ran with? I can show a graph with the >> data as is and the relative differences will properly surface..but it >> would be nice to have apples to apples in terms of IOPS units with my >> other run. >> >> -Greg > > Please, that'll be nice. > > Quad-Core AMD Opteron(tm) Processor 2358 SE > > host: 2.6.30-rc2 > guest: 2.6.29.1-102.fc11.x86_64 > > test_count=1000000, tsc freq=2402882804 Hz > > NPT disabled: > > test 0 = 2771200766 > test 1 = 3018726738 > test 2 = 6414705418 > test 3 = 2890332864 > > NPT enabled: > > test 0 = 2908604045 > test 1 = 3174687394 > test 2 = 7912464804 > test 3 = 3046085805 > DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: host: 2.6.30-rc3 guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 with EPT test 0: 543617607291 - 518146439877 = 25471167414 test 1: 572568176856 - 543617703004 = 28950473852 test 2: 630182158139 - 572568269792 = 57613888347 without EPT test 0: 1383532195307 - 1358052032086 = 25480163221 test 1: 1411587055210 - 1383532318617 = 28054736593 test 2: 1471446356172 - 1411587194600 = 59859161572 david > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 19:56 ` David S. Ahern @ 2009-05-08 20:01 ` Gregory Haskins 2009-05-08 23:23 ` David S. Ahern 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 20:01 UTC (permalink / raw) To: David S. Ahern Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 2785 bytes --] David S. Ahern wrote: > Marcelo Tosatti wrote: > >> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >> >>> Marcelo Tosatti wrote: >>> >>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: >>>> >>>> >>>>> Marcelo Tosatti wrote: >>>>> >>>>> >>>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>>>>> it probably sucks much less than what you're seeing. >>>>>> >>>>>> >>>>>> >>>>> Why would NPT improve mmio? If anything, it would be worse, since the >>>>> processor has to do the nested walk. >>>>> >>>>> Of course, these are newer machines, so the absolute results as well as >>>>> the difference will be smaller. >>>>> >>>>> >>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: >>>> >>>> NPT enabled: >>>> test 0: 3088633284634 - 3059375712321 = 29257572313 >>>> test 1: 3121754636397 - 3088633419760 = 33121216637 >>>> test 2: 3204666462763 - 3121754668573 = 82911794190 >>>> >>>> NPT disabled: >>>> test 0: 3638061646250 - 3609416811687 = 28644834563 >>>> test 1: 3669413430258 - 3638061771291 = 31351658967 >>>> test 2: 3736287253287 - 3669413463506 = 66873789781 >>>> >>>> >>>> >>> Thanks for running that. Its interesting to see that NPT was in fact >>> worse as Avi predicted. >>> >>> Would you mind if I graphed the result and added this data to my wiki? >>> If so, could you adjust the tsc result into IOPs using the proper >>> time-base and the test_count you ran with? I can show a graph with the >>> data as is and the relative differences will properly surface..but it >>> would be nice to have apples to apples in terms of IOPS units with my >>> other run. >>> >>> -Greg >>> >> Please, that'll be nice. >> >> Quad-Core AMD Opteron(tm) Processor 2358 SE >> >> host: 2.6.30-rc2 >> guest: 2.6.29.1-102.fc11.x86_64 >> >> test_count=1000000, tsc freq=2402882804 Hz >> >> NPT disabled: >> >> test 0 = 2771200766 >> test 1 = 3018726738 >> test 2 = 6414705418 >> test 3 = 2890332864 >> >> NPT enabled: >> >> test 0 = 2908604045 >> test 1 = 3174687394 >> test 2 = 7912464804 >> test 3 = 3046085805 >> >> > > DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: > host: 2.6.30-rc3 > guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 > > with EPT > test 0: 543617607291 - 518146439877 = 25471167414 > test 1: 572568176856 - 543617703004 = 28950473852 > test 2: 630182158139 - 572568269792 = 57613888347 > > > without EPT > test 0: 1383532195307 - 1358052032086 = 25480163221 > test 1: 1411587055210 - 1383532318617 = 28054736593 > test 2: 1471446356172 - 1411587194600 = 59859161572 > > > Thank you kindly, David. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 20:01 ` Gregory Haskins @ 2009-05-08 23:23 ` David S. Ahern 2009-05-09 8:45 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: David S. Ahern @ 2009-05-08 23:23 UTC (permalink / raw) To: Gregory Haskins Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori Gregory Haskins wrote: > David S. Ahern wrote: >> Marcelo Tosatti wrote: >> >>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >>> >>>> Marcelo Tosatti wrote: >>>> >>>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: >>>>> >>>>> >>>>>> Marcelo Tosatti wrote: >>>>>> >>>>>> >>>>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>>>>>> it probably sucks much less than what you're seeing. >>>>>>> >>>>>>> >>>>>>> >>>>>> Why would NPT improve mmio? If anything, it would be worse, since the >>>>>> processor has to do the nested walk. >>>>>> >>>>>> Of course, these are newer machines, so the absolute results as well as >>>>>> the difference will be smaller. >>>>>> >>>>>> >>>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: >>>>> >>>>> NPT enabled: >>>>> test 0: 3088633284634 - 3059375712321 = 29257572313 >>>>> test 1: 3121754636397 - 3088633419760 = 33121216637 >>>>> test 2: 3204666462763 - 3121754668573 = 82911794190 >>>>> >>>>> NPT disabled: >>>>> test 0: 3638061646250 - 3609416811687 = 28644834563 >>>>> test 1: 3669413430258 - 3638061771291 = 31351658967 >>>>> test 2: 3736287253287 - 3669413463506 = 66873789781 >>>>> >>>>> >>>>> >>>> Thanks for running that. Its interesting to see that NPT was in fact >>>> worse as Avi predicted. >>>> >>>> Would you mind if I graphed the result and added this data to my wiki? >>>> If so, could you adjust the tsc result into IOPs using the proper >>>> time-base and the test_count you ran with? I can show a graph with the >>>> data as is and the relative differences will properly surface..but it >>>> would be nice to have apples to apples in terms of IOPS units with my >>>> other run. >>>> >>>> -Greg >>>> >>> Please, that'll be nice. >>> >>> Quad-Core AMD Opteron(tm) Processor 2358 SE >>> >>> host: 2.6.30-rc2 >>> guest: 2.6.29.1-102.fc11.x86_64 >>> >>> test_count=1000000, tsc freq=2402882804 Hz >>> >>> NPT disabled: >>> >>> test 0 = 2771200766 >>> test 1 = 3018726738 >>> test 2 = 6414705418 >>> test 3 = 2890332864 >>> >>> NPT enabled: >>> >>> test 0 = 2908604045 >>> test 1 = 3174687394 >>> test 2 = 7912464804 >>> test 3 = 3046085805 >>> >>> >> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: >> host: 2.6.30-rc3 >> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 >> >> with EPT >> test 0: 543617607291 - 518146439877 = 25471167414 >> test 1: 572568176856 - 543617703004 = 28950473852 >> test 2: 630182158139 - 572568269792 = 57613888347 >> >> >> without EPT >> test 0: 1383532195307 - 1358052032086 = 25480163221 >> test 1: 1411587055210 - 1383532318617 = 28054736593 >> test 2: 1471446356172 - 1411587194600 = 59859161572 >> >> >> > > Thank you kindly, David. > > -Greg I ran another test case with SMT disabled, and while I was at it converted TSC delta to operations/sec. The results without SMT are confusing -- to me anyways. I'm hoping someone can explain it. Basically, using a count of 10,000,000 (per your web page) with SMT disabled the guest detected a soft lockup on the CPU. So, I dropped the count down to 1,000,000. So, for 1e6 iterations: without SMT, with EPT: HC: 259,455 ops/sec PIO: 226,937 ops/sec MMIO: 113,180 ops/sec without SMT, without EPT: HC: 274,825 ops/sec PIO: 247,910 ops/sec MMIO: 111,535 ops/sec Converting the prior TSC deltas: with SMT, with EPT: HC: 994,655 ops/sec PIO: 875,116 ops/sec MMIO: 439,738 ops/sec with SMT, without EPT: HC: 994,304 ops/sec PIO: 903,057 ops/sec MMIO: 423,244 ops/sec Running the tests repeatedly I did notice a fair variability (as much as -10% down from these numbers). Also, just to make sure I converted the delta to ops/sec, the formula I used was cpu_freq / dTSC * count = operations/sec david ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 23:23 ` David S. Ahern @ 2009-05-09 8:45 ` Avi Kivity 2009-05-09 11:27 ` Gregory Haskins 2009-05-10 4:24 ` David S. Ahern 0 siblings, 2 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-09 8:45 UTC (permalink / raw) To: David S. Ahern Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, kvm, Anthony Liguori David S. Ahern wrote: > I ran another test case with SMT disabled, and while I was at it > converted TSC delta to operations/sec. The results without SMT are > confusing -- to me anyways. I'm hoping someone can explain it. > Basically, using a count of 10,000,000 (per your web page) with SMT > disabled the guest detected a soft lockup on the CPU. So, I dropped the > count down to 1,000,000. So, for 1e6 iterations: > > without SMT, with EPT: > HC: 259,455 ops/sec > PIO: 226,937 ops/sec > MMIO: 113,180 ops/sec > > without SMT, without EPT: > HC: 274,825 ops/sec > PIO: 247,910 ops/sec > MMIO: 111,535 ops/sec > > Converting the prior TSC deltas: > > with SMT, with EPT: > HC: 994,655 ops/sec > PIO: 875,116 ops/sec > MMIO: 439,738 ops/sec > > with SMT, without EPT: > HC: 994,304 ops/sec > PIO: 903,057 ops/sec > MMIO: 423,244 ops/sec > > Running the tests repeatedly I did notice a fair variability (as much as > -10% down from these numbers). > > Also, just to make sure I converted the delta to ops/sec, the formula I > used was cpu_freq / dTSC * count = operations/sec > > The only think I can think of is cpu frequency scaling lying about the cpu frequency. Really the test needs to use time and not the time stamp counter. Are the results expressed in cycles/op more reasonable? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-09 8:45 ` Avi Kivity @ 2009-05-09 11:27 ` Gregory Haskins 2009-05-10 4:27 ` David S. Ahern 2009-05-10 4:24 ` David S. Ahern 1 sibling, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-09 11:27 UTC (permalink / raw) To: Avi Kivity Cc: David S. Ahern, Gregory Haskins, Marcelo Tosatti, Chris Wright, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1532 bytes --] Avi Kivity wrote: > David S. Ahern wrote: >> I ran another test case with SMT disabled, and while I was at it >> converted TSC delta to operations/sec. The results without SMT are >> confusing -- to me anyways. I'm hoping someone can explain it. >> Basically, using a count of 10,000,000 (per your web page) with SMT >> disabled the guest detected a soft lockup on the CPU. So, I dropped the >> count down to 1,000,000. So, for 1e6 iterations: >> >> without SMT, with EPT: >> HC: 259,455 ops/sec >> PIO: 226,937 ops/sec >> MMIO: 113,180 ops/sec >> >> without SMT, without EPT: >> HC: 274,825 ops/sec >> PIO: 247,910 ops/sec >> MMIO: 111,535 ops/sec >> >> Converting the prior TSC deltas: >> >> with SMT, with EPT: >> HC: 994,655 ops/sec >> PIO: 875,116 ops/sec >> MMIO: 439,738 ops/sec >> >> with SMT, without EPT: >> HC: 994,304 ops/sec >> PIO: 903,057 ops/sec >> MMIO: 423,244 ops/sec >> >> Running the tests repeatedly I did notice a fair variability (as much as >> -10% down from these numbers). >> >> Also, just to make sure I converted the delta to ops/sec, the formula I >> used was cpu_freq / dTSC * count = operations/sec >> >> > > The only think I can think of is cpu frequency scaling lying about the > cpu frequency. Really the test needs to use time and not the time > stamp counter. > > Are the results expressed in cycles/op more reasonable? FWIW: I always used kvm_stat instead of my tsc printk [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-09 11:27 ` Gregory Haskins @ 2009-05-10 4:27 ` David S. Ahern 2009-05-10 5:24 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: David S. Ahern @ 2009-05-10 4:27 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Gregory Haskins, Marcelo Tosatti, Chris Wright, kvm, Anthony Liguori Gregory Haskins wrote: > Avi Kivity wrote: >> David S. Ahern wrote: >>> I ran another test case with SMT disabled, and while I was at it >>> converted TSC delta to operations/sec. The results without SMT are >>> confusing -- to me anyways. I'm hoping someone can explain it. >>> Basically, using a count of 10,000,000 (per your web page) with SMT >>> disabled the guest detected a soft lockup on the CPU. So, I dropped the >>> count down to 1,000,000. So, for 1e6 iterations: >>> >>> without SMT, with EPT: >>> HC: 259,455 ops/sec >>> PIO: 226,937 ops/sec >>> MMIO: 113,180 ops/sec >>> >>> without SMT, without EPT: >>> HC: 274,825 ops/sec >>> PIO: 247,910 ops/sec >>> MMIO: 111,535 ops/sec >>> >>> Converting the prior TSC deltas: >>> >>> with SMT, with EPT: >>> HC: 994,655 ops/sec >>> PIO: 875,116 ops/sec >>> MMIO: 439,738 ops/sec >>> >>> with SMT, without EPT: >>> HC: 994,304 ops/sec >>> PIO: 903,057 ops/sec >>> MMIO: 423,244 ops/sec >>> >>> Running the tests repeatedly I did notice a fair variability (as much as >>> -10% down from these numbers). >>> >>> Also, just to make sure I converted the delta to ops/sec, the formula I >>> used was cpu_freq / dTSC * count = operations/sec >>> >>> >> The only think I can think of is cpu frequency scaling lying about the >> cpu frequency. Really the test needs to use time and not the time >> stamp counter. >> >> Are the results expressed in cycles/op more reasonable? > > FWIW: I always used kvm_stat instead of my tsc printk > kvm_stat shows same approximate numbers as with the TSC-->ops/sec conversions. Interestingly, MMIO writes are not showing up as mmio_exits in kvm_stat; they are showing up as insn_emulation. david > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-10 4:27 ` David S. Ahern @ 2009-05-10 5:24 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-10 5:24 UTC (permalink / raw) To: David S. Ahern Cc: Gregory Haskins, Gregory Haskins, Marcelo Tosatti, Chris Wright, kvm, Anthony Liguori David S. Ahern wrote: > kvm_stat shows same approximate numbers as with the TSC-->ops/sec > conversions. Interestingly, MMIO writes are not showing up as mmio_exits > in kvm_stat; they are showing up as insn_emulation. > That's a bug, mmio_exits ignores mmios that are handled in the kernel. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-09 8:45 ` Avi Kivity 2009-05-09 11:27 ` Gregory Haskins @ 2009-05-10 4:24 ` David S. Ahern 1 sibling, 0 replies; 115+ messages in thread From: David S. Ahern @ 2009-05-10 4:24 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, kvm, Anthony Liguori Avi Kivity wrote: > David S. Ahern wrote: >> I ran another test case with SMT disabled, and while I was at it >> converted TSC delta to operations/sec. The results without SMT are >> confusing -- to me anyways. I'm hoping someone can explain it. >> Basically, using a count of 10,000,000 (per your web page) with SMT >> disabled the guest detected a soft lockup on the CPU. So, I dropped the >> count down to 1,000,000. So, for 1e6 iterations: >> >> without SMT, with EPT: >> HC: 259,455 ops/sec >> PIO: 226,937 ops/sec >> MMIO: 113,180 ops/sec >> >> without SMT, without EPT: >> HC: 274,825 ops/sec >> PIO: 247,910 ops/sec >> MMIO: 111,535 ops/sec >> >> Converting the prior TSC deltas: >> >> with SMT, with EPT: >> HC: 994,655 ops/sec >> PIO: 875,116 ops/sec >> MMIO: 439,738 ops/sec >> >> with SMT, without EPT: >> HC: 994,304 ops/sec >> PIO: 903,057 ops/sec >> MMIO: 423,244 ops/sec >> >> Running the tests repeatedly I did notice a fair variability (as much as >> -10% down from these numbers). >> >> Also, just to make sure I converted the delta to ops/sec, the formula I >> used was cpu_freq / dTSC * count = operations/sec >> >> > > The only think I can think of is cpu frequency scaling lying about the > cpu frequency. Really the test needs to use time and not the time stamp > counter. > > Are the results expressed in cycles/op more reasonable? > Power settings seem to be the root cause. With this HP server the SMT mode must be disabling or overriding a power setting that is enabled in the bios. I found one power-based knob that gets non-SMT performance close to SMT numbers. Not very intuitive that SMT/non-SMT can differ so dramatically. david ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:35 ` Marcelo Tosatti 2009-05-07 23:43 ` Marcelo Tosatti @ 2009-05-08 3:13 ` Gregory Haskins 2009-05-08 7:59 ` Avi Kivity 2009-05-08 14:15 ` Gregory Haskins 3 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 3:13 UTC (permalink / raw) To: Marcelo Tosatti Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 4197 bytes --] Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > >> Chris Wright wrote: >> >>> * Gregory Haskins (ghaskins@novell.com) wrote: >>> >>> >>>> Chris Wright wrote: >>>> >>>> >>>>> VF drivers can also have this issue (and typically use mmio). >>>>> I at least have a better idea what your proposal is, thanks for >>>>> explanation. Are you able to demonstrate concrete benefit with it yet >>>>> (improved latency numbers for example)? >>>>> >>>>> >>>> I had a test-harness/numbers for this kind of thing, but its a bit >>>> crufty since its from ~1.5 years ago. I will dig it up, update it, and >>>> generate/post new numbers. >>>> >>>> >>> That would be useful, because I keep coming back to pio and shared >>> page(s) when think of why not to do this. Seems I'm not alone in that. >>> >>> thanks, >>> -chris >>> >>> >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Greg, > > I think comparison is not entirely fair. You're using > KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > (on Intel) to only one register read: > > nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > > Whereas in a real hypercall for (say) PIO you would need the address, > size, direction and data. > Hi Marcelo, I'll have to respectfully disagree with you here. What you are proposing is actually a different test: a 4th type I would call "PIO over HC". It is distinctly different than the existing MMIO, PIO, and HC tests already present. I assert that the current HC test remains valid because for pure hypercalls, the "nr" *is* the address. It identifies the function to be executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my nullio device identifies the function to be executed (i.e. nullio_write() = null) My argument is that the HC test emulates the "dynhc()" concept I have been talking about, whereas the PIOoHC is more like the pv_io_ops->iowrite approach. That said, your 4th test type would actually be a very interesting data-point to add to the suite (especially since we are still kicking around the notion of doing something like this). I will update the patches. > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. In theory they should both share about the same algorithmic complexity in the decode-stage, but due to the possible optimization you mention you may have a point. I need to take some steps to ensure the HC path isn't artificially simplified by GCC (like making the execute stage do some trivial work like you mention below). > I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? > > I'm not sure I need that much complexity. I can probably just change the test harness to generate an ioread32(), and have the functions return the TSC value as a return parameter for all test types. The important thing is that we pick something extremely cheap (yet dynamic) to compute so the execution time doesn't invalidate the measurement granularity with a large constant. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:35 ` Marcelo Tosatti 2009-05-07 23:43 ` Marcelo Tosatti 2009-05-08 3:13 ` Gregory Haskins @ 2009-05-08 7:59 ` Avi Kivity 2009-05-08 11:09 ` Gregory Haskins [not found] ` <20090508104228.GD3011@amt.cnet> 2009-05-08 14:15 ` Gregory Haskins 3 siblings, 2 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-08 7:59 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori Marcelo Tosatti wrote: > I think comparison is not entirely fair. You're using > KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > (on Intel) to only one register read: > > nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > > Whereas in a real hypercall for (say) PIO you would need the address, > size, direction and data. > Well, that's probably one of the reasons pio is slower, as the cpu has to set these up, and the kernel has to read them. > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > Since there are only one or two elements in the list, I don't see how it could be optimized. > Whereas for hypercall measurement you don't. I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? > I don't understand why you want host tsc? We're interested in round-trip latency, so you want guest tsc all the time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 7:59 ` Avi Kivity @ 2009-05-08 11:09 ` Gregory Haskins [not found] ` <20090508104228.GD3011@amt.cnet> 1 sibling, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 11:09 UTC (permalink / raw) To: Avi Kivity Cc: Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 2899 bytes --] Avi Kivity wrote: > Marcelo Tosatti wrote: >> I think comparison is not entirely fair. You're using >> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >> (on Intel) to only one register read: >> >> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >> >> Whereas in a real hypercall for (say) PIO you would need the address, >> size, direction and data. >> > > Well, that's probably one of the reasons pio is slower, as the cpu has > to set these up, and the kernel has to read them. Right, that was the point I was trying to make. Its real-world overhead to measure how long it takes KVM to go round-trip in each of the respective trap types. > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); return 1; >> } >> > > Since there are only one or two elements in the list, I don't see how > it could be optimized. To Marcelo's point, I think he was more taking exception to the fact that the HC path was potentially completely optimized out if GCC was super-intuitive about the switch(nr) statement hitting the null vector. In theory, both the io_bus and the select(nr) are about equivalent in algorithmic complexity (and depth, I should say) which is why I think in general the test is "fair". IOW it represents the real-world decode cycle function for each transport. However, if one side was artificially optimized simply due to the triviality of my NULLIO test, that is not fair, and that is the point I believe he was making. In any case, I just wrote a new version of the test which hopefully addresses forces GCC to leave it as a more real-world decode. (FYI: I saw no difference). I will update the tarball/wiki shortly. > >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > I don't understand why you want host tsc? We're interested in > round-trip latency, so you want guest tsc all the time. Yeah, I agree. My take is he was just trying to introduce a real workload so GCC wouldn't do that potential "cheater decode" in the HC path. After thinking about it, however, I realized we could do that with a simple "state++" operation, so the new test does this in each of the various test's "execute" cycle. The timing calculation remains unchanged. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
[parent not found: <20090508104228.GD3011@amt.cnet>]
* Re: [RFC PATCH 0/3] generic hypercall support [not found] ` <20090508104228.GD3011@amt.cnet> @ 2009-05-08 12:43 ` Gregory Haskins 2009-05-08 15:33 ` Marcelo Tosatti 2009-05-08 16:48 ` Paul E. McKenney 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 12:43 UTC (permalink / raw) To: Marcelo Tosatti Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori, paulmck [-- Attachment #1: Type: text/plain, Size: 2708 bytes --] Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: > >> Marcelo Tosatti wrote: >> >>> I think comparison is not entirely fair. You're using >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >>> (on Intel) to only one register read: >>> >>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >>> >>> Whereas in a real hypercall for (say) PIO you would need the address, >>> size, direction and data. >>> >>> >> Well, that's probably one of the reasons pio is slower, as the cpu has >> to set these up, and the kernel has to read them. >> >> >>> Also for PIO/MMIO you're adding this unoptimized lookup to the >>> measurement: >>> >>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >>> if (pio_dev) { >>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >>> complete_pio(vcpu); return 1; >>> } >>> >>> >> Since there are only one or two elements in the list, I don't see how it >> could be optimized. >> > > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev > is probably the last in the io_bus list. > > Not sure if this one matters very much. Point is you should measure the > exit time only, not the pio path vs hypercall path in kvm. > The problem is the exit time in of itself isnt all that interesting to me. What I am interested in measuring is how long it takes KVM to process the request and realize that I want to execute function "X". Ultimately that is what matters in terms of execution latency and is thus the more interesting data. I think the exit time is possibly an interesting 5th data point, but its more of a side-bar IMO. In any case, I suspect that both exits will be approximately the same at the VT/SVM level. OTOH: If there is a patch out there to improve KVMs code (say specifically the PIO handling logic), that is fair-game here and we should benchmark it. For instance, if you have ideas on ways to improve the find_pio_dev performance, etc.... One item may be to replace the kvm->lock on the bus scan with an RCU or something.... (though PIOs are very frequent and the constant re-entry to an an RCU read-side CS may effectively cause a perpetual grace-period and may be too prohibitive). CC'ing pmck. FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that 140 can possibly be recouped. I currently suspect the lock acquisition in the iobus-scan is the bulk of that time, but that is admittedly a guess. The remaining 200-250ns is elsewhere in the PIO decode. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 12:43 ` Gregory Haskins @ 2009-05-08 15:33 ` Marcelo Tosatti 2009-05-08 19:02 ` Avi Kivity 2009-05-08 16:48 ` Paul E. McKenney 1 sibling, 1 reply; 115+ messages in thread From: Marcelo Tosatti @ 2009-05-08 15:33 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori, paulmck On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > The problem is the exit time in of itself isnt all that interesting to > me. What I am interested in measuring is how long it takes KVM to > process the request and realize that I want to execute function "X". > Ultimately that is what matters in terms of execution latency and is > thus the more interesting data. I think the exit time is possibly an > interesting 5th data point, but its more of a side-bar IMO. In any > case, I suspect that both exits will be approximately the same at the > VT/SVM level. > > OTOH: If there is a patch out there to improve KVMs code (say > specifically the PIO handling logic), that is fair-game here and we > should benchmark it. For instance, if you have ideas on ways to improve > the find_pio_dev performance, etc.... <guess mode on> One easy thing to try is to cache the last successful lookup on a pointer, to improve patterns where there's "device locality" (like nullio test). <guess mode off> > One item may be to replace the kvm->lock on the bus scan with an RCU > or something.... (though PIOs are very frequent and the constant > re-entry to an an RCU read-side CS may effectively cause a perpetual > grace-period and may be too prohibitive). CC'ing pmck. Yes, locking improvements are needed there badly (think for eg the cache bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP guests). > FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that > 140 can possibly be recouped. I currently suspect the lock acquisition > in the iobus-scan is the bulk of that time, but that is admittedly a > guess. The remaining 200-250ns is elsewhere in the PIO decode. vmcs_read is significantly expensive (http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html, likely that my measurements were foobar, Avi mentioned 50 cycles for vmcs_write). See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit. Also this one looks pretty bad for a 32-bit PAE guest (and you can get away with the unconditional GUEST_CR3 read too). /* Access CR3 don't cause VMExit in paging mode, so we need * to sync with guest real CR3. */ if (enable_ept && is_paging(vcpu)) { vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); ept_load_pdptrs(vcpu); } ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 15:33 ` Marcelo Tosatti @ 2009-05-08 19:02 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-08 19:02 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori, paulmck Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > >> The problem is the exit time in of itself isnt all that interesting to >> me. What I am interested in measuring is how long it takes KVM to >> process the request and realize that I want to execute function "X". >> Ultimately that is what matters in terms of execution latency and is >> thus the more interesting data. I think the exit time is possibly an >> interesting 5th data point, but its more of a side-bar IMO. In any >> case, I suspect that both exits will be approximately the same at the >> VT/SVM level. >> >> OTOH: If there is a patch out there to improve KVMs code (say >> specifically the PIO handling logic), that is fair-game here and we >> should benchmark it. For instance, if you have ideas on ways to improve >> the find_pio_dev performance, etc.... >> > > <guess mode on> > > One easy thing to try is to cache the last successful lookup on a > pointer, to improve patterns where there's "device locality" (like > nullio test). > We should do that everywhere, memory slots, pio slots, etc. Or even keep statistics on accesses and sort by that. > <guess mode off> > > I'd leave it on if I were you. >> One item may be to replace the kvm->lock on the bus scan with an RCU >> or something.... (though PIOs are very frequent and the constant >> re-entry to an an RCU read-side CS may effectively cause a perpetual >> grace-period and may be too prohibitive). CC'ing pmck. >> > > Yes, locking improvements are needed there badly (think for eg the cache > bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP > guests). > There's no reason for kvm->lock on pio. We should push the locking to devices. I'm going to rename slots_lock as slots_lock_please_reimplement_me_using_rcu, this keeps coming up. >> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that >> 140 can possibly be recouped. I currently suspect the lock acquisition >> in the iobus-scan is the bulk of that time, but that is admittedly a >> guess. The remaining 200-250ns is elsewhere in the PIO decode. >> > > vmcs_read is significantly expensive > (http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html, > likely that my measurements were foobar, Avi mentioned 50 cycles for > vmcs_write). > IIRC vmcs reads are pretty fast, and are being improved. > See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit. > Ugh. > Also this one looks pretty bad for a 32-bit PAE guest (and you can > get away with the unconditional GUEST_CR3 read too). > > /* Access CR3 don't cause VMExit in paging mode, so we need > * to sync with guest real CR3. */ > if (enable_ept && is_paging(vcpu)) { > vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); > ept_load_pdptrs(vcpu); > } > > We should use an accessor here just like with registers and segment registers. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 12:43 ` Gregory Haskins 2009-05-08 15:33 ` Marcelo Tosatti @ 2009-05-08 16:48 ` Paul E. McKenney 2009-05-08 19:55 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Paul E. McKenney @ 2009-05-08 16:48 UTC (permalink / raw) To: Gregory Haskins Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > Marcelo Tosatti wrote: > > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: > > > >> Marcelo Tosatti wrote: > >> > >>> I think comparison is not entirely fair. You're using > >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > >>> (on Intel) to only one register read: > >>> > >>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > >>> > >>> Whereas in a real hypercall for (say) PIO you would need the address, > >>> size, direction and data. > >>> > >>> > >> Well, that's probably one of the reasons pio is slower, as the cpu has > >> to set these up, and the kernel has to read them. > >> > >> > >>> Also for PIO/MMIO you're adding this unoptimized lookup to the > >>> measurement: > >>> > >>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > >>> if (pio_dev) { > >>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > >>> complete_pio(vcpu); return 1; > >>> } > >>> > >>> > >> Since there are only one or two elements in the list, I don't see how it > >> could be optimized. > >> > > > > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev > > is probably the last in the io_bus list. > > > > Not sure if this one matters very much. Point is you should measure the > > exit time only, not the pio path vs hypercall path in kvm. > > > > The problem is the exit time in of itself isnt all that interesting to > me. What I am interested in measuring is how long it takes KVM to > process the request and realize that I want to execute function "X". > Ultimately that is what matters in terms of execution latency and is > thus the more interesting data. I think the exit time is possibly an > interesting 5th data point, but its more of a side-bar IMO. In any > case, I suspect that both exits will be approximately the same at the > VT/SVM level. > > OTOH: If there is a patch out there to improve KVMs code (say > specifically the PIO handling logic), that is fair-game here and we > should benchmark it. For instance, if you have ideas on ways to improve > the find_pio_dev performance, etc.... One item may be to replace the > kvm->lock on the bus scan with an RCU or something.... (though PIOs are > very frequent and the constant re-entry to an an RCU read-side CS may > effectively cause a perpetual grace-period and may be too prohibitive). > CC'ing pmck. Hello, Greg! Not a problem. ;-) A grace period only needs to wait on RCU read-side critical sections that started before the grace period started. As soon as those pre-existing RCU read-side critical get done, the grace period can end, regardless of how many RCU read-side critical sections might have started after the grace period started. If you find a situation where huge numbers of RCU read-side critical sections do indefinitely delay a grace period, then that is a bug in RCU that I need to fix. Of course, if you have a single RCU read-side critical section that runs for a very long time, that -will- delay a grace period. As long as you don't do it too often, this is not a problem, though if running a single RCU read-side critical section for more than a few milliseconds is probably not a good thing. Not as bad as holding a heavily contended spinlock for a few milliseconds, but still not a good thing. Thanx, Paul > FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that > 140 can possibly be recouped. I currently suspect the lock acquisition > in the iobus-scan is the bulk of that time, but that is admittedly a > guess. The remaining 200-250ns is elsewhere in the PIO decode. > > -Greg > > > > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 16:48 ` Paul E. McKenney @ 2009-05-08 19:55 ` Gregory Haskins 0 siblings, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 19:55 UTC (permalink / raw) To: paulmck Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 4007 bytes --] Paul E. McKenney wrote: > On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > >> Marcelo Tosatti wrote: >> >>> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: >>> >>> >>>> Marcelo Tosatti wrote: >>>> >>>> >>>>> I think comparison is not entirely fair. You're using >>>>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >>>>> (on Intel) to only one register read: >>>>> >>>>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >>>>> >>>>> Whereas in a real hypercall for (say) PIO you would need the address, >>>>> size, direction and data. >>>>> >>>>> >>>>> >>>> Well, that's probably one of the reasons pio is slower, as the cpu has >>>> to set these up, and the kernel has to read them. >>>> >>>> >>>> >>>>> Also for PIO/MMIO you're adding this unoptimized lookup to the >>>>> measurement: >>>>> >>>>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >>>>> if (pio_dev) { >>>>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >>>>> complete_pio(vcpu); return 1; >>>>> } >>>>> >>>>> >>>>> >>>> Since there are only one or two elements in the list, I don't see how it >>>> could be optimized. >>>> >>>> >>> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev >>> is probably the last in the io_bus list. >>> >>> Not sure if this one matters very much. Point is you should measure the >>> exit time only, not the pio path vs hypercall path in kvm. >>> >>> >> The problem is the exit time in of itself isnt all that interesting to >> me. What I am interested in measuring is how long it takes KVM to >> process the request and realize that I want to execute function "X". >> Ultimately that is what matters in terms of execution latency and is >> thus the more interesting data. I think the exit time is possibly an >> interesting 5th data point, but its more of a side-bar IMO. In any >> case, I suspect that both exits will be approximately the same at the >> VT/SVM level. >> >> OTOH: If there is a patch out there to improve KVMs code (say >> specifically the PIO handling logic), that is fair-game here and we >> should benchmark it. For instance, if you have ideas on ways to improve >> the find_pio_dev performance, etc.... One item may be to replace the >> kvm->lock on the bus scan with an RCU or something.... (though PIOs are >> very frequent and the constant re-entry to an an RCU read-side CS may >> effectively cause a perpetual grace-period and may be too prohibitive). >> CC'ing pmck. >> > > Hello, Greg! > > Not a problem. ;-) > > A grace period only needs to wait on RCU read-side critical sections that > started before the grace period started. As soon as those pre-existing > RCU read-side critical get done, the grace period can end, regardless > of how many RCU read-side critical sections might have started after > the grace period started. > > If you find a situation where huge numbers of RCU read-side critical > sections do indefinitely delay a grace period, then that is a bug in > RCU that I need to fix. > > Of course, if you have a single RCU read-side critical section that > runs for a very long time, that -will- delay a grace period. As long > as you don't do it too often, this is not a problem, though if running > a single RCU read-side critical section for more than a few milliseconds > is probably not a good thing. Not as bad as holding a heavily contended > spinlock for a few milliseconds, but still not a good thing. > Hey Paul, This makes sense, and it clears up a misconception I had about RCU. So thanks for that. Based on what Paul said, I think we can get some amount of gains in the PIO and PIOoHC stats from converting to RCU. I will do this next. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 23:35 ` Marcelo Tosatti ` (2 preceding siblings ...) 2009-05-08 7:59 ` Avi Kivity @ 2009-05-08 14:15 ` Gregory Haskins 2009-05-08 14:53 ` Anthony Liguori 3 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 14:15 UTC (permalink / raw) To: Marcelo Tosatti Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori [-- Attachment #1: Type: text/plain, Size: 1655 bytes --] Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > >> Chris Wright wrote: >> >>> * Gregory Haskins (ghaskins@novell.com) wrote: >>> >>> >>>> Chris Wright wrote: >>>> >>>> >>>>> VF drivers can also have this issue (and typically use mmio). >>>>> I at least have a better idea what your proposal is, thanks for >>>>> explanation. Are you able to demonstrate concrete benefit with it yet >>>>> (improved latency numbers for example)? >>>>> >>>>> >>>> I had a test-harness/numbers for this kind of thing, but its a bit >>>> crufty since its from ~1.5 years ago. I will dig it up, update it, and >>>> generate/post new numbers. >>>> >>>> >>> That would be useful, because I keep coming back to pio and shared >>> page(s) when think of why not to do this. Seems I'm not alone in that. >>> >>> thanks, >>> -chris >>> >>> >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Greg, > > I think comparison is not entirely fair. <snip> FYI: I've update the test/wiki to (hopefully) address your concerns. http://developer.novell.com/wiki/index.php/WhyHypercalls Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 14:15 ` Gregory Haskins @ 2009-05-08 14:53 ` Anthony Liguori 2009-05-08 18:50 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-08 14:53 UTC (permalink / raw) To: Gregory Haskins Cc: Marcelo Tosatti, Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm Gregory Haskins wrote: >> Greg, >> >> I think comparison is not entirely fair. >> > > <snip> > > FYI: I've update the test/wiki to (hopefully) address your concerns. > > http://developer.novell.com/wiki/index.php/WhyHypercalls > And we're now getting close to the point where the difference is virtually meaningless. At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 71429 exits. If you have this many exits, the shear cost of the base vmexit overhead is going to result in about 15% CPU overhead. To put this another way, if you're workload was entirely bound by vmexits (which is virtually impossible), then when you were saturating your CPU at 100%, only 7% of that is the cost of PIO exits vs. HC. In real life workloads, if you're paying 15% overhead just to the cost of exits (not including the cost of heavy weight or post-exit processing), you're toast. I think it's going to be very difficult to construct a real scenario where you'll have a measurable (i.e. > 1%) performance overhead from using PIO vs. HC. And in the absence of that, I don't see the justification for adding additional infrastructure to Linux to support this. The non-x86 architecture argument isn't valid because other architectures either 1) don't use PCI at all (s390) and are already using hypercalls 2) use PCI, but do not have a dedicated hypercall instruction (PPC emb) or 3) have PIO (ia64). Regards, Anthony Liguori > Regards, > -Greg > > > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 14:53 ` Anthony Liguori @ 2009-05-08 18:50 ` Avi Kivity 2009-05-08 19:02 ` Anthony Liguori 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-08 18:50 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel, kvm Anthony Liguori wrote: > > And we're now getting close to the point where the difference is > virtually meaningless. > > At .14us, in order to see 1% CPU overhead added from PIO vs HC, you > need 71429 exits. > If I read things correctly, you want the difference between PIO and PIOoHC, which is 210ns. But your point stands, 50,000 exits/sec will add 1% cpu overhead. > > The non-x86 architecture argument isn't valid because other > architectures either 1) don't use PCI at all (s390) and are already > using hypercalls 2) use PCI, but do not have a dedicated hypercall > instruction (PPC emb) or 3) have PIO (ia64). ia64 uses mmio to emulate pio, so the cost may be different. I agree on x86 it's almost negligible. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 18:50 ` Avi Kivity @ 2009-05-08 19:02 ` Anthony Liguori 2009-05-08 19:06 ` Avi Kivity 2009-05-11 16:37 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-08 19:02 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel, kvm Avi Kivity wrote: > Anthony Liguori wrote: >> >> And we're now getting close to the point where the difference is >> virtually meaningless. >> >> At .14us, in order to see 1% CPU overhead added from PIO vs HC, you >> need 71429 exits. >> > > If I read things correctly, you want the difference between PIO and > PIOoHC, which is 210ns. But your point stands, 50,000 exits/sec will > add 1% cpu overhead. Right, the basic math still stands. >> >> The non-x86 architecture argument isn't valid because other >> architectures either 1) don't use PCI at all (s390) and are already >> using hypercalls 2) use PCI, but do not have a dedicated hypercall >> instruction (PPC emb) or 3) have PIO (ia64). > > ia64 uses mmio to emulate pio, so the cost may be different. I agree > on x86 it's almost negligible. Yes, I misunderstood that they actually emulated it like that. However, ia64 has no paravirtualization support today so surely, we aren't going to be justifying this via ia64, right? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 19:02 ` Anthony Liguori @ 2009-05-08 19:06 ` Avi Kivity 2009-05-11 16:37 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-08 19:06 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel, kvm Anthony Liguori wrote: >> ia64 uses mmio to emulate pio, so the cost may be different. I agree >> on x86 it's almost negligible. > > Yes, I misunderstood that they actually emulated it like that. > However, ia64 has no paravirtualization support today so surely, we > aren't going to be justifying this via ia64, right? > Right. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 19:02 ` Anthony Liguori 2009-05-08 19:06 ` Avi Kivity @ 2009-05-11 16:37 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 115+ messages in thread From: Jeremy Fitzhardinge @ 2009-05-11 16:37 UTC (permalink / raw) To: Anthony Liguori Cc: Avi Kivity, Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel, kvm Anthony Liguori wrote: > Yes, I misunderstood that they actually emulated it like that. > However, ia64 has no paravirtualization support today so surely, we > aren't going to be justifying this via ia64, right? > Someone is actively putting a pvops infrastructure into the ia64 port, along with a Xen port. I think pieces of it got merged this last window. J ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 13:17 ` Gregory Haskins 2009-05-06 16:07 ` Chris Wright @ 2009-05-07 12:29 ` Avi Kivity 2009-05-08 14:59 ` Anthony Liguori 1 sibling, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-07 12:29 UTC (permalink / raw) To: Gregory Haskins; +Cc: Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: > Chris Wright wrote: > >> * Gregory Haskins (ghaskins@novell.com) wrote: >> >> >>> Chris Wright wrote: >>> >>> >>>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) >>>> means hypercall number and arg list must be the same in order for code >>>> to call hypercall() in a hypervisor agnostic way. >>>> >>>> >>> Yes, and that is exactly the intention. I think its perhaps the point >>> you are missing. >>> >>> >> Yes, I was reading this as purely any hypercall, but it seems a bit >> more like: >> pv_io_ops->iomap() >> pv_io_ops->ioread() >> pv_io_ops->iowrite() >> >> > > Right. > Hmm, reminds me of something I thought of a while back. We could implement an 'mmio hypercall' that does mmio reads/writes via a hypercall instead of an mmio operation. That will speed up mmio for emulated devices (say, e1000). It's easy to hook into Linux (readl/writel), is pci-friendly, non-x86 friendly, etc. It also makes the device work when hypercall support is not available (qemu/tcg); you simply fall back on mmio. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-07 12:29 ` Avi Kivity @ 2009-05-08 14:59 ` Anthony Liguori 2009-05-09 12:01 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-08 14:59 UTC (permalink / raw) To: Avi Kivity Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm Avi Kivity wrote: > > Hmm, reminds me of something I thought of a while back. > > We could implement an 'mmio hypercall' that does mmio reads/writes via > a hypercall instead of an mmio operation. That will speed up mmio for > emulated devices (say, e1000). It's easy to hook into Linux > (readl/writel), is pci-friendly, non-x86 friendly, etc. By the time you get down to userspace for an emulated device, that 2us difference between mmio and hypercalls is simply not going to make a difference. I'm surprised so much effort is going into this, is there any indication that this is even close to a bottleneck in any circumstance? We have much, much lower hanging fruit to attack. The basic fact that we still copy data multiple times in the networking drivers is clearly more significant than a few hundred nanoseconds that should occur less than once per packet. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 14:59 ` Anthony Liguori @ 2009-05-09 12:01 ` Gregory Haskins 2009-05-10 18:38 ` Anthony Liguori 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-09 12:01 UTC (permalink / raw) To: Anthony Liguori Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 1854 bytes --] Anthony Liguori wrote: > Avi Kivity wrote: >> >> Hmm, reminds me of something I thought of a while back. >> >> We could implement an 'mmio hypercall' that does mmio reads/writes >> via a hypercall instead of an mmio operation. That will speed up >> mmio for emulated devices (say, e1000). It's easy to hook into Linux >> (readl/writel), is pci-friendly, non-x86 friendly, etc. > > By the time you get down to userspace for an emulated device, that 2us > difference between mmio and hypercalls is simply not going to make a > difference. I don't care about this path for emulated devices. I am interested in in-kernel vbus devices. > I'm surprised so much effort is going into this, is there any > indication that this is even close to a bottleneck in any circumstance? Yes. Each 1us of overhead is a 4% regression in something as trivial as a 25us UDP/ICMP rtt "ping". > > > We have much, much lower hanging fruit to attack. The basic fact that > we still copy data multiple times in the networking drivers is clearly > more significant than a few hundred nanoseconds that should occur less > than once per packet. for request-response, this is generally for *every* packet since you cannot exploit buffering/deferring. Can you back up your claim that PPC has no difference in performance with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" like instructions, but clearly there are ways to cause a trap, so presumably we can measure the difference between a PF exit and something more explicit). We need numbers before we can really decide to abandon this optimization. If PPC mmio has no penalty over hypercall, I am not sure the 350ns on x86 is worth this effort (especially if I can shrink this with some RCU fixes). Otherwise, the margin is quite a bit larger. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-09 12:01 ` Gregory Haskins @ 2009-05-10 18:38 ` Anthony Liguori 2009-05-11 13:14 ` Gregory Haskins 2009-05-11 16:44 ` Hollis Blanchard 0 siblings, 2 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-10 18:38 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm, Hollis Blanchard Gregory Haskins wrote: > Anthony Liguori wrote: > > >> I'm surprised so much effort is going into this, is there any >> indication that this is even close to a bottleneck in any circumstance? >> > > Yes. Each 1us of overhead is a 4% regression in something as trivial as > a 25us UDP/ICMP rtt "ping".m > It wasn't 1us, it was 350ns or something around there (i.e ~1%). > for request-response, this is generally for *every* packet since you > cannot exploit buffering/deferring. > > Can you back up your claim that PPC has no difference in performance > with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" > like instructions, but clearly there are ways to cause a trap, so > presumably we can measure the difference between a PF exit and something > more explicit). > First, the PPC that KVM supports performs very poorly relatively speaking because it receives no hardware assistance this is not the right place to focus wrt optimizations. And because there's no hardware assistance, there simply isn't a hypercall instruction. Are PFs the fastest type of exits? Probably not but I honestly have no idea. I'm sure Hollis does though. Page faults are going to have tremendously different performance characteristics on PPC too because it's a software managed TLB. There's no page table lookup like there is on x86. As a more general observation, we need numbers to justify an optimization, not to justify not including an optimization. In other words, the burden is on you to present a scenario where this optimization would result in a measurable improvement in a real world work load. Regards, Anthony Liguori > We need numbers before we can really decide to abandon this > optimization. If PPC mmio has no penalty over hypercall, I am not sure > the 350ns on x86 is worth this effort (especially if I can shrink this > with some RCU fixes). Otherwise, the margin is quite a bit larger. > > -Greg > > > > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-10 18:38 ` Anthony Liguori @ 2009-05-11 13:14 ` Gregory Haskins 2009-05-11 16:35 ` Hollis Blanchard 2009-05-11 17:31 ` Anthony Liguori 2009-05-11 16:44 ` Hollis Blanchard 1 sibling, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-11 13:14 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm, Hollis Blanchard [-- Attachment #1: Type: text/plain, Size: 8778 bytes --] Anthony Liguori wrote: > Gregory Haskins wrote: >> Anthony Liguori wrote: >> >>> I'm surprised so much effort is going into this, is there any >>> indication that this is even close to a bottleneck in any circumstance? >>> >> >> Yes. Each 1us of overhead is a 4% regression in something as trivial as >> a 25us UDP/ICMP rtt "ping".m > > It wasn't 1us, it was 350ns or something around there (i.e ~1%). I wasn't referring to "it". I chose my words carefully. Let me rephrase for your clarity: *each* 1us of overhead introduced into the signaling path is a ~4% latency regression for a round trip on a high speed network (note that this can also affect throughput at some level, too). I believe this point has been lost on you from the very beginning of the vbus discussions. I specifically generalized my statement above because #1 I assume everyone here is smart enough to convert that nice round unit into the relevant figure. And #2, there are multiple potential latency sources at play which we need to factor in when looking at the big picture. For instance, the difference between PF exit, and an IO exit (2.58us on x86, to be precise). Or whether you need to take a heavy-weight exit. Or a context switch to qemu, the the kernel, back to qemu, and back to the vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO. I know you wish that this whole discussion would just go away, but these little "300ns here, 1600ns there" really add up in aggregate despite your dismissive attitude towards them. And it doesn't take much to affect the results in a measurable way. As stated, each 1us costs ~4%. My motivation is to reduce as many of these sources as possible. So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% improvement. So what? Its still an improvement. If that improvement were for free, would you object? And we all know that this change isn't "free" because we have to change some code (+128/-0, to be exact). But what is it specifically you are objecting to in the first place? Adding hypercall support as an pv_ops primitive isn't exactly hard or complex, or even very much code. Besides, I've already clearly stated multiple times (including in this very thread) that I agree that I am not yet sure if the 350ns/1.4% improvement alone is enough to justify a change. So if you are somehow trying to make me feel silly by pointing out the "~1%" above, you are being ridiculous. Rather, I was simply answering your question as to whether these latency sources are a real issue. The answer is "yes" (assuming you care about latency) and I gave you a specific example and a method to quantify the impact. It is duly noted that you do not care about this type of performance, but you also need to realize that your "blessing" or acknowledgment/denial of the problem domain has _zero_ bearing on whether the domain exists, or if there are others out there that do care about it. Sorry. > >> for request-response, this is generally for *every* packet since you >> cannot exploit buffering/deferring. >> >> Can you back up your claim that PPC has no difference in performance >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" >> like instructions, but clearly there are ways to cause a trap, so >> presumably we can measure the difference between a PF exit and something >> more explicit). >> > > First, the PPC that KVM supports performs very poorly relatively > speaking because it receives no hardware assistance So wouldn't that be making the case that it could use as much help as possible? > this is not the right place to focus wrt optimizations. Odd choice of words. I am advocating the opposite (broad solution to many arches and many platforms (i.e. hypervisors)) and therefore I am not "focused" on it (or really any one arch) at all per se. I am _worried_ however, that we could be overlooking PPC (as an example) if we ignore the disparity between MMIO and HC since other higher performance options are not available like PIO. The goal on this particular thread is to come up with an IO interface that works reasonably well across as many hypervisors as possible. MMIO/PIO do not appear to fit that bill (at least not without tunneling them over HCs) If I am guilty of focusing anywhere too much it would be x86 since that is the only development platform I have readily available. > > > And because there's no hardware assistance, there simply isn't a > hypercall instruction. Are PFs the fastest type of exits? Probably > not but I honestly have no idea. I'm sure Hollis does though. > > Page faults are going to have tremendously different performance > characteristics on PPC too because it's a software managed TLB. > There's no page table lookup like there is on x86. The difference between MMIO and "HC", and whether it is cause for concern will continue to be pure speculation until we can find someone with a PPC box willing to run some numbers. I will point out that we both seem to theorize that PFs will yield lower output than alternatives, so it would seem you are actually making my point for me. > > As a more general observation, we need numbers to justify an > optimization, not to justify not including an optimization. > > In other words, the burden is on you to present a scenario where this > optimization would result in a measurable improvement in a real world > work load. I have already done this. You seem to have chosen to ignore my statements and results, but if you insist on rehashing: I started this project by analyzing system traces and finding some of the various bottlenecks in comparison to a native host. Throughput was already pretty decent, but latency was pretty bad (and recently got *really* bad, but I know you already have a handle on whats causing that). I digress...one of the conclusions of the research was that I wanted to focus on building an IO subsystem designed to minimize the quantity of exits, minimize the cost of each exit, and shorten the end-to-end signaling path to achieve optimal performance. I also wanted to build a system that was extensible enough to work with a variety of client types, on a variety of architectures, etc, so we would only need to solve these problems "once". The end result was vbus, and the first working example was venet. The measured performance data of this work was as follows: 802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back with Chelsio T3 10GE via crossover. Bare metal : tput = 9717Mb/s, round-trip = 30396pps (33us rtt) Virtio-net (PCI) : tput = 4578Mb/s, round-trip = 249pps (4016us rtt) Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt) For more details: http://lkml.org/lkml/2009/4/21/408 You can download this today and run it, review it, compare it. Whatever you want. As part of that work, I measured IO performance in KVM and found HCs to be the superior performer. You can find these results here: http://developer.novell.com/wiki/index.php/WhyHypercalls. Without having access to platforms other than x86, but with an understanding of computer architecture, I speculate that the difference should be even more profound everywhere else in lieu of a PIO primitive. And even on the platform which should yield the least benefit (x86), the gain (~1.4%) is not huge, but its not zero either. Therefore, my data and findings suggest that this is not a bad optimization to consider IMO. My final results above do not indicate to me that I was completely wrong in my analysis. Now I know you have been quick in the past to dismiss my efforts, and to claim you can get the same results without needing the various tricks and optimizations I uncovered. But quite frankly, until you post some patches for community review and comparison (as I have done), it's just meaningless talk. Perhaps you are truly unimpressed with my results and will continue to insist that my work including my final results are "virtually meaningless". Or perhaps you have an agenda. You can keep working against me and try to block anything I suggest by coming up with what appears to be any excuse you can find, making rude replies on email threads and snide comments on IRC, etc. It's simply not necessary. Alternatively, you can work _with_ me to help try to improve KVM and Linux (e.g. I still need someone to implement a virtio-net backend, and who knows it better than you). The choice is yours. But lets cut the BS because it's counter productive, and frankly, getting old. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 13:14 ` Gregory Haskins @ 2009-05-11 16:35 ` Hollis Blanchard 2009-05-11 17:06 ` Avi Kivity 2009-05-11 17:31 ` Anthony Liguori 1 sibling, 1 reply; 115+ messages in thread From: Hollis Blanchard @ 2009-05-11 16:35 UTC (permalink / raw) To: Gregory Haskins Cc: Anthony Liguori, Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm On Mon, 2009-05-11 at 09:14 -0400, Gregory Haskins wrote: > > >> for request-response, this is generally for *every* packet since > you > >> cannot exploit buffering/deferring. > >> > >> Can you back up your claim that PPC has no difference in > performance > >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no > "VT" > >> like instructions, but clearly there are ways to cause a trap, so > >> presumably we can measure the difference between a PF exit and > something > >> more explicit). > >> > > > > First, the PPC that KVM supports performs very poorly relatively > > speaking because it receives no hardware assistance > > So wouldn't that be making the case that it could use as much help as > possible? I think he's referencing Ahmdal's Law here. While I'd agree, this is relevant only for the current KVM PowerPC implementations. I think it would be short-sighted to design an IO architecture around that. > > this is not the right place to focus wrt optimizations. > > Odd choice of words. I am advocating the opposite (broad solution to > many arches and many platforms (i.e. hypervisors)) and therefore I am > not "focused" on it (or really any one arch) at all per se. I am > _worried_ however, that we could be overlooking PPC (as an example) if > we ignore the disparity between MMIO and HC since other higher > performance options are not available like PIO. The goal on this > particular thread is to come up with an IO interface that works > reasonably well across as many hypervisors as possible. MMIO/PIO do > not appear to fit that bill (at least not without tunneling them over > HCs) I haven't been following this conversation at all. With that in mind... AFAICS, a hypercall is clearly the higher-performing option, since you don't need the additional memory load (which could even cause a page fault in some circumstances) and instruction decode. That said, I'm willing to agree that this overhead is probably negligible compared to the IOp itself... Ahmdal's Law again. -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 16:35 ` Hollis Blanchard @ 2009-05-11 17:06 ` Avi Kivity 2009-05-11 17:29 ` Gregory Haskins 2009-05-11 17:51 ` Anthony Liguori 0 siblings, 2 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-11 17:06 UTC (permalink / raw) To: Hollis Blanchard Cc: Gregory Haskins, Anthony Liguori, Gregory Haskins, Chris Wright, linux-kernel, kvm Hollis Blanchard wrote: > I haven't been following this conversation at all. With that in mind... > > AFAICS, a hypercall is clearly the higher-performing option, since you > don't need the additional memory load (which could even cause a page > fault in some circumstances) and instruction decode. That said, I'm > willing to agree that this overhead is probably negligible compared to > the IOp itself... Ahmdal's Law again. > It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:06 ` Avi Kivity @ 2009-05-11 17:29 ` Gregory Haskins 2009-05-11 17:53 ` Avi Kivity 2009-05-11 17:51 ` Anthony Liguori 1 sibling, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-11 17:29 UTC (permalink / raw) To: Avi Kivity Cc: Hollis Blanchard, Anthony Liguori, Gregory Haskins, Chris Wright, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 2038 bytes --] Avi Kivity wrote: > Hollis Blanchard wrote: >> I haven't been following this conversation at all. With that in mind... >> >> AFAICS, a hypercall is clearly the higher-performing option, since you >> don't need the additional memory load (which could even cause a page >> fault in some circumstances) and instruction decode. That said, I'm >> willing to agree that this overhead is probably negligible compared to >> the IOp itself... Ahmdal's Law again. >> > > It's a question of cost vs. benefit. It's clear the benefit is low > (but that doesn't mean it's not worth having). The cost initially > appeared to be very low, until the nested virtualization wrench was > thrown into the works. Not that nested virtualization is a reality -- > even on svm where it is implemented it is not yet production quality > and is disabled by default. > > Now nested virtualization is beginning to look interesting, with > Windows 7's XP mode requiring virtualization extensions. Desktop > virtualization is also something likely to use device assignment > (though you probably won't assign a virtio device to the XP instance > inside Windows 7). > > Maybe we should revisit the mmio hypercall idea again, it might be > workable if we find a way to let the guest know if it should use the > hypercall or not for a given memory range. > > mmio hypercall is nice because > - it falls back nicely to pure mmio > - it optimizes an existing slow path, not just new device models > - it has preexisting semantics, so we have less ABI to screw up > - for nested virtualization + device assignment, we can drop it and > get a nice speed win (or rather, less speed loss) > Yeah, I agree with all this. I am still wrestling with how to deal with the device-assignment problem w.r.t. shunting io requests into a hypercall vs letting them PF. Are you saying we could simply ignore this case by disabling "MMIOoHC" when assignment is enabled? That would certainly make the problem much easier to solve. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:29 ` Gregory Haskins @ 2009-05-11 17:53 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-11 17:53 UTC (permalink / raw) To: Gregory Haskins Cc: Hollis Blanchard, Anthony Liguori, Gregory Haskins, Chris Wright, linux-kernel, kvm Gregory Haskins wrote: > Avi Kivity wrote: > >> Hollis Blanchard wrote: >> >>> I haven't been following this conversation at all. With that in mind... >>> >>> AFAICS, a hypercall is clearly the higher-performing option, since you >>> don't need the additional memory load (which could even cause a page >>> fault in some circumstances) and instruction decode. That said, I'm >>> willing to agree that this overhead is probably negligible compared to >>> the IOp itself... Ahmdal's Law again. >>> >>> >> It's a question of cost vs. benefit. It's clear the benefit is low >> (but that doesn't mean it's not worth having). The cost initially >> appeared to be very low, until the nested virtualization wrench was >> thrown into the works. Not that nested virtualization is a reality -- >> even on svm where it is implemented it is not yet production quality >> and is disabled by default. >> >> Now nested virtualization is beginning to look interesting, with >> Windows 7's XP mode requiring virtualization extensions. Desktop >> virtualization is also something likely to use device assignment >> (though you probably won't assign a virtio device to the XP instance >> inside Windows 7). >> >> Maybe we should revisit the mmio hypercall idea again, it might be >> workable if we find a way to let the guest know if it should use the >> hypercall or not for a given memory range. >> >> mmio hypercall is nice because >> - it falls back nicely to pure mmio >> - it optimizes an existing slow path, not just new device models >> - it has preexisting semantics, so we have less ABI to screw up >> - for nested virtualization + device assignment, we can drop it and >> get a nice speed win (or rather, less speed loss) >> >> > Yeah, I agree with all this. I am still wrestling with how to deal with > the device-assignment problem w.r.t. shunting io requests into a > hypercall vs letting them PF. Are you saying we could simply ignore > this case by disabling "MMIOoHC" when assignment is enabled? That would > certainly make the problem much easier to solve. > No, we need to deal with hotplug. Something like IO_COND that Chris mentioned, but how to avoid turning this into a doctoral thesis. (On the other hand, device assignment requires the iommu, and I think you have to specify that up front?) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:06 ` Avi Kivity 2009-05-11 17:29 ` Gregory Haskins @ 2009-05-11 17:51 ` Anthony Liguori 2009-05-11 18:02 ` Avi Kivity 1 sibling, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-11 17:51 UTC (permalink / raw) To: Avi Kivity Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright, linux-kernel, kvm Avi Kivity wrote: > Hollis Blanchard wrote: >> I haven't been following this conversation at all. With that in mind... >> >> AFAICS, a hypercall is clearly the higher-performing option, since you >> don't need the additional memory load (which could even cause a page >> fault in some circumstances) and instruction decode. That said, I'm >> willing to agree that this overhead is probably negligible compared to >> the IOp itself... Ahmdal's Law again. >> > > It's a question of cost vs. benefit. It's clear the benefit is low > (but that doesn't mean it's not worth having). The cost initially > appeared to be very low, until the nested virtualization wrench was > thrown into the works. Not that nested virtualization is a reality -- > even on svm where it is implemented it is not yet production quality > and is disabled by default. > > Now nested virtualization is beginning to look interesting, with > Windows 7's XP mode requiring virtualization extensions. Desktop > virtualization is also something likely to use device assignment > (though you probably won't assign a virtio device to the XP instance > inside Windows 7). > > Maybe we should revisit the mmio hypercall idea again, it might be > workable if we find a way to let the guest know if it should use the > hypercall or not for a given memory range. > > mmio hypercall is nice because > - it falls back nicely to pure mmio > - it optimizes an existing slow path, not just new device models > - it has preexisting semantics, so we have less ABI to screw up > - for nested virtualization + device assignment, we can drop it and > get a nice speed win (or rather, less speed loss) If it's a PCI device, then we can also have an interrupt which we currently lack with vmcall-based hypercalls. This would give us guestcalls, upcalls, or whatever we've previously decided to call them. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:51 ` Anthony Liguori @ 2009-05-11 18:02 ` Avi Kivity 2009-05-11 18:18 ` Anthony Liguori 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-11 18:02 UTC (permalink / raw) To: Anthony Liguori Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright, linux-kernel, kvm Anthony Liguori wrote: >> >> It's a question of cost vs. benefit. It's clear the benefit is low >> (but that doesn't mean it's not worth having). The cost initially >> appeared to be very low, until the nested virtualization wrench was >> thrown into the works. Not that nested virtualization is a reality >> -- even on svm where it is implemented it is not yet production >> quality and is disabled by default. >> >> Now nested virtualization is beginning to look interesting, with >> Windows 7's XP mode requiring virtualization extensions. Desktop >> virtualization is also something likely to use device assignment >> (though you probably won't assign a virtio device to the XP instance >> inside Windows 7). >> >> Maybe we should revisit the mmio hypercall idea again, it might be >> workable if we find a way to let the guest know if it should use the >> hypercall or not for a given memory range. >> >> mmio hypercall is nice because >> - it falls back nicely to pure mmio >> - it optimizes an existing slow path, not just new device models >> - it has preexisting semantics, so we have less ABI to screw up >> - for nested virtualization + device assignment, we can drop it and >> get a nice speed win (or rather, less speed loss) > > If it's a PCI device, then we can also have an interrupt which we > currently lack with vmcall-based hypercalls. This would give us > guestcalls, upcalls, or whatever we've previously decided to call them. Sorry, I totally failed to understand this. Please explain. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 18:02 ` Avi Kivity @ 2009-05-11 18:18 ` Anthony Liguori 0 siblings, 0 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-11 18:18 UTC (permalink / raw) To: Avi Kivity Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright, linux-kernel, kvm Avi Kivity wrote: > Anthony Liguori wrote: >>> >>> It's a question of cost vs. benefit. It's clear the benefit is low >>> (but that doesn't mean it's not worth having). The cost initially >>> appeared to be very low, until the nested virtualization wrench was >>> thrown into the works. Not that nested virtualization is a reality >>> -- even on svm where it is implemented it is not yet production >>> quality and is disabled by default. >>> >>> Now nested virtualization is beginning to look interesting, with >>> Windows 7's XP mode requiring virtualization extensions. Desktop >>> virtualization is also something likely to use device assignment >>> (though you probably won't assign a virtio device to the XP instance >>> inside Windows 7). >>> >>> Maybe we should revisit the mmio hypercall idea again, it might be >>> workable if we find a way to let the guest know if it should use the >>> hypercall or not for a given memory range. >>> >>> mmio hypercall is nice because >>> - it falls back nicely to pure mmio >>> - it optimizes an existing slow path, not just new device models >>> - it has preexisting semantics, so we have less ABI to screw up >>> - for nested virtualization + device assignment, we can drop it and >>> get a nice speed win (or rather, less speed loss) >> >> If it's a PCI device, then we can also have an interrupt which we >> currently lack with vmcall-based hypercalls. This would give us >> guestcalls, upcalls, or whatever we've previously decided to call them. > > Sorry, I totally failed to understand this. Please explain. I totally missed what you meant by MMIO hypercall. In what cases do you think MMIO hypercall would result in a net benefit? I think the difference in MMIO vs hcall will be overshadowed by the heavy weight transition to userspace. The only thing I can think of where it may matter is for in-kernel devices like the APIC but that's a totally different path in Linux. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 13:14 ` Gregory Haskins 2009-05-11 16:35 ` Hollis Blanchard @ 2009-05-11 17:31 ` Anthony Liguori 2009-05-13 10:53 ` Gregory Haskins 2009-05-13 14:45 ` Gregory Haskins 1 sibling, 2 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-11 17:31 UTC (permalink / raw) To: Gregory Haskins Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm, Hollis Blanchard Gregory Haskins wrote: > I specifically generalized my statement above because #1 I assume > everyone here is smart enough to convert that nice round unit into the > relevant figure. And #2, there are multiple potential latency sources > at play which we need to factor in when looking at the big picture. For > instance, the difference between PF exit, and an IO exit (2.58us on x86, > to be precise). Or whether you need to take a heavy-weight exit. Or a > context switch to qemu, the the kernel, back to qemu, and back to the > vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO. > I know you wish that this whole discussion would just go away, but these > little "300ns here, 1600ns there" really add up in aggregate despite > your dismissive attitude towards them. And it doesn't take much to > affect the results in a measurable way. As stated, each 1us costs ~4%. > My motivation is to reduce as many of these sources as possible. > > So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% > improvement. So what? Its still an improvement. If that improvement > were for free, would you object? And we all know that this change isn't > "free" because we have to change some code (+128/-0, to be exact). But > what is it specifically you are objecting to in the first place? Adding > hypercall support as an pv_ops primitive isn't exactly hard or complex, > or even very much code. > Where does 25us come from? The number you post below are 33us and 66us. This is part of what's frustrating me in this thread. Things are way too theoretical. Saying that "if packet latency was 25us, then it would be a 1.4% improvement" is close to misleading. The numbers you've posted are also measuring on-box speeds. What really matters are off-box latencies and that's just going to exaggerate. IIUC, if you switched vbus to using PIO today, you would go from 66us to to 65.65, which you'd round to 66us for on-box latencies. Even if you didn't round, it's a 0.5% improvement in latency. Adding hypercall support as a pv_ops primitive is adding a fair bit of complexity. You need a hypercall fd mechanism to plumb this down to userspace otherwise, you can't support migration from in-kernel backend to non in-kernel backend. You need some way to allocate hypercalls to particular devices which so far, has been completely ignored. I've already mentioned why hypercalls are also unfortunate from a guest perspective. They require kernel patching and this is almost certainly going to break at least Vista as a guest. Certainly Windows 7. So it's not at all fair to trivialize the complexity introduce here. I'm simply asking for justification to introduce this complexity. I don't see why this is unfair for me to ask. >> As a more general observation, we need numbers to justify an >> optimization, not to justify not including an optimization. >> >> In other words, the burden is on you to present a scenario where this >> optimization would result in a measurable improvement in a real world >> work load. >> > > I have already done this. You seem to have chosen to ignore my > statements and results, but if you insist on rehashing: > > I started this project by analyzing system traces and finding some of > the various bottlenecks in comparison to a native host. Throughput was > already pretty decent, but latency was pretty bad (and recently got > *really* bad, but I know you already have a handle on whats causing > that). I digress...one of the conclusions of the research was that I > wanted to focus on building an IO subsystem designed to minimize the > quantity of exits, minimize the cost of each exit, and shorten the > end-to-end signaling path to achieve optimal performance. I also wanted > to build a system that was extensible enough to work with a variety of > client types, on a variety of architectures, etc, so we would only need > to solve these problems "once". The end result was vbus, and the first > working example was venet. The measured performance data of this work > was as follows: > > 802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back > with Chelsio T3 10GE via crossover. > > Bare metal : tput = 9717Mb/s, round-trip = 30396pps (33us rtt) > Virtio-net (PCI) : tput = 4578Mb/s, round-trip = 249pps (4016us rtt) > Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt) > > For more details: http://lkml.org/lkml/2009/4/21/408 > Sending out a massive infrastructure change that does things wildly differently from how they're done today without any indication of why those changes were necessary is disruptive. If you could characterize all of the changes that vbus makes that are different from virtio, demonstrating at each stage why the change mattered and what benefit it brought, then we'd be having a completely different discussion. I have no problem throwing away virtio today if there's something else better. That's not what you've done though. You wrote a bunch of code without understanding why virtio does things the way it does and then dropped it all on the list. This isn't necessarily a bad exercise, but there's a ton of work necessary to determine which things vbus does differently actually matter. I'm not saying that you shouldn't have done vbus, but I'm saying there's a bunch of analysis work that you haven't done that needs to be done before we start making any changes in upstream code. I've been trying to argue why I don't think hypercalls are an important part of vbus from a performance perspective. I've tried to demonstrate why I don't think this is an important part of vbus. The frustration I have with this series is that you seem unwilling to compromise any aspect of vbus design. I understand you've made your decisions in vbus for some reasons and you think the way you've done things is better, but that's not enough. We have virtio today, it provides greater functionality than vbus does, it supports multiple guest types, and it's gotten quite a lot of testing. It has its warts, but most things that have been around for some time do. > Now I know you have been quick in the past to dismiss my efforts, and to > claim you can get the same results without needing the various tricks > and optimizations I uncovered. But quite frankly, until you post some > patches for community review and comparison (as I have done), it's just > meaningless talk. I can just as easily say that until you post a full series that covers all of the functionality that virtio has today, vbus is just meaningless talk. But I'm trying not to be dismissive in all of this because I do want to see you contribute to the KVM paravirtual IO infrastructure. Clearly, you have useful ideas. We can't just go rewriting things without a clear understanding of why something's better. What's missing is a detailed analysis of what virtio-net does today and what vbus does so that it's possible to draw some conclusions. For instance, this could look like: For a single packet delivery: 150ns are spent from PIO operation 320ns are spent in heavy-weight exit handler 150ns are spent transitioning to userspace 5us are spent contending on qemu_mutex 30us are spent copying data in tun/tap driver 40us are spent waiting for RX ... For vbus, it would look like: 130ns are spent from HC instruction 100ns are spent signaling TX thread ... But single packet delivery is just one part of the puzzle. Bulk transfers are also important. CPU consumption is important. How we address things like live migration, non-privileged user initialization, and userspace plumbing are all also important. Right now, the whole discussion around this series is wildly speculative and quite frankly, counter productive. A few RTT benchmarks are not sufficient to make any kind of forward progress here. I certainly like rewriting things as much as anyone else, but you need a substantial amount of justification for it that so far hasn't been presented. Do you understand what my concerns are and why I don't want to just switch to a new large infrastructure? Do you feel like you understand what sort of data I'm looking for to justify the changes vbus is proposing to make? Is this something your willing to do because IMHO this is a prerequisite for any sort of merge consideration. The analysis of the virtio-net side of things is just as important as the vbus side of things. I've tried to explain this to you a number of times now and so far it doesn't seem like I've been successful. If it isn't clear, please let me know. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:31 ` Anthony Liguori @ 2009-05-13 10:53 ` Gregory Haskins 2009-05-13 14:45 ` Gregory Haskins 1 sibling, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-13 10:53 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm, Hollis Blanchard [-- Attachment #1: Type: text/plain, Size: 1285 bytes --] Anthony Liguori wrote: > Gregory Haskins wrote: >> >> So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% >> improvement. So what? Its still an improvement. If that improvement >> were for free, would you object? And we all know that this change isn't >> "free" because we have to change some code (+128/-0, to be exact). But >> what is it specifically you are objecting to in the first place? Adding >> hypercall support as an pv_ops primitive isn't exactly hard or complex, >> or even very much code. >> > > Where does 25us come from? The number you post below are 33us and 66us. <snip> The 25us is approximately the max from an in-kernel harness strapped directly to the driver gathered informally during testing. The 33us is from formally averaging multiple runs of a userspace socket app in preparation for publishing. I consider the 25us the "target goal" since there is obviously overhead that a socket application deals with that theoretically a guest bypasses with the tap-device. Note that the socket application itself often sees < 30us itself...this was just a particularly "slow" set of runs that day. Note that this is why I express the impact as "approximately" (e.g. "~4%"). Sorry for the confusion. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 17:31 ` Anthony Liguori 2009-05-13 10:53 ` Gregory Haskins @ 2009-05-13 14:45 ` Gregory Haskins 1 sibling, 0 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-13 14:45 UTC (permalink / raw) To: Anthony Liguori Cc: Avi Kivity, Chris Wright, linux-kernel, kvm, Hollis Blanchard [-- Attachment #1: Type: text/plain, Size: 17947 bytes --] Anthony Liguori wrote: > Gregory Haskins wrote: >> I specifically generalized my statement above because #1 I assume >> everyone here is smart enough to convert that nice round unit into the >> relevant figure. And #2, there are multiple potential latency sources >> at play which we need to factor in when looking at the big picture. For >> instance, the difference between PF exit, and an IO exit (2.58us on x86, >> to be precise). Or whether you need to take a heavy-weight exit. Or a >> context switch to qemu, the the kernel, back to qemu, and back to the >> vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models >> IO. I know you wish that this whole discussion would just go away, >> but these >> little "300ns here, 1600ns there" really add up in aggregate despite >> your dismissive attitude towards them. And it doesn't take much to >> affect the results in a measurable way. As stated, each 1us costs >> ~4%. My motivation is to reduce as many of these sources as possible. >> >> So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% >> improvement. So what? Its still an improvement. If that improvement >> were for free, would you object? And we all know that this change isn't >> "free" because we have to change some code (+128/-0, to be exact). But >> what is it specifically you are objecting to in the first place? Adding >> hypercall support as an pv_ops primitive isn't exactly hard or complex, >> or even very much code. >> > > Where does 25us come from? The number you post below are 33us and > 66us. This is part of what's frustrating me in this thread. Things > are way too theoretical. Saying that "if packet latency was 25us, > then it would be a 1.4% improvement" is close to misleading. [ answered in the last reply ] > The numbers you've posted are also measuring on-box speeds. What > really matters are off-box latencies and that's just going to exaggerate. I'm not 100% clear on what you mean with on-box vs off-box. These figures were gathered between two real machines connected via 10GE cross-over cable. The 5.8Gb/s and 33us (25us) values were gathered sending real data between these hosts. This sounds "off-box" to me, but I am not sure I truly understand your assertion. > > > IIUC, if you switched vbus to using PIO today, you would go from 66us > to to 65.65, which you'd round to 66us for on-box latencies. Even if > you didn't round, it's a 0.5% improvement in latency. I think part of what you are missing is that in order to create vbus, I needed to _create_ an in-kernel hook from scratch since there were no existing methods. Since I measured HC to be superior in performance (if by only a little), I wasn't going to chose the slower way if there wasn't a reason, and at the time I didn't see one. Now after community review, perhaps we do have a reason, but that is the point of the review process. So now we can push something like iofd as a PIO hook instead. But either way, something needed to be created. > > > Adding hypercall support as a pv_ops primitive is adding a fair bit of > complexity. You need a hypercall fd mechanism to plumb this down to > userspace otherwise, you can't support migration from in-kernel > backend to non in-kernel backend. I respectfully disagree. This is orthogonal to the simple issue of the IO type for the exit. Where you *do* have a point is that the bigger benefit comes from in-kernel termination (like the iofd stuff I posted yesterday). However, in-kernel termination is not strictly necessary to exploit some reduction in overhead in the IO latency. In either case we know we can shave off about 2.56us from an MMIO. Since I formally measured MMIO rtt to userspace yesterday, we now know that we can do qemu-mmio in about 110k IOPS, 9.09us rtt. Switching to pv_io_ops->mmio() alone would be a boost to approximately 153k IOPS, 6.53us rtt. This would have a tangible benefit to all models without any hypercall plumbing screwing up migration. Therefore I still stand by the assertion that the hypercall discussion alone doesn't add very much complexity. > You need some way to allocate hypercalls to particular devices which > so far, has been completely ignored. I'm sorry, but thats not true. Vbus already handles this mapping. > I've already mentioned why hypercalls are also unfortunate from a > guest perspective. They require kernel patching and this is almost > certainly going to break at least Vista as a guest. Certainly Windows 7. Yes, you have a point here. > > So it's not at all fair to trivialize the complexity introduce here. > I'm simply asking for justification to introduce this complexity. I > don't see why this is unfair for me to ask. In summary, I don't think there is really much complexity being added because this stuff really doesn't depend on the hypercallfd (iofd) interface in order to have some benefit, as you assert above. The hypercall page is a good point for attestation, but that issue exists already today and is not a newly created issue by this proposal. > >>> As a more general observation, we need numbers to justify an >>> optimization, not to justify not including an optimization. >>> >>> In other words, the burden is on you to present a scenario where this >>> optimization would result in a measurable improvement in a real world >>> work load. >>> >> >> I have already done this. You seem to have chosen to ignore my >> statements and results, but if you insist on rehashing: >> >> I started this project by analyzing system traces and finding some of >> the various bottlenecks in comparison to a native host. Throughput was >> already pretty decent, but latency was pretty bad (and recently got >> *really* bad, but I know you already have a handle on whats causing >> that). I digress...one of the conclusions of the research was that I >> wanted to focus on building an IO subsystem designed to minimize the >> quantity of exits, minimize the cost of each exit, and shorten the >> end-to-end signaling path to achieve optimal performance. I also wanted >> to build a system that was extensible enough to work with a variety of >> client types, on a variety of architectures, etc, so we would only need >> to solve these problems "once". The end result was vbus, and the first >> working example was venet. The measured performance data of this work >> was as follows: >> >> 802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back >> with Chelsio T3 10GE via crossover. >> >> Bare metal : tput = 9717Mb/s, round-trip = 30396pps (33us >> rtt) >> Virtio-net (PCI) : tput = 4578Mb/s, round-trip = 249pps (4016us rtt) >> Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt) >> >> For more details: http://lkml.org/lkml/2009/4/21/408 >> > > Sending out a massive infrastructure change that does things wildly > differently from how they're done today without any indication of why > those changes were necessary is disruptive. Well, this is an unfortunate situation. I had to build my ideas before I could even know if they worked and therefore if they were worth discussing. So we are where we are because of that fact. Anyway, this is somewhat irrelevant here because the topic at hand in this thread is purely the hypercall interface. I only mentioned the backstory of vbus so we are focused/aware of the end-goal. > > If you could characterize all of the changes that vbus makes that are > different from virtio, demonstrating at each stage why the change > mattered and what benefit it brought, then we'd be having a completely > different discussion. Splitting this up into small changes is already work is in progress. Thats what we are discussing in this thread, afterall. > I have no problem throwing away virtio today if there's something > else better. I am not advocating this. Virtio can (and does, though it needs testing and backends written) run happily on top of vbus. I am not even advocating getting rid of the userspace virtio models. They have a place and could simply be run-time switched on or off, depending on the user configuration. > > That's not what you've done though. You wrote a bunch of code without > understanding why virtio does things the way it does and then dropped > it all on the list. False. I studied the system for the bottlenecks before I began. > This isn't necessarily a bad exercise, but there's a ton of work > necessary to determine which things vbus does differently actually > matter. I'm not saying that you shouldn't have done vbus, but I'm > saying there's a bunch of analysis work that you haven't done that > needs to be done before we start making any changes in upstream code. > > I've been trying to argue why I don't think hypercalls are an > important part of vbus from a performance perspective. I've tried to > demonstrate why I don't think this is an important part of vbus. Yes, lets just drop them. I will use PIO and not worry about non-x86. I generally don't like to ignore issues like this where at least conjecture indicates I should try to think ahead. However, its not worth this headache to continue the conversation. If anything, we can do the IOoHC idea but lets just backburner this whole HC thing for now. > The frustration I have with this series is that you seem unwilling > to compromise any aspect of vbus design. Thats an unfair assertion. As an example, I already started migrating to PCI even though I was pretty against it in the beginning. I also conceded that the 350ns is possibly a non-issue if we don't care about non-x86. > I understand you've made your decisions in vbus for some reasons > and you think the way you've done things is better, but that's not > enough. We have virtio today, it provides greater functionality than > vbus does, it supports multiple guest types, and it's gotten quite a > lot of testing. Yes, and I admit that is a compelling argument against changing. > It has its warts, but most things that have been around for some time do. > >> Now I know you have been quick in the past to dismiss my efforts, and to >> claim you can get the same results without needing the various tricks >> and optimizations I uncovered. But quite frankly, until you post some >> patches for community review and comparison (as I have done), it's just >> meaningless talk. > > I can just as easily say that until you post a full series that covers > all of the functionality that virtio has today, vbus is just > meaningless talk. That's a stretch. Its a series that offers a new way to do a device-model that we can re-use in multiple environments, and has performance considerations specifically for software-to-software interaction (such as guest/hypervisor) built into it. It will also hopefully support some features that we don't have today, like RDMA interfaces, but that is in the future. To your point, it is missing some features like live-migration, but IIUC thats how FOSS works. Start with an idea...If others think its good, you polish it up together. I don't believe KVM was 100% when it was released either, or we wouldn't be having this conversation. > But I'm trying not to be dismissive in all of this because I do want > to see you contribute to the KVM paravirtual IO infrastructure. > Clearly, you have useful ideas. > > We can't just go rewriting things without a clear understanding of why > something's better. What's missing is a detailed analysis of what > virtio-net does today and what vbus does so that it's possible to draw > some conclusions. This is ongoing. For one, it hops through userspace which as I've already demonstrated has a cost (that at least I care about). > > For instance, this could look like: > > For a single packet delivery: > > 150ns are spent from PIO operation > 320ns are spent in heavy-weight exit handler > 150ns are spent transitioning to userspace > 5us are spent contending on qemu_mutex > 30us are spent copying data in tun/tap driver > 40us are spent waiting for RX > ... > > For vbus, it would look like: > > 130ns are spent from HC instruction > 100ns are spent signaling TX thread > ... > > But single packet delivery is just one part of the puzzle. Bulk > transfers are also important. Both of those are covered today (and, I believe, superior)...see the throughput numbers in my post. > CPU consumption is important. Yes, this is a good point. I need to quantify this. I fully expect to be worse since one of the design points is that we should try to use as much parallel effort as possible in the face of these multi-core boxes. Note that this particular design decision is a attribute of venettap specifically. vbus itself doesn't force a threading policy. > How we address things like live migration Yes, this feature is not started. Looking for community help on that. > , non-privileged user initialization Nor is this. I think I need to write some patches for configfs to get there. > , and userspace plumbing are all also important. Basic userspace plumbing is complete, but I need to hook qemu into it > > Right now, the whole discussion around this series is wildly > speculative and quite frankly, counter productive. How? I've posted tangible patches you can test today that show an measurable improvement. If someone wants to investigate/propose an improvement in general, how else would you do it so it is productive and not speculative? > A few RTT benchmarks are not sufficient to make any kind of forward > progress here. I certainly like rewriting things as much as anyone > else, but you need a substantial amount of justification for it that > so far hasn't been presented. Well, unfortunately for me the real interesting stuff hasn't been completed yet nor articulated very well, so I can't fault you for not seeing the complete picture. In fact, as of right now we don't even know if the "real interesting" stuff will work yet. Its still being hashed out. What we do know is that the basic performance elements of the design (in-kernel, etc) seemed to have made a substantial improvement. And more tweaks are coming to widen the gap. > > > Do you understand what my concerns are and why I don't want to just > switch to a new large infrastructure? Yes. However, to reiterate: its not a switch. Its just another option in the mix to have in-kernel virtio or not. Today you have emulated+virtio, tomorrow you could have emulate+virtio+kern_virtio > > > Do you feel like you understand what sort of data I'm looking for to > justify the changes vbus is proposing to make? Is this something your > willing to do because IMHO this is a prerequisite for any sort of > merge consideration. The analysis of the virtio-net side of things is > just as important as the vbus side of things. Work in progress. > > > I've tried to explain this to you a number of times now and so far it > doesn't seem like I've been successful. If it isn't clear, please let > me know. I don't think this is accurate, and I don't really appreciate the patronizing tone. I've been gathering the figures for the kernel side of things, and last we spoke, *you* were gathering all the performance data and issues in qemu and were working on improvements. In fact, last I heard, you had reduced the 4000us rtt to 90us as part of that effort. While I have yet to see your patches, I was certainly not going to duplicate your work here. So to infer that I am somehow simply being obtuse w.r.t. your request/guidance is a bit disingenuous. That wasn't really what we discussed from my perspective. That said, I will certainly continue to analyze the gross path through userspace, as I have done with "doorbell", because that is directly relevant to my points. Figuring out something like the details of any qemu_mutex bottlenecks, on the other hand, is not. I already contend that the userspace path, at any cost, is just a waste. Ultimately we need to come back to the kernel anyway to do the IO, so there is no real use in digging in deeper there IMO. Just skip it and inject the IO directly. If you look, venettap is about the same complexity as tuntap. Userpsace is just acting as a marginal shim, so the arguments about the MMU protection in userspace, etc, are of diminished value. That said, if you want to submit your patches to me that improve the userspace overhead, I will include them in my benchmarks so we can gather the most accurate picture possible. The big question is: does the KVM community believe in in-kernel devices or not? If we don't, then the kvm+vbus conversation is done and I will just be forced to maintain the kvm-connector in my own tree, or (perhaps your preferred outcome?) let it die. If we *do* believe in in-kernel, then I hope that I can convince the relevant maintainers that vbus is a good infrastructure to provide a framework for it. Perhaps I need to generate more numbers before I can convince everyone of this, and thats fine. Everything else beyond that is just work to fill in the missing details. If you and others want to join me in implementing things like a virtio-net backend and adding live-migration support, that would be awesome. The choice is yours. For now, the plan that Avi and I have tentatively laid out is to work on making KVM more extensible via some of this plumbing (e.g. irqfd) so that these types of enhancements are possible without requiring "pollution" (my word, though Avi would likely concur) of the KVM core. Regards, -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-10 18:38 ` Anthony Liguori 2009-05-11 13:14 ` Gregory Haskins @ 2009-05-11 16:44 ` Hollis Blanchard 2009-05-11 17:54 ` Anthony Liguori 1 sibling, 1 reply; 115+ messages in thread From: Hollis Blanchard @ 2009-05-11 16:44 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote: > Gregory Haskins wrote: > > > > Can you back up your claim that PPC has no difference in performance > > with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" > > like instructions, but clearly there are ways to cause a trap, so > > presumably we can measure the difference between a PF exit and something > > more explicit). > > First, the PPC that KVM supports performs very poorly relatively > speaking because it receives no hardware assistance this is not the > right place to focus wrt optimizations. > > And because there's no hardware assistance, there simply isn't a > hypercall instruction. Are PFs the fastest type of exits? Probably not > but I honestly have no idea. I'm sure Hollis does though. Memory load from the guest context (for instruction decoding) is a *very* poorly performing path on most PowerPC, even considering server PowerPC with hardware virtualization support. No, I don't have any data for you, but switching the hardware MMU contexts requires some heavyweight synchronization instructions. > Page faults are going to have tremendously different performance > characteristics on PPC too because it's a software managed TLB. There's > no page table lookup like there is on x86. To clarify, software-managed TLBs are only found in embedded PowerPC. Server and classic PowerPC use hash tables, which are a third MMU type. -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-11 16:44 ` Hollis Blanchard @ 2009-05-11 17:54 ` Anthony Liguori 2009-05-11 19:24 ` PowerPC page faults Hollis Blanchard 0 siblings, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-11 17:54 UTC (permalink / raw) To: Hollis Blanchard Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm Hollis Blanchard wrote: > On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote: > >> Gregory Haskins wrote: >> >>> Can you back up your claim that PPC has no difference in performance >>> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" >>> like instructions, but clearly there are ways to cause a trap, so >>> presumably we can measure the difference between a PF exit and something >>> more explicit). >>> >> First, the PPC that KVM supports performs very poorly relatively >> speaking because it receives no hardware assistance this is not the >> right place to focus wrt optimizations. >> >> And because there's no hardware assistance, there simply isn't a >> hypercall instruction. Are PFs the fastest type of exits? Probably not >> but I honestly have no idea. I'm sure Hollis does though. >> > > Memory load from the guest context (for instruction decoding) is a > *very* poorly performing path on most PowerPC, even considering server > PowerPC with hardware virtualization support. No, I don't have any data > for you, but switching the hardware MMU contexts requires some > heavyweight synchronization instructions. > For current ppcemb, you would have to do a memory load no matter what, right? I guess you could have a dedicated interrupt vector or something... For future ppcemb's, do you know if there is an equivalent of a PF exit type? Does the hardware squirrel away the faulting address somewhere and set PC to the start of the instruction? If so, no guest memory load should be required. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: PowerPC page faults 2009-05-11 17:54 ` Anthony Liguori @ 2009-05-11 19:24 ` Hollis Blanchard 2009-05-11 22:17 ` Anthony Liguori 0 siblings, 1 reply; 115+ messages in thread From: Hollis Blanchard @ 2009-05-11 19:24 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote: > For future ppcemb's, do you know if there is an equivalent of a PF exit > type? Does the hardware squirrel away the faulting address somewhere > and set PC to the start of the instruction? If so, no guest memory load > should be required. Ahhh... you're saying that the address itself (or offset within a page) is the hypercall token, totally separate from IO emulation, and so we could ignore the access size. I guess it looks like this: page fault vector: if (faulting_address & PAGE_MASK) == vcpu->hcall_page handle_hcall(faulting_address & ~PAGE_MASK) else if (faulting_address is IO) emulate_io(faulting_address) else handle_pagefault(faulting_address) Testing for hypercalls in the page fault handler path would add some overhead, and on processors with software-managed TLBs, the page fault path is *very* hot. Implementing the above pseudocode wouldn't be ideal, especially because Power processors with hardware virtualization support have a separate vector for hypercalls. However, I suspect it wouldn't be a show-stopper from a performance point of view. Note that other Power virtualization solutions (hypervisors from IBM, Sony, and Toshiba) use the dedicated hypercall instruction and interrupt vector, which after all is how the hardware was designed. To my knowledge, they also don't do IO emulation, so they avoid both conditionals in the above psuedocode. -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: PowerPC page faults 2009-05-11 19:24 ` PowerPC page faults Hollis Blanchard @ 2009-05-11 22:17 ` Anthony Liguori 2009-05-12 5:46 ` Liu Yu-B13201 2009-05-12 14:50 ` Hollis Blanchard 0 siblings, 2 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-11 22:17 UTC (permalink / raw) To: Hollis Blanchard Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm Hollis Blanchard wrote: > On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote: > >> For future ppcemb's, do you know if there is an equivalent of a PF exit >> type? Does the hardware squirrel away the faulting address somewhere >> and set PC to the start of the instruction? If so, no guest memory load >> should be required. >> > > Ahhh... you're saying that the address itself (or offset within a page) > is the hypercall token, totally separate from IO emulation, and so we > could ignore the access size. No, I'm not being nearly that clever. I was suggesting that hardware virtualization support in future PPC systems might contain a mechanism to intercept a guest-mode TLB miss. If it did, it would be useful if that guest-mode TLB miss "exit" contained extra information somewhere that included the PC of the faulting instruction, the address response for the fault, and enough information to handle the fault without instruction decoding. I assume all MMIO comes from the same set of instructions in PPC? Something like ld/st instructions? Presumably all you need to know from instruction decoding is the destination register and whether it was a read or write? Or am I oversimplifying? Does the current hardware spec contain this sort of assistance? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* RE: PowerPC page faults 2009-05-11 22:17 ` Anthony Liguori @ 2009-05-12 5:46 ` Liu Yu-B13201 2009-05-12 14:50 ` Hollis Blanchard 1 sibling, 0 replies; 115+ messages in thread From: Liu Yu-B13201 @ 2009-05-12 5:46 UTC (permalink / raw) To: Anthony Liguori, Hollis Blanchard Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm > -----Original Message----- > From: kvm-owner@vger.kernel.org > [mailto:kvm-owner@vger.kernel.org] On Behalf Of Anthony Liguori > Sent: Tuesday, May 12, 2009 6:18 AM > To: Hollis Blanchard > Cc: Gregory Haskins; Avi Kivity; Chris Wright; Gregory > Haskins; linux-kernel@vger.kernel.org; kvm@vger.kernel.org > Subject: Re: PowerPC page faults > > Hollis Blanchard wrote: > > On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote: > > > >> For future ppcemb's, do you know if there is an equivalent > of a PF exit > >> type? Does the hardware squirrel away the faulting > address somewhere > >> and set PC to the start of the instruction? If so, no > guest memory load > >> should be required. > >> > > > > Ahhh... you're saying that the address itself (or offset > within a page) > > is the hypercall token, totally separate from IO emulation, > and so we > > could ignore the access size. > > No, I'm not being nearly that clever. > > I was suggesting that hardware virtualization support in future PPC > systems might contain a mechanism to intercept a guest-mode > TLB miss. > If it did, it would be useful if that guest-mode TLB miss "exit" > contained extra information somewhere that included the PC of the > faulting instruction, the address response for the fault, and enough > information to handle the fault without instruction decoding. I don't think this can come true if it is software managed TLB. For guest will never know the content of its TLB without the help of software. > > I assume all MMIO comes from the same set of instructions in PPC? > Something like ld/st instructions? Presumably all you need > to know from > instruction decoding is the destination register and whether it was a > read or write? Yes, but more there is a search in guest TLB to see if it is a guest TLB miss currently. A big overhead, especially for a large TLB without being organized by sth. like rbtree. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: PowerPC page faults 2009-05-11 22:17 ` Anthony Liguori 2009-05-12 5:46 ` Liu Yu-B13201 @ 2009-05-12 14:50 ` Hollis Blanchard 1 sibling, 0 replies; 115+ messages in thread From: Hollis Blanchard @ 2009-05-12 14:50 UTC (permalink / raw) To: Anthony Liguori Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm On Monday 11 May 2009 17:17:53 Anthony Liguori wrote: > Hollis Blanchard wrote: > > On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote: > > > >> For future ppcemb's, do you know if there is an equivalent of a PF exit > >> type? Does the hardware squirrel away the faulting address somewhere > >> and set PC to the start of the instruction? If so, no guest memory load > >> should be required. > >> > > > > Ahhh... you're saying that the address itself (or offset within a page) > > is the hypercall token, totally separate from IO emulation, and so we > > could ignore the access size. > > No, I'm not being nearly that clever. I started thinking you weren't, but then I realized you must be working several levels above me and just forgot to explain... :) > I was suggesting that hardware virtualization support in future PPC > systems might contain a mechanism to intercept a guest-mode TLB miss. > If it did, it would be useful if that guest-mode TLB miss "exit" > contained extra information somewhere that included the PC of the > faulting instruction, the address response for the fault, and enough > information to handle the fault without instruction decoding. > > I assume all MMIO comes from the same set of instructions in PPC? > Something like ld/st instructions? Presumably all you need to know from > instruction decoding is the destination register and whether it was a > read or write? In addition to register source/target, we also need to know the access size of the memory reference. That information isn't stuffed into registers for us by hardware, and it's not in published specifications for future hardware. Now, if you wanted to define a hypercall as a byte access within a particular 4K page, where the register source/target is ignored, that could be interesting, but I don't know if that's relevant to this "hypercall vs MMIO" discussion. -- Hollis Blanchard IBM Linux Technology Center ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 3:51 ` Gregory Haskins 2009-05-06 7:22 ` Chris Wright @ 2009-05-06 13:56 ` Anthony Liguori 2009-05-06 16:03 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Anthony Liguori @ 2009-05-06 13:56 UTC (permalink / raw) To: Gregory Haskins Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm Gregory Haskins wrote: > Chris Wright wrote: > >> * Gregory Haskins (gregory.haskins@gmail.com) wrote: >> >> >>> So you would never have someone making a generic >>> hypercall(KVM_HC_MMU_OP). I agree. >>> >>> >> Which is why I think the interface proposal you've made is wrong. >> > > I respectfully disagree. Its only wrong in that the name chosen for the > interface was perhaps too broad/vague. I still believe the concept is > sound, and the general layering is appropriate. > > >> There's >> already hypercall interfaces w/ specific ABI and semantic meaning (which >> are typically called directly/indirectly from an existing pv op hook). >> >> > > Yes, these are different, thus the new interface. > > >> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) >> means hypercall number and arg list must be the same in order for code >> to call hypercall() in a hypervisor agnostic way. >> >> > > Yes, and that is exactly the intention. I think its perhaps the point > you are missing. > > I am well aware that historically the things we do over a hypercall > interface would inherently have meaning only to a specific hypervisor > (e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()). However, this > doesn't in any way infer that it is the only use for the general > concept. Its just the only way they have been exploited to date. > > While I acknowledge that the hypervisor certainly must be coordinated > with their use, in their essence hypercalls are just another form of IO > joining the ranks of things like MMIO and PIO. This is an attempt to > bring them out of the bowels of CONFIG_PARAVIRT to make them a first > class citizen. > > The thing I am building here is really not a general hypercall in the > broad sense. Rather, its a subset of the hypercall vector namespace. > It is designed specifically for dynamic binding a synchronous call() > interface to things like virtual devices, and it is therefore these > virtual device models that define the particular ABI within that > namespace. Thus the ABI in question is explicitly independent of the > underlying hypervisor. I therefore stand by the proposed design to have > this interface described above the hypervisor support layer (i.e. > pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as > per my later discussion with Avi). > > Consider PIO: The hypervisor (or hardware) and OS negotiate a port > address, but the two end-points are the driver and the device-model (or > real device). The driver doesnt have to say: > > if (kvm) > kvm_iowrite32(addr, ..); > else if (lguest) > lguest_iowrite32(addr, ...); > else > native_iowrite32(addr, ...); > > Instead, it just says "iowrite32(addr, ...);" and the address is used to > route the message appropriately by the platform. The ABI of that > message, however, is specific to the driver/device and is not > interpreted by kvm/lguest/native-hw infrastructure on the way. > > Today, there is no equivelent of a platform agnostic "iowrite32()" for > hypercalls so the driver would look like the pseudocode above except > substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal > is to allow the hypervisor to assign a dynamic vector to resources in > the backend and convey this vector to the guest (such as in PCI > config-space as mentioned in my example use-case). The provides the > "address negotiation" function that would normally be done for something > like a pio port-address. The hypervisor agnostic driver can then use > this globally recognized address-token coupled with other device-private > ABI parameters to communicate with the device. This can all occur > without the core hypervisor needing to understand the details beyond the > addressing. > PCI already provide a hypervisor agnostic interface (via IO regions). You have a mechanism for devices to discover which regions they have allocated and to request remappings. It's supported by Linux and Windows. It works on the vast majority of architectures out there today. Why reinvent the wheel? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 13:56 ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori @ 2009-05-06 16:03 ` Gregory Haskins 2009-05-08 8:17 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-06 16:03 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 1689 bytes --] Anthony Liguori wrote: > Gregory Haskins wrote: >> >> Today, there is no equivelent of a platform agnostic "iowrite32()" for >> hypercalls so the driver would look like the pseudocode above except >> substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal >> is to allow the hypervisor to assign a dynamic vector to resources in >> the backend and convey this vector to the guest (such as in PCI >> config-space as mentioned in my example use-case). The provides the >> "address negotiation" function that would normally be done for something >> like a pio port-address. The hypervisor agnostic driver can then use >> this globally recognized address-token coupled with other device-private >> ABI parameters to communicate with the device. This can all occur >> without the core hypervisor needing to understand the details beyond the >> addressing. >> > > PCI already provide a hypervisor agnostic interface (via IO regions). > You have a mechanism for devices to discover which regions they have > allocated and to request remappings. It's supported by Linux and > Windows. It works on the vast majority of architectures out there today. > > Why reinvent the wheel? I suspect the current wheel is square. And the air is out. Plus its pulling to the left when I accelerate, but to be fair that may be my alignment.... :) But I digress. See: http://patchwork.kernel.org/patch/21865/ To give PCI proper respect, I think its greatest value add here is the inherent IRQ routing (which is a huge/difficult component, as I experienced with dynirq in vbus v1). Beyond that, however, I think we can do better. HTH -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-06 16:03 ` Gregory Haskins @ 2009-05-08 8:17 ` Avi Kivity 2009-05-08 15:20 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-08 8:17 UTC (permalink / raw) To: Gregory Haskins Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: > Anthony Liguori wrote: > >> Gregory Haskins wrote: >> >>> Today, there is no equivelent of a platform agnostic "iowrite32()" for >>> hypercalls so the driver would look like the pseudocode above except >>> substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal >>> is to allow the hypervisor to assign a dynamic vector to resources in >>> the backend and convey this vector to the guest (such as in PCI >>> config-space as mentioned in my example use-case). The provides the >>> "address negotiation" function that would normally be done for something >>> like a pio port-address. The hypervisor agnostic driver can then use >>> this globally recognized address-token coupled with other device-private >>> ABI parameters to communicate with the device. This can all occur >>> without the core hypervisor needing to understand the details beyond the >>> addressing. >>> >>> >> PCI already provide a hypervisor agnostic interface (via IO regions). >> You have a mechanism for devices to discover which regions they have >> allocated and to request remappings. It's supported by Linux and >> Windows. It works on the vast majority of architectures out there today. >> >> Why reinvent the wheel? >> > > I suspect the current wheel is square. And the air is out. Plus its > pulling to the left when I accelerate, but to be fair that may be my > alignment.... No, your wheel is slightly faster on the highway, but doesn't work at all off-road. Consider nested virtualization where the host (H) runs a guest (G1) which is itself a hypervisor, running a guest (G2). The host exposes a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than creating a new virtio devices and bridging it to one of V1..Vn, assigns virtio device V1 to guest G2, and prays. Now guest G2 issues a hypercall. Host H traps the hypercall, sees it originated in G1 while in guest mode, so it injects it into G1. G1 examines the parameters but can't make any sense of them, so it returns an error to G2. If this were done using mmio or pio, it would have just worked. With pio, H would have reflected the pio into G1, G1 would have done the conversion from G2's port number into G1's port number and reissued the pio, finally trapped by H and used to issue the I/O. With mmio, G1 would have set up G2's page tables to point directly at the addresses set up by H, so we would actually have a direct G2->H path. Of course we'd need an emulated iommu so all the memory references actually resolve to G2's context. So the upshoot is that hypercalls for devices must not be the primary method of communications; they're fine as an optimization, but we should always be able to fall back on something else. We also need to figure out how G1 can stop V1 from advertising hypercall support. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 8:17 ` Avi Kivity @ 2009-05-08 15:20 ` Gregory Haskins 2009-05-08 17:00 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 15:20 UTC (permalink / raw) To: Avi Kivity Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 4855 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >> Anthony Liguori wrote: >> >>> Gregory Haskins wrote: >>> >>>> Today, there is no equivelent of a platform agnostic "iowrite32()" for >>>> hypercalls so the driver would look like the pseudocode above except >>>> substitute with kvm_hypercall(), lguest_hypercall(), etc. The >>>> proposal >>>> is to allow the hypervisor to assign a dynamic vector to resources in >>>> the backend and convey this vector to the guest (such as in PCI >>>> config-space as mentioned in my example use-case). The provides the >>>> "address negotiation" function that would normally be done for >>>> something >>>> like a pio port-address. The hypervisor agnostic driver can then use >>>> this globally recognized address-token coupled with other >>>> device-private >>>> ABI parameters to communicate with the device. This can all occur >>>> without the core hypervisor needing to understand the details >>>> beyond the >>>> addressing. >>>> >>> PCI already provide a hypervisor agnostic interface (via IO >>> regions). You have a mechanism for devices to discover which regions >>> they have >>> allocated and to request remappings. It's supported by Linux and >>> Windows. It works on the vast majority of architectures out there >>> today. >>> >>> Why reinvent the wheel? >>> >> >> I suspect the current wheel is square. And the air is out. Plus its >> pulling to the left when I accelerate, but to be fair that may be my >> alignment.... > > No, your wheel is slightly faster on the highway, but doesn't work at > all off-road. Heh.. > > Consider nested virtualization where the host (H) runs a guest (G1) > which is itself a hypervisor, running a guest (G2). The host exposes > a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than > creating a new virtio devices and bridging it to one of V1..Vn, > assigns virtio device V1 to guest G2, and prays. > > Now guest G2 issues a hypercall. Host H traps the hypercall, sees it > originated in G1 while in guest mode, so it injects it into G1. G1 > examines the parameters but can't make any sense of them, so it > returns an error to G2. > > If this were done using mmio or pio, it would have just worked. With > pio, H would have reflected the pio into G1, G1 would have done the > conversion from G2's port number into G1's port number and reissued > the pio, finally trapped by H and used to issue the I/O. I might be missing something, but I am not seeing the difference here. We have an "address" (in this case the HC-id) and a context (in this case G1 running in non-root mode). Whether the trap to H is a HC or a PIO, the context tells us that it needs to re-inject the same trap to G1 for proper handling. So the "address" is re-injected from H to G1 as an emulated trap to G1s root-mode, and we continue (just like the PIO). And likewise, in both cases, G1 would (should?) know what to do with that "address" as it relates to G2, just as it would need to know what the PIO address is for. Typically this would result in some kind of translation of that "address", but I suppose even this is completely arbitrary and only G1 knows for sure. E.g. it might translate from hypercall vector X to Y similar to your PIO example, it might completely change transports, or it might terminate locally (e.g. emulated device in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might be using MMIO to talk to H. I don't think it matters from a topology perspective (though it might from a performance perspective). > With mmio, G1 would have set up G2's page tables to point directly at > the addresses set up by H, so we would actually have a direct G2->H > path. Of course we'd need an emulated iommu so all the memory > references actually resolve to G2's context. /me head explodes > > So the upshoot is that hypercalls for devices must not be the primary > method of communications; they're fine as an optimization, but we > should always be able to fall back on something else. We also need to > figure out how G1 can stop V1 from advertising hypercall support. I agree it would be desirable to be able to control this exposure. However, I am not currently convinced its strictly necessary because of the reason you mentioned above. And also note that I am not currently convinced its even possible to control it. For instance, what if G1 is an old KVM, or (dare I say) a completely different hypervisor? You could control things like whether G1 can see the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who is to say what G1 will expose to G2? G1 may very well advertise a HC feature bit to G2 which may allow G2 to try to make a VMCALL. How do you stop that? -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 15:20 ` Gregory Haskins @ 2009-05-08 17:00 ` Avi Kivity 2009-05-08 18:55 ` Gregory Haskins 0 siblings, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-08 17:00 UTC (permalink / raw) To: Gregory Haskins Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: >> Consider nested virtualization where the host (H) runs a guest (G1) >> which is itself a hypervisor, running a guest (G2). The host exposes >> a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than >> creating a new virtio devices and bridging it to one of V1..Vn, >> assigns virtio device V1 to guest G2, and prays. >> >> Now guest G2 issues a hypercall. Host H traps the hypercall, sees it >> originated in G1 while in guest mode, so it injects it into G1. G1 >> examines the parameters but can't make any sense of them, so it >> returns an error to G2. >> >> If this were done using mmio or pio, it would have just worked. With >> pio, H would have reflected the pio into G1, G1 would have done the >> conversion from G2's port number into G1's port number and reissued >> the pio, finally trapped by H and used to issue the I/O. >> > > I might be missing something, but I am not seeing the difference here. > We have an "address" (in this case the HC-id) and a context (in this > case G1 running in non-root mode). Whether the trap to H is a HC or a > PIO, the context tells us that it needs to re-inject the same trap to G1 > for proper handling. So the "address" is re-injected from H to G1 as an > emulated trap to G1s root-mode, and we continue (just like the PIO). > So far, so good (though in fact mmio can short-circuit G2->H directly). > And likewise, in both cases, G1 would (should?) know what to do with > that "address" as it relates to G2, just as it would need to know what > the PIO address is for. Typically this would result in some kind of > translation of that "address", but I suppose even this is completely > arbitrary and only G1 knows for sure. E.g. it might translate from > hypercall vector X to Y similar to your PIO example, it might completely > change transports, or it might terminate locally (e.g. emulated device > in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might > be using MMIO to talk to H. I don't think it matters from a topology > perspective (though it might from a performance perspective). > How can you translate a hypercall? G1's and H's hypercall mechanisms can be completely different. >> So the upshoot is that hypercalls for devices must not be the primary >> method of communications; they're fine as an optimization, but we >> should always be able to fall back on something else. We also need to >> figure out how G1 can stop V1 from advertising hypercall support. >> > I agree it would be desirable to be able to control this exposure. > However, I am not currently convinced its strictly necessary because of > the reason you mentioned above. And also note that I am not currently > convinced its even possible to control it. > > For instance, what if G1 is an old KVM, or (dare I say) a completely > different hypervisor? You could control things like whether G1 can see > the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who > is to say what G1 will expose to G2? G1 may very well advertise a HC > feature bit to G2 which may allow G2 to try to make a VMCALL. How do > you stop that? > I don't see any way. If, instead of a hypercall we go through the pio hypercall route, then it all resolves itself. G2 issues a pio hypercall, H bounces it to G1, G1 either issues a pio or a pio hypercall depending on what the H and G1 negotiated. Of course mmio is faster in this case since it traps directly. btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a 0.4us difference will buy us 0.4% reduction in cpu load, so let's see what's the potential gain here. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 17:00 ` Avi Kivity @ 2009-05-08 18:55 ` Gregory Haskins 2009-05-08 19:05 ` Anthony Liguori 2009-05-08 19:12 ` Avi Kivity 0 siblings, 2 replies; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 18:55 UTC (permalink / raw) To: Avi Kivity Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 6004 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >>> Consider nested virtualization where the host (H) runs a guest (G1) >>> which is itself a hypervisor, running a guest (G2). The host exposes >>> a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than >>> creating a new virtio devices and bridging it to one of V1..Vn, >>> assigns virtio device V1 to guest G2, and prays. >>> >>> Now guest G2 issues a hypercall. Host H traps the hypercall, sees it >>> originated in G1 while in guest mode, so it injects it into G1. G1 >>> examines the parameters but can't make any sense of them, so it >>> returns an error to G2. >>> >>> If this were done using mmio or pio, it would have just worked. With >>> pio, H would have reflected the pio into G1, G1 would have done the >>> conversion from G2's port number into G1's port number and reissued >>> the pio, finally trapped by H and used to issue the I/O. >> >> I might be missing something, but I am not seeing the difference >> here. We have an "address" (in this case the HC-id) and a context (in >> this >> case G1 running in non-root mode). Whether the trap to H is a HC or a >> PIO, the context tells us that it needs to re-inject the same trap to G1 >> for proper handling. So the "address" is re-injected from H to G1 as an >> emulated trap to G1s root-mode, and we continue (just like the PIO). >> > > So far, so good (though in fact mmio can short-circuit G2->H directly). Yeah, that is a nice trick. Despite the fact that MMIOs have about 50% degradation over an equivalent PIO/HC trap, you would be hard-pressed to make that up again with all the nested reinjection going on on the PIO/HC side of the coin. I think MMIO would be a fairly easy win with one level of nesting, and absolutely trounce anything that happens to be deeper. > >> And likewise, in both cases, G1 would (should?) know what to do with >> that "address" as it relates to G2, just as it would need to know what >> the PIO address is for. Typically this would result in some kind of >> translation of that "address", but I suppose even this is completely >> arbitrary and only G1 knows for sure. E.g. it might translate from >> hypercall vector X to Y similar to your PIO example, it might completely >> change transports, or it might terminate locally (e.g. emulated device >> in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might >> be using MMIO to talk to H. I don't think it matters from a topology >> perspective (though it might from a performance perspective). >> > > How can you translate a hypercall? G1's and H's hypercall mechanisms > can be completely different. Well, what I mean is that the hypercall ABI is specific to G2->G1, but the path really looks like G2->(H)->G1 transparently since H gets all the initial exits coming from G2. But all H has to do is blindly reinject the exit with all the same parameters (e.g. registers, primarily) to the G1-root context. So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, and does its thing according to the ABI. Perhaps the ABI for that particular HC-id is a PIOoHC, so it turns around and does a ioread/iowrite PIO, trapping us back to H. So this transform of the HC-id "X" to PIO("Y") is the translation I was referring to. It could really be anything, though (e.g. HC "X" to HC "Z", if thats what G1s handler for X told it to do) > > > >>> So the upshoot is that hypercalls for devices must not be the primary >>> method of communications; they're fine as an optimization, but we >>> should always be able to fall back on something else. We also need to >>> figure out how G1 can stop V1 from advertising hypercall support. >>> >> I agree it would be desirable to be able to control this exposure. >> However, I am not currently convinced its strictly necessary because of >> the reason you mentioned above. And also note that I am not currently >> convinced its even possible to control it. >> >> For instance, what if G1 is an old KVM, or (dare I say) a completely >> different hypervisor? You could control things like whether G1 can see >> the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who >> is to say what G1 will expose to G2? G1 may very well advertise a HC >> feature bit to G2 which may allow G2 to try to make a VMCALL. How do >> you stop that? >> > > I don't see any way. > > If, instead of a hypercall we go through the pio hypercall route, then > it all resolves itself. G2 issues a pio hypercall, H bounces it to > G1, G1 either issues a pio or a pio hypercall depending on what the H > and G1 negotiated. Actually I don't even think it matters what the HC payload is. Its governed by the ABI between G1 and G2. H will simply reflect the trap, so the HC could be of any type, really. > Of course mmio is faster in this case since it traps directly. > > btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a > 0.4us difference will buy us 0.4% reduction in cpu load, so let's see > what's the potential gain here. Its more of an issue of execution latency (which translates to IO latency, since "execution" is usually for the specific goal of doing some IO). In fact, per my own design claims, I try to avoid exits like the plague and generally succeed at making very few of them. ;) So its not really the .4% reduction of cpu use that allures me. Its the 16% reduction in latency. Time/discussion will tell if its worth the trouble to use HC or just try to shave more off of PIO. If we went that route, I am concerned about falling back to MMIO, but Anthony seems to think this is not a real issue. From what we've discussed here, it seems the best case scenario would be if the Intel/AMD folks came up with some really good hardware accelerated MMIO-EXIT so we could avoid all the decode/walk crap in the first place. ;) -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 18:55 ` Gregory Haskins @ 2009-05-08 19:05 ` Anthony Liguori 2009-05-08 19:12 ` Avi Kivity 1 sibling, 0 replies; 115+ messages in thread From: Anthony Liguori @ 2009-05-08 19:05 UTC (permalink / raw) To: Gregory Haskins Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: > Its more of an issue of execution latency (which translates to IO > latency, since "execution" is usually for the specific goal of doing > some IO). In fact, per my own design claims, I try to avoid exits like > the plague and generally succeed at making very few of them. ;) > > So its not really the .4% reduction of cpu use that allures me. Its the > 16% reduction in latency. Time/discussion will tell if its worth the > trouble to use HC or just try to shave more off of PIO. If we went that > route, I am concerned about falling back to MMIO, but Anthony seems to > think this is not a real issue. > It's only a 16% reduction in latency if your workload is entirely dependent on the latency of a hypercall. What is that workload? I don't think it exists. For a network driver, I have a hard time believing that anyone cares that much about 210ns of latency. We're getting close to the cost of a few dozen instructions here. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 18:55 ` Gregory Haskins 2009-05-08 19:05 ` Anthony Liguori @ 2009-05-08 19:12 ` Avi Kivity 2009-05-08 19:59 ` Gregory Haskins 1 sibling, 1 reply; 115+ messages in thread From: Avi Kivity @ 2009-05-08 19:12 UTC (permalink / raw) To: Gregory Haskins Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: >>> And likewise, in both cases, G1 would (should?) know what to do with >>> that "address" as it relates to G2, just as it would need to know what >>> the PIO address is for. Typically this would result in some kind of >>> translation of that "address", but I suppose even this is completely >>> arbitrary and only G1 knows for sure. E.g. it might translate from >>> hypercall vector X to Y similar to your PIO example, it might completely >>> change transports, or it might terminate locally (e.g. emulated device >>> in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might >>> be using MMIO to talk to H. I don't think it matters from a topology >>> perspective (though it might from a performance perspective). >>> >>> >> How can you translate a hypercall? G1's and H's hypercall mechanisms >> can be completely different. >> > > Well, what I mean is that the hypercall ABI is specific to G2->G1, but > the path really looks like G2->(H)->G1 transparently since H gets all > the initial exits coming from G2. But all H has to do is blindly > reinject the exit with all the same parameters (e.g. registers, > primarily) to the G1-root context. > > So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, > and does its thing according to the ABI. Perhaps the ABI for that > particular HC-id is a PIOoHC, so it turns around and does a > ioread/iowrite PIO, trapping us back to H. > > So this transform of the HC-id "X" to PIO("Y") is the translation I was > referring to. It could really be anything, though (e.g. HC "X" to HC > "Z", if thats what G1s handler for X told it to do) > That only works if the device exposes a pio port, and the hypervisor exposes HC_PIO. If the device exposes the hypercall, things break once you assign it. >> Of course mmio is faster in this case since it traps directly. >> >> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a >> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see >> what's the potential gain here. >> > > Its more of an issue of execution latency (which translates to IO > latency, since "execution" is usually for the specific goal of doing > some IO). In fact, per my own design claims, I try to avoid exits like > the plague and generally succeed at making very few of them. ;) > > So its not really the .4% reduction of cpu use that allures me. Its the > 16% reduction in latency. Time/discussion will tell if its worth the > trouble to use HC or just try to shave more off of PIO. If we went that > route, I am concerned about falling back to MMIO, but Anthony seems to > think this is not a real issue. > You need to use absolute numbers, not percentages off the smallest component. If you want to reduce latency, keep things on the same core (IPIs, cache bounces are more expensive than the 200ns we're seeing here). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 19:12 ` Avi Kivity @ 2009-05-08 19:59 ` Gregory Haskins 2009-05-10 9:59 ` Avi Kivity 0 siblings, 1 reply; 115+ messages in thread From: Gregory Haskins @ 2009-05-08 19:59 UTC (permalink / raw) To: Avi Kivity Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 4225 bytes --] Avi Kivity wrote: > Gregory Haskins wrote: >>>> And likewise, in both cases, G1 would (should?) know what to do with >>>> that "address" as it relates to G2, just as it would need to know what >>>> the PIO address is for. Typically this would result in some kind of >>>> translation of that "address", but I suppose even this is completely >>>> arbitrary and only G1 knows for sure. E.g. it might translate from >>>> hypercall vector X to Y similar to your PIO example, it might >>>> completely >>>> change transports, or it might terminate locally (e.g. emulated device >>>> in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 >>>> might >>>> be using MMIO to talk to H. I don't think it matters from a topology >>>> perspective (though it might from a performance perspective). >>>> >>> How can you translate a hypercall? G1's and H's hypercall mechanisms >>> can be completely different. >>> >> >> Well, what I mean is that the hypercall ABI is specific to G2->G1, but >> the path really looks like G2->(H)->G1 transparently since H gets all >> the initial exits coming from G2. But all H has to do is blindly >> reinject the exit with all the same parameters (e.g. registers, >> primarily) to the G1-root context. >> >> So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, >> and does its thing according to the ABI. Perhaps the ABI for that >> particular HC-id is a PIOoHC, so it turns around and does a >> ioread/iowrite PIO, trapping us back to H. >> >> So this transform of the HC-id "X" to PIO("Y") is the translation I was >> referring to. It could really be anything, though (e.g. HC "X" to HC >> "Z", if thats what G1s handler for X told it to do) >> > > That only works if the device exposes a pio port, and the hypervisor > exposes HC_PIO. If the device exposes the hypercall, things break > once you assign it. Well, true. But normally I would think you would resurface the device from G1 to G2 anyway, so any relevant transform would also be reflected in the resurfaced device config. I suppose if you had a hard requirement that, say, even the pci-config space was pass-through, this would be a problem. I am not sure if that is a realistic environment, though. > >>> Of course mmio is faster in this case since it traps directly. >>> >>> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a >>> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see >>> what's the potential gain here. >>> >> >> Its more of an issue of execution latency (which translates to IO >> latency, since "execution" is usually for the specific goal of doing >> some IO). In fact, per my own design claims, I try to avoid exits like >> the plague and generally succeed at making very few of them. ;) >> >> So its not really the .4% reduction of cpu use that allures me. Its the >> 16% reduction in latency. Time/discussion will tell if its worth the >> trouble to use HC or just try to shave more off of PIO. If we went that >> route, I am concerned about falling back to MMIO, but Anthony seems to >> think this is not a real issue. >> > > You need to use absolute numbers, not percentages off the smallest > component. If you want to reduce latency, keep things on the same > core (IPIs, cache bounces are more expensive than the 200ns we're > seeing here). > > Ok, so there are no shortages of IO cards that can perform operations in the order of 10us-15us. Therefore a 350ns latency (the delta between PIO and HC) turns into a 2%-3.5% overhead when compared to bare-metal. I am not really at liberty to talk about most of the kinds of applications that might care. A trivial example might be PTPd clock distribution. But this is going nowhere. Based on Anthony's asserting that the MMIO fallback worry is unfounded, and all the controversy this is causing, perhaps we should just move on and just forget the whole thing. If I have to I will patch the HC code in my own tree. For now, I will submit a few patches to clean up the locking on the io_bus. That may help narrow the gap without all this stuff, anyway. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 266 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [RFC PATCH 0/3] generic hypercall support 2009-05-08 19:59 ` Gregory Haskins @ 2009-05-10 9:59 ` Avi Kivity 0 siblings, 0 replies; 115+ messages in thread From: Avi Kivity @ 2009-05-10 9:59 UTC (permalink / raw) To: Gregory Haskins Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm Gregory Haskins wrote: >> >> That only works if the device exposes a pio port, and the hypervisor >> exposes HC_PIO. If the device exposes the hypercall, things break >> once you assign it. >> > > Well, true. But normally I would think you would resurface the device > from G1 to G2 anyway, so any relevant transform would also be reflected > in the resurfaced device config. We do, but the G1 hypervisor cannot be expected to understand the config option that exposes the hypercall. > I suppose if you had a hard > requirement that, say, even the pci-config space was pass-through, this > would be a problem. I am not sure if that is a realistic environment, > though. > You must pass through the config space, as some of it is device specific. The hypervisor will trap config space accesses, but unless it understands them, it cannot modify them. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 115+ messages in thread
end of thread, other threads:[~2009-05-13 14:45 UTC | newest] Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins 2009-05-05 17:03 ` Hollis Blanchard 2009-05-06 13:52 ` Anthony Liguori 2009-05-06 15:16 ` Gregory Haskins 2009-05-06 16:52 ` Arnd Bergmann 2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins 2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins 2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity 2009-05-05 13:40 ` Gregory Haskins 2009-05-05 14:00 ` Avi Kivity 2009-05-05 14:14 ` Gregory Haskins 2009-05-05 14:21 ` Gregory Haskins 2009-05-05 15:02 ` Avi Kivity 2009-05-05 23:17 ` Chris Wright 2009-05-06 3:51 ` Gregory Haskins 2009-05-06 7:22 ` Chris Wright 2009-05-06 13:17 ` Gregory Haskins 2009-05-06 16:07 ` Chris Wright 2009-05-07 17:03 ` Gregory Haskins 2009-05-07 18:05 ` Avi Kivity 2009-05-07 18:08 ` Gregory Haskins 2009-05-07 18:12 ` Avi Kivity 2009-05-07 18:16 ` Gregory Haskins 2009-05-07 18:24 ` Avi Kivity 2009-05-07 18:37 ` Gregory Haskins 2009-05-07 19:00 ` Avi Kivity 2009-05-07 19:05 ` Gregory Haskins 2009-05-07 19:43 ` Avi Kivity 2009-05-07 20:07 ` Gregory Haskins 2009-05-07 20:15 ` Avi Kivity 2009-05-07 20:26 ` Gregory Haskins 2009-05-08 8:35 ` Avi Kivity 2009-05-08 11:29 ` Gregory Haskins 2009-05-07 19:07 ` Chris Wright 2009-05-07 19:12 ` Gregory Haskins 2009-05-07 19:21 ` Chris Wright 2009-05-07 19:26 ` Avi Kivity 2009-05-07 19:44 ` Avi Kivity 2009-05-07 19:29 ` Gregory Haskins 2009-05-07 20:25 ` Chris Wright 2009-05-07 20:34 ` Gregory Haskins 2009-05-07 20:54 ` Arnd Bergmann 2009-05-07 21:13 ` Gregory Haskins 2009-05-07 21:57 ` Chris Wright 2009-05-07 22:11 ` Arnd Bergmann 2009-05-08 22:33 ` Benjamin Herrenschmidt 2009-05-11 13:01 ` Arnd Bergmann 2009-05-11 13:04 ` Gregory Haskins 2009-05-07 20:00 ` Arnd Bergmann 2009-05-07 20:31 ` Gregory Haskins 2009-05-07 20:42 ` Arnd Bergmann 2009-05-07 20:47 ` Arnd Bergmann 2009-05-07 20:50 ` Chris Wright 2009-05-07 23:35 ` Marcelo Tosatti 2009-05-07 23:43 ` Marcelo Tosatti 2009-05-08 3:17 ` Gregory Haskins 2009-05-08 7:55 ` Avi Kivity [not found] ` <20090508103253.GC3011@amt.cnet> 2009-05-08 11:37 ` Avi Kivity 2009-05-08 14:35 ` Marcelo Tosatti 2009-05-08 14:45 ` Gregory Haskins 2009-05-08 15:51 ` Marcelo Tosatti 2009-05-08 19:56 ` David S. Ahern 2009-05-08 20:01 ` Gregory Haskins 2009-05-08 23:23 ` David S. Ahern 2009-05-09 8:45 ` Avi Kivity 2009-05-09 11:27 ` Gregory Haskins 2009-05-10 4:27 ` David S. Ahern 2009-05-10 5:24 ` Avi Kivity 2009-05-10 4:24 ` David S. Ahern 2009-05-08 3:13 ` Gregory Haskins 2009-05-08 7:59 ` Avi Kivity 2009-05-08 11:09 ` Gregory Haskins [not found] ` <20090508104228.GD3011@amt.cnet> 2009-05-08 12:43 ` Gregory Haskins 2009-05-08 15:33 ` Marcelo Tosatti 2009-05-08 19:02 ` Avi Kivity 2009-05-08 16:48 ` Paul E. McKenney 2009-05-08 19:55 ` Gregory Haskins 2009-05-08 14:15 ` Gregory Haskins 2009-05-08 14:53 ` Anthony Liguori 2009-05-08 18:50 ` Avi Kivity 2009-05-08 19:02 ` Anthony Liguori 2009-05-08 19:06 ` Avi Kivity 2009-05-11 16:37 ` Jeremy Fitzhardinge 2009-05-07 12:29 ` Avi Kivity 2009-05-08 14:59 ` Anthony Liguori 2009-05-09 12:01 ` Gregory Haskins 2009-05-10 18:38 ` Anthony Liguori 2009-05-11 13:14 ` Gregory Haskins 2009-05-11 16:35 ` Hollis Blanchard 2009-05-11 17:06 ` Avi Kivity 2009-05-11 17:29 ` Gregory Haskins 2009-05-11 17:53 ` Avi Kivity 2009-05-11 17:51 ` Anthony Liguori 2009-05-11 18:02 ` Avi Kivity 2009-05-11 18:18 ` Anthony Liguori 2009-05-11 17:31 ` Anthony Liguori 2009-05-13 10:53 ` Gregory Haskins 2009-05-13 14:45 ` Gregory Haskins 2009-05-11 16:44 ` Hollis Blanchard 2009-05-11 17:54 ` Anthony Liguori 2009-05-11 19:24 ` PowerPC page faults Hollis Blanchard 2009-05-11 22:17 ` Anthony Liguori 2009-05-12 5:46 ` Liu Yu-B13201 2009-05-12 14:50 ` Hollis Blanchard 2009-05-06 13:56 ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori 2009-05-06 16:03 ` Gregory Haskins 2009-05-08 8:17 ` Avi Kivity 2009-05-08 15:20 ` Gregory Haskins 2009-05-08 17:00 ` Avi Kivity 2009-05-08 18:55 ` Gregory Haskins 2009-05-08 19:05 ` Anthony Liguori 2009-05-08 19:12 ` Avi Kivity 2009-05-08 19:59 ` Gregory Haskins 2009-05-10 9:59 ` Avi Kivity
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.