All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] generic hypercall support
@ 2009-05-05 13:24 Gregory Haskins
  2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
                   ` (3 more replies)
  0 siblings, 4 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, avi

(Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)

Please see patch 1/3 for a description.  This has been tested with a KVM
guest on x86_64 and appears to work properly.  Comments, please.

-Greg

---

Gregory Haskins (3):
      kvm: add pv_cpu_ops.hypercall support to the guest
      x86: add generic hypercall support
      add generic hypercall support


 arch/Kconfig                     |    3 +
 arch/x86/Kconfig                 |    1 
 arch/x86/include/asm/paravirt.h  |   13 ++++++
 arch/x86/include/asm/processor.h |    6 +++
 arch/x86/kernel/kvm.c            |   22 ++++++++++
 include/linux/hypercall.h        |   83 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 128 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/hypercall.h

-- 
Signature

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [RFC PATCH 1/3] add generic hypercall support
  2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins
@ 2009-05-05 13:24 ` Gregory Haskins
  2009-05-05 17:03   ` Hollis Blanchard
  2009-05-06 13:52   ` Anthony Liguori
  2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, avi

We add a generic hypercall() mechanism for use by IO code which is
compatible with a variety of hypervisors, but which prefers to use
hypercalls over other types of hypervisor traps for performance and/or
feature reasons.

For instance, consider an emulated PCI device in KVM.  Today we can chose
to do IO over MMIO or PIO infrastructure, but they each have their own
distinct disadvantages:

*) MMIO causes a page-fault, which must be decoded by the hypervisor and is
   therefore fairly expensive.

*) PIO is more direct than MMIO, but it poses other problems such as:
      a) can have a small limited address space (x86 is 2^16)
      b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time)
      c) not available on all archs (PCI mentions ppc as problematic) and
         is therefore recommended to avoid.

Hypercalls, on the other hand, offer a direct access path like PIOs, yet
do not suffer the same drawbacks such as a limited address space or a
narrow-band interface.  Hypercalls are much more friendly to software
to software interaction since we can pack multiple registers in a way
that is natural and simple for software to utilize.

The problem with hypercalls today is that there is no generic support.
There is various support for hypervisor specific implementations (for
instance, see  kvm_hypercall0() in arch/x86/include/asm/kvm_para.h).  This
makes it difficult to implement a device that is hypervisor agnostic since
it would not only need to know the hypercall ABI, but also which platform
specific function call it should make.

If we can convey a dynamic binding to a specific hypercall vector in a
generic way (out of the scope of this patch series), then an IO driver
could utilize that dynamic binding to communicate without requiring
hypervisor specific knowledge.  Therefore, we implement a system wide
hypercall() interface based on a variable length list of unsigned longs
(representing registers to pack) and expect that various arch/hypervisor
implementations can fill in the details, if supported.  This is expected
to be done as part of the pv_ops infrastructure, which is the natural
hook-point for hypervisor specific code.  Note, however, that the
generic hypercall() interface does not require the implementation to use
pv_ops if so desired.

Example use case:
------------------

Consider a PCI device "X".  It can already advertise MMIO/PIO regions via
its BAR infrastructure.  With this new model it could also advertise a
hypercall vector in its device-specific upper configuration space.  (The
allocation and assignment of this vector on the backend is beyond the scope
of this series).  The guest-side driver for device "X" would sense (via
something like a feature-bit) if the hypercall was available and valid,
read the value with a configuration cycle, and proceed to ignore the BARs
in favor of using the hypercall() interface.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/hypercall.h |   83 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 83 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/hypercall.h

diff --git a/include/linux/hypercall.h b/include/linux/hypercall.h
new file mode 100644
index 0000000..c8a1492
--- /dev/null
+++ b/include/linux/hypercall.h
@@ -0,0 +1,83 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_HYPERCALL_H
+#define _LINUX_HYPERCALL_H
+
+#ifdef CONFIG_HAVE_HYPERCALL
+
+long hypercall(unsigned long nr, unsigned long *args, size_t count);
+
+#else
+
+static inline long
+hypercall(unsigned long nr, unsigned long *args, size_t count)
+{
+	return -EINVAL;
+}
+
+#endif /* CONFIG_HAVE_HYPERCALL */
+
+#define hypercall0(nr) hypercall(nr, NULL, 0)
+#define hypercall1(nr, a1)                                              \
+	({						                \
+		unsigned long __args[] = { a1, };	                \
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+#define hypercall2(nr, a1, a2)						\
+	({						                \
+		unsigned long __args[] = { a1, a2, };	                \
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+#define hypercall3(nr, a1, a2, a3)			       		\
+	({						                \
+		unsigned long __args[] = { a1, a2, a3, };		\
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+#define hypercall4(nr, a1, a2, a3, a4)					\
+	({						                \
+		unsigned long __args[] = { a1, a2, a3, a4, };		\
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+#define hypercall5(nr, a1, a2, a3, a4, a5)				\
+	({						                \
+		unsigned long __args[] = { a1, a2, a3, a4, a5, };	\
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+#define hypercall6(nr, a1, a2, a3, a4, a5, a6)				\
+	({						                \
+		unsigned long __args[] = { a1, a2, a3, a4, a5, a6, };	\
+		long __ret;				                \
+		__ret = hypercall(nr, __args, ARRAY_SIZE(__args));      \
+		__ret;                                                  \
+	})
+
+
+#endif /* _LINUX_HYPERCALL_H */


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [RFC PATCH 2/3] x86: add generic hypercall support
  2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins
  2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
@ 2009-05-05 13:24 ` Gregory Haskins
  2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins
  2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity
  3 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, avi

This adds a hypercall() vector to x86 pv_cpu_ops to be optionally filled in
by a hypervisor driver as it loads its other pv_ops components.  We also
declare x86 as CONFIG_HAVE_HYPERCALL to enable the generic hypercall code
whenever the user builds for x86.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/Kconfig                     |    3 +++
 arch/x86/Kconfig                 |    1 +
 arch/x86/include/asm/paravirt.h  |   13 +++++++++++++
 arch/x86/include/asm/processor.h |    6 ++++++
 4 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 78a35e9..239b658 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -112,3 +112,6 @@ config HAVE_DMA_API_DEBUG
 
 config HAVE_DEFAULT_NO_SPIN_MUTEXES
 	bool
+
+config HAVE_HYPERCALL
+        bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df9e885..3c609cf 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -46,6 +46,7 @@ config X86
 	select HAVE_KERNEL_GZIP
 	select HAVE_KERNEL_BZIP2
 	select HAVE_KERNEL_LZMA
+	select HAVE_HYPERCALL
 
 config ARCH_DEFCONFIG
 	string
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 378e369..ed22c84 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -6,6 +6,7 @@
 #ifdef CONFIG_PARAVIRT
 #include <asm/pgtable_types.h>
 #include <asm/asm.h>
+#include <asm/errno.h>
 
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_NONE 0
@@ -203,6 +204,8 @@ struct pv_cpu_ops {
 
 	void (*swapgs)(void);
 
+	long (*hypercall)(unsigned long nr, unsigned long *args, size_t count);
+
 	struct pv_lazy_ops lazy_mode;
 };
 
@@ -723,6 +726,16 @@ static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
 	PVOP_VCALL4(pv_cpu_ops.cpuid, eax, ebx, ecx, edx);
 }
 
+static inline long hypercall(unsigned long nr,
+			     unsigned long *args,
+			     size_t count)
+{
+	if (!pv_cpu_ops.hypercall)
+		return -EINVAL;
+
+	return pv_cpu_ops.hypercall(nr, args, count);
+}
+
 /*
  * These special macros can be used to get or set a debugging register
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c2cceae..8fa988d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -570,6 +570,12 @@ static inline void native_swapgs(void)
 #define __cpuid			native_cpuid
 #define paravirt_enabled()	0
 
+static inline long
+hypercall(unsigned long nr, unsigned long *args, size_t count)
+{
+	return -EINVAL;
+}
+
 /*
  * These special macros can be used to get or set a debugging register
  */


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest
  2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins
  2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
  2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins
@ 2009-05-05 13:24 ` Gregory Haskins
  2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity
  3 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, avi

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/kernel/kvm.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33019dd..d299ed5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -50,6 +50,26 @@ static void kvm_io_delay(void)
 {
 }
 
+static long _kvm_hypercall(unsigned long nr,
+			   unsigned long *args,
+			   size_t count)
+{
+	switch (count) {
+	case 0:
+		return kvm_hypercall0(nr);
+	case 1:
+		return kvm_hypercall1(nr, args[0]);
+	case 2:
+		return kvm_hypercall2(nr, args[0], args[1]);
+	case 3:
+		return kvm_hypercall3(nr, args[0], args[1], args[2]);
+	case 4:
+		return kvm_hypercall4(nr, args[0], args[1], args[2], args[3]);
+	default:
+		return -EINVAL;
+	}
+}
+
 static void kvm_mmu_op(void *buffer, unsigned len)
 {
 	int r;
@@ -207,6 +227,8 @@ static void paravirt_ops_setup(void)
 	if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
 		pv_cpu_ops.io_delay = kvm_io_delay;
 
+	pv_cpu_ops.hypercall = _kvm_hypercall;
+
 	if (kvm_para_has_feature(KVM_FEATURE_MMU_OP)) {
 		pv_mmu_ops.set_pte = kvm_set_pte;
 		pv_mmu_ops.set_pte_at = kvm_set_pte_at;


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins
@ 2009-05-05 13:36 ` Avi Kivity
  2009-05-05 13:40   ` Gregory Haskins
  3 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-05 13:36 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, kvm

Gregory Haskins wrote:
> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)
>
> Please see patch 1/3 for a description.  This has been tested with a KVM
> guest on x86_64 and appears to work properly.  Comments, please.
>   

What about the hypercalls in include/asm/kvm_para.h?

In general, hypercalls cannot be generic since each hypervisor 
implements its own ABI.  The abstraction needs to be at a higher level 
(pv_ops is such a level).


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity
@ 2009-05-05 13:40   ` Gregory Haskins
  2009-05-05 14:00     ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 13:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)
>>
>> Please see patch 1/3 for a description.  This has been tested with a KVM
>> guest on x86_64 and appears to work properly.  Comments, please.
>>   
>
> What about the hypercalls in include/asm/kvm_para.h?
>
> In general, hypercalls cannot be generic since each hypervisor
> implements its own ABI.
Please see the prologue to 1/3.  Its all described there, including a
use case which I think answers your questions.  If there is still
ambiguity, let me know.

>   The abstraction needs to be at a higher level (pv_ops is such a level).
Yep, agreed.  Thats exactly what this series is doing, actually.


-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 13:40   ` Gregory Haskins
@ 2009-05-05 14:00     ` Avi Kivity
  2009-05-05 14:14       ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-05 14:00 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)
>>>
>>> Please see patch 1/3 for a description.  This has been tested with a KVM
>>> guest on x86_64 and appears to work properly.  Comments, please.
>>>   
>>>       
>> What about the hypercalls in include/asm/kvm_para.h?
>>
>> In general, hypercalls cannot be generic since each hypervisor
>> implements its own ABI.
>>     
> Please see the prologue to 1/3.  Its all described there, including a
> use case which I think answers your questions.  If there is still
> ambiguity, let me know.
>
>   

Yeah, sorry.

>>   The abstraction needs to be at a higher level (pv_ops is such a level).
>>     
> Yep, agreed.  Thats exactly what this series is doing, actually.
>   

No, it doesn't.  It makes "making hypercalls" a pv_op, but hypervisors 
don't implement the same ABI.

pv_ops all _use_ hypercalls to implement higher level operations, like 
set_pte (probably the only place set_pte can be considered a high level 
operation).

In this case, the higher level event could be 
hypervisor_dynamic_event(number); each pv_ops implementation would use 
its own hypercalls to implement that.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 14:00     ` Avi Kivity
@ 2009-05-05 14:14       ` Gregory Haskins
  2009-05-05 14:21         ` Gregory Haskins
                           ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 14:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 2615 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)
>>>>
>>>> Please see patch 1/3 for a description.  This has been tested with
>>>> a KVM
>>>> guest on x86_64 and appears to work properly.  Comments, please.
>>>>         
>>> What about the hypercalls in include/asm/kvm_para.h?
>>>
>>> In general, hypercalls cannot be generic since each hypervisor
>>> implements its own ABI.
>>>     
>> Please see the prologue to 1/3.  Its all described there, including a
>> use case which I think answers your questions.  If there is still
>> ambiguity, let me know.
>>
>>   
>
> Yeah, sorry.
>
>>>   The abstraction needs to be at a higher level (pv_ops is such a
>>> level).
>>>     
>> Yep, agreed.  Thats exactly what this series is doing, actually.
>>   
>
> No, it doesn't.  It makes "making hypercalls" a pv_op, but hypervisors
> don't implement the same ABI.
Yes, that is true, but I think the issue right now is more of
semantics.  I think we are on the same page.

So you would never have someone making a generic
hypercall(KVM_HC_MMU_OP).  I agree.  What I am proposing here is more
akin to PIO-BAR + iowrite()/ioread().  E.g. the infrastructure sets up
the "addressing" (where in PIO this is literally an address, and for
hypercalls this is a vector), but the "device" defines the ABI at that
address.  So its really the "device end-point" that is defining the ABI
here, not the hypervisor (per se) and thats why I thought its ok to
declare these "generic".  But to your point below...

>
> pv_ops all _use_ hypercalls to implement higher level operations, like
> set_pte (probably the only place set_pte can be considered a high
> level operation).
>
> In this case, the higher level event could be
> hypervisor_dynamic_event(number); each pv_ops implementation would use
> its own hypercalls to implement that.

I see.  I had designed it slightly different where KVM could assign any
top level vector it wanted and thus that drove the guest-side interface
you see here to be more "generic hypercall".  However, I think your
proposal is perfectly fine too and it makes sense to more narrowly focus
these calls as specifically "dynamic"...as thats the only vectors that
we could technically use like this anyway.

So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
to kvm_para.h, and I will change the interface to follow suit (something
like s/hypercall/dynhc).  Sound good?

Thanks, Avi,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 14:14       ` Gregory Haskins
@ 2009-05-05 14:21         ` Gregory Haskins
  2009-05-05 15:02         ` Avi Kivity
  2009-05-05 23:17         ` Chris Wright
  2 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-05 14:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

Gregory Haskins wrote:
> So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
> to kvm_para.h, and I will change the interface to follow suit (something
> like s/hypercall/dynhc).  Sound good?
>   

A small ramification of this change will be that I will need to do
something like add a feature-bit to cpuid for detecting if HC_DYNAMIC is
supported on the backend or not.  The current v1 design doesn't suffer
from this requirement because the presence of the dynamic vector itself
is enough to know its supported.  I like Avi's proposal enough to say
that its worth this minor inconvenience, but FYI I will have to
additionally submit a userspace patch for v2 if we go this route.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 14:14       ` Gregory Haskins
  2009-05-05 14:21         ` Gregory Haskins
@ 2009-05-05 15:02         ` Avi Kivity
  2009-05-05 23:17         ` Chris Wright
  2 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-05 15:02 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
> I see.  I had designed it slightly different where KVM could assign any
> top level vector it wanted and thus that drove the guest-side interface
> you see here to be more "generic hypercall".  However, I think your
> proposal is perfectly fine too and it makes sense to more narrowly focus
> these calls as specifically "dynamic"...as thats the only vectors that
> we could technically use like this anyway.
>
> So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
> to kvm_para.h, and I will change the interface to follow suit (something
> like s/hypercall/dynhc).  Sound good?
>   

Yeah.

Another couple of points:

- on the host side, we'd rig this to hit an eventfd.  Nothing stops us 
from rigging pio to hit an eventfd as well, giving us kernel handling 
for pio trigger points.
- pio actually has an advantage over hypercalls with nested guests.  
Since hypercalls don't have an associated port number, the lowermost 
hypervisor must interpret a hypercall as going to a guest's hypervisor, 
and not any lower-level hypervisors.  What it boils down to is that you 
cannot use device assignment to give a guest access to a virtio/vbus 
device from a lower level hypervisor.

(Bah, that's totally unreadable.  What I want is

instead of

   hypervisor[eth0/virtio-server]   ----> 
intermediate[virtio-driver/virtio-server]  ----> guest[virtio-driver]

do

   hypervisor[eth0/virtio-server]   ----> intermediate[assign virtio 
device]  ----> guest[virtio-driver]

well, it's probably still unreadable)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 1/3] add generic hypercall support
  2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
@ 2009-05-05 17:03   ` Hollis Blanchard
  2009-05-06 13:52   ` Anthony Liguori
  1 sibling, 0 replies; 115+ messages in thread
From: Hollis Blanchard @ 2009-05-05 17:03 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, kvm, avi

On Tue, 2009-05-05 at 09:24 -0400, Gregory Haskins wrote:
> 
> *) PIO is more direct than MMIO, but it poses other problems such as:
>       a) can have a small limited address space (x86 is 2^16)
>       b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time)
>       c) not available on all archs (PCI mentions ppc as problematic) and
>          is therefore recommended to avoid.

Side note: I don't know what PCI has to do with this, and "problematic"
isn't the word I would use. ;) As far as I know, x86 is the only
still-alive architecture that implements instructions for a separate IO
space (not even ia64 does).

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 14:14       ` Gregory Haskins
  2009-05-05 14:21         ` Gregory Haskins
  2009-05-05 15:02         ` Avi Kivity
@ 2009-05-05 23:17         ` Chris Wright
  2009-05-06  3:51           ` Gregory Haskins
  2 siblings, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-05 23:17 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Avi Kivity, Gregory Haskins, linux-kernel, kvm

* Gregory Haskins (gregory.haskins@gmail.com) wrote:
> So you would never have someone making a generic
> hypercall(KVM_HC_MMU_OP).  I agree.

Which is why I think the interface proposal you've made is wrong.  There's
already hypercall interfaces w/ specific ABI and semantic meaning (which
are typically called directly/indirectly from an existing pv op hook).

But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
means hypercall number and arg list must be the same in order for code
to call hypercall() in a hypervisor agnostic way.

The pv_ops level need to have semantic meaning, not a free form
hypercall multiplexor.

thanks,
-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-05 23:17         ` Chris Wright
@ 2009-05-06  3:51           ` Gregory Haskins
  2009-05-06  7:22             ` Chris Wright
  2009-05-06 13:56             ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-06  3:51 UTC (permalink / raw)
  To: Chris Wright; +Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 4182 bytes --]

Chris Wright wrote:
> * Gregory Haskins (gregory.haskins@gmail.com) wrote:
>   
>> So you would never have someone making a generic
>> hypercall(KVM_HC_MMU_OP).  I agree.
>>     
>
> Which is why I think the interface proposal you've made is wrong.

I respectfully disagree.  Its only wrong in that the name chosen for the
interface was perhaps too broad/vague.  I still believe the concept is
sound, and the general layering is appropriate. 

>   There's
> already hypercall interfaces w/ specific ABI and semantic meaning (which
> are typically called directly/indirectly from an existing pv op hook).
>   

Yes, these are different, thus the new interface.

> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
> means hypercall number and arg list must be the same in order for code
> to call hypercall() in a hypervisor agnostic way.
>   

Yes, and that is exactly the intention.  I think its perhaps the point
you are missing.

I am well aware that historically the things we do over a hypercall
interface would inherently have meaning only to a specific hypervisor
(e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()).  However, this
doesn't in any way infer that it is the only use for the general
concept.  Its just the only way they have been exploited to date.

While I acknowledge that the hypervisor certainly must be coordinated
with their use, in their essence hypercalls are just another form of IO
joining the ranks of things like MMIO and PIO.  This is an attempt to
bring them out of the bowels of CONFIG_PARAVIRT to make them a first
class citizen. 

The thing I am building here is really not a general hypercall in the
broad sense.  Rather, its a subset of the hypercall vector namespace. 
It is designed specifically for dynamic binding a synchronous call()
interface to things like virtual devices, and it is therefore these
virtual device models that define the particular ABI within that
namespace.  Thus the ABI in question is explicitly independent of the
underlying hypervisor.  I therefore stand by the proposed design to have
this interface described above the hypervisor support layer (i.e.
pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as
per my later discussion with Avi).

Consider PIO: The hypervisor (or hardware) and OS negotiate a port
address, but the two end-points are the driver and the device-model (or
real device).  The driver doesnt have to say:

if (kvm)
   kvm_iowrite32(addr, ..);
else if (lguest)
   lguest_iowrite32(addr, ...);
else
   native_iowrite32(addr, ...);

Instead, it just says "iowrite32(addr, ...);" and the address is used to
route the message appropriately by the platform.  The ABI of that
message, however, is specific to the driver/device and is not
interpreted by kvm/lguest/native-hw infrastructure on the way.

Today, there is no equivelent of a platform agnostic "iowrite32()" for
hypercalls so the driver would look like the pseudocode above except
substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
is to allow the hypervisor to assign a dynamic vector to resources in
the backend and convey this vector to the guest (such as in PCI
config-space as mentioned in my example use-case).  The provides the
"address negotiation" function that would normally be done for something
like a pio port-address.   The hypervisor agnostic driver can then use
this globally recognized address-token coupled with other device-private
ABI parameters to communicate with the device.  This can all occur
without the core hypervisor needing to understand the details beyond the
addressing.

What this means to our interface design is that the only thing the
hypervisor really cares about is the first "nr" parameter.  This acts as
our address-token.  The optional/variable list of args is just payload
as far as the core infrastructure is concerned and are coupled only to
our device ABI.  They were chosen to be an array of ulongs (vs something
like vargs) to reflect the fact that hypercalls are typically passed by
packing registers.

Hope this helps,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06  3:51           ` Gregory Haskins
@ 2009-05-06  7:22             ` Chris Wright
  2009-05-06 13:17               ` Gregory Haskins
  2009-05-06 13:56             ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori
  1 sibling, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-06  7:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:
> Chris Wright wrote:
> > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
> > means hypercall number and arg list must be the same in order for code
> > to call hypercall() in a hypervisor agnostic way.
> 
> Yes, and that is exactly the intention.  I think its perhaps the point
> you are missing.

Yes, I was reading this as purely any hypercall, but it seems a bit
more like:
 pv_io_ops->iomap()
 pv_io_ops->ioread()
 pv_io_ops->iowrite()

<snip>
> Today, there is no equivelent of a platform agnostic "iowrite32()" for
> hypercalls so the driver would look like the pseudocode above except
> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
> is to allow the hypervisor to assign a dynamic vector to resources in
> the backend and convey this vector to the guest (such as in PCI
> config-space as mentioned in my example use-case).  The provides the
> "address negotiation" function that would normally be done for something
> like a pio port-address.   The hypervisor agnostic driver can then use
> this globally recognized address-token coupled with other device-private
> ABI parameters to communicate with the device.  This can all occur
> without the core hypervisor needing to understand the details beyond the
> addressing.

VF drivers can also have this issue (and typically use mmio).
I at least have a better idea what your proposal is, thanks for
explanation.  Are you able to demonstrate concrete benefit with it yet
(improved latency numbers for example)?

thanks,
-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06  7:22             ` Chris Wright
@ 2009-05-06 13:17               ` Gregory Haskins
  2009-05-06 16:07                 ` Chris Wright
  2009-05-07 12:29                 ` Avi Kivity
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-06 13:17 UTC (permalink / raw)
  To: Chris Wright; +Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Chris Wright wrote:
>>     
>>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
>>> means hypercall number and arg list must be the same in order for code
>>> to call hypercall() in a hypervisor agnostic way.
>>>       
>> Yes, and that is exactly the intention.  I think its perhaps the point
>> you are missing.
>>     
>
> Yes, I was reading this as purely any hypercall, but it seems a bit
> more like:
>  pv_io_ops->iomap()
>  pv_io_ops->ioread()
>  pv_io_ops->iowrite()
>   

Right.

> <snip>
>   
>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>> hypercalls so the driver would look like the pseudocode above except
>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
>> is to allow the hypervisor to assign a dynamic vector to resources in
>> the backend and convey this vector to the guest (such as in PCI
>> config-space as mentioned in my example use-case).  The provides the
>> "address negotiation" function that would normally be done for something
>> like a pio port-address.   The hypervisor agnostic driver can then use
>> this globally recognized address-token coupled with other device-private
>> ABI parameters to communicate with the device.  This can all occur
>> without the core hypervisor needing to understand the details beyond the
>> addressing.
>>     
>
> VF drivers can also have this issue (and typically use mmio).
> I at least have a better idea what your proposal is, thanks for
> explanation.  Are you able to demonstrate concrete benefit with it yet
> (improved latency numbers for example)?
>   

I had a test-harness/numbers for this kind of thing, but its a bit
crufty since its from ~1.5 years ago.  I will dig it up, update it, and
generate/post new numbers.

Thanks Chris,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 1/3] add generic hypercall support
  2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
  2009-05-05 17:03   ` Hollis Blanchard
@ 2009-05-06 13:52   ` Anthony Liguori
  2009-05-06 15:16     ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-06 13:52 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, kvm, avi

Gregory Haskins wrote:
> We add a generic hypercall() mechanism for use by IO code which is
> compatible with a variety of hypervisors, but which prefers to use
> hypercalls over other types of hypervisor traps for performance and/or
> feature reasons.
>
> For instance, consider an emulated PCI device in KVM.  Today we can chose
> to do IO over MMIO or PIO infrastructure, but they each have their own
> distinct disadvantages:
>
> *) MMIO causes a page-fault, which must be decoded by the hypervisor and is
>    therefore fairly expensive.
>
> *) PIO is more direct than MMIO, but it poses other problems such as:
>       a) can have a small limited address space (x86 is 2^16)
>       b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a time)
>       c) not available on all archs (PCI mentions ppc as problematic) and
>          is therefore recommended to avoid.
>
> Hypercalls, on the other hand, offer a direct access path like PIOs, yet
> do not suffer the same drawbacks such as a limited address space or a
> narrow-band interface.  Hypercalls are much more friendly to software
> to software interaction since we can pack multiple registers in a way
> that is natural and simple for software to utilize.
>   

No way.  Hypercalls are just about awful because they cannot be 
implemented sanely with VT/SVM as Intel/AMD couldn't agree on a common 
instruction for it.  This means you either need a hypercall page, which 
I'm pretty sure makes transparent migration impossible, or you need to 
do hypercall patching which is going to throw off attestation.

If anything, I'd argue that we shouldn't use hypercalls for anything in 
KVM because it will break run-time attestation.

Hypercalls cannot pass any data either.  We pass data with hypercalls by 
relying on external state (like register state).  It's just as easy to 
do this with PIO. VMware does this with vmport, for instance.  However, 
in general, you do not want to pass that much data with a notification.  
It's better to rely on some external state (like a ring queue) and have 
the "hypercall" act simply as a notification mechanism.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06  3:51           ` Gregory Haskins
  2009-05-06  7:22             ` Chris Wright
@ 2009-05-06 13:56             ` Anthony Liguori
  2009-05-06 16:03               ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-06 13:56 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm

Gregory Haskins wrote:
> Chris Wright wrote:
>   
>> * Gregory Haskins (gregory.haskins@gmail.com) wrote:
>>   
>>     
>>> So you would never have someone making a generic
>>> hypercall(KVM_HC_MMU_OP).  I agree.
>>>     
>>>       
>> Which is why I think the interface proposal you've made is wrong.
>>     
>
> I respectfully disagree.  Its only wrong in that the name chosen for the
> interface was perhaps too broad/vague.  I still believe the concept is
> sound, and the general layering is appropriate. 
>
>   
>>   There's
>> already hypercall interfaces w/ specific ABI and semantic meaning (which
>> are typically called directly/indirectly from an existing pv op hook).
>>   
>>     
>
> Yes, these are different, thus the new interface.
>
>   
>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
>> means hypercall number and arg list must be the same in order for code
>> to call hypercall() in a hypervisor agnostic way.
>>   
>>     
>
> Yes, and that is exactly the intention.  I think its perhaps the point
> you are missing.
>
> I am well aware that historically the things we do over a hypercall
> interface would inherently have meaning only to a specific hypervisor
> (e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()).  However, this
> doesn't in any way infer that it is the only use for the general
> concept.  Its just the only way they have been exploited to date.
>
> While I acknowledge that the hypervisor certainly must be coordinated
> with their use, in their essence hypercalls are just another form of IO
> joining the ranks of things like MMIO and PIO.  This is an attempt to
> bring them out of the bowels of CONFIG_PARAVIRT to make them a first
> class citizen. 
>
> The thing I am building here is really not a general hypercall in the
> broad sense.  Rather, its a subset of the hypercall vector namespace. 
> It is designed specifically for dynamic binding a synchronous call()
> interface to things like virtual devices, and it is therefore these
> virtual device models that define the particular ABI within that
> namespace.  Thus the ABI in question is explicitly independent of the
> underlying hypervisor.  I therefore stand by the proposed design to have
> this interface described above the hypervisor support layer (i.e.
> pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as
> per my later discussion with Avi).
>
> Consider PIO: The hypervisor (or hardware) and OS negotiate a port
> address, but the two end-points are the driver and the device-model (or
> real device).  The driver doesnt have to say:
>
> if (kvm)
>    kvm_iowrite32(addr, ..);
> else if (lguest)
>    lguest_iowrite32(addr, ...);
> else
>    native_iowrite32(addr, ...);
>
> Instead, it just says "iowrite32(addr, ...);" and the address is used to
> route the message appropriately by the platform.  The ABI of that
> message, however, is specific to the driver/device and is not
> interpreted by kvm/lguest/native-hw infrastructure on the way.
>
> Today, there is no equivelent of a platform agnostic "iowrite32()" for
> hypercalls so the driver would look like the pseudocode above except
> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
> is to allow the hypervisor to assign a dynamic vector to resources in
> the backend and convey this vector to the guest (such as in PCI
> config-space as mentioned in my example use-case).  The provides the
> "address negotiation" function that would normally be done for something
> like a pio port-address.   The hypervisor agnostic driver can then use
> this globally recognized address-token coupled with other device-private
> ABI parameters to communicate with the device.  This can all occur
> without the core hypervisor needing to understand the details beyond the
> addressing.
>   

PCI already provide a hypervisor agnostic interface (via IO regions).  
You have a mechanism for devices to discover which regions they have 
allocated and to request remappings.  It's supported by Linux and 
Windows.  It works on the vast majority of architectures out there today.

Why reinvent the wheel?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 1/3] add generic hypercall support
  2009-05-06 13:52   ` Anthony Liguori
@ 2009-05-06 15:16     ` Gregory Haskins
  2009-05-06 16:52       ` Arnd Bergmann
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-06 15:16 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: linux-kernel, kvm, avi

[-- Attachment #1: Type: text/plain, Size: 5719 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> We add a generic hypercall() mechanism for use by IO code which is
>> compatible with a variety of hypervisors, but which prefers to use
>> hypercalls over other types of hypervisor traps for performance and/or
>> feature reasons.
>>
>> For instance, consider an emulated PCI device in KVM.  Today we can
>> chose
>> to do IO over MMIO or PIO infrastructure, but they each have their own
>> distinct disadvantages:
>>
>> *) MMIO causes a page-fault, which must be decoded by the hypervisor
>> and is
>>    therefore fairly expensive.
>>
>> *) PIO is more direct than MMIO, but it poses other problems such as:
>>       a) can have a small limited address space (x86 is 2^16)
>>       b) is a narrow-band interface (one 8, 16, 32, 64 bit word at a
>> time)
>>       c) not available on all archs (PCI mentions ppc as problematic)
>> and
>>          is therefore recommended to avoid.
>>
>> Hypercalls, on the other hand, offer a direct access path like PIOs, yet
>> do not suffer the same drawbacks such as a limited address space or a
>> narrow-band interface.  Hypercalls are much more friendly to software
>> to software interaction since we can pack multiple registers in a way
>> that is natural and simple for software to utilize.
>>   
>
> No way.  Hypercalls are just about awful because they cannot be
> implemented sanely with VT/SVM as Intel/AMD couldn't agree on a common
> instruction for it.  This means you either need a hypercall page,
> which I'm pretty sure makes transparent migration impossible, or you
> need to do hypercall patching which is going to throw off attestation.

This is irrelevant since KVM already does this patching today and we
therefore have to deal with this fact.  This new work is riding on the
existing infrastructure so it of negligible cost to add new vectors at
least w.r.t. your concerns above.

>
> If anything, I'd argue that we shouldn't use hypercalls for anything
> in KVM because it will break run-time attestation.

The cats already out of the bag on that one.  Besides, even if the
hypercall infrastructure does break attestation (which I don't think is
proven as fact) not everyone cares about migration.  At some point,
someone may want to make a decision to trade performance for the ability
to migrate (or vice versa).  I am ok with that.

>
>
> Hypercalls cannot pass any data either.  We pass data with hypercalls
> by relying on external state (like register state).  

This is a silly argument.  The CALL or SYSENTER instructions do not pass
data either.  Rather, they branch the IP and/or execution context,
predicated on the notion that the new IP/context can still access the
relevant machine state prior to the call.  VMCALL type instructions are
just one more logical extension of that same construct.

PIO on the other hand is different.  The architectural assumption is
that the target endpoint does not have access to the machine state. 
Sure, we can "cheat" in virtualization by knowing that it really will
have access, but then we are still constrained by all the things I have
already mentioned that are disadvantageous to PIOs, plus....

> It's just as easy to do this with PIO.

..you are glossing over the fact that we already have the infrastructure
to do proper register setup in kvm_hypercallX().  We would need to come
up with an similar (arch specific) "pio_callX()" as well.  Oh wait, no
we wouldn't, since PIOs apparently only work on x86. ;)  Ok, so we would
need to come up with these pio_calls for x86, and no other arch can use
the infrastructure (but wait, PPC can use PCI too, so how does that
work? It must be either MMIO emulation or its not supported?  That puts
us back to square one).

In addition, we can have at most 8k unique vectors on x86_64 (2^16/8
bytes per port).    Even this is a gross overestimation because it
assumes the entire address space is available for pio_call, which it
isn't, and it assumes all allocations are in the precise width of the
address space (8-bytes), which they wont be.  It doesn't exactly sound
like a winner to me.

> VMware does this with vmport, for instance.  However, in general, you
> do not want to pass that much data with a notification.  It's better
> to rely on some external state (like a ring queue) and have the
> "hypercall" act simply as a notification mechanism.

Disagree completely.  Things like shared-memory rings provide a really
nice asynchronous mechanism, and for the majority of your IO and for
fast-path thats probably exactly what we should do.  However, there are
plenty of patterns that fit better with a synchronous model.  Case in
point: see VIRTIO_VBUS_FUNC_GET_FEATURES, SET_STATUS, and RESET
functions in the virtio-vbus driver I posted.  And as an aside, the ring
notification rides on the same fundamental synchronous transport.

So sure, you could argue that you could also make an extra "call" ring,
place your request in the ring, and use a flush/block primitive to wait
for that item to execute.  But that sounds a bit complicated when all I
want is the ability to invoke simple synchronous calls.  Therefore, lets
just get this "right" and build the primitives to express both patterns
easily.

In the end, I don't specifically care if the "call()" vehicle is a PIO
or a hypercall per se, as long as we can meet the following:
   a) is feasible at least anywhere PCI works
   b) provides a robust and uniform interface so drivers do not need to
care about the implementation.
   b) has equivalent performance (e.g. if PPC maps PIO as MMIO, that is
no-good).

Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06 13:56             ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori
@ 2009-05-06 16:03               ` Gregory Haskins
  2009-05-08  8:17                 ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-06 16:03 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>>
>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>> hypercalls so the driver would look like the pseudocode above except
>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
>> is to allow the hypervisor to assign a dynamic vector to resources in
>> the backend and convey this vector to the guest (such as in PCI
>> config-space as mentioned in my example use-case).  The provides the
>> "address negotiation" function that would normally be done for something
>> like a pio port-address.   The hypervisor agnostic driver can then use
>> this globally recognized address-token coupled with other device-private
>> ABI parameters to communicate with the device.  This can all occur
>> without the core hypervisor needing to understand the details beyond the
>> addressing.
>>   
>
> PCI already provide a hypervisor agnostic interface (via IO regions). 
> You have a mechanism for devices to discover which regions they have
> allocated and to request remappings.  It's supported by Linux and
> Windows.  It works on the vast majority of architectures out there today.
>
> Why reinvent the wheel?

I suspect the current wheel is square.  And the air is out.  Plus its
pulling to the left when I accelerate, but to be fair that may be my
alignment....

:) But I digress.  See: http://patchwork.kernel.org/patch/21865/

To give PCI proper respect, I think its greatest value add here is the
inherent IRQ routing (which is a huge/difficult component, as I
experienced with dynirq in vbus v1).  Beyond that, however, I think we
can do better.

HTH
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06 13:17               ` Gregory Haskins
@ 2009-05-06 16:07                 ` Chris Wright
  2009-05-07 17:03                   ` Gregory Haskins
  2009-05-07 12:29                 ` Avi Kivity
  1 sibling, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-06 16:07 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:
> Chris Wright wrote:
> > VF drivers can also have this issue (and typically use mmio).
> > I at least have a better idea what your proposal is, thanks for
> > explanation.  Are you able to demonstrate concrete benefit with it yet
> > (improved latency numbers for example)?
> 
> I had a test-harness/numbers for this kind of thing, but its a bit
> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
> generate/post new numbers.

That would be useful, because I keep coming back to pio and shared
page(s) when think of why not to do this.  Seems I'm not alone in that.

thanks,
-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 1/3] add generic hypercall support
  2009-05-06 15:16     ` Gregory Haskins
@ 2009-05-06 16:52       ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-06 16:52 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Anthony Liguori, linux-kernel, kvm, avi

On Wednesday 06 May 2009, Gregory Haskins wrote:
 
> Ok, so we would
> need to come up with these pio_calls for x86, and no other arch can use
> the infrastructure (but wait, PPC can use PCI too, so how does that
> work? It must be either MMIO emulation or its not supported?  That puts
> us back to square one).

PowerPC already has an abstraction for PIO and MMIO because certain
broken hardware chips don't do what they should, see
arch/powerpc/platforms/cell/io-workarounds.c for the only current user.
If you need to, you could do the same on x86 (or generalize the code),
but please don't add another level of indirection on top of this.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06 13:17               ` Gregory Haskins
  2009-05-06 16:07                 ` Chris Wright
@ 2009-05-07 12:29                 ` Avi Kivity
  2009-05-08 14:59                   ` Anthony Liguori
  1 sibling, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 12:29 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
> Chris Wright wrote:
>   
>> * Gregory Haskins (ghaskins@novell.com) wrote:
>>   
>>     
>>> Chris Wright wrote:
>>>     
>>>       
>>>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
>>>> means hypercall number and arg list must be the same in order for code
>>>> to call hypercall() in a hypervisor agnostic way.
>>>>       
>>>>         
>>> Yes, and that is exactly the intention.  I think its perhaps the point
>>> you are missing.
>>>     
>>>       
>> Yes, I was reading this as purely any hypercall, but it seems a bit
>> more like:
>>  pv_io_ops->iomap()
>>  pv_io_ops->ioread()
>>  pv_io_ops->iowrite()
>>   
>>     
>
> Right.
>   

Hmm, reminds me of something I thought of a while back.

We could implement an 'mmio hypercall' that does mmio reads/writes via a 
hypercall instead of an mmio operation.  That will speed up mmio for 
emulated devices (say, e1000).  It's easy to hook into Linux 
(readl/writel), is pci-friendly, non-x86 friendly, etc.

It also makes the device work when hypercall support is not available 
(qemu/tcg); you simply fall back on mmio.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06 16:07                 ` Chris Wright
@ 2009-05-07 17:03                   ` Gregory Haskins
  2009-05-07 18:05                     ` Avi Kivity
  2009-05-07 23:35                     ` Marcelo Tosatti
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 17:03 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1186 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Chris Wright wrote:
>>     
>>> VF drivers can also have this issue (and typically use mmio).
>>> I at least have a better idea what your proposal is, thanks for
>>> explanation.  Are you able to demonstrate concrete benefit with it yet
>>> (improved latency numbers for example)?
>>>       
>> I had a test-harness/numbers for this kind of thing, but its a bit
>> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
>> generate/post new numbers.
>>     
>
> That would be useful, because I keep coming back to pio and shared
> page(s) when think of why not to do this.  Seems I'm not alone in that.
>
> thanks,
> -chris
>   

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
reinvent the wheel?" questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 17:03                   ` Gregory Haskins
@ 2009-05-07 18:05                     ` Avi Kivity
  2009-05-07 18:08                       ` Gregory Haskins
  2009-05-07 23:35                     ` Marcelo Tosatti
  1 sibling, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 18:05 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
> I completed the resurrection of the test and wrote up a little wiki on
> the subject, which you can find here:
>
> http://developer.novell.com/wiki/index.php/WhyHypercalls
>
> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
> reinvent the wheel?" questions.
>
> I will include this information when I publish the updated v2 series
> with the s/hypercall/dynhc changes.
>
> Let me know if you have any questions.
>   

Well, 420 ns is not to be sneezed at.

What do you think of my mmio hypercall?  That will speed up all mmio to 
be as fast as a hypercall, and then we can use ordinary mmio/pio writes 
to trigger things.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:05                     ` Avi Kivity
@ 2009-05-07 18:08                       ` Gregory Haskins
  2009-05-07 18:12                         ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 18:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>>   
>
> Well, 420 ns is not to be sneezed at.
>
> What do you think of my mmio hypercall?  That will speed up all mmio
> to be as fast as a hypercall, and then we can use ordinary mmio/pio
> writes to trigger things.
>
I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:08                       ` Gregory Haskins
@ 2009-05-07 18:12                         ` Avi Kivity
  2009-05-07 18:16                           ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 18:12 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
>> What do you think of my mmio hypercall?  That will speed up all mmio
>> to be as fast as a hypercall, and then we can use ordinary mmio/pio
>> writes to trigger things.
>>
>>     
> I like it!
>
> Bigger question is what kind of work goes into making mmio a pv_op (or
> is this already done)?
>
>   

Looks like it isn't there.  But it isn't any different than set_pte - 
convert a write into a hypercall.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:12                         ` Avi Kivity
@ 2009-05-07 18:16                           ` Gregory Haskins
  2009-05-07 18:24                             ` Avi Kivity
  2009-05-07 20:00                             ` Arnd Bergmann
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 18:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> What do you think of my mmio hypercall?  That will speed up all mmio
>>> to be as fast as a hypercall, and then we can use ordinary mmio/pio
>>> writes to trigger things.
>>>
>>>     
>> I like it!
>>
>> Bigger question is what kind of work goes into making mmio a pv_op (or
>> is this already done)?
>>
>>   
>
> Looks like it isn't there.  But it isn't any different than set_pte -
> convert a write into a hypercall.
>
>

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.

Thoughts?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:16                           ` Gregory Haskins
@ 2009-05-07 18:24                             ` Avi Kivity
  2009-05-07 18:37                               ` Gregory Haskins
  2009-05-07 20:00                             ` Arnd Bergmann
  1 sibling, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 18:24 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
> I guess technically mmio can just be a simple access of the page which
> would be problematic to trap locally without a PF.  However it seems
> that most mmio always passes through a ioread()/iowrite() call so this
> is perhaps the hook point.  If we set the stake in the ground that mmios
> that go through some other mechanism like PFs can just hit the "slow
> path" are an acceptable casualty, I think we can make that work.
>   

That's my thinking exactly.

Note we can cheat further.  kvm already has a "coalesced mmio" feature 
where side-effect-free mmios are collected in the kernel and passed to 
userspace only when some other significant event happens.  We could pass 
those addresses to the guest and let it queue those writes itself, 
avoiding the hypercall completely.

Though it's probably pointless: if the guest is paravirtualized enough 
to have the mmio hypercall, then it shouldn't be using e1000.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:24                             ` Avi Kivity
@ 2009-05-07 18:37                               ` Gregory Haskins
  2009-05-07 19:00                                 ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 18:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>   
>
> That's my thinking exactly.

Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.

>
> Note we can cheat further.  kvm already has a "coalesced mmio" feature
> where side-effect-free mmios are collected in the kernel and passed to
> userspace only when some other significant event happens.  We could
> pass those addresses to the guest and let it queue those writes
> itself, avoiding the hypercall completely.
>
> Though it's probably pointless: if the guest is paravirtualized enough
> to have the mmio hypercall, then it shouldn't be using e1000.

Yeah...plus at least for my vbus purposes, all my my guest->host
transitions are explicitly to cause side-effects, or I wouldn't be doing
them in the first place ;)  I suspect virtio-pci is exactly the same. 
I.e. the coalescing has already been done at a higher layer for
platforms running "PV" code.

Still a cool feature, tho.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:37                               ` Gregory Haskins
@ 2009-05-07 19:00                                 ` Avi Kivity
  2009-05-07 19:05                                   ` Gregory Haskins
  2009-05-07 19:07                                   ` Chris Wright
  0 siblings, 2 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 19:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> I guess technically mmio can just be a simple access of the page which
>>> would be problematic to trap locally without a PF.  However it seems
>>> that most mmio always passes through a ioread()/iowrite() call so this
>>> is perhaps the hook point.  If we set the stake in the ground that mmios
>>> that go through some other mechanism like PFs can just hit the "slow
>>> path" are an acceptable casualty, I think we can make that work.
>>>   
>>>       
>> That's my thinking exactly.
>>     
>
> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> through the "nullio" ringer, too. ;)  It would be cool to see the
> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> will not be the case.
>   

Don't - it's broken.  It will also catch device assignment mmio and 
hypercall them.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:00                                 ` Avi Kivity
@ 2009-05-07 19:05                                   ` Gregory Haskins
  2009-05-07 19:43                                     ` Avi Kivity
  2009-05-07 19:07                                   ` Chris Wright
  1 sibling, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 19:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> I guess technically mmio can just be a simple access of the page which
>>>> would be problematic to trap locally without a PF.  However it seems
>>>> that most mmio always passes through a ioread()/iowrite() call so this
>>>> is perhaps the hook point.  If we set the stake in the ground that
>>>> mmios
>>>> that go through some other mechanism like PFs can just hit the "slow
>>>> path" are an acceptable casualty, I think we can make that work.
>>>>         
>>> That's my thinking exactly.
>>>     
>>
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>>   
>
> Don't - it's broken.  It will also catch device assignment mmio and
> hypercall them.
>
Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:00                                 ` Avi Kivity
  2009-05-07 19:05                                   ` Gregory Haskins
@ 2009-05-07 19:07                                   ` Chris Wright
  2009-05-07 19:12                                     ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-07 19:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Gregory Haskins, Chris Wright, linux-kernel,
	kvm, Anthony Liguori

* Avi Kivity (avi@redhat.com) wrote:
> Gregory Haskins wrote:
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>
> Don't - it's broken.  It will also catch device assignment mmio and  
> hypercall them.

Not necessarily.  It just needs to be creative w/ IO_COND

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:07                                   ` Chris Wright
@ 2009-05-07 19:12                                     ` Gregory Haskins
  2009-05-07 19:21                                       ` Chris Wright
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 19:12 UTC (permalink / raw)
  To: Chris Wright
  Cc: Avi Kivity, Gregory Haskins, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

Chris Wright wrote:
> * Avi Kivity (avi@redhat.com) wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>>> through the "nullio" ringer, too. ;)  It would be cool to see the
>>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>>> will not be the case.
>>>       
>> Don't - it's broken.  It will also catch device assignment mmio and  
>> hypercall them.
>>     
>
> Not necessarily.  It just needs to be creative w/ IO_COND
>   

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:12                                     ` Gregory Haskins
@ 2009-05-07 19:21                                       ` Chris Wright
  2009-05-07 19:26                                         ` Avi Kivity
  2009-05-07 19:29                                         ` Gregory Haskins
  0 siblings, 2 replies; 115+ messages in thread
From: Chris Wright @ 2009-05-07 19:21 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Avi Kivity, Gregory Haskins, linux-kernel, kvm,
	Anthony Liguori

* Gregory Haskins (ghaskins@novell.com) wrote:
> Chris Wright wrote:
> > * Avi Kivity (avi@redhat.com) wrote:
> >> Gregory Haskins wrote:
> >>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> >>> through the "nullio" ringer, too. ;)  It would be cool to see the
> >>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> >>> will not be the case.
> >>>       
> >> Don't - it's broken.  It will also catch device assignment mmio and  
> >> hypercall them.
> >
> > Not necessarily.  It just needs to be creative w/ IO_COND
> 
> Hi Chris,
>    Could you elaborate?  How would you know which pages to hypercall and
> which to let PF?

Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:21                                       ` Chris Wright
@ 2009-05-07 19:26                                         ` Avi Kivity
  2009-05-07 19:44                                           ` Avi Kivity
  2009-05-07 19:29                                         ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 19:26 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Gregory Haskins, linux-kernel, kvm, Anthony Liguori

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Chris Wright wrote:
>>     
>>> * Avi Kivity (avi@redhat.com) wrote:
>>>       
>>>> Gregory Haskins wrote:
>>>>         
>>>>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>>>>> through the "nullio" ringer, too. ;)  It would be cool to see the
>>>>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>>>>> will not be the case.
>>>>>       
>>>>>           
>>>> Don't - it's broken.  It will also catch device assignment mmio and  
>>>> hypercall them.
>>>>         
>>> Not necessarily.  It just needs to be creative w/ IO_COND
>>>       
>> Hi Chris,
>>    Could you elaborate?  How would you know which pages to hypercall and
>> which to let PF?
>>     
>
> Was just thinking of some ugly mangling of the addr (I'm not entirely
> sure what would work best).
>   

I think we just past the "too complicated" threshold.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:21                                       ` Chris Wright
  2009-05-07 19:26                                         ` Avi Kivity
@ 2009-05-07 19:29                                         ` Gregory Haskins
  2009-05-07 20:25                                           ` Chris Wright
  2009-05-07 20:54                                           ` Arnd Bergmann
  1 sibling, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 19:29 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Chris Wright wrote:
>>     
>>> * Avi Kivity (avi@redhat.com) wrote:
>>>       
>>>> Gregory Haskins wrote:
>>>>         
>>>>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>>>>> through the "nullio" ringer, too. ;)  It would be cool to see the
>>>>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>>>>> will not be the case.
>>>>>       
>>>>>           
>>>> Don't - it's broken.  It will also catch device assignment mmio and  
>>>> hypercall them.
>>>>         
>>> Not necessarily.  It just needs to be creative w/ IO_COND
>>>       
>> Hi Chris,
>>    Could you elaborate?  How would you know which pages to hypercall and
>> which to let PF?
>>     
>
> Was just thinking of some ugly mangling of the addr (I'm not entirely
> sure what would work best).
>   
Right, I get the part about flagging the address and then keying off
that flag in IO_COND (like we do for PIO vs MMIO).

What I am not clear on is how you would know to flag the address to
begin with.

Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
would never be a real device).  This means we route any io requests from
virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
is not as slick as boosting *everyones* mmio speed as Avi's original
idea would have, but it is perhaps a good tradeoff between the entirely
new namespace created by my original dynhc() proposal and leaving them
all PF based.

This way, its just like using my dynhc() proposal except the mmio-addr
is the substitute address-token (instead of the dynhc-vector). 
Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
ignored and it just slow-paths through the PF as it does today.  Dynhc()
would be dependent  on pv_ops.

Thoughts?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:05                                   ` Gregory Haskins
@ 2009-05-07 19:43                                     ` Avi Kivity
  2009-05-07 20:07                                       ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 19:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
>> Don't - it's broken.  It will also catch device assignment mmio and
>> hypercall them.
>>
>>     
> Ah.  Crap.
>
> Would you be conducive if I continue along with the dynhc() approach then?
>   

Oh yes.  But don't call it dynhc - like Chris says it's the wrong semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT 
or something along these lines.  You won't be able to pass any data, but 
that's fine.  Registers are saved to memory anyway.

And btw, given that eventfd and the underlying infrastructure are so 
flexible, it's probably better to go back to your original "irqfd gets 
fd from userspace" just to be consistent everywhere.

(no, I'm not deliberately making you rewrite that patch again and 
again... it's going to be a key piece of infrastructure so I want to get 
it right)

Just to make sure we have everything plumbed down, here's how I see 
things working out (using qemu and virtio, use sed to taste):

1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx 
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available, 
data-not-available, and kick ports of virtio-net-server
5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space 
since this is complicated (shared level triggered interrupts)
7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the 
notify fds as notifyfd
10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the interrupt.  
One day the kvm developers get their act together and change it to 
inject the interrupt directly when the irqfd is signalled (which could 
be from the net softirq or somewhere similarly nasty).

virtio-net-server wires notifyfd according to its liking.  It may 
schedule a thread, or it may execute directly.

And they all lived happily ever after.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:26                                         ` Avi Kivity
@ 2009-05-07 19:44                                           ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 19:44 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Gregory Haskins, linux-kernel, kvm, Anthony Liguori

Avi Kivity wrote:
>
> I think we just past the "too complicated" threshold.
>

And the "can't spel" threshold in the same sentence.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 18:16                           ` Gregory Haskins
  2009-05-07 18:24                             ` Avi Kivity
@ 2009-05-07 20:00                             ` Arnd Bergmann
  2009-05-07 20:31                               ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-07 20:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm,
	Anthony Liguori

On Thursday 07 May 2009, Gregory Haskins wrote:
> I guess technically mmio can just be a simple access of the page which
> would be problematic to trap locally without a PF.  However it seems
> that most mmio always passes through a ioread()/iowrite() call so this
> is perhaps the hook point.  If we set the stake in the ground that mmios
> that go through some other mechanism like PFs can just hit the "slow
> path" are an acceptable casualty, I think we can make that work.
> 
> Thoughts?

An mmio that goes through a PF is a bug, it's certainly broken on
a number of platforms, so performance should not be an issue there.

Note that are four commonly used interface classes for PIO/MMIO:

1. readl/writel: little-endian MMIO
2. inl/outl: little-endian PIO
3. ioread32/iowrite32: converged little-endian PIO/MMIO
4. __raw_readl/__raw_writel: native-endian MMIO without checks

You don't need to worry about the __raw_* stuff, as this should never
be used in device drivers.

As a simplification, you could mandate that all drivers that want to
use this get converted to the ioread/iowrite class of interfaces and
leave the others slow.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:43                                     ` Avi Kivity
@ 2009-05-07 20:07                                       ` Gregory Haskins
  2009-05-07 20:15                                         ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 20:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Don't - it's broken.  It will also catch device assignment mmio and
>>> hypercall them.
>>>
>>>     
>> Ah.  Crap.
>>
>> Would you be conducive if I continue along with the dynhc() approach
>> then?
>>   
>
> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
> semantic.
>
> Since we want to connect it to an eventfd, call it HC_NOTIFY or
> HC_EVENT or something along these lines.  You won't be able to pass
> any data, but that's fine.  Registers are saved to memory anyway.
Ok, but how would you access the registers since you would presumably
only be getting a waitq::func callback on the eventfd.  Or were you
saying that more data, if required, is saved in a side-band memory
location?  I can see the latter working.  I can't wrap my head around
the former.

>
> And btw, given that eventfd and the underlying infrastructure are so
> flexible, it's probably better to go back to your original "irqfd gets
> fd from userspace" just to be consistent everywhere.
>
> (no, I'm not deliberately making you rewrite that patch again and
> again... it's going to be a key piece of infrastructure so I want to
> get it right)

Ok, np.  Actually now that Davide showed me the waitq::func trick, the
fd technically doesn't even need to be an eventfd per se.  We can just
plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. 
Ill submit this later today.

>
>
> Just to make sure we have everything plumbed down, here's how I see
> things working out (using qemu and virtio, use sed to taste):
>
> 1. qemu starts up, sets up the VM
> 2. qemu creates virtio-net-server
> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
> ring, one set for rx ring)
> 4. qemu connects the six eventfd to the data-available,
> data-not-available, and kick ports of virtio-net-server
> 5. the guest starts up and configures virtio-net in pci pin mode
> 6. qemu notices and decides it will manage interrupts in user space
> since this is complicated (shared level triggered interrupts)
> 7. the guest OS boots, loads device driver
> 8. device driver switches virtio-net to msix mode
> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
> notify fds as notifyfd
> 10. look ma, no hands.
>
> Under the hood, the following takes place.
>
> kvm wires the irqfds to schedule a work item which fires the
> interrupt.  One day the kvm developers get their act together and
> change it to inject the interrupt directly when the irqfd is signalled
> (which could be from the net softirq or somewhere similarly nasty).
>
> virtio-net-server wires notifyfd according to its liking.  It may
> schedule a thread, or it may execute directly.
>
> And they all lived happily ever after.

Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)  But
even if not, this should provide enough plumbing that we can all coexist
together peacefully.

Thanks,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:07                                       ` Gregory Haskins
@ 2009-05-07 20:15                                         ` Avi Kivity
  2009-05-07 20:26                                           ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-07 20:15 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:
>> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
>> semantic.
>>
>> Since we want to connect it to an eventfd, call it HC_NOTIFY or
>> HC_EVENT or something along these lines.  You won't be able to pass
>> any data, but that's fine.  Registers are saved to memory anyway.
>>     
> Ok, but how would you access the registers since you would presumably
> only be getting a waitq::func callback on the eventfd.  Or were you
> saying that more data, if required, is saved in a side-band memory
> location?  I can see the latter working. 

Yeah.  You basically have that side-band in vbus shmem (or the virtio ring).

>  I can't wrap my head around
> the former.
>   

I only meant that registers aren't faster than memory, since they are 
just another memory location.

In fact registers are accessed through a function call (not that that 
takes any time these days).


>> Just to make sure we have everything plumbed down, here's how I see
>> things working out (using qemu and virtio, use sed to taste):
>>
>> 1. qemu starts up, sets up the VM
>> 2. qemu creates virtio-net-server
>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
>> ring, one set for rx ring)
>> 4. qemu connects the six eventfd to the data-available,
>> data-not-available, and kick ports of virtio-net-server
>> 5. the guest starts up and configures virtio-net in pci pin mode
>> 6. qemu notices and decides it will manage interrupts in user space
>> since this is complicated (shared level triggered interrupts)
>> 7. the guest OS boots, loads device driver
>> 8. device driver switches virtio-net to msix mode
>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
>> notify fds as notifyfd
>> 10. look ma, no hands.
>>
>> Under the hood, the following takes place.
>>
>> kvm wires the irqfds to schedule a work item which fires the
>> interrupt.  One day the kvm developers get their act together and
>> change it to inject the interrupt directly when the irqfd is signalled
>> (which could be from the net softirq or somewhere similarly nasty).
>>
>> virtio-net-server wires notifyfd according to its liking.  It may
>> schedule a thread, or it may execute directly.
>>
>> And they all lived happily ever after.
>>     
>
> Ack.  I hope when its all said and done I can convince you that the
> framework to code up those virtio backends in the kernel is vbus ;)

If vbus doesn't bring significant performance advantages, I'll prefer 
virtio because of existing investment.

>   But
> even if not, this should provide enough plumbing that we can all coexist
> together peacefully.
>   

Yes, vbus and virtio can compete on their merits without bias from some 
maintainer getting in the way.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:29                                         ` Gregory Haskins
@ 2009-05-07 20:25                                           ` Chris Wright
  2009-05-07 20:34                                             ` Gregory Haskins
  2009-05-07 20:54                                           ` Arnd Bergmann
  1 sibling, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-07 20:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

* Gregory Haskins (gregory.haskins@gmail.com) wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
would get called on IORESOURCE_PVIO type.  This isn't really transparent
though (only virtio devices basically), kind of like you're saying below.

> Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
> would never be a real device).  This means we route any io requests from
> virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
> is not as slick as boosting *everyones* mmio speed as Avi's original
> idea would have, but it is perhaps a good tradeoff between the entirely
> new namespace created by my original dynhc() proposal and leaving them
> all PF based.
>
> This way, its just like using my dynhc() proposal except the mmio-addr
> is the substitute address-token (instead of the dynhc-vector). 
> Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
> ignored and it just slow-paths through the PF as it does today.  Dynhc()
> would be dependent  on pv_ops.
> 
> Thoughts?

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:15                                         ` Avi Kivity
@ 2009-05-07 20:26                                           ` Gregory Haskins
  2009-05-08  8:35                                             ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 20:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 3230 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
>>> semantic.
>>>
>>> Since we want to connect it to an eventfd, call it HC_NOTIFY or
>>> HC_EVENT or something along these lines.  You won't be able to pass
>>> any data, but that's fine.  Registers are saved to memory anyway.
>>>     
>> Ok, but how would you access the registers since you would presumably
>> only be getting a waitq::func callback on the eventfd.  Or were you
>> saying that more data, if required, is saved in a side-band memory
>> location?  I can see the latter working. 
>
> Yeah.  You basically have that side-band in vbus shmem (or the virtio
> ring).

Ok, got it.
>
>>  I can't wrap my head around
>> the former.
>>   
>
> I only meant that registers aren't faster than memory, since they are
> just another memory location.
>
> In fact registers are accessed through a function call (not that that
> takes any time these days).
>
>
>>> Just to make sure we have everything plumbed down, here's how I see
>>> things working out (using qemu and virtio, use sed to taste):
>>>
>>> 1. qemu starts up, sets up the VM
>>> 2. qemu creates virtio-net-server
>>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
>>> ring, one set for rx ring)
>>> 4. qemu connects the six eventfd to the data-available,
>>> data-not-available, and kick ports of virtio-net-server
>>> 5. the guest starts up and configures virtio-net in pci pin mode
>>> 6. qemu notices and decides it will manage interrupts in user space
>>> since this is complicated (shared level triggered interrupts)
>>> 7. the guest OS boots, loads device driver
>>> 8. device driver switches virtio-net to msix mode
>>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
>>> notify fds as notifyfd
>>> 10. look ma, no hands.
>>>
>>> Under the hood, the following takes place.
>>>
>>> kvm wires the irqfds to schedule a work item which fires the
>>> interrupt.  One day the kvm developers get their act together and
>>> change it to inject the interrupt directly when the irqfd is signalled
>>> (which could be from the net softirq or somewhere similarly nasty).
>>>
>>> virtio-net-server wires notifyfd according to its liking.  It may
>>> schedule a thread, or it may execute directly.
>>>
>>> And they all lived happily ever after.
>>>     
>>
>> Ack.  I hope when its all said and done I can convince you that the
>> framework to code up those virtio backends in the kernel is vbus ;)
>
> If vbus doesn't bring significant performance advantages, I'll prefer
> virtio because of existing investment.

Just to clarify: vbus is just the container/framework for the in-kernel
models.  You can implement and deploy virtio devices inside the
container (tho I haven't had a chance to sit down and implement one
yet).  Note that I did publish a virtio transport in the last few series
to demonstrate how that might work, so its just ripe for the picking if
someone is so inclined.

So really the question is whether you implement the in-kernel virtio
backend in vbus, in some other framework, or just do it standalone.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:00                             ` Arnd Bergmann
@ 2009-05-07 20:31                               ` Gregory Haskins
  2009-05-07 20:42                                 ` Arnd Bergmann
  2009-05-07 20:50                                 ` Chris Wright
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 20:31 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>
>> Thoughts?
>>     
>
> An mmio that goes through a PF is a bug, it's certainly broken on
> a number of platforms, so performance should not be an issue there.
>   

This may be my own ignorance, but I thought a VMEXIT of type "PF" was
how MMIO worked in VT/SVM.  I didn't mean to imply that the guest nor
the host took a traditional PF exception in their respective IDT, if
that is what you thought I meant here.  Rather, the mmio region is
unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of
type PF, and the host side code then consults the guest page-table to
see if its an MMIO or not.  I could very well be mistaken as I have only
a cursory understanding of what happens in KVM today with this path.

After posting my numbers today, what I *can* tell you definitively that
its significantly slower to VMEXIT via MMIO.  I guess I do not really
know the reason for sure. :)
> Note that are four commonly used interface classes for PIO/MMIO:
>
> 1. readl/writel: little-endian MMIO
> 2. inl/outl: little-endian PIO
> 3. ioread32/iowrite32: converged little-endian PIO/MMIO
> 4. __raw_readl/__raw_writel: native-endian MMIO without checks
>
> You don't need to worry about the __raw_* stuff, as this should never
> be used in device drivers.
>
> As a simplification, you could mandate that all drivers that want to
> use this get converted to the ioread/iowrite class of interfaces and
> leave the others slow.
>   

I guess the problem that was later pointed out is that we cannot discern
which devices might be pass-through and therefore should not be
revectored through a HC.  But I am even less knowledgeable about how
pass-through works than I am about the MMIO traps, so I might be
completely off here.

In any case, thank you kindly for the suggestions.

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:25                                           ` Chris Wright
@ 2009-05-07 20:34                                             ` Gregory Haskins
  0 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 20:34 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Avi Kivity, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 412 bytes --]

Chris Wright wrote:
> * Gregory Haskins (gregory.haskins@gmail.com) wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>>     
>
> That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
> would get called on IORESOURCE_PVIO type.

Yeah, this wasn't clear at the time, but I totally get what you meant
now in retrospect.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:31                               ` Gregory Haskins
@ 2009-05-07 20:42                                 ` Arnd Bergmann
  2009-05-07 20:47                                   ` Arnd Bergmann
  2009-05-07 20:50                                 ` Chris Wright
  1 sibling, 1 reply; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-07 20:42 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm,
	Anthony Liguori

On Thursday 07 May 2009, Gregory Haskins wrote:
> Arnd Bergmann wrote:
> > An mmio that goes through a PF is a bug, it's certainly broken on
> > a number of platforms, so performance should not be an issue there.
> >   
> 
> This may be my own ignorance, but I thought a VMEXIT of type "PF" was
> how MMIO worked in VT/SVM. 

You are right that all MMIOs (and PIO on most non-x86 architectures)
are handled this way in the end. What I meant was that an MMIO that
traps because of a simple pointer dereference as in __raw_writel
is a bug, while any actual writel() call could be diverted to
do an hcall and therefore not cause a PF once the infrastructure
is there.

> I guess the problem that was later pointed out is that we cannot discern
> which devices might be pass-through and therefore should not be
> revectored through a HC.  But I am even less knowledgeable about how
> pass-through works than I am about the MMIO traps, so I might be
> completely off here.

An easy way to deal with the pass-through case might be to actually use
__raw_writel there. In guest-to-guest communication, the two sides are
known to have the same endianess (I assume) and you can still add the
appropriate smp_mb() and such into the code.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:42                                 ` Arnd Bergmann
@ 2009-05-07 20:47                                   ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-07 20:47 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Gregory Haskins, Chris Wright, linux-kernel, kvm,
	Anthony Liguori

On Thursday 07 May 2009, Arnd Bergmann wrote:
> An easy way to deal with the pass-through case might be to actually use
> __raw_writel there. In guest-to-guest communication, the two sides are
> known to have the same endianess (I assume) and you can still add the
> appropriate smp_mb() and such into the code.

Ok, that was nonsense. I thought you meant pass-through to a memory range
on the host that is potentially shared with other processes or guests.
For pass-through to a real device, it obviously would not work.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:31                               ` Gregory Haskins
  2009-05-07 20:42                                 ` Arnd Bergmann
@ 2009-05-07 20:50                                 ` Chris Wright
  1 sibling, 0 replies; 115+ messages in thread
From: Chris Wright @ 2009-05-07 20:50 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Arnd Bergmann, Avi Kivity, Gregory Haskins, Chris Wright,
	linux-kernel, kvm, Anthony Liguori

* Gregory Haskins (gregory.haskins@gmail.com) wrote:
> After posting my numbers today, what I *can* tell you definitively that
> its significantly slower to VMEXIT via MMIO.  I guess I do not really
> know the reason for sure. :)

there's certainly more work, including insn decoding

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 19:29                                         ` Gregory Haskins
  2009-05-07 20:25                                           ` Chris Wright
@ 2009-05-07 20:54                                           ` Arnd Bergmann
  2009-05-07 21:13                                             ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-07 20:54 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

On Thursday 07 May 2009, Gregory Haskins wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

pci_iomap could look at the bus device that the PCI function sits on.
If it detects a PCI bridge that has a certain property (config space
setting, vendor/device ID, ...), it assumes that the device itself
will be emulated and it should set the address flag for IO_COND.

This implies that all pass-through devices need to be on a different
PCI bridge from the emulated devices, which should be fairly
straightforward to enforce.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:54                                           ` Arnd Bergmann
@ 2009-05-07 21:13                                             ` Gregory Haskins
  2009-05-07 21:57                                               ` Chris Wright
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-07 21:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 902 bytes --]

Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>>     
>
> pci_iomap could look at the bus device that the PCI function sits on.
> If it detects a PCI bridge that has a certain property (config space
> setting, vendor/device ID, ...), it assumes that the device itself
> will be emulated and it should set the address flag for IO_COND.
>
> This implies that all pass-through devices need to be on a different
> PCI bridge from the emulated devices, which should be fairly
> straightforward to enforce.
>   

Thats actually a pretty good idea.

Chris, is that issue with the non ioread/iowrite access of a mangled
pointer still an issue here?  I would think so, but I am a bit fuzzy on
whether there is still an issue of non-wrapped MMIO ever occuring.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 21:13                                             ` Gregory Haskins
@ 2009-05-07 21:57                                               ` Chris Wright
  2009-05-07 22:11                                                 ` Arnd Bergmann
  0 siblings, 1 reply; 115+ messages in thread
From: Chris Wright @ 2009-05-07 21:57 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Arnd Bergmann, Chris Wright, Gregory Haskins, Avi Kivity,
	linux-kernel, kvm, Anthony Liguori

* Gregory Haskins (gregory.haskins@gmail.com) wrote:
> Arnd Bergmann wrote:
> > pci_iomap could look at the bus device that the PCI function sits on.
> > If it detects a PCI bridge that has a certain property (config space
> > setting, vendor/device ID, ...), it assumes that the device itself
> > will be emulated and it should set the address flag for IO_COND.
> >
> > This implies that all pass-through devices need to be on a different
> > PCI bridge from the emulated devices, which should be fairly
> > straightforward to enforce.

Hmm, this gets to the grey area of the ABI.  I think this would mean an
upgrade of the host would suddenly break when the mgmt tool does:

(qemu) pci_add pci_addr=0:6 host host=01:10.0

> Thats actually a pretty good idea.
> 
> Chris, is that issue with the non ioread/iowrite access of a mangled
> pointer still an issue here?  I would think so, but I am a bit fuzzy on
> whether there is still an issue of non-wrapped MMIO ever occuring.

Arnd was saying it's a bug for other reasons, so perhaps it would work
out fine.

thanks,
-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 21:57                                               ` Chris Wright
@ 2009-05-07 22:11                                                 ` Arnd Bergmann
  2009-05-08 22:33                                                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-07 22:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: Gregory Haskins, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori, Benjamin Herrenschmidt

On Thursday 07 May 2009, Chris Wright wrote:
> 
> > Chris, is that issue with the non ioread/iowrite access of a mangled
> > pointer still an issue here?  I would think so, but I am a bit fuzzy on
> > whether there is still an issue of non-wrapped MMIO ever occuring.
> 
> Arnd was saying it's a bug for other reasons, so perhaps it would work
> out fine.

Well, maybe. I only said that __raw_writel and pointer dereference is
bad, but not writel.

IIRC when we had that discussion about io-workarounds on powerpc,
the outcome was that passing an IORESOURCE_MEM resource into pci_iomap
must still result in something that can be passed into writel in addition
to iowrite32, while an IORESOURCE_IO resource may or may not be valid for
writel and/or outl.

Unfortunately, this means that either readl/writel needs to be adapted
in some way (e.g. the address also ioremapped to the mangled pointer)
or the mechanism will be limited to I/O space accesses.

Maybe BenH remembers the details better than me.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 17:03                   ` Gregory Haskins
  2009-05-07 18:05                     ` Avi Kivity
@ 2009-05-07 23:35                     ` Marcelo Tosatti
  2009-05-07 23:43                       ` Marcelo Tosatti
                                         ` (3 more replies)
  1 sibling, 4 replies; 115+ messages in thread
From: Marcelo Tosatti @ 2009-05-07 23:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
> Chris Wright wrote:
> > * Gregory Haskins (ghaskins@novell.com) wrote:
> >   
> >> Chris Wright wrote:
> >>     
> >>> VF drivers can also have this issue (and typically use mmio).
> >>> I at least have a better idea what your proposal is, thanks for
> >>> explanation.  Are you able to demonstrate concrete benefit with it yet
> >>> (improved latency numbers for example)?
> >>>       
> >> I had a test-harness/numbers for this kind of thing, but its a bit
> >> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
> >> generate/post new numbers.
> >>     
> >
> > That would be useful, because I keep coming back to pio and shared
> > page(s) when think of why not to do this.  Seems I'm not alone in that.
> >
> > thanks,
> > -chris
> >   
> 
> I completed the resurrection of the test and wrote up a little wiki on
> the subject, which you can find here:
> 
> http://developer.novell.com/wiki/index.php/WhyHypercalls
> 
> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
> reinvent the wheel?" questions.
> 
> I will include this information when I publish the updated v2 series
> with the s/hypercall/dynhc changes.
> 
> Let me know if you have any questions.

Greg,

I think comparison is not entirely fair. You're using
KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
(on Intel) to only one register read:

        nr = kvm_register_read(vcpu, VCPU_REGS_RAX);

Whereas in a real hypercall for (say) PIO you would need the address,
size, direction and data.

Also for PIO/MMIO you're adding this unoptimized lookup to the 
measurement:

        pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
        if (pio_dev) {
                kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
                complete_pio(vcpu); 
                return 1;
        }

Whereas for hypercall measurement you don't. I believe a fair comparison
would be have a shared guest/host memory area where you store guest/host
TSC values and then do, on guest:

	rdtscll(&shared_area->guest_tsc);
	pio/mmio/hypercall
	... back to host
	rdtscll(&shared_area->host_tsc);

And then calculate the difference (minus guests TSC_OFFSET of course)?


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:35                     ` Marcelo Tosatti
@ 2009-05-07 23:43                       ` Marcelo Tosatti
  2009-05-08  3:17                         ` Gregory Haskins
  2009-05-08  7:55                         ` Avi Kivity
  2009-05-08  3:13                       ` Gregory Haskins
                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 115+ messages in thread
From: Marcelo Tosatti @ 2009-05-07 23:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
> 
>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>         if (pio_dev) {
>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>                 complete_pio(vcpu); 
>                 return 1;
>         }
> 
> Whereas for hypercall measurement you don't. I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
> 
> 	rdtscll(&shared_area->guest_tsc);
> 	pio/mmio/hypercall
> 	... back to host
> 	rdtscll(&shared_area->host_tsc);
> 
> And then calculate the difference (minus guests TSC_OFFSET of course)?

Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
Core2 Xeon 5130 @2.00Ghz, 4GB RAM.

Also it would be interesting to see the MMIO comparison with EPT/NPT,
it probably sucks much less than what you're seeing.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:35                     ` Marcelo Tosatti
  2009-05-07 23:43                       ` Marcelo Tosatti
@ 2009-05-08  3:13                       ` Gregory Haskins
  2009-05-08  7:59                       ` Avi Kivity
  2009-05-08 14:15                       ` Gregory Haskins
  3 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08  3:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 4197 bytes --]

Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
>   
>> Chris Wright wrote:
>>     
>>> * Gregory Haskins (ghaskins@novell.com) wrote:
>>>   
>>>       
>>>> Chris Wright wrote:
>>>>     
>>>>         
>>>>> VF drivers can also have this issue (and typically use mmio).
>>>>> I at least have a better idea what your proposal is, thanks for
>>>>> explanation.  Are you able to demonstrate concrete benefit with it yet
>>>>> (improved latency numbers for example)?
>>>>>       
>>>>>           
>>>> I had a test-harness/numbers for this kind of thing, but its a bit
>>>> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
>>>> generate/post new numbers.
>>>>     
>>>>         
>>> That would be useful, because I keep coming back to pio and shared
>>> page(s) when think of why not to do this.  Seems I'm not alone in that.
>>>
>>> thanks,
>>> -chris
>>>   
>>>       
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>>     
>
> Greg,
>
> I think comparison is not entirely fair. You're using
> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> (on Intel) to only one register read:
>
>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>
> Whereas in a real hypercall for (say) PIO you would need the address,
> size, direction and data.
>   

Hi Marcelo,
   I'll have to respectfully disagree with you here.  What you are
proposing is actually a different test: a 4th type I would call  "PIO
over HC".  It is distinctly different than the existing MMIO, PIO, and
HC tests already present.

I assert that the current HC test remains valid because for pure
hypercalls, the "nr" *is* the address.  It identifies the function to be
executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my
nullio device identifies the function to be executed (i.e.
nullio_write() = null)

My argument is that the HC test emulates the "dynhc()" concept I have
been talking about, whereas the PIOoHC is more like the
pv_io_ops->iowrite approach.

That said, your 4th test type would actually be a very interesting
data-point to add to the suite (especially since we are still kicking
around the notion of doing something like this).  I will update the
patches.


> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
>
>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>         if (pio_dev) {
>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>                 complete_pio(vcpu); 
>                 return 1;
>         }
>
> Whereas for hypercall measurement you don't.

In theory they should both share about the same algorithmic complexity
in the decode-stage, but due to the possible optimization you mention
you may have a point.  I need to take some steps to ensure the HC path
isn't artificially simplified by GCC (like making the execute stage do
some trivial work like you mention below).

>  I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
>
> 	rdtscll(&shared_area->guest_tsc);
> 	pio/mmio/hypercall
> 	... back to host
> 	rdtscll(&shared_area->host_tsc);
>
> And then calculate the difference (minus guests TSC_OFFSET of course)?
>
>   
I'm not sure I need that much complexity.  I can probably just change
the test harness to generate an ioread32(), and have the functions
return the TSC value as a return parameter for all test types.  The
important thing is that we pick something extremely cheap (yet dynamic)
to compute so the execution time doesn't invalidate the measurement
granularity with a large constant.

Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:43                       ` Marcelo Tosatti
@ 2009-05-08  3:17                         ` Gregory Haskins
  2009-05-08  7:55                         ` Avi Kivity
  1 sibling, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08  3:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
>   
>> Also for PIO/MMIO you're adding this unoptimized lookup to the 
>> measurement:
>>
>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>         if (pio_dev) {
>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>                 complete_pio(vcpu); 
>>                 return 1;
>>         }
>>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>> 	rdtscll(&shared_area->guest_tsc);
>> 	pio/mmio/hypercall
>> 	... back to host
>> 	rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>>     
>
> Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
> Core2 Xeon 5130 @2.00Ghz, 4GB RAM.
>
> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> it probably sucks much less than what you're seeing.
>
>   

Agreed.  If you or someone on this thread has such a beast, please fire
up my test and post the numbers.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:43                       ` Marcelo Tosatti
  2009-05-08  3:17                         ` Gregory Haskins
@ 2009-05-08  7:55                         ` Avi Kivity
       [not found]                           ` <20090508103253.GC3011@amt.cnet>
  2009-05-08 14:35                           ` Marcelo Tosatti
  1 sibling, 2 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-08  7:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel,
	kvm, Anthony Liguori

Marcelo Tosatti wrote:
> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> it probably sucks much less than what you're seeing.
>   

Why would NPT improve mmio?  If anything, it would be worse, since the 
processor has to do the nested walk.

Of course, these are newer machines, so the absolute results as well as 
the difference will be smaller.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:35                     ` Marcelo Tosatti
  2009-05-07 23:43                       ` Marcelo Tosatti
  2009-05-08  3:13                       ` Gregory Haskins
@ 2009-05-08  7:59                       ` Avi Kivity
  2009-05-08 11:09                         ` Gregory Haskins
       [not found]                         ` <20090508104228.GD3011@amt.cnet>
  2009-05-08 14:15                       ` Gregory Haskins
  3 siblings, 2 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-08  7:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel,
	kvm, Anthony Liguori

Marcelo Tosatti wrote:
> I think comparison is not entirely fair. You're using
> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> (on Intel) to only one register read:
>
>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>
> Whereas in a real hypercall for (say) PIO you would need the address,
> size, direction and data.
>   

Well, that's probably one of the reasons pio is slower, as the cpu has 
to set these up, and the kernel has to read them.

> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
>
>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>         if (pio_dev) {
>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>                 complete_pio(vcpu); 
>                 return 1;
>         }
>   

Since there are only one or two elements in the list, I don't see how it 
could be optimized.

> Whereas for hypercall measurement you don't. I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
>
> 	rdtscll(&shared_area->guest_tsc);
> 	pio/mmio/hypercall
> 	... back to host
> 	rdtscll(&shared_area->host_tsc);
>
> And then calculate the difference (minus guests TSC_OFFSET of course)?
>   

I don't understand why you want host tsc?  We're interested in 
round-trip latency, so you want guest tsc all the time.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-06 16:03               ` Gregory Haskins
@ 2009-05-08  8:17                 ` Avi Kivity
  2009-05-08 15:20                   ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-08  8:17 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
> Anthony Liguori wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>>> hypercalls so the driver would look like the pseudocode above except
>>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
>>> is to allow the hypervisor to assign a dynamic vector to resources in
>>> the backend and convey this vector to the guest (such as in PCI
>>> config-space as mentioned in my example use-case).  The provides the
>>> "address negotiation" function that would normally be done for something
>>> like a pio port-address.   The hypervisor agnostic driver can then use
>>> this globally recognized address-token coupled with other device-private
>>> ABI parameters to communicate with the device.  This can all occur
>>> without the core hypervisor needing to understand the details beyond the
>>> addressing.
>>>   
>>>       
>> PCI already provide a hypervisor agnostic interface (via IO regions). 
>> You have a mechanism for devices to discover which regions they have
>> allocated and to request remappings.  It's supported by Linux and
>> Windows.  It works on the vast majority of architectures out there today.
>>
>> Why reinvent the wheel?
>>     
>
> I suspect the current wheel is square.  And the air is out.  Plus its
> pulling to the left when I accelerate, but to be fair that may be my
> alignment....

No, your wheel is slightly faster on the highway, but doesn't work at 
all off-road.

Consider nested virtualization where the host (H) runs a guest (G1) 
which is itself a hypervisor, running a guest (G2).  The host exposes a 
set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than 
creating a new virtio devices and bridging it to one of V1..Vn, assigns 
virtio device V1 to guest G2, and prays.

Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it 
originated in G1 while in guest mode, so it injects it into G1.  G1 
examines the parameters but can't make any sense of them, so it returns 
an error to G2.

If this were done using mmio or pio, it would have just worked.  With 
pio, H would have reflected the pio into G1, G1 would have done the 
conversion from G2's port number into G1's port number and reissued the 
pio, finally trapped by H and used to issue the I/O.  With mmio, G1 
would have set up G2's page tables to point directly at the addresses 
set up by H, so we would actually have a direct G2->H path.  Of course 
we'd need an emulated iommu so all the memory references actually 
resolve to G2's context.

So the upshoot is that hypercalls for devices must not be the primary 
method of communications; they're fine as an optimization, but we should 
always be able to fall back on something else.  We also need to figure 
out how G1 can stop V1 from advertising hypercall support.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 20:26                                           ` Gregory Haskins
@ 2009-05-08  8:35                                             ` Avi Kivity
  2009-05-08 11:29                                               ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-08  8:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

Gregory Haskins wrote:

    

>>> Ack.  I hope when its all said and done I can convince you that the
>>> framework to code up those virtio backends in the kernel is vbus ;)
>>>       
>> If vbus doesn't bring significant performance advantages, I'll prefer
>> virtio because of existing investment.
>>     
>
> Just to clarify: vbus is just the container/framework for the in-kernel
> models.  You can implement and deploy virtio devices inside the
> container (tho I haven't had a chance to sit down and implement one
> yet).  Note that I did publish a virtio transport in the last few series
> to demonstrate how that might work, so its just ripe for the picking if
> someone is so inclined.
>
>   

Yeah I keep getting confused over this.

> So really the question is whether you implement the in-kernel virtio
> backend in vbus, in some other framework, or just do it standalone.
>   

I prefer the standalone model.  Keep the glue in userspace.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08  7:59                       ` Avi Kivity
@ 2009-05-08 11:09                         ` Gregory Haskins
       [not found]                         ` <20090508104228.GD3011@amt.cnet>
  1 sibling, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 11:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Marcelo Tosatti, Chris Wright, Gregory Haskins, linux-kernel,
	kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 2899 bytes --]

Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> I think comparison is not entirely fair. You're using
>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>> (on Intel) to only one register read:
>>
>>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>
>> Whereas in a real hypercall for (say) PIO you would need the address,
>> size, direction and data.
>>   
>
> Well, that's probably one of the reasons pio is slower, as the cpu has
> to set these up, and the kernel has to read them.

Right, that was the point I was trying to make.  Its real-world overhead
to measure how long it takes KVM to go round-trip in each of the
respective trap types.

>
>> Also for PIO/MMIO you're adding this unoptimized lookup to the
>> measurement:
>>
>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>         if (pio_dev) {
>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>                 complete_pio(vcpu);                 return 1;
>>         }
>>   
>
> Since there are only one or two elements in the list, I don't see how
> it could be optimized.

To Marcelo's point, I think he was more taking exception to the fact
that the HC path was potentially completely optimized out if GCC was
super-intuitive about the switch(nr) statement hitting the null vector. 
In theory, both the io_bus and the select(nr) are about equivalent in
algorithmic complexity (and depth, I should say) which is why I think in
general the test is "fair".  IOW it represents the real-world decode
cycle function for each transport.

However, if one side was artificially optimized simply due to the
triviality of my NULLIO test, that is not fair, and that is the point I
believe he was making.  In any case, I just wrote a new version of the
test which hopefully addresses forces GCC to leave it as a more
real-world decode.  (FYI: I saw no difference).  I will update the
tarball/wiki shortly.

>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>>     rdtscll(&shared_area->guest_tsc);
>>     pio/mmio/hypercall
>>     ... back to host
>>     rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>>   
>
> I don't understand why you want host tsc?  We're interested in
> round-trip latency, so you want guest tsc all the time.

Yeah, I agree.  My take is he was just trying to introduce a real
workload so GCC wouldn't do that potential "cheater decode" in the HC
path.  After thinking about it, however, I realized we could do that
with a simple "state++" operation, so the new test does this in each of
the various test's "execute" cycle.  The timing calculation remains
unchanged.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08  8:35                                             ` Avi Kivity
@ 2009-05-08 11:29                                               ` Gregory Haskins
  0 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 11:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>   
>>>> Ack.  I hope when its all said and done I can convince you that the
>>>> framework to code up those virtio backends in the kernel is vbus ;)
>>>>       
>>> If vbus doesn't bring significant performance advantages, I'll prefer
>>> virtio because of existing investment.
>>>     
>>
>> Just to clarify: vbus is just the container/framework for the in-kernel
>> models.  You can implement and deploy virtio devices inside the
>> container (tho I haven't had a chance to sit down and implement one
>> yet).  Note that I did publish a virtio transport in the last few series
>> to demonstrate how that might work, so its just ripe for the picking if
>> someone is so inclined.
>>
>>   
>
> Yeah I keep getting confused over this.
>
>> So really the question is whether you implement the in-kernel virtio
>> backend in vbus, in some other framework, or just do it standalone.
>>   
>
> I prefer the standalone model.  Keep the glue in userspace.

Just to keep the facts straight:  The glue in userspace vs standalone
model are independent variables.  E.g. you can have the glue in
userspace for vbus, too.  Its not written that way today for KVM, but
its moving in that direction as we work though these subtopics like
irqfd, dynhc, etc.

What vbus buys you as a core technology is that you can write one
backend that works "everywhere" (you only need a glue layer for each
environment you want to support).  You might say "I can make my backends
work everywhere too", and to that I would say "by the time you get it to
work, you will have duplicated almost my exact effort on vbus" ;).  Of
course, you may also say "I don't care if it works anywhere else but
KVM", which is a perfectly valid (if not unfortunate) position to take.

I think the confusion point is possibly a result of the name "vbus". 
The vbus core isn't really true bus in the traditional sense.  It's just
a host-side kernel-based container for these device models.  That is all
I am talking about here.  There is, of course, also an LDM "bus" for
rendering vbus devices in the guest as a function of the current
kvm-specific glue layer Ive written.  Note that this glue layer could
render them as PCI in the future, TBD.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
       [not found]                           ` <20090508103253.GC3011@amt.cnet>
@ 2009-05-08 11:37                             ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 11:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel,
	kvm, Anthony Liguori

Marcelo Tosatti wrote:
>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>> it probably sucks much less than what you're seeing.
>>>   
>>>       
>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>> processor has to do the nested walk.
>>     
>
> I suppose the hardware is much more efficient than walk_addr? There's
> all this kmalloc, spinlock, etc overhead in the fault path.
>   

mmio still has to do a walk_addr, even with npt.  We don't take the mmu 
lock during walk_addr.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
       [not found]                         ` <20090508104228.GD3011@amt.cnet>
@ 2009-05-08 12:43                           ` Gregory Haskins
  2009-05-08 15:33                             ` Marcelo Tosatti
  2009-05-08 16:48                             ` Paul E. McKenney
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 12:43 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm,
	Anthony Liguori, paulmck

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]

Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> I think comparison is not entirely fair. You're using
>>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>>> (on Intel) to only one register read:
>>>
>>>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>>
>>> Whereas in a real hypercall for (say) PIO you would need the address,
>>> size, direction and data.
>>>   
>>>       
>> Well, that's probably one of the reasons pio is slower, as the cpu has  
>> to set these up, and the kernel has to read them.
>>
>>     
>>> Also for PIO/MMIO you're adding this unoptimized lookup to the  
>>> measurement:
>>>
>>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>>         if (pio_dev) {
>>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>>                 complete_pio(vcpu);                 return 1;
>>>         }
>>>   
>>>       
>> Since there are only one or two elements in the list, I don't see how it  
>> could be optimized.
>>     
>
> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
> is probably the last in the io_bus list.
>
> Not sure if this one matters very much. Point is you should measure the
> exit time only, not the pio path vs hypercall path in kvm. 
>   

The problem is the exit time in of itself isnt all that interesting to
me.  What I am interested in measuring is how long it takes KVM to
process the request and realize that I want to execute function "X". 
Ultimately that is what matters in terms of execution latency and is
thus the more interesting data.  I think the exit time is possibly an
interesting 5th data point, but its more of a side-bar IMO.   In any
case, I suspect that both exits will be approximately the same at the
VT/SVM level.

OTOH: If there is a patch out there to improve KVMs code (say
specifically the PIO handling logic), that is fair-game here and we
should benchmark it.  For instance, if you have ideas on ways to improve
the find_pio_dev performance, etc....   One item may be to replace the
kvm->lock on the bus scan with an RCU or something.... (though PIOs are
very frequent and the constant re-entry to an an RCU read-side CS may
effectively cause a perpetual grace-period and may be too prohibitive). 
CC'ing pmck.

FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
140 can possibly be recouped.  I currently suspect the lock acquisition
in the iobus-scan is the bulk of that time, but that is admittedly a
guess.  The remaining 200-250ns is elsewhere in the PIO decode.

-Greg





[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 23:35                     ` Marcelo Tosatti
                                         ` (2 preceding siblings ...)
  2009-05-08  7:59                       ` Avi Kivity
@ 2009-05-08 14:15                       ` Gregory Haskins
  2009-05-08 14:53                         ` Anthony Liguori
  3 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 14:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Chris Wright, Gregory Haskins, Avi Kivity, linux-kernel, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
>   
>> Chris Wright wrote:
>>     
>>> * Gregory Haskins (ghaskins@novell.com) wrote:
>>>   
>>>       
>>>> Chris Wright wrote:
>>>>     
>>>>         
>>>>> VF drivers can also have this issue (and typically use mmio).
>>>>> I at least have a better idea what your proposal is, thanks for
>>>>> explanation.  Are you able to demonstrate concrete benefit with it yet
>>>>> (improved latency numbers for example)?
>>>>>       
>>>>>           
>>>> I had a test-harness/numbers for this kind of thing, but its a bit
>>>> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
>>>> generate/post new numbers.
>>>>     
>>>>         
>>> That would be useful, because I keep coming back to pio and shared
>>> page(s) when think of why not to do this.  Seems I'm not alone in that.
>>>
>>> thanks,
>>> -chris
>>>   
>>>       
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>>     
>
> Greg,
>
> I think comparison is not entirely fair.

<snip>

FYI: I've update the test/wiki to (hopefully) address your concerns.

http://developer.novell.com/wiki/index.php/WhyHypercalls

Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08  7:55                         ` Avi Kivity
       [not found]                           ` <20090508103253.GC3011@amt.cnet>
@ 2009-05-08 14:35                           ` Marcelo Tosatti
  2009-05-08 14:45                             ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Marcelo Tosatti @ 2009-05-08 14:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, kvm, Anthony Liguori

On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>> it probably sucks much less than what you're seeing.
>>   
>
> Why would NPT improve mmio?  If anything, it would be worse, since the  
> processor has to do the nested walk.
>
> Of course, these are newer machines, so the absolute results as well as  
> the difference will be smaller.

Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:

NPT enabled:
test 0: 3088633284634 - 3059375712321 = 29257572313
test 1: 3121754636397 - 3088633419760 = 33121216637
test 2: 3204666462763 - 3121754668573 = 82911794190

NPT disabled:
test 0: 3638061646250 - 3609416811687 = 28644834563
test 1: 3669413430258 - 3638061771291 = 31351658967
test 2: 3736287253287 - 3669413463506 = 66873789781


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 14:35                           ` Marcelo Tosatti
@ 2009-05-08 14:45                             ` Gregory Haskins
  2009-05-08 15:51                               ` Marcelo Tosatti
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 14:45 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1440 bytes --]

Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>> it probably sucks much less than what you're seeing.
>>>   
>>>       
>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>> processor has to do the nested walk.
>>
>> Of course, these are newer machines, so the absolute results as well as  
>> the difference will be smaller.
>>     
>
> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>
> NPT enabled:
> test 0: 3088633284634 - 3059375712321 = 29257572313
> test 1: 3121754636397 - 3088633419760 = 33121216637
> test 2: 3204666462763 - 3121754668573 = 82911794190
>
> NPT disabled:
> test 0: 3638061646250 - 3609416811687 = 28644834563
> test 1: 3669413430258 - 3638061771291 = 31351658967
> test 2: 3736287253287 - 3669413463506 = 66873789781
>
>   
Thanks for running that.  Its interesting to see that NPT was in fact
worse as Avi predicted.

Would you mind if I graphed the result and added this data to my wiki? 
If so, could you adjust the tsc result into IOPs using the proper
time-base and the test_count you ran with?   I can show a graph with the
data as is and the relative differences will properly surface..but it
would be nice to have apples to apples in terms of IOPS units with my
other run.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 14:15                       ` Gregory Haskins
@ 2009-05-08 14:53                         ` Anthony Liguori
  2009-05-08 18:50                           ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-08 14:53 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Marcelo Tosatti, Chris Wright, Gregory Haskins, Avi Kivity,
	linux-kernel, kvm

Gregory Haskins wrote:
>> Greg,
>>
>> I think comparison is not entirely fair.
>>     
>
> <snip>
>
> FYI: I've update the test/wiki to (hopefully) address your concerns.
>
> http://developer.novell.com/wiki/index.php/WhyHypercalls
>   

And we're now getting close to the point where the difference is 
virtually meaningless.

At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 
71429 exits.

If you have this many exits, the shear cost of the base vmexit overhead 
is going to result in about 15% CPU overhead.  To put this another way, 
if you're workload was entirely bound by vmexits (which is virtually 
impossible), then when you were saturating your CPU at 100%, only 7% of 
that is the cost of PIO exits vs. HC.

In real life workloads, if you're paying 15% overhead just to the cost 
of exits (not including the cost of heavy weight or post-exit 
processing), you're toast.  I think it's going to be very difficult to 
construct a real scenario where you'll have a measurable (i.e. > 1%) 
performance overhead from using PIO vs. HC.

And in the absence of that, I don't see the justification for adding 
additional infrastructure to Linux to support this.

The non-x86 architecture argument isn't valid because other 
architectures either 1) don't use PCI at all (s390) and are already 
using hypercalls 2) use PCI, but do not have a dedicated hypercall 
instruction (PPC emb) or 3) have PIO (ia64).

Regards,

Anthony Liguori

> Regards,
> -Greg
>
>
>   


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 12:29                 ` Avi Kivity
@ 2009-05-08 14:59                   ` Anthony Liguori
  2009-05-09 12:01                     ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-08 14:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel, kvm

Avi Kivity wrote:
>
> Hmm, reminds me of something I thought of a while back.
>
> We could implement an 'mmio hypercall' that does mmio reads/writes via 
> a hypercall instead of an mmio operation.  That will speed up mmio for 
> emulated devices (say, e1000).  It's easy to hook into Linux 
> (readl/writel), is pci-friendly, non-x86 friendly, etc.

By the time you get down to userspace for an emulated device, that 2us 
difference between mmio and hypercalls is simply not going to make a 
difference.  I'm surprised so much effort is going into this, is there 
any indication that this is even close to a bottleneck in any circumstance?

We have much, much lower hanging fruit to attack.  The basic fact that 
we still copy data multiple times in the networking drivers is clearly 
more significant than a few hundred nanoseconds that should occur less 
than once per packet.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08  8:17                 ` Avi Kivity
@ 2009-05-08 15:20                   ` Gregory Haskins
  2009-05-08 17:00                     ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 15:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 4855 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>>>> hypercalls so the driver would look like the pseudocode above except
>>>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The
>>>> proposal
>>>> is to allow the hypervisor to assign a dynamic vector to resources in
>>>> the backend and convey this vector to the guest (such as in PCI
>>>> config-space as mentioned in my example use-case).  The provides the
>>>> "address negotiation" function that would normally be done for
>>>> something
>>>> like a pio port-address.   The hypervisor agnostic driver can then use
>>>> this globally recognized address-token coupled with other
>>>> device-private
>>>> ABI parameters to communicate with the device.  This can all occur
>>>> without the core hypervisor needing to understand the details
>>>> beyond the
>>>> addressing.
>>>>         
>>> PCI already provide a hypervisor agnostic interface (via IO
>>> regions). You have a mechanism for devices to discover which regions
>>> they have
>>> allocated and to request remappings.  It's supported by Linux and
>>> Windows.  It works on the vast majority of architectures out there
>>> today.
>>>
>>> Why reinvent the wheel?
>>>     
>>
>> I suspect the current wheel is square.  And the air is out.  Plus its
>> pulling to the left when I accelerate, but to be fair that may be my
>> alignment....
>
> No, your wheel is slightly faster on the highway, but doesn't work at
> all off-road.

Heh..

>
> Consider nested virtualization where the host (H) runs a guest (G1)
> which is itself a hypervisor, running a guest (G2).  The host exposes
> a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
> creating a new virtio devices and bridging it to one of V1..Vn,
> assigns virtio device V1 to guest G2, and prays.
>
> Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
> originated in G1 while in guest mode, so it injects it into G1.  G1
> examines the parameters but can't make any sense of them, so it
> returns an error to G2.
>
> If this were done using mmio or pio, it would have just worked.  With
> pio, H would have reflected the pio into G1, G1 would have done the
> conversion from G2's port number into G1's port number and reissued
> the pio, finally trapped by H and used to issue the I/O. 

I might be missing something, but I am not seeing the difference here. 
We have an "address" (in this case the HC-id) and a context (in this
case G1 running in non-root mode).   Whether the  trap to H is a HC or a
PIO, the context tells us that it needs to re-inject the same trap to G1
for proper handling.  So the "address" is re-injected from H to G1 as an
emulated trap to G1s root-mode, and we continue (just like the PIO).

And likewise, in both cases, G1 would (should?) know what to do with
that "address" as it relates to G2, just as it would need to know what
the PIO address is for.  Typically this would result in some kind of
translation of that "address", but I suppose even this is completely
arbitrary and only G1 knows for sure.  E.g. it might translate from
hypercall vector X to Y similar to your PIO example, it might completely
change transports, or it might terminate locally (e.g. emulated device
in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
be using MMIO to talk to H.  I don't think it matters from a topology
perspective (though it might from a performance perspective).

> With mmio, G1 would have set up G2's page tables to point directly at
> the addresses set up by H, so we would actually have a direct G2->H
> path.  Of course we'd need an emulated iommu so all the memory
> references actually resolve to G2's context.

/me head explodes

>
> So the upshoot is that hypercalls for devices must not be the primary
> method of communications; they're fine as an optimization, but we
> should always be able to fall back on something else.  We also need to
> figure out how G1 can stop V1 from advertising hypercall support.
I agree it would be desirable to be able to control this exposure. 
However, I am not currently convinced its strictly necessary because of
the reason you mentioned above.  And also note that I am not currently
convinced its even possible to control it.

For instance, what if G1 is an old KVM, or (dare I say) a completely
different hypervisor?  You could control things like whether G1 can see
the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
is to say what G1 will expose to G2?  G1 may very well advertise a HC
feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
you stop that?

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 12:43                           ` Gregory Haskins
@ 2009-05-08 15:33                             ` Marcelo Tosatti
  2009-05-08 19:02                               ` Avi Kivity
  2009-05-08 16:48                             ` Paul E. McKenney
  1 sibling, 1 reply; 115+ messages in thread
From: Marcelo Tosatti @ 2009-05-08 15:33 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm,
	Anthony Liguori, paulmck

On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
> The problem is the exit time in of itself isnt all that interesting to
> me.  What I am interested in measuring is how long it takes KVM to
> process the request and realize that I want to execute function "X". 
> Ultimately that is what matters in terms of execution latency and is
> thus the more interesting data.  I think the exit time is possibly an
> interesting 5th data point, but its more of a side-bar IMO.   In any
> case, I suspect that both exits will be approximately the same at the
> VT/SVM level.
> 
> OTOH: If there is a patch out there to improve KVMs code (say
> specifically the PIO handling logic), that is fair-game here and we
> should benchmark it.  For instance, if you have ideas on ways to improve
> the find_pio_dev performance, etc....   

<guess mode on>

One easy thing to try is to cache the last successful lookup on a
pointer, to improve patterns where there's "device locality" (like
nullio test).

<guess mode off>

> One item may be to replace the kvm->lock on the bus scan with an RCU
> or something.... (though PIOs are very frequent and the constant
> re-entry to an an RCU read-side CS may effectively cause a perpetual
> grace-period and may be too prohibitive). CC'ing pmck.

Yes, locking improvements are needed there badly (think for eg the cache
bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP
guests).

> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
> 140 can possibly be recouped.  I currently suspect the lock acquisition
> in the iobus-scan is the bulk of that time, but that is admittedly a
> guess.  The remaining 200-250ns is elsewhere in the PIO decode.

vmcs_read is significantly expensive
(http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html,
likely that my measurements were foobar, Avi mentioned 50 cycles for
vmcs_write).

See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit.

Also this one looks pretty bad for a 32-bit PAE guest (and you can 
get away with the unconditional GUEST_CR3 read too).

        /* Access CR3 don't cause VMExit in paging mode, so we need
         * to sync with guest real CR3. */
        if (enable_ept && is_paging(vcpu)) {
                vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
                ept_load_pdptrs(vcpu);
        }


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 14:45                             ` Gregory Haskins
@ 2009-05-08 15:51                               ` Marcelo Tosatti
  2009-05-08 19:56                                 ` David S. Ahern
  0 siblings, 1 reply; 115+ messages in thread
From: Marcelo Tosatti @ 2009-05-08 15:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, kvm, Anthony Liguori

On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
> Marcelo Tosatti wrote:
> > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
> >   
> >> Marcelo Tosatti wrote:
> >>     
> >>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> >>> it probably sucks much less than what you're seeing.
> >>>   
> >>>       
> >> Why would NPT improve mmio?  If anything, it would be worse, since the  
> >> processor has to do the nested walk.
> >>
> >> Of course, these are newer machines, so the absolute results as well as  
> >> the difference will be smaller.
> >>     
> >
> > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
> >
> > NPT enabled:
> > test 0: 3088633284634 - 3059375712321 = 29257572313
> > test 1: 3121754636397 - 3088633419760 = 33121216637
> > test 2: 3204666462763 - 3121754668573 = 82911794190
> >
> > NPT disabled:
> > test 0: 3638061646250 - 3609416811687 = 28644834563
> > test 1: 3669413430258 - 3638061771291 = 31351658967
> > test 2: 3736287253287 - 3669413463506 = 66873789781
> >
> >   
> Thanks for running that.  Its interesting to see that NPT was in fact
> worse as Avi predicted.
> 
> Would you mind if I graphed the result and added this data to my wiki? 
> If so, could you adjust the tsc result into IOPs using the proper
> time-base and the test_count you ran with?   I can show a graph with the
> data as is and the relative differences will properly surface..but it
> would be nice to have apples to apples in terms of IOPS units with my
> other run.
> 
> -Greg

Please, that'll be nice.

Quad-Core AMD Opteron(tm) Processor 2358 SE

host: 2.6.30-rc2
guest: 2.6.29.1-102.fc11.x86_64

test_count=1000000, tsc freq=2402882804 Hz

NPT disabled:

test 0 = 2771200766
test 1 = 3018726738
test 2 = 6414705418
test 3 = 2890332864

NPT enabled:

test 0 = 2908604045
test 1 = 3174687394
test 2 = 7912464804
test 3 = 3046085805


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 12:43                           ` Gregory Haskins
  2009-05-08 15:33                             ` Marcelo Tosatti
@ 2009-05-08 16:48                             ` Paul E. McKenney
  2009-05-08 19:55                               ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Paul E. McKenney @ 2009-05-08 16:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm, Anthony Liguori

On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
> Marcelo Tosatti wrote:
> > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
> >   
> >> Marcelo Tosatti wrote:
> >>     
> >>> I think comparison is not entirely fair. You're using
> >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> >>> (on Intel) to only one register read:
> >>>
> >>>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
> >>>
> >>> Whereas in a real hypercall for (say) PIO you would need the address,
> >>> size, direction and data.
> >>>   
> >>>       
> >> Well, that's probably one of the reasons pio is slower, as the cpu has  
> >> to set these up, and the kernel has to read them.
> >>
> >>     
> >>> Also for PIO/MMIO you're adding this unoptimized lookup to the  
> >>> measurement:
> >>>
> >>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> >>>         if (pio_dev) {
> >>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> >>>                 complete_pio(vcpu);                 return 1;
> >>>         }
> >>>   
> >>>       
> >> Since there are only one or two elements in the list, I don't see how it  
> >> could be optimized.
> >>     
> >
> > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
> > is probably the last in the io_bus list.
> >
> > Not sure if this one matters very much. Point is you should measure the
> > exit time only, not the pio path vs hypercall path in kvm. 
> >   
> 
> The problem is the exit time in of itself isnt all that interesting to
> me.  What I am interested in measuring is how long it takes KVM to
> process the request and realize that I want to execute function "X". 
> Ultimately that is what matters in terms of execution latency and is
> thus the more interesting data.  I think the exit time is possibly an
> interesting 5th data point, but its more of a side-bar IMO.   In any
> case, I suspect that both exits will be approximately the same at the
> VT/SVM level.
> 
> OTOH: If there is a patch out there to improve KVMs code (say
> specifically the PIO handling logic), that is fair-game here and we
> should benchmark it.  For instance, if you have ideas on ways to improve
> the find_pio_dev performance, etc....   One item may be to replace the
> kvm->lock on the bus scan with an RCU or something.... (though PIOs are
> very frequent and the constant re-entry to an an RCU read-side CS may
> effectively cause a perpetual grace-period and may be too prohibitive). 
> CC'ing pmck.

Hello, Greg!

Not a problem.  ;-)

A grace period only needs to wait on RCU read-side critical sections that
started before the grace period started.  As soon as those pre-existing
RCU read-side critical get done, the grace period can end, regardless
of how many RCU read-side critical sections might have started after
the grace period started.

If you find a situation where huge numbers of RCU read-side critical
sections do indefinitely delay a grace period, then that is a bug in
RCU that I need to fix.

Of course, if you have a single RCU read-side critical section that
runs for a very long time, that -will- delay a grace period.  As long
as you don't do it too often, this is not a problem, though if running
a single RCU read-side critical section for more than a few milliseconds
is probably not a good thing.  Not as bad as holding a heavily contended
spinlock for a few milliseconds, but still not a good thing.

							Thanx, Paul

> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
> 140 can possibly be recouped.  I currently suspect the lock acquisition
> in the iobus-scan is the bulk of that time, but that is admittedly a
> guess.  The remaining 200-250ns is elsewhere in the PIO decode.
> 
> -Greg
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 15:20                   ` Gregory Haskins
@ 2009-05-08 17:00                     ` Avi Kivity
  2009-05-08 18:55                       ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 17:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
>> Consider nested virtualization where the host (H) runs a guest (G1)
>> which is itself a hypervisor, running a guest (G2).  The host exposes
>> a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
>> creating a new virtio devices and bridging it to one of V1..Vn,
>> assigns virtio device V1 to guest G2, and prays.
>>
>> Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
>> originated in G1 while in guest mode, so it injects it into G1.  G1
>> examines the parameters but can't make any sense of them, so it
>> returns an error to G2.
>>
>> If this were done using mmio or pio, it would have just worked.  With
>> pio, H would have reflected the pio into G1, G1 would have done the
>> conversion from G2's port number into G1's port number and reissued
>> the pio, finally trapped by H and used to issue the I/O. 
>>     
>
> I might be missing something, but I am not seeing the difference here. 
> We have an "address" (in this case the HC-id) and a context (in this
> case G1 running in non-root mode).   Whether the  trap to H is a HC or a
> PIO, the context tells us that it needs to re-inject the same trap to G1
> for proper handling.  So the "address" is re-injected from H to G1 as an
> emulated trap to G1s root-mode, and we continue (just like the PIO).
>   

So far, so good (though in fact mmio can short-circuit G2->H directly).

> And likewise, in both cases, G1 would (should?) know what to do with
> that "address" as it relates to G2, just as it would need to know what
> the PIO address is for.  Typically this would result in some kind of
> translation of that "address", but I suppose even this is completely
> arbitrary and only G1 knows for sure.  E.g. it might translate from
> hypercall vector X to Y similar to your PIO example, it might completely
> change transports, or it might terminate locally (e.g. emulated device
> in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
> be using MMIO to talk to H.  I don't think it matters from a topology
> perspective (though it might from a performance perspective).
>   

How can you translate a hypercall?  G1's and H's hypercall mechanisms 
can be completely different.


>> So the upshoot is that hypercalls for devices must not be the primary
>> method of communications; they're fine as an optimization, but we
>> should always be able to fall back on something else.  We also need to
>> figure out how G1 can stop V1 from advertising hypercall support.
>>     
> I agree it would be desirable to be able to control this exposure. 
> However, I am not currently convinced its strictly necessary because of
> the reason you mentioned above.  And also note that I am not currently
> convinced its even possible to control it.
>
> For instance, what if G1 is an old KVM, or (dare I say) a completely
> different hypervisor?  You could control things like whether G1 can see
> the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
> is to say what G1 will expose to G2?  G1 may very well advertise a HC
> feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
> you stop that?
>   

I don't see any way.

If, instead of a hypercall we go through the pio hypercall route, then 
it all resolves itself.  G2 issues a pio hypercall, H bounces it to G1, 
G1 either issues a pio or a pio hypercall depending on what the H and G1 
negotiated.  Of course mmio is faster in this case since it traps directly.

btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a 
0.4us difference will buy us 0.4% reduction in cpu load, so let's see 
what's the potential gain here.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 14:53                         ` Anthony Liguori
@ 2009-05-08 18:50                           ` Avi Kivity
  2009-05-08 19:02                             ` Anthony Liguori
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 18:50 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

Anthony Liguori wrote:
>
> And we're now getting close to the point where the difference is 
> virtually meaningless.
>
> At .14us, in order to see 1% CPU overhead added from PIO vs HC, you 
> need 71429 exits.
>

If I read things correctly, you want the difference between PIO and 
PIOoHC, which is 210ns.  But your point stands, 50,000 exits/sec will 
add 1% cpu overhead.

>
> The non-x86 architecture argument isn't valid because other 
> architectures either 1) don't use PCI at all (s390) and are already 
> using hypercalls 2) use PCI, but do not have a dedicated hypercall 
> instruction (PPC emb) or 3) have PIO (ia64).

ia64 uses mmio to emulate pio, so the cost may be different.  I agree on 
x86 it's almost negligible.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 17:00                     ` Avi Kivity
@ 2009-05-08 18:55                       ` Gregory Haskins
  2009-05-08 19:05                         ` Anthony Liguori
  2009-05-08 19:12                         ` Avi Kivity
  0 siblings, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 18:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 6004 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Consider nested virtualization where the host (H) runs a guest (G1)
>>> which is itself a hypervisor, running a guest (G2).  The host exposes
>>> a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
>>> creating a new virtio devices and bridging it to one of V1..Vn,
>>> assigns virtio device V1 to guest G2, and prays.
>>>
>>> Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
>>> originated in G1 while in guest mode, so it injects it into G1.  G1
>>> examines the parameters but can't make any sense of them, so it
>>> returns an error to G2.
>>>
>>> If this were done using mmio or pio, it would have just worked.  With
>>> pio, H would have reflected the pio into G1, G1 would have done the
>>> conversion from G2's port number into G1's port number and reissued
>>> the pio, finally trapped by H and used to issue the I/O.     
>>
>> I might be missing something, but I am not seeing the difference
>> here. We have an "address" (in this case the HC-id) and a context (in
>> this
>> case G1 running in non-root mode).   Whether the  trap to H is a HC or a
>> PIO, the context tells us that it needs to re-inject the same trap to G1
>> for proper handling.  So the "address" is re-injected from H to G1 as an
>> emulated trap to G1s root-mode, and we continue (just like the PIO).
>>   
>
> So far, so good (though in fact mmio can short-circuit G2->H directly).

Yeah, that is a nice trick.  Despite the fact that MMIOs have about 50%
degradation over an equivalent PIO/HC trap, you would be hard-pressed to
make that up again with all the nested reinjection going on on the
PIO/HC side of the coin.  I think MMIO would be a fairly easy win with
one level of nesting, and absolutely trounce anything that happens to be
deeper.

>
>> And likewise, in both cases, G1 would (should?) know what to do with
>> that "address" as it relates to G2, just as it would need to know what
>> the PIO address is for.  Typically this would result in some kind of
>> translation of that "address", but I suppose even this is completely
>> arbitrary and only G1 knows for sure.  E.g. it might translate from
>> hypercall vector X to Y similar to your PIO example, it might completely
>> change transports, or it might terminate locally (e.g. emulated device
>> in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
>> be using MMIO to talk to H.  I don't think it matters from a topology
>> perspective (though it might from a performance perspective).
>>   
>
> How can you translate a hypercall?  G1's and H's hypercall mechanisms
> can be completely different.

Well, what I mean is that the hypercall ABI is specific to G2->G1, but
the path really looks like G2->(H)->G1 transparently since H gets all
the initial exits coming from G2.  But all H has to do is blindly
reinject the exit with all the same parameters (e.g. registers,
primarily) to the G1-root context.

So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
and does its thing according to the ABI.  Perhaps the ABI for that
particular HC-id is a PIOoHC, so it turns around and does a
ioread/iowrite PIO, trapping us back to H.

So this transform of the HC-id "X" to PIO("Y") is the translation I was
referring to.  It could really be anything, though (e.g. HC "X" to HC
"Z", if thats what G1s handler for X told it to do)

>
>
>
>>> So the upshoot is that hypercalls for devices must not be the primary
>>> method of communications; they're fine as an optimization, but we
>>> should always be able to fall back on something else.  We also need to
>>> figure out how G1 can stop V1 from advertising hypercall support.
>>>     
>> I agree it would be desirable to be able to control this exposure.
>> However, I am not currently convinced its strictly necessary because of
>> the reason you mentioned above.  And also note that I am not currently
>> convinced its even possible to control it.
>>
>> For instance, what if G1 is an old KVM, or (dare I say) a completely
>> different hypervisor?  You could control things like whether G1 can see
>> the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
>> is to say what G1 will expose to G2?  G1 may very well advertise a HC
>> feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
>> you stop that?
>>   
>
> I don't see any way.
>
> If, instead of a hypercall we go through the pio hypercall route, then
> it all resolves itself.  G2 issues a pio hypercall, H bounces it to
> G1, G1 either issues a pio or a pio hypercall depending on what the H
> and G1 negotiated.

Actually I don't even think it matters what the HC payload is.  Its
governed by the ABI between G1 and G2. H will simply reflect the trap,
so the HC could be of any type, really.

>   Of course mmio is faster in this case since it traps directly.
>
> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see
> what's the potential gain here.

Its more of an issue of execution latency (which translates to IO
latency, since "execution" is usually for the specific goal of doing
some IO).  In fact, per my own design claims, I try to avoid exits like
the plague and generally succeed at making very few of them. ;)

So its not really the .4% reduction of cpu use that allures me.  Its the
16% reduction in latency.  Time/discussion will tell if its worth the
trouble to use HC or just try to shave more off of PIO.  If we went that
route, I am concerned about falling back to MMIO, but Anthony seems to
think this is not a real issue.

From what we've discussed here, it seems the best case scenario would be
if the Intel/AMD folks came up with some really good hardware
accelerated MMIO-EXIT so we could avoid all the decode/walk crap in the
first place.  ;)

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 18:50                           ` Avi Kivity
@ 2009-05-08 19:02                             ` Anthony Liguori
  2009-05-08 19:06                               ` Avi Kivity
  2009-05-11 16:37                               ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-08 19:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> And we're now getting close to the point where the difference is 
>> virtually meaningless.
>>
>> At .14us, in order to see 1% CPU overhead added from PIO vs HC, you 
>> need 71429 exits.
>>
>
> If I read things correctly, you want the difference between PIO and 
> PIOoHC, which is 210ns.  But your point stands, 50,000 exits/sec will 
> add 1% cpu overhead.

Right, the basic math still stands.

>>
>> The non-x86 architecture argument isn't valid because other 
>> architectures either 1) don't use PCI at all (s390) and are already 
>> using hypercalls 2) use PCI, but do not have a dedicated hypercall 
>> instruction (PPC emb) or 3) have PIO (ia64).
>
> ia64 uses mmio to emulate pio, so the cost may be different.  I agree 
> on x86 it's almost negligible.

Yes, I misunderstood that they actually emulated it like that.  However, 
ia64 has no paravirtualization support today so surely, we aren't going 
to be justifying this via ia64, right?

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 15:33                             ` Marcelo Tosatti
@ 2009-05-08 19:02                               ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 19:02 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gregory Haskins, Chris Wright, Gregory Haskins, linux-kernel,
	kvm, Anthony Liguori, paulmck

Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
>   
>> The problem is the exit time in of itself isnt all that interesting to
>> me.  What I am interested in measuring is how long it takes KVM to
>> process the request and realize that I want to execute function "X". 
>> Ultimately that is what matters in terms of execution latency and is
>> thus the more interesting data.  I think the exit time is possibly an
>> interesting 5th data point, but its more of a side-bar IMO.   In any
>> case, I suspect that both exits will be approximately the same at the
>> VT/SVM level.
>>
>> OTOH: If there is a patch out there to improve KVMs code (say
>> specifically the PIO handling logic), that is fair-game here and we
>> should benchmark it.  For instance, if you have ideas on ways to improve
>> the find_pio_dev performance, etc....   
>>     
>
> <guess mode on>
>
> One easy thing to try is to cache the last successful lookup on a
> pointer, to improve patterns where there's "device locality" (like
> nullio test).
>   

We should do that everywhere, memory slots, pio slots, etc.  Or even 
keep statistics on accesses and sort by that.

> <guess mode off>
>
>   

I'd leave it on if I were you.

>> One item may be to replace the kvm->lock on the bus scan with an RCU
>> or something.... (though PIOs are very frequent and the constant
>> re-entry to an an RCU read-side CS may effectively cause a perpetual
>> grace-period and may be too prohibitive). CC'ing pmck.
>>     
>
> Yes, locking improvements are needed there badly (think for eg the cache
> bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP
> guests).
>   

There's no reason for kvm->lock on pio.  We should push the locking to 
devices.

I'm going to rename slots_lock as 
slots_lock_please_reimplement_me_using_rcu, this keeps coming up.

>> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
>> 140 can possibly be recouped.  I currently suspect the lock acquisition
>> in the iobus-scan is the bulk of that time, but that is admittedly a
>> guess.  The remaining 200-250ns is elsewhere in the PIO decode.
>>     
>
> vmcs_read is significantly expensive
> (http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html,
> likely that my measurements were foobar, Avi mentioned 50 cycles for
> vmcs_write).
>   

IIRC vmcs reads are pretty fast, and are being improved.

> See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit.
>   

Ugh.

> Also this one looks pretty bad for a 32-bit PAE guest (and you can 
> get away with the unconditional GUEST_CR3 read too).
>
>         /* Access CR3 don't cause VMExit in paging mode, so we need
>          * to sync with guest real CR3. */
>         if (enable_ept && is_paging(vcpu)) {
>                 vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
>                 ept_load_pdptrs(vcpu);
>         }
>
>   

We should use an accessor here just like with registers and segment 
registers.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 18:55                       ` Gregory Haskins
@ 2009-05-08 19:05                         ` Anthony Liguori
  2009-05-08 19:12                         ` Avi Kivity
  1 sibling, 0 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-08 19:05 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
> Its more of an issue of execution latency (which translates to IO
> latency, since "execution" is usually for the specific goal of doing
> some IO).  In fact, per my own design claims, I try to avoid exits like
> the plague and generally succeed at making very few of them. ;)
>
> So its not really the .4% reduction of cpu use that allures me.  Its the
> 16% reduction in latency.  Time/discussion will tell if its worth the
> trouble to use HC or just try to shave more off of PIO.  If we went that
> route, I am concerned about falling back to MMIO, but Anthony seems to
> think this is not a real issue.
>   

It's only a 16% reduction in latency if your workload is entirely 
dependent on the latency of a hypercall.  What is that workload?  I 
don't think it exists.

For a network driver, I have a hard time believing that anyone cares 
that much about 210ns of latency.  We're getting close to the cost of a 
few dozen instructions here.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 19:02                             ` Anthony Liguori
@ 2009-05-08 19:06                               ` Avi Kivity
  2009-05-11 16:37                               ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 19:06 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

Anthony Liguori wrote:
>> ia64 uses mmio to emulate pio, so the cost may be different.  I agree 
>> on x86 it's almost negligible.
>
> Yes, I misunderstood that they actually emulated it like that.  
> However, ia64 has no paravirtualization support today so surely, we 
> aren't going to be justifying this via ia64, right?
>

Right.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 18:55                       ` Gregory Haskins
  2009-05-08 19:05                         ` Anthony Liguori
@ 2009-05-08 19:12                         ` Avi Kivity
  2009-05-08 19:59                           ` Gregory Haskins
  1 sibling, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-08 19:12 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
>>> And likewise, in both cases, G1 would (should?) know what to do with
>>> that "address" as it relates to G2, just as it would need to know what
>>> the PIO address is for.  Typically this would result in some kind of
>>> translation of that "address", but I suppose even this is completely
>>> arbitrary and only G1 knows for sure.  E.g. it might translate from
>>> hypercall vector X to Y similar to your PIO example, it might completely
>>> change transports, or it might terminate locally (e.g. emulated device
>>> in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
>>> be using MMIO to talk to H.  I don't think it matters from a topology
>>> perspective (though it might from a performance perspective).
>>>   
>>>       
>> How can you translate a hypercall?  G1's and H's hypercall mechanisms
>> can be completely different.
>>     
>
> Well, what I mean is that the hypercall ABI is specific to G2->G1, but
> the path really looks like G2->(H)->G1 transparently since H gets all
> the initial exits coming from G2.  But all H has to do is blindly
> reinject the exit with all the same parameters (e.g. registers,
> primarily) to the G1-root context.
>
> So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
> and does its thing according to the ABI.  Perhaps the ABI for that
> particular HC-id is a PIOoHC, so it turns around and does a
> ioread/iowrite PIO, trapping us back to H.
>
> So this transform of the HC-id "X" to PIO("Y") is the translation I was
> referring to.  It could really be anything, though (e.g. HC "X" to HC
> "Z", if thats what G1s handler for X told it to do)
>   

That only works if the device exposes a pio port, and the hypervisor 
exposes HC_PIO.  If the device exposes the hypercall, things break once 
you assign it.

>>   Of course mmio is faster in this case since it traps directly.
>>
>> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
>> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see
>> what's the potential gain here.
>>     
>
> Its more of an issue of execution latency (which translates to IO
> latency, since "execution" is usually for the specific goal of doing
> some IO).  In fact, per my own design claims, I try to avoid exits like
> the plague and generally succeed at making very few of them. ;)
>
> So its not really the .4% reduction of cpu use that allures me.  Its the
> 16% reduction in latency.  Time/discussion will tell if its worth the
> trouble to use HC or just try to shave more off of PIO.  If we went that
> route, I am concerned about falling back to MMIO, but Anthony seems to
> think this is not a real issue.
>   

You need to use absolute numbers, not percentages off the smallest 
component.  If you want to reduce latency, keep things on the same core 
(IPIs, cache bounces are more expensive than the 200ns we're seeing here).



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 16:48                             ` Paul E. McKenney
@ 2009-05-08 19:55                               ` Gregory Haskins
  0 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 19:55 UTC (permalink / raw)
  To: paulmck
  Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

Paul E. McKenney wrote:
> On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
>>>   
>>>       
>>>> Marcelo Tosatti wrote:
>>>>     
>>>>         
>>>>> I think comparison is not entirely fair. You're using
>>>>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>>>>> (on Intel) to only one register read:
>>>>>
>>>>>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>>>>
>>>>> Whereas in a real hypercall for (say) PIO you would need the address,
>>>>> size, direction and data.
>>>>>   
>>>>>       
>>>>>           
>>>> Well, that's probably one of the reasons pio is slower, as the cpu has  
>>>> to set these up, and the kernel has to read them.
>>>>
>>>>     
>>>>         
>>>>> Also for PIO/MMIO you're adding this unoptimized lookup to the  
>>>>> measurement:
>>>>>
>>>>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>>>>         if (pio_dev) {
>>>>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>>>>                 complete_pio(vcpu);                 return 1;
>>>>>         }
>>>>>   
>>>>>       
>>>>>           
>>>> Since there are only one or two elements in the list, I don't see how it  
>>>> could be optimized.
>>>>     
>>>>         
>>> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
>>> is probably the last in the io_bus list.
>>>
>>> Not sure if this one matters very much. Point is you should measure the
>>> exit time only, not the pio path vs hypercall path in kvm. 
>>>   
>>>       
>> The problem is the exit time in of itself isnt all that interesting to
>> me.  What I am interested in measuring is how long it takes KVM to
>> process the request and realize that I want to execute function "X". 
>> Ultimately that is what matters in terms of execution latency and is
>> thus the more interesting data.  I think the exit time is possibly an
>> interesting 5th data point, but its more of a side-bar IMO.   In any
>> case, I suspect that both exits will be approximately the same at the
>> VT/SVM level.
>>
>> OTOH: If there is a patch out there to improve KVMs code (say
>> specifically the PIO handling logic), that is fair-game here and we
>> should benchmark it.  For instance, if you have ideas on ways to improve
>> the find_pio_dev performance, etc....   One item may be to replace the
>> kvm->lock on the bus scan with an RCU or something.... (though PIOs are
>> very frequent and the constant re-entry to an an RCU read-side CS may
>> effectively cause a perpetual grace-period and may be too prohibitive). 
>> CC'ing pmck.
>>     
>
> Hello, Greg!
>
> Not a problem.  ;-)
>
> A grace period only needs to wait on RCU read-side critical sections that
> started before the grace period started.  As soon as those pre-existing
> RCU read-side critical get done, the grace period can end, regardless
> of how many RCU read-side critical sections might have started after
> the grace period started.
>
> If you find a situation where huge numbers of RCU read-side critical
> sections do indefinitely delay a grace period, then that is a bug in
> RCU that I need to fix.
>
> Of course, if you have a single RCU read-side critical section that
> runs for a very long time, that -will- delay a grace period.  As long
> as you don't do it too often, this is not a problem, though if running
> a single RCU read-side critical section for more than a few milliseconds
> is probably not a good thing.  Not as bad as holding a heavily contended
> spinlock for a few milliseconds, but still not a good thing.
>   

Hey Paul,
  This makes sense, and it clears up a misconception I had about RCU. 
So thanks for that.

Based on what Paul said, I think we can get some amount of gains in the
PIO and PIOoHC stats from converting to RCU.  I will do this next.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 15:51                               ` Marcelo Tosatti
@ 2009-05-08 19:56                                 ` David S. Ahern
  2009-05-08 20:01                                   ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: David S. Ahern @ 2009-05-08 19:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins, kvm,
	Anthony Liguori



Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>> Marcelo Tosatti wrote:
>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>>>   
>>>> Marcelo Tosatti wrote:
>>>>     
>>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>>>> it probably sucks much less than what you're seeing.
>>>>>   
>>>>>       
>>>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>>>> processor has to do the nested walk.
>>>>
>>>> Of course, these are newer machines, so the absolute results as well as  
>>>> the difference will be smaller.
>>>>     
>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>>>
>>> NPT enabled:
>>> test 0: 3088633284634 - 3059375712321 = 29257572313
>>> test 1: 3121754636397 - 3088633419760 = 33121216637
>>> test 2: 3204666462763 - 3121754668573 = 82911794190
>>>
>>> NPT disabled:
>>> test 0: 3638061646250 - 3609416811687 = 28644834563
>>> test 1: 3669413430258 - 3638061771291 = 31351658967
>>> test 2: 3736287253287 - 3669413463506 = 66873789781
>>>
>>>   
>> Thanks for running that.  Its interesting to see that NPT was in fact
>> worse as Avi predicted.
>>
>> Would you mind if I graphed the result and added this data to my wiki? 
>> If so, could you adjust the tsc result into IOPs using the proper
>> time-base and the test_count you ran with?   I can show a graph with the
>> data as is and the relative differences will properly surface..but it
>> would be nice to have apples to apples in terms of IOPS units with my
>> other run.
>>
>> -Greg
> 
> Please, that'll be nice.
> 
> Quad-Core AMD Opteron(tm) Processor 2358 SE
> 
> host: 2.6.30-rc2
> guest: 2.6.29.1-102.fc11.x86_64
> 
> test_count=1000000, tsc freq=2402882804 Hz
> 
> NPT disabled:
> 
> test 0 = 2771200766
> test 1 = 3018726738
> test 2 = 6414705418
> test 3 = 2890332864
> 
> NPT enabled:
> 
> test 0 = 2908604045
> test 1 = 3174687394
> test 2 = 7912464804
> test 3 = 3046085805
> 

DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
host: 2.6.30-rc3
guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64

with EPT
test 0: 543617607291 - 518146439877 = 25471167414
test 1: 572568176856 - 543617703004 = 28950473852
test 2: 630182158139 - 572568269792 = 57613888347


without EPT
test 0: 1383532195307 - 1358052032086 = 25480163221
test 1: 1411587055210 - 1383532318617 = 28054736593
test 2: 1471446356172 - 1411587194600 = 59859161572


david


> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 19:12                         ` Avi Kivity
@ 2009-05-08 19:59                           ` Gregory Haskins
  2009-05-10  9:59                             ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 19:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 4225 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>>> And likewise, in both cases, G1 would (should?) know what to do with
>>>> that "address" as it relates to G2, just as it would need to know what
>>>> the PIO address is for.  Typically this would result in some kind of
>>>> translation of that "address", but I suppose even this is completely
>>>> arbitrary and only G1 knows for sure.  E.g. it might translate from
>>>> hypercall vector X to Y similar to your PIO example, it might
>>>> completely
>>>> change transports, or it might terminate locally (e.g. emulated device
>>>> in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1
>>>> might
>>>> be using MMIO to talk to H.  I don't think it matters from a topology
>>>> perspective (though it might from a performance perspective).
>>>>         
>>> How can you translate a hypercall?  G1's and H's hypercall mechanisms
>>> can be completely different.
>>>     
>>
>> Well, what I mean is that the hypercall ABI is specific to G2->G1, but
>> the path really looks like G2->(H)->G1 transparently since H gets all
>> the initial exits coming from G2.  But all H has to do is blindly
>> reinject the exit with all the same parameters (e.g. registers,
>> primarily) to the G1-root context.
>>
>> So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
>> and does its thing according to the ABI.  Perhaps the ABI for that
>> particular HC-id is a PIOoHC, so it turns around and does a
>> ioread/iowrite PIO, trapping us back to H.
>>
>> So this transform of the HC-id "X" to PIO("Y") is the translation I was
>> referring to.  It could really be anything, though (e.g. HC "X" to HC
>> "Z", if thats what G1s handler for X told it to do)
>>   
>
> That only works if the device exposes a pio port, and the hypervisor
> exposes HC_PIO.  If the device exposes the hypercall, things break
> once you assign it.

Well, true.  But normally I would think you would resurface the device
from G1 to G2 anyway, so any relevant transform would also be reflected
in the resurfaced device config.  I suppose if you had a hard
requirement that, say, even the pci-config space was pass-through, this
would be a problem.  I am not sure if that is a realistic environment,
though. 
>
>>>   Of course mmio is faster in this case since it traps directly.
>>>
>>> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
>>> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see
>>> what's the potential gain here.
>>>     
>>
>> Its more of an issue of execution latency (which translates to IO
>> latency, since "execution" is usually for the specific goal of doing
>> some IO).  In fact, per my own design claims, I try to avoid exits like
>> the plague and generally succeed at making very few of them. ;)
>>
>> So its not really the .4% reduction of cpu use that allures me.  Its the
>> 16% reduction in latency.  Time/discussion will tell if its worth the
>> trouble to use HC or just try to shave more off of PIO.  If we went that
>> route, I am concerned about falling back to MMIO, but Anthony seems to
>> think this is not a real issue.
>>   
>
> You need to use absolute numbers, not percentages off the smallest
> component.  If you want to reduce latency, keep things on the same
> core (IPIs, cache bounces are more expensive than the 200ns we're
> seeing here).
>
>
Ok, so there are no shortages of IO cards that can perform operations in
the order of 10us-15us.  Therefore a 350ns latency (the delta between
PIO and HC) turns into a 2%-3.5% overhead when compared to bare-metal. 
I am not really at liberty to talk about most of the kinds of
applications that might care.  A trivial example might be PTPd clock
distribution.

But this is going nowhere.  Based on Anthony's asserting that the MMIO
fallback worry is unfounded, and all the controversy this is causing,
perhaps we should just move on and just forget the whole thing.  If I
have to I will patch the HC code in my own tree.  For now, I will submit
a few patches to clean up the locking on the io_bus.  That may help
narrow the gap without all this stuff, anyway.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 19:56                                 ` David S. Ahern
@ 2009-05-08 20:01                                   ` Gregory Haskins
  2009-05-08 23:23                                     ` David S. Ahern
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-08 20:01 UTC (permalink / raw)
  To: David S. Ahern
  Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, kvm,
	Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 2785 bytes --]

David S. Ahern wrote:
> Marcelo Tosatti wrote:
>   
>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>>     
>>> Marcelo Tosatti wrote:
>>>       
>>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>>>>   
>>>>         
>>>>> Marcelo Tosatti wrote:
>>>>>     
>>>>>           
>>>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>>>>> it probably sucks much less than what you're seeing.
>>>>>>   
>>>>>>       
>>>>>>             
>>>>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>>>>> processor has to do the nested walk.
>>>>>
>>>>> Of course, these are newer machines, so the absolute results as well as  
>>>>> the difference will be smaller.
>>>>>     
>>>>>           
>>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>>>>
>>>> NPT enabled:
>>>> test 0: 3088633284634 - 3059375712321 = 29257572313
>>>> test 1: 3121754636397 - 3088633419760 = 33121216637
>>>> test 2: 3204666462763 - 3121754668573 = 82911794190
>>>>
>>>> NPT disabled:
>>>> test 0: 3638061646250 - 3609416811687 = 28644834563
>>>> test 1: 3669413430258 - 3638061771291 = 31351658967
>>>> test 2: 3736287253287 - 3669413463506 = 66873789781
>>>>
>>>>   
>>>>         
>>> Thanks for running that.  Its interesting to see that NPT was in fact
>>> worse as Avi predicted.
>>>
>>> Would you mind if I graphed the result and added this data to my wiki? 
>>> If so, could you adjust the tsc result into IOPs using the proper
>>> time-base and the test_count you ran with?   I can show a graph with the
>>> data as is and the relative differences will properly surface..but it
>>> would be nice to have apples to apples in terms of IOPS units with my
>>> other run.
>>>
>>> -Greg
>>>       
>> Please, that'll be nice.
>>
>> Quad-Core AMD Opteron(tm) Processor 2358 SE
>>
>> host: 2.6.30-rc2
>> guest: 2.6.29.1-102.fc11.x86_64
>>
>> test_count=1000000, tsc freq=2402882804 Hz
>>
>> NPT disabled:
>>
>> test 0 = 2771200766
>> test 1 = 3018726738
>> test 2 = 6414705418
>> test 3 = 2890332864
>>
>> NPT enabled:
>>
>> test 0 = 2908604045
>> test 1 = 3174687394
>> test 2 = 7912464804
>> test 3 = 3046085805
>>
>>     
>
> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
> host: 2.6.30-rc3
> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64
>
> with EPT
> test 0: 543617607291 - 518146439877 = 25471167414
> test 1: 572568176856 - 543617703004 = 28950473852
> test 2: 630182158139 - 572568269792 = 57613888347
>
>
> without EPT
> test 0: 1383532195307 - 1358052032086 = 25480163221
> test 1: 1411587055210 - 1383532318617 = 28054736593
> test 2: 1471446356172 - 1411587194600 = 59859161572
>
>
>   

Thank you kindly, David.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-07 22:11                                                 ` Arnd Bergmann
@ 2009-05-08 22:33                                                   ` Benjamin Herrenschmidt
  2009-05-11 13:01                                                     ` Arnd Bergmann
  0 siblings, 1 reply; 115+ messages in thread
From: Benjamin Herrenschmidt @ 2009-05-08 22:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Chris Wright, Gregory Haskins, Gregory Haskins, Avi Kivity,
	linux-kernel, kvm, Anthony Liguori

On Fri, 2009-05-08 at 00:11 +0200, Arnd Bergmann wrote:
> On Thursday 07 May 2009, Chris Wright wrote:
> > 
> > > Chris, is that issue with the non ioread/iowrite access of a mangled
> > > pointer still an issue here?  I would think so, but I am a bit fuzzy on
> > > whether there is still an issue of non-wrapped MMIO ever occuring.
> > 
> > Arnd was saying it's a bug for other reasons, so perhaps it would work
> > out fine.
> 
> Well, maybe. I only said that __raw_writel and pointer dereference is
> bad, but not writel.

I have only vague recollection of that stuff, but basically, it boiled
down to me attempting to use annotations to differenciate old style
"ioremap" vs. new style iomap and effectively forbid mixing ioremap with
iomap in either direction.

This was shot down by a vast majority of people, with the outcome being
an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
something that is strictly interchangeable with what ioremap would have
returned.

That means that readl and writel must work on the output of pci_iomap()
and similar, but I don't see why __raw_writel would be excluded there, I
think it's in there too.

Direct dereference is illegal in all cases though.

The token returned by pci_iomap for other type of resources (IO for
example) is also only supported for use by iomap access functions
(ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
neither.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 20:01                                   ` Gregory Haskins
@ 2009-05-08 23:23                                     ` David S. Ahern
  2009-05-09  8:45                                       ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: David S. Ahern @ 2009-05-08 23:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Marcelo Tosatti, Avi Kivity, Chris Wright, Gregory Haskins, kvm,
	Anthony Liguori



Gregory Haskins wrote:
> David S. Ahern wrote:
>> Marcelo Tosatti wrote:
>>   
>>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>>>     
>>>> Marcelo Tosatti wrote:
>>>>       
>>>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>>>>>   
>>>>>         
>>>>>> Marcelo Tosatti wrote:
>>>>>>     
>>>>>>           
>>>>>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>>>>>> it probably sucks much less than what you're seeing.
>>>>>>>   
>>>>>>>       
>>>>>>>             
>>>>>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>>>>>> processor has to do the nested walk.
>>>>>>
>>>>>> Of course, these are newer machines, so the absolute results as well as  
>>>>>> the difference will be smaller.
>>>>>>     
>>>>>>           
>>>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>>>>>
>>>>> NPT enabled:
>>>>> test 0: 3088633284634 - 3059375712321 = 29257572313
>>>>> test 1: 3121754636397 - 3088633419760 = 33121216637
>>>>> test 2: 3204666462763 - 3121754668573 = 82911794190
>>>>>
>>>>> NPT disabled:
>>>>> test 0: 3638061646250 - 3609416811687 = 28644834563
>>>>> test 1: 3669413430258 - 3638061771291 = 31351658967
>>>>> test 2: 3736287253287 - 3669413463506 = 66873789781
>>>>>
>>>>>   
>>>>>         
>>>> Thanks for running that.  Its interesting to see that NPT was in fact
>>>> worse as Avi predicted.
>>>>
>>>> Would you mind if I graphed the result and added this data to my wiki? 
>>>> If so, could you adjust the tsc result into IOPs using the proper
>>>> time-base and the test_count you ran with?   I can show a graph with the
>>>> data as is and the relative differences will properly surface..but it
>>>> would be nice to have apples to apples in terms of IOPS units with my
>>>> other run.
>>>>
>>>> -Greg
>>>>       
>>> Please, that'll be nice.
>>>
>>> Quad-Core AMD Opteron(tm) Processor 2358 SE
>>>
>>> host: 2.6.30-rc2
>>> guest: 2.6.29.1-102.fc11.x86_64
>>>
>>> test_count=1000000, tsc freq=2402882804 Hz
>>>
>>> NPT disabled:
>>>
>>> test 0 = 2771200766
>>> test 1 = 3018726738
>>> test 2 = 6414705418
>>> test 3 = 2890332864
>>>
>>> NPT enabled:
>>>
>>> test 0 = 2908604045
>>> test 1 = 3174687394
>>> test 2 = 7912464804
>>> test 3 = 3046085805
>>>
>>>     
>> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
>> host: 2.6.30-rc3
>> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64
>>
>> with EPT
>> test 0: 543617607291 - 518146439877 = 25471167414
>> test 1: 572568176856 - 543617703004 = 28950473852
>> test 2: 630182158139 - 572568269792 = 57613888347
>>
>>
>> without EPT
>> test 0: 1383532195307 - 1358052032086 = 25480163221
>> test 1: 1411587055210 - 1383532318617 = 28054736593
>> test 2: 1471446356172 - 1411587194600 = 59859161572
>>
>>
>>   
> 
> Thank you kindly, David.
> 
> -Greg

I ran another test case with SMT disabled, and while I was at it
converted TSC delta to operations/sec. The results without SMT are
confusing -- to me anyways. I'm hoping someone can explain it.
Basically, using a count of 10,000,000 (per your web page) with SMT
disabled the guest detected a soft lockup on the CPU. So, I dropped the
count down to 1,000,000. So, for 1e6 iterations:

without SMT, with EPT:
    HC:   259,455 ops/sec
    PIO:  226,937 ops/sec
    MMIO: 113,180 ops/sec

without SMT, without EPT:
    HC:   274,825 ops/sec
    PIO:  247,910 ops/sec
    MMIO: 111,535 ops/sec

Converting the prior TSC deltas:

with SMT, with EPT:
    HC:    994,655 ops/sec
    PIO:   875,116 ops/sec
    MMIO:  439,738 ops/sec

with SMT, without EPT:
    HC:    994,304 ops/sec
    PIO:   903,057 ops/sec
    MMIO:  423,244 ops/sec

Running the tests repeatedly I did notice a fair variability (as much as
-10% down from these numbers).

Also, just to make sure I converted the delta to ops/sec, the formula I
used was cpu_freq / dTSC * count = operations/sec


david

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 23:23                                     ` David S. Ahern
@ 2009-05-09  8:45                                       ` Avi Kivity
  2009-05-09 11:27                                         ` Gregory Haskins
  2009-05-10  4:24                                         ` David S. Ahern
  0 siblings, 2 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-09  8:45 UTC (permalink / raw)
  To: David S. Ahern
  Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins,
	kvm, Anthony Liguori

David S. Ahern wrote:
> I ran another test case with SMT disabled, and while I was at it
> converted TSC delta to operations/sec. The results without SMT are
> confusing -- to me anyways. I'm hoping someone can explain it.
> Basically, using a count of 10,000,000 (per your web page) with SMT
> disabled the guest detected a soft lockup on the CPU. So, I dropped the
> count down to 1,000,000. So, for 1e6 iterations:
>
> without SMT, with EPT:
>     HC:   259,455 ops/sec
>     PIO:  226,937 ops/sec
>     MMIO: 113,180 ops/sec
>
> without SMT, without EPT:
>     HC:   274,825 ops/sec
>     PIO:  247,910 ops/sec
>     MMIO: 111,535 ops/sec
>
> Converting the prior TSC deltas:
>
> with SMT, with EPT:
>     HC:    994,655 ops/sec
>     PIO:   875,116 ops/sec
>     MMIO:  439,738 ops/sec
>
> with SMT, without EPT:
>     HC:    994,304 ops/sec
>     PIO:   903,057 ops/sec
>     MMIO:  423,244 ops/sec
>
> Running the tests repeatedly I did notice a fair variability (as much as
> -10% down from these numbers).
>
> Also, just to make sure I converted the delta to ops/sec, the formula I
> used was cpu_freq / dTSC * count = operations/sec
>
>   

The only think I can think of is cpu frequency scaling lying about the 
cpu frequency.  Really the test needs to use time and not the time stamp 
counter.

Are the results expressed in cycles/op more reasonable?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-09  8:45                                       ` Avi Kivity
@ 2009-05-09 11:27                                         ` Gregory Haskins
  2009-05-10  4:27                                           ` David S. Ahern
  2009-05-10  4:24                                         ` David S. Ahern
  1 sibling, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-09 11:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David S. Ahern, Gregory Haskins, Marcelo Tosatti, Chris Wright,
	kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1532 bytes --]

Avi Kivity wrote:
> David S. Ahern wrote:
>> I ran another test case with SMT disabled, and while I was at it
>> converted TSC delta to operations/sec. The results without SMT are
>> confusing -- to me anyways. I'm hoping someone can explain it.
>> Basically, using a count of 10,000,000 (per your web page) with SMT
>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>> count down to 1,000,000. So, for 1e6 iterations:
>>
>> without SMT, with EPT:
>>     HC:   259,455 ops/sec
>>     PIO:  226,937 ops/sec
>>     MMIO: 113,180 ops/sec
>>
>> without SMT, without EPT:
>>     HC:   274,825 ops/sec
>>     PIO:  247,910 ops/sec
>>     MMIO: 111,535 ops/sec
>>
>> Converting the prior TSC deltas:
>>
>> with SMT, with EPT:
>>     HC:    994,655 ops/sec
>>     PIO:   875,116 ops/sec
>>     MMIO:  439,738 ops/sec
>>
>> with SMT, without EPT:
>>     HC:    994,304 ops/sec
>>     PIO:   903,057 ops/sec
>>     MMIO:  423,244 ops/sec
>>
>> Running the tests repeatedly I did notice a fair variability (as much as
>> -10% down from these numbers).
>>
>> Also, just to make sure I converted the delta to ops/sec, the formula I
>> used was cpu_freq / dTSC * count = operations/sec
>>
>>   
>
> The only think I can think of is cpu frequency scaling lying about the
> cpu frequency.  Really the test needs to use time and not the time
> stamp counter.
>
> Are the results expressed in cycles/op more reasonable?

FWIW: I always used kvm_stat instead of my tsc printk



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 14:59                   ` Anthony Liguori
@ 2009-05-09 12:01                     ` Gregory Haskins
  2009-05-10 18:38                       ` Anthony Liguori
  0 siblings, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-09 12:01 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

Anthony Liguori wrote:
> Avi Kivity wrote:
>>
>> Hmm, reminds me of something I thought of a while back.
>>
>> We could implement an 'mmio hypercall' that does mmio reads/writes
>> via a hypercall instead of an mmio operation.  That will speed up
>> mmio for emulated devices (say, e1000).  It's easy to hook into Linux
>> (readl/writel), is pci-friendly, non-x86 friendly, etc.
>
> By the time you get down to userspace for an emulated device, that 2us
> difference between mmio and hypercalls is simply not going to make a
> difference.

I don't care about this path for emulated devices.  I am interested in
in-kernel vbus devices.

>   I'm surprised so much effort is going into this, is there any
> indication that this is even close to a bottleneck in any circumstance?

Yes.  Each 1us of overhead is a 4% regression in something as trivial as
a 25us UDP/ICMP rtt "ping".
>
>
> We have much, much lower hanging fruit to attack.  The basic fact that
> we still copy data multiple times in the networking drivers is clearly
> more significant than a few hundred nanoseconds that should occur less
> than once per packet.
for request-response, this is generally for *every* packet since you
cannot exploit buffering/deferring.

Can you back up your claim that PPC has no difference in performance
with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
like instructions, but clearly there are ways to cause a trap, so
presumably we can measure the difference between a PF exit and something
more explicit).

We need numbers before we can really decide to abandon this
optimization.  If PPC mmio has no penalty over hypercall, I am not sure
the 350ns on x86 is worth this effort (especially if I can shrink this
with some RCU fixes).  Otherwise, the margin is quite a bit larger.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-09  8:45                                       ` Avi Kivity
  2009-05-09 11:27                                         ` Gregory Haskins
@ 2009-05-10  4:24                                         ` David S. Ahern
  1 sibling, 0 replies; 115+ messages in thread
From: David S. Ahern @ 2009-05-10  4:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Marcelo Tosatti, Chris Wright, Gregory Haskins,
	kvm, Anthony Liguori



Avi Kivity wrote:
> David S. Ahern wrote:
>> I ran another test case with SMT disabled, and while I was at it
>> converted TSC delta to operations/sec. The results without SMT are
>> confusing -- to me anyways. I'm hoping someone can explain it.
>> Basically, using a count of 10,000,000 (per your web page) with SMT
>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>> count down to 1,000,000. So, for 1e6 iterations:
>>
>> without SMT, with EPT:
>>     HC:   259,455 ops/sec
>>     PIO:  226,937 ops/sec
>>     MMIO: 113,180 ops/sec
>>
>> without SMT, without EPT:
>>     HC:   274,825 ops/sec
>>     PIO:  247,910 ops/sec
>>     MMIO: 111,535 ops/sec
>>
>> Converting the prior TSC deltas:
>>
>> with SMT, with EPT:
>>     HC:    994,655 ops/sec
>>     PIO:   875,116 ops/sec
>>     MMIO:  439,738 ops/sec
>>
>> with SMT, without EPT:
>>     HC:    994,304 ops/sec
>>     PIO:   903,057 ops/sec
>>     MMIO:  423,244 ops/sec
>>
>> Running the tests repeatedly I did notice a fair variability (as much as
>> -10% down from these numbers).
>>
>> Also, just to make sure I converted the delta to ops/sec, the formula I
>> used was cpu_freq / dTSC * count = operations/sec
>>
>>   
> 
> The only think I can think of is cpu frequency scaling lying about the
> cpu frequency.  Really the test needs to use time and not the time stamp
> counter.
> 
> Are the results expressed in cycles/op more reasonable?
> 

Power settings seem to be the root cause. With this HP server the SMT
mode must be disabling or overriding a power setting that is enabled in
the bios. I found one power-based knob that gets non-SMT performance
close to SMT numbers. Not very intuitive that SMT/non-SMT can differ so
dramatically.

david

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-09 11:27                                         ` Gregory Haskins
@ 2009-05-10  4:27                                           ` David S. Ahern
  2009-05-10  5:24                                             ` Avi Kivity
  0 siblings, 1 reply; 115+ messages in thread
From: David S. Ahern @ 2009-05-10  4:27 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Gregory Haskins, Marcelo Tosatti, Chris Wright, kvm,
	Anthony Liguori



Gregory Haskins wrote:
> Avi Kivity wrote:
>> David S. Ahern wrote:
>>> I ran another test case with SMT disabled, and while I was at it
>>> converted TSC delta to operations/sec. The results without SMT are
>>> confusing -- to me anyways. I'm hoping someone can explain it.
>>> Basically, using a count of 10,000,000 (per your web page) with SMT
>>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>>> count down to 1,000,000. So, for 1e6 iterations:
>>>
>>> without SMT, with EPT:
>>>     HC:   259,455 ops/sec
>>>     PIO:  226,937 ops/sec
>>>     MMIO: 113,180 ops/sec
>>>
>>> without SMT, without EPT:
>>>     HC:   274,825 ops/sec
>>>     PIO:  247,910 ops/sec
>>>     MMIO: 111,535 ops/sec
>>>
>>> Converting the prior TSC deltas:
>>>
>>> with SMT, with EPT:
>>>     HC:    994,655 ops/sec
>>>     PIO:   875,116 ops/sec
>>>     MMIO:  439,738 ops/sec
>>>
>>> with SMT, without EPT:
>>>     HC:    994,304 ops/sec
>>>     PIO:   903,057 ops/sec
>>>     MMIO:  423,244 ops/sec
>>>
>>> Running the tests repeatedly I did notice a fair variability (as much as
>>> -10% down from these numbers).
>>>
>>> Also, just to make sure I converted the delta to ops/sec, the formula I
>>> used was cpu_freq / dTSC * count = operations/sec
>>>
>>>   
>> The only think I can think of is cpu frequency scaling lying about the
>> cpu frequency.  Really the test needs to use time and not the time
>> stamp counter.
>>
>> Are the results expressed in cycles/op more reasonable?
> 
> FWIW: I always used kvm_stat instead of my tsc printk
> 

kvm_stat shows same approximate numbers as with the TSC-->ops/sec
conversions. Interestingly, MMIO writes are not showing up as mmio_exits
in kvm_stat; they are showing up as insn_emulation.

david
> 

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-10  4:27                                           ` David S. Ahern
@ 2009-05-10  5:24                                             ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-10  5:24 UTC (permalink / raw)
  To: David S. Ahern
  Cc: Gregory Haskins, Gregory Haskins, Marcelo Tosatti, Chris Wright,
	kvm, Anthony Liguori

David S. Ahern wrote:
> kvm_stat shows same approximate numbers as with the TSC-->ops/sec
> conversions. Interestingly, MMIO writes are not showing up as mmio_exits
> in kvm_stat; they are showing up as insn_emulation.
>   

That's a bug, mmio_exits ignores mmios that are handled in the kernel.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 19:59                           ` Gregory Haskins
@ 2009-05-10  9:59                             ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-10  9:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Chris Wright, Gregory Haskins, linux-kernel, kvm

Gregory Haskins wrote:
>>
>> That only works if the device exposes a pio port, and the hypervisor
>> exposes HC_PIO.  If the device exposes the hypercall, things break
>> once you assign it.
>>     
>
> Well, true.  But normally I would think you would resurface the device
> from G1 to G2 anyway, so any relevant transform would also be reflected
> in the resurfaced device config.  

We do, but the G1 hypervisor cannot be expected to understand the config 
option that exposes the hypercall.

> I suppose if you had a hard
> requirement that, say, even the pci-config space was pass-through, this
> would be a problem.  I am not sure if that is a realistic environment,
> though. 
>   

You must pass through the config space, as some of it is device 
specific.  The hypervisor will trap config space accesses, but unless it 
understands them, it cannot modify them.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-09 12:01                     ` Gregory Haskins
@ 2009-05-10 18:38                       ` Anthony Liguori
  2009-05-11 13:14                         ` Gregory Haskins
  2009-05-11 16:44                         ` Hollis Blanchard
  0 siblings, 2 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-10 18:38 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Chris Wright, Gregory Haskins, linux-kernel, kvm,
	Hollis Blanchard

Gregory Haskins wrote:
> Anthony Liguori wrote:
>   
>
>>   I'm surprised so much effort is going into this, is there any
>> indication that this is even close to a bottleneck in any circumstance?
>>     
>
> Yes.  Each 1us of overhead is a 4% regression in something as trivial as
> a 25us UDP/ICMP rtt "ping".m 
>   

It wasn't 1us, it was 350ns or something around there (i.e ~1%).

> for request-response, this is generally for *every* packet since you
> cannot exploit buffering/deferring.
>
> Can you back up your claim that PPC has no difference in performance
> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
> like instructions, but clearly there are ways to cause a trap, so
> presumably we can measure the difference between a PF exit and something
> more explicit).
>   

First, the PPC that KVM supports performs very poorly relatively 
speaking because it receives no hardware assistance  this is not the 
right place to focus wrt optimizations.

And because there's no hardware assistance, there simply isn't a 
hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
but I honestly have no idea.  I'm sure Hollis does though.

Page faults are going to have tremendously different performance 
characteristics on PPC too because it's a software managed TLB. There's 
no page table lookup like there is on x86.

As a more general observation, we need numbers to justify an 
optimization, not to justify not including an optimization.

In other words, the burden is on you to present a scenario where this 
optimization would result in a measurable improvement in a real world 
work load.


Regards,

Anthony Liguori
> We need numbers before we can really decide to abandon this
> optimization.  If PPC mmio has no penalty over hypercall, I am not sure
> the 350ns on x86 is worth this effort (especially if I can shrink this
> with some RCU fixes).  Otherwise, the margin is quite a bit larger.
>
> -Greg
>
>
>
>   


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 22:33                                                   ` Benjamin Herrenschmidt
@ 2009-05-11 13:01                                                     ` Arnd Bergmann
  2009-05-11 13:04                                                       ` Gregory Haskins
  0 siblings, 1 reply; 115+ messages in thread
From: Arnd Bergmann @ 2009-05-11 13:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Chris Wright, Gregory Haskins, Gregory Haskins, Avi Kivity,
	linux-kernel, kvm, Anthony Liguori

On Saturday 09 May 2009, Benjamin Herrenschmidt wrote:
> This was shot down by a vast majority of people, with the outcome being
> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
> something that is strictly interchangeable with what ioremap would have
> returned.
> 
> That means that readl and writel must work on the output of pci_iomap()
> and similar, but I don't see why __raw_writel would be excluded there, I
> think it's in there too.

One of the ideas was to change pci_iomap to return a special token
in case of virtual devices that causes iowrite32() to do an hcall,
and to just define writel() to do iowrite32().

Unfortunately, there is no __raw_iowrite32(), although I guess we
could add this generically if necessary.

> Direct dereference is illegal in all cases though.

right.
 
> The token returned by pci_iomap for other type of resources (IO for
> example) is also only supported for use by iomap access functions
> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
> neither.

That still leaves the option to let drivers pass the IORESOURCE_PVIO
for its own resources under some conditions, meaning that we will
only use hcalls for I/O on these drivers but not on others, as Chris
explained earlier.

	Arnd <><

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 13:01                                                     ` Arnd Bergmann
@ 2009-05-11 13:04                                                       ` Gregory Haskins
  0 siblings, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-11 13:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Chris Wright, Gregory Haskins,
	Avi Kivity, linux-kernel, kvm, Anthony Liguori

[-- Attachment #1: Type: text/plain, Size: 1519 bytes --]

Arnd Bergmann wrote:
> On Saturday 09 May 2009, Benjamin Herrenschmidt wrote:
>   
>> This was shot down by a vast majority of people, with the outcome being
>> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
>> something that is strictly interchangeable with what ioremap would have
>> returned.
>>
>> That means that readl and writel must work on the output of pci_iomap()
>> and similar, but I don't see why __raw_writel would be excluded there, I
>> think it's in there too.
>>     
>
> One of the ideas was to change pci_iomap to return a special token
> in case of virtual devices that causes iowrite32() to do an hcall,
> and to just define writel() to do iowrite32().
>
> Unfortunately, there is no __raw_iowrite32(), although I guess we
> could add this generically if necessary.
>
>   
>> Direct dereference is illegal in all cases though.
>>     
>
> right.
>  
>   
>> The token returned by pci_iomap for other type of resources (IO for
>> example) is also only supported for use by iomap access functions
>> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
>> neither.
>>     
>
> That still leaves the option to let drivers pass the IORESOURCE_PVIO
> for its own resources under some conditions, meaning that we will
> only use hcalls for I/O on these drivers but not on others, as Chris
> explained earlier.
>   

Between this, and Avi's "nesting" point, this is the direction I am
leaning in right now.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-10 18:38                       ` Anthony Liguori
@ 2009-05-11 13:14                         ` Gregory Haskins
  2009-05-11 16:35                           ` Hollis Blanchard
  2009-05-11 17:31                           ` Anthony Liguori
  2009-05-11 16:44                         ` Hollis Blanchard
  1 sibling, 2 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-11 13:14 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm,
	Hollis Blanchard

[-- Attachment #1: Type: text/plain, Size: 8778 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>  
>>>   I'm surprised so much effort is going into this, is there any
>>> indication that this is even close to a bottleneck in any circumstance?
>>>     
>>
>> Yes.  Each 1us of overhead is a 4% regression in something as trivial as
>> a 25us UDP/ICMP rtt "ping".m   
>
> It wasn't 1us, it was 350ns or something around there (i.e ~1%).

I wasn't referring to "it".  I chose my words carefully.

Let me rephrase for your clarity: *each* 1us of overhead introduced into
the signaling path is a ~4% latency regression for a round trip on a
high speed network (note that this can also affect throughput at some
level, too).  I believe this point has been lost on you from the very
beginning of the vbus discussions.

I specifically generalized my statement above because #1 I assume
everyone here is smart enough to convert that nice round unit into the
relevant figure.  And #2, there are multiple potential latency sources
at play which we need to factor in when looking at the big picture.  For
instance, the difference between PF exit, and an IO exit (2.58us on x86,
to be precise).  Or whether you need to take a heavy-weight exit.  Or a
context switch to qemu, the the kernel, back to qemu, and back to the
vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
I know you wish that this whole discussion would just go away, but these
little "300ns here, 1600ns there" really add up in aggregate despite
your dismissive attitude towards them.  And it doesn't take much to
affect the results in a measurable way.  As stated, each 1us costs ~4%. 
My motivation is to reduce as many of these sources as possible.

So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
improvement.  So what?  Its still an improvement.  If that improvement
were for free, would you object?  And we all know that this change isn't
"free" because we have to change some code (+128/-0, to be exact).  But
what is it specifically you are objecting to in the first place?  Adding
hypercall support as an pv_ops primitive isn't exactly hard or complex,
or even very much code.

Besides, I've already clearly stated multiple times (including in this
very thread) that I agree that I am not yet sure if the 350ns/1.4%
improvement alone is enough to justify a change.  So if you are somehow
trying to make me feel silly by pointing out the "~1%" above, you are
being ridiculous.

Rather, I was simply answering your question as to whether these latency
sources are a real issue.  The answer is "yes" (assuming you care about
latency) and I gave you a specific example and a method to quantify the
impact.

It is duly noted that you do not care about this type of performance,
but you also need to realize that your "blessing" or
acknowledgment/denial of the problem domain has _zero_ bearing on
whether the domain exists, or if there are others out there that do care
about it.  Sorry.

>
>> for request-response, this is generally for *every* packet since you
>> cannot exploit buffering/deferring.
>>
>> Can you back up your claim that PPC has no difference in performance
>> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
>> like instructions, but clearly there are ways to cause a trap, so
>> presumably we can measure the difference between a PF exit and something
>> more explicit).
>>   
>
> First, the PPC that KVM supports performs very poorly relatively
> speaking because it receives no hardware assistance

So wouldn't that be making the case that it could use as much help as
possible?

>   this is not the right place to focus wrt optimizations.

Odd choice of words.  I am advocating the opposite (broad solution to
many arches and many platforms (i.e. hypervisors)) and therefore I am
not "focused" on it (or really any one arch) at all per se.  I am
_worried_ however, that we could be overlooking PPC (as an example) if
we ignore the disparity between MMIO and HC since other higher
performance options are not available like PIO.  The goal on this
particular thread is to come up with an IO interface that works
reasonably well across as many hypervisors as possible.  MMIO/PIO do not
appear to fit that bill (at least not without tunneling them over HCs)

If I am guilty of focusing anywhere too much it would be x86 since that
is the only development platform I have readily available.


>
>
> And because there's no hardware assistance, there simply isn't a
> hypercall instruction.  Are PFs the fastest type of exits?  Probably
> not but I honestly have no idea.  I'm sure Hollis does though.
>
> Page faults are going to have tremendously different performance
> characteristics on PPC too because it's a software managed TLB.
> There's no page table lookup like there is on x86.

The difference between MMIO and "HC", and whether it is cause for
concern will continue to be pure speculation until we can find someone
with a PPC box willing to run some numbers.  I will point out that we
both seem to theorize that PFs will yield lower output than
alternatives, so it would seem you are actually making my point for me.

>
> As a more general observation, we need numbers to justify an
> optimization, not to justify not including an optimization.
>
> In other words, the burden is on you to present a scenario where this
> optimization would result in a measurable improvement in a real world
> work load.

I have already done this.  You seem to have chosen to ignore my
statements and results, but if you insist on rehashing:

I started this project by analyzing system traces and finding some of
the various bottlenecks in comparison to a native host.  Throughput was
already pretty decent, but latency was pretty bad (and recently got
*really* bad, but I know you already have a handle on whats causing
that).  I digress...one of the conclusions of the research was that  I
wanted to focus on building an IO subsystem designed to minimize the
quantity of exits, minimize the cost of each exit, and shorten the
end-to-end signaling path to achieve optimal performance.  I also wanted
to build a system that was extensible enough to work with a variety of
client types, on a variety of architectures, etc, so we would only need
to solve these problems "once".  The end result was vbus, and the first
working example was venet.  The measured performance data of this work
was as follows:

802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
with Chelsio T3 10GE via crossover.

Bare metal            : tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net (PCI)    : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet      (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)

For more details:  http://lkml.org/lkml/2009/4/21/408

You can download this today and run it, review it, compare it.  Whatever
you want.

As part of that work, I measured IO performance in KVM and found HCs to
be the superior performer.  You can find these results here: 
http://developer.novell.com/wiki/index.php/WhyHypercalls.  Without
having access to platforms other than x86, but with an understanding of
computer architecture, I speculate that the difference should be even
more profound everywhere else in lieu of a PIO primitive.  And even on
the platform which should yield the least benefit (x86), the gain
(~1.4%) is not huge, but its not zero either.  Therefore, my data and
findings suggest that this is not a bad optimization to consider IMO. 
My final results above do not indicate to me that I was completely wrong
in my analysis.

Now I know you have been quick in the past to dismiss my efforts, and to
claim you can get the same results without needing the various tricks
and optimizations I uncovered.  But quite frankly, until you post some
patches for community review and comparison (as I have done), it's just
meaningless talk.
 
Perhaps you are truly unimpressed with my results and will continue to
insist that my work including my final results are "virtually
meaningless".  Or perhaps you have an agenda.  You can keep working
against me and try to block anything I suggest by coming up with what
appears to be any excuse you can find, making rude replies on email
threads and snide comments on IRC, etc.  It's simply not necessary.

Alternatively, you can work _with_ me to help try to improve KVM and
Linux (e.g. I still need someone to implement a virtio-net backend, and
who knows it better than you).  The choice is yours.  But lets cut the
BS because it's counter productive, and frankly, getting old.

Regards,
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 13:14                         ` Gregory Haskins
@ 2009-05-11 16:35                           ` Hollis Blanchard
  2009-05-11 17:06                             ` Avi Kivity
  2009-05-11 17:31                           ` Anthony Liguori
  1 sibling, 1 reply; 115+ messages in thread
From: Hollis Blanchard @ 2009-05-11 16:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Gregory Haskins, Avi Kivity, Chris Wright,
	linux-kernel, kvm

On Mon, 2009-05-11 at 09:14 -0400, Gregory Haskins wrote:
> 
> >> for request-response, this is generally for *every* packet since
> you
> >> cannot exploit buffering/deferring.
> >>
> >> Can you back up your claim that PPC has no difference in
> performance
> >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no
> "VT"
> >> like instructions, but clearly there are ways to cause a trap, so
> >> presumably we can measure the difference between a PF exit and
> something
> >> more explicit).
> >>   
> >
> > First, the PPC that KVM supports performs very poorly relatively
> > speaking because it receives no hardware assistance
> 
> So wouldn't that be making the case that it could use as much help as
> possible?

I think he's referencing Ahmdal's Law here. While I'd agree, this is
relevant only for the current KVM PowerPC implementations. I think it
would be short-sighted to design an IO architecture around that.

> >   this is not the right place to focus wrt optimizations.
> 
> Odd choice of words.  I am advocating the opposite (broad solution to
> many arches and many platforms (i.e. hypervisors)) and therefore I am
> not "focused" on it (or really any one arch) at all per se.  I am
> _worried_ however, that we could be overlooking PPC (as an example) if
> we ignore the disparity between MMIO and HC since other higher
> performance options are not available like PIO.  The goal on this
> particular thread is to come up with an IO interface that works
> reasonably well across as many hypervisors as possible.  MMIO/PIO do
> not appear to fit that bill (at least not without tunneling them over
> HCs)

I haven't been following this conversation at all. With that in mind...

AFAICS, a hypercall is clearly the higher-performing option, since you
don't need the additional memory load (which could even cause a page
fault in some circumstances) and instruction decode. That said, I'm
willing to agree that this overhead is probably negligible compared to
the IOp itself... Ahmdal's Law again.

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-08 19:02                             ` Anthony Liguori
  2009-05-08 19:06                               ` Avi Kivity
@ 2009-05-11 16:37                               ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 115+ messages in thread
From: Jeremy Fitzhardinge @ 2009-05-11 16:37 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Gregory Haskins, Marcelo Tosatti, Chris Wright,
	Gregory Haskins, linux-kernel, kvm

Anthony Liguori wrote:
> Yes, I misunderstood that they actually emulated it like that.  
> However, ia64 has no paravirtualization support today so surely, we 
> aren't going to be justifying this via ia64, right?
>

Someone is actively putting a pvops infrastructure into the ia64 port, 
along with a Xen port.  I think pieces of it got merged this last window.

    J

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-10 18:38                       ` Anthony Liguori
  2009-05-11 13:14                         ` Gregory Haskins
@ 2009-05-11 16:44                         ` Hollis Blanchard
  2009-05-11 17:54                           ` Anthony Liguori
  1 sibling, 1 reply; 115+ messages in thread
From: Hollis Blanchard @ 2009-05-11 16:44 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote:
> Gregory Haskins wrote:
> >
> > Can you back up your claim that PPC has no difference in performance
> > with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
> > like instructions, but clearly there are ways to cause a trap, so
> > presumably we can measure the difference between a PF exit and something
> > more explicit).
> 
> First, the PPC that KVM supports performs very poorly relatively 
> speaking because it receives no hardware assistance  this is not the 
> right place to focus wrt optimizations.
> 
> And because there's no hardware assistance, there simply isn't a 
> hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
> but I honestly have no idea.  I'm sure Hollis does though.

Memory load from the guest context (for instruction decoding) is a
*very* poorly performing path on most PowerPC, even considering server
PowerPC with hardware virtualization support. No, I don't have any data
for you, but switching the hardware MMU contexts requires some
heavyweight synchronization instructions.

> Page faults are going to have tremendously different performance 
> characteristics on PPC too because it's a software managed TLB. There's 
> no page table lookup like there is on x86.

To clarify, software-managed TLBs are only found in embedded PowerPC.
Server and classic PowerPC use hash tables, which are a third MMU type.

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 16:35                           ` Hollis Blanchard
@ 2009-05-11 17:06                             ` Avi Kivity
  2009-05-11 17:29                               ` Gregory Haskins
  2009-05-11 17:51                               ` Anthony Liguori
  0 siblings, 2 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-11 17:06 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Gregory Haskins, Anthony Liguori, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

Hollis Blanchard wrote:
> I haven't been following this conversation at all. With that in mind...
>
> AFAICS, a hypercall is clearly the higher-performing option, since you
> don't need the additional memory load (which could even cause a page
> fault in some circumstances) and instruction decode. That said, I'm
> willing to agree that this overhead is probably negligible compared to
> the IOp itself... Ahmdal's Law again.
>   

It's a question of cost vs. benefit.  It's clear the benefit is low (but 
that doesn't mean it's not worth having).  The cost initially appeared 
to be very low, until the nested virtualization wrench was thrown into 
the works.  Not that nested virtualization is a reality -- even on svm 
where it is implemented it is not yet production quality and is disabled 
by default.

Now nested virtualization is beginning to look interesting, with Windows 
7's XP mode requiring virtualization extensions.  Desktop virtualization 
is also something likely to use device assignment (though you probably 
won't assign a virtio device to the XP instance inside Windows 7).

Maybe we should revisit the mmio hypercall idea again, it might be 
workable if we find a way to let the guest know if it should use the 
hypercall or not for a given memory range.

mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and get 
a nice speed win (or rather, less speed loss)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:06                             ` Avi Kivity
@ 2009-05-11 17:29                               ` Gregory Haskins
  2009-05-11 17:53                                 ` Avi Kivity
  2009-05-11 17:51                               ` Anthony Liguori
  1 sibling, 1 reply; 115+ messages in thread
From: Gregory Haskins @ 2009-05-11 17:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Hollis Blanchard, Anthony Liguori, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

Avi Kivity wrote:
> Hollis Blanchard wrote:
>> I haven't been following this conversation at all. With that in mind...
>>
>> AFAICS, a hypercall is clearly the higher-performing option, since you
>> don't need the additional memory load (which could even cause a page
>> fault in some circumstances) and instruction decode. That said, I'm
>> willing to agree that this overhead is probably negligible compared to
>> the IOp itself... Ahmdal's Law again.
>>   
>
> It's a question of cost vs. benefit.  It's clear the benefit is low
> (but that doesn't mean it's not worth having).  The cost initially
> appeared to be very low, until the nested virtualization wrench was
> thrown into the works.  Not that nested virtualization is a reality --
> even on svm where it is implemented it is not yet production quality
> and is disabled by default.
>
> Now nested virtualization is beginning to look interesting, with
> Windows 7's XP mode requiring virtualization extensions.  Desktop
> virtualization is also something likely to use device assignment
> (though you probably won't assign a virtio device to the XP instance
> inside Windows 7).
>
> Maybe we should revisit the mmio hypercall idea again, it might be
> workable if we find a way to let the guest know if it should use the
> hypercall or not for a given memory range.
>
> mmio hypercall is nice because
> - it falls back nicely to pure mmio
> - it optimizes an existing slow path, not just new device models
> - it has preexisting semantics, so we have less ABI to screw up
> - for nested virtualization + device assignment, we can drop it and
> get a nice speed win (or rather, less speed loss)
>
Yeah, I agree with all this.  I am still wrestling with how to deal with
the device-assignment problem w.r.t. shunting io requests into a
hypercall vs letting them PF.  Are you saying we could simply ignore
this case by disabling "MMIOoHC" when assignment is enabled?  That would
certainly make the problem much easier to solve.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 13:14                         ` Gregory Haskins
  2009-05-11 16:35                           ` Hollis Blanchard
@ 2009-05-11 17:31                           ` Anthony Liguori
  2009-05-13 10:53                             ` Gregory Haskins
  2009-05-13 14:45                             ` Gregory Haskins
  1 sibling, 2 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-11 17:31 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm,
	Hollis Blanchard

Gregory Haskins wrote:
> I specifically generalized my statement above because #1 I assume
> everyone here is smart enough to convert that nice round unit into the
> relevant figure.  And #2, there are multiple potential latency sources
> at play which we need to factor in when looking at the big picture.  For
> instance, the difference between PF exit, and an IO exit (2.58us on x86,
> to be precise).  Or whether you need to take a heavy-weight exit.  Or a
> context switch to qemu, the the kernel, back to qemu, and back to the
> vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
> I know you wish that this whole discussion would just go away, but these
> little "300ns here, 1600ns there" really add up in aggregate despite
> your dismissive attitude towards them.  And it doesn't take much to
> affect the results in a measurable way.  As stated, each 1us costs ~4%. 
> My motivation is to reduce as many of these sources as possible.
>
> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
> improvement.  So what?  Its still an improvement.  If that improvement
> were for free, would you object?  And we all know that this change isn't
> "free" because we have to change some code (+128/-0, to be exact).  But
> what is it specifically you are objecting to in the first place?  Adding
> hypercall support as an pv_ops primitive isn't exactly hard or complex,
> or even very much code.
>   

Where does 25us come from?  The number you post below are 33us and 
66us.  This is part of what's frustrating me in this thread.  Things are 
way too theoretical.  Saying that "if packet latency was 25us, then it 
would be a 1.4% improvement" is close to misleading.  The numbers you've 
posted are also measuring on-box speeds.  What really matters are 
off-box latencies and that's just going to exaggerate.

IIUC, if you switched vbus to using PIO today, you would go from 66us to 
to 65.65, which you'd round to 66us for on-box latencies.  Even if you 
didn't round, it's a 0.5% improvement in latency.

Adding hypercall support as a pv_ops primitive is adding a fair bit of 
complexity.  You need a hypercall fd mechanism to plumb this down to 
userspace otherwise, you can't support migration from in-kernel backend 
to non in-kernel backend.  You need some way to allocate hypercalls to 
particular devices which so far, has been completely ignored.  I've 
already mentioned why hypercalls are also unfortunate from a guest 
perspective.  They require kernel patching and this is almost certainly 
going to break at least Vista as a guest.  Certainly Windows 7.

So it's not at all fair to trivialize the complexity introduce here.  
I'm simply asking for justification to introduce this complexity.  I 
don't see why this is unfair for me to ask.

>> As a more general observation, we need numbers to justify an
>> optimization, not to justify not including an optimization.
>>
>> In other words, the burden is on you to present a scenario where this
>> optimization would result in a measurable improvement in a real world
>> work load.
>>     
>
> I have already done this.  You seem to have chosen to ignore my
> statements and results, but if you insist on rehashing:
>
> I started this project by analyzing system traces and finding some of
> the various bottlenecks in comparison to a native host.  Throughput was
> already pretty decent, but latency was pretty bad (and recently got
> *really* bad, but I know you already have a handle on whats causing
> that).  I digress...one of the conclusions of the research was that  I
> wanted to focus on building an IO subsystem designed to minimize the
> quantity of exits, minimize the cost of each exit, and shorten the
> end-to-end signaling path to achieve optimal performance.  I also wanted
> to build a system that was extensible enough to work with a variety of
> client types, on a variety of architectures, etc, so we would only need
> to solve these problems "once".  The end result was vbus, and the first
> working example was venet.  The measured performance data of this work
> was as follows:
>
> 802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
> with Chelsio T3 10GE via crossover.
>
> Bare metal            : tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
> Virtio-net (PCI)    : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
> Venet      (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)
>
> For more details:  http://lkml.org/lkml/2009/4/21/408
>   

Sending out a massive infrastructure change that does things wildly 
differently from how they're done today without any indication of why 
those changes were necessary is disruptive.

If you could characterize all of the changes that vbus makes that are 
different from virtio, demonstrating at each stage why the change 
mattered and what benefit it brought, then we'd be having a completely 
different discussion.  I have no problem throwing away virtio today if 
there's something else better.

That's not what you've done though.  You wrote a bunch of code without 
understanding why virtio does things the way it does and then dropped it 
all on the list.  This isn't necessarily a bad exercise, but there's a 
ton of work necessary to determine which things vbus does differently 
actually matter.  I'm not saying that you shouldn't have done vbus, but 
I'm saying there's a bunch of analysis work that you haven't done that 
needs to be done before we start making any changes in upstream code.

I've been trying to argue why I don't think hypercalls are an important 
part of vbus from a performance perspective.   I've tried to demonstrate 
why I don't think this is an important part of vbus.  The frustration I 
have with this series is that you seem unwilling to compromise any 
aspect of vbus design.  I understand you've made your decisions  in vbus 
for some reasons and you think the way you've done things is better, but 
that's not enough.  We have virtio today, it provides greater 
functionality than vbus does, it supports multiple guest types, and it's 
gotten quite a lot of testing.  It has its warts, but most things that 
have been around for some time do.

> Now I know you have been quick in the past to dismiss my efforts, and to
> claim you can get the same results without needing the various tricks
> and optimizations I uncovered.  But quite frankly, until you post some
> patches for community review and comparison (as I have done), it's just
> meaningless talk.

I can just as easily say that until you post a full series that covers 
all of the functionality that virtio has today, vbus is just meaningless 
talk.  But I'm trying not to be dismissive in all of this because I do 
want to see you contribute to the KVM paravirtual IO infrastructure.  
Clearly, you have useful ideas.

We can't just go rewriting things without a clear understanding of why 
something's better.  What's missing is a detailed analysis of what 
virtio-net does today and what vbus does so that it's possible to draw 
some conclusions.

For instance, this could look like:

For a single packet delivery:

150ns are spent from PIO operation
320ns are spent in heavy-weight exit handler
150ns are spent transitioning to userspace
5us are spent contending on qemu_mutex
30us are spent copying data in tun/tap driver
40us are spent waiting for RX
...

For vbus, it would look like:

130ns are spent from HC instruction
100ns are spent signaling TX thread
...

But single packet delivery is just one part of the puzzle.  Bulk 
transfers are also important.  CPU consumption is important.  How we 
address things like live migration, non-privileged user initialization, 
and userspace plumbing are all also important.

Right now, the whole discussion around this series is wildly speculative 
and quite frankly, counter productive.  A few RTT benchmarks are not 
sufficient to make any kind of forward progress here.  I certainly like 
rewriting things as much as anyone else, but you need a substantial 
amount of justification for it that so far hasn't been presented.

Do you understand what my concerns are and why I don't want to just 
switch to a new large infrastructure?

Do you feel like you understand what sort of data I'm looking for to 
justify the changes vbus is proposing to make?  Is this something your 
willing to do because IMHO this is a prerequisite for any sort of merge 
consideration.  The analysis of the virtio-net side of things is just as 
important as the vbus side of things.

I've tried to explain this to you a number of times now and so far it 
doesn't seem like I've been successful.  If it isn't clear, please let 
me know.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:06                             ` Avi Kivity
  2009-05-11 17:29                               ` Gregory Haskins
@ 2009-05-11 17:51                               ` Anthony Liguori
  2009-05-11 18:02                                 ` Avi Kivity
  1 sibling, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-11 17:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

Avi Kivity wrote:
> Hollis Blanchard wrote:
>> I haven't been following this conversation at all. With that in mind...
>>
>> AFAICS, a hypercall is clearly the higher-performing option, since you
>> don't need the additional memory load (which could even cause a page
>> fault in some circumstances) and instruction decode. That said, I'm
>> willing to agree that this overhead is probably negligible compared to
>> the IOp itself... Ahmdal's Law again.
>>   
>
> It's a question of cost vs. benefit.  It's clear the benefit is low 
> (but that doesn't mean it's not worth having).  The cost initially 
> appeared to be very low, until the nested virtualization wrench was 
> thrown into the works.  Not that nested virtualization is a reality -- 
> even on svm where it is implemented it is not yet production quality 
> and is disabled by default.
>
> Now nested virtualization is beginning to look interesting, with 
> Windows 7's XP mode requiring virtualization extensions.  Desktop 
> virtualization is also something likely to use device assignment 
> (though you probably won't assign a virtio device to the XP instance 
> inside Windows 7).
>
> Maybe we should revisit the mmio hypercall idea again, it might be 
> workable if we find a way to let the guest know if it should use the 
> hypercall or not for a given memory range.
>
> mmio hypercall is nice because
> - it falls back nicely to pure mmio
> - it optimizes an existing slow path, not just new device models
> - it has preexisting semantics, so we have less ABI to screw up
> - for nested virtualization + device assignment, we can drop it and 
> get a nice speed win (or rather, less speed loss)

If it's a PCI device, then we can also have an interrupt which we 
currently lack with vmcall-based hypercalls.  This would give us 
guestcalls, upcalls, or whatever we've previously decided to call them.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:29                               ` Gregory Haskins
@ 2009-05-11 17:53                                 ` Avi Kivity
  0 siblings, 0 replies; 115+ messages in thread
From: Avi Kivity @ 2009-05-11 17:53 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Hollis Blanchard, Anthony Liguori, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Hollis Blanchard wrote:
>>     
>>> I haven't been following this conversation at all. With that in mind...
>>>
>>> AFAICS, a hypercall is clearly the higher-performing option, since you
>>> don't need the additional memory load (which could even cause a page
>>> fault in some circumstances) and instruction decode. That said, I'm
>>> willing to agree that this overhead is probably negligible compared to
>>> the IOp itself... Ahmdal's Law again.
>>>   
>>>       
>> It's a question of cost vs. benefit.  It's clear the benefit is low
>> (but that doesn't mean it's not worth having).  The cost initially
>> appeared to be very low, until the nested virtualization wrench was
>> thrown into the works.  Not that nested virtualization is a reality --
>> even on svm where it is implemented it is not yet production quality
>> and is disabled by default.
>>
>> Now nested virtualization is beginning to look interesting, with
>> Windows 7's XP mode requiring virtualization extensions.  Desktop
>> virtualization is also something likely to use device assignment
>> (though you probably won't assign a virtio device to the XP instance
>> inside Windows 7).
>>
>> Maybe we should revisit the mmio hypercall idea again, it might be
>> workable if we find a way to let the guest know if it should use the
>> hypercall or not for a given memory range.
>>
>> mmio hypercall is nice because
>> - it falls back nicely to pure mmio
>> - it optimizes an existing slow path, not just new device models
>> - it has preexisting semantics, so we have less ABI to screw up
>> - for nested virtualization + device assignment, we can drop it and
>> get a nice speed win (or rather, less speed loss)
>>
>>     
> Yeah, I agree with all this.  I am still wrestling with how to deal with
> the device-assignment problem w.r.t. shunting io requests into a
> hypercall vs letting them PF.  Are you saying we could simply ignore
> this case by disabling "MMIOoHC" when assignment is enabled?  That would
> certainly make the problem much easier to solve.
>   

No, we need to deal with hotplug.  Something like IO_COND that Chris 
mentioned, but how to avoid turning this into a doctoral thesis.

(On the other hand, device assignment requires the iommu, and I think 
you have to specify that up front?)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 16:44                         ` Hollis Blanchard
@ 2009-05-11 17:54                           ` Anthony Liguori
  2009-05-11 19:24                             ` PowerPC page faults Hollis Blanchard
  0 siblings, 1 reply; 115+ messages in thread
From: Anthony Liguori @ 2009-05-11 17:54 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

Hollis Blanchard wrote:
> On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Can you back up your claim that PPC has no difference in performance
>>> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
>>> like instructions, but clearly there are ways to cause a trap, so
>>> presumably we can measure the difference between a PF exit and something
>>> more explicit).
>>>       
>> First, the PPC that KVM supports performs very poorly relatively 
>> speaking because it receives no hardware assistance  this is not the 
>> right place to focus wrt optimizations.
>>
>> And because there's no hardware assistance, there simply isn't a 
>> hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
>> but I honestly have no idea.  I'm sure Hollis does though.
>>     
>
> Memory load from the guest context (for instruction decoding) is a
> *very* poorly performing path on most PowerPC, even considering server
> PowerPC with hardware virtualization support. No, I don't have any data
> for you, but switching the hardware MMU contexts requires some
> heavyweight synchronization instructions.
>   

For current ppcemb, you would have to do a memory load no matter what, 
right?  I guess you could have a dedicated interrupt vector or something...

For future ppcemb's, do you know if there is an equivalent of a PF exit 
type?  Does the hardware squirrel away the faulting address somewhere 
and set PC to the start of the instruction?  If so, no guest memory load 
should be required.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:51                               ` Anthony Liguori
@ 2009-05-11 18:02                                 ` Avi Kivity
  2009-05-11 18:18                                   ` Anthony Liguori
  0 siblings, 1 reply; 115+ messages in thread
From: Avi Kivity @ 2009-05-11 18:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

Anthony Liguori wrote: 
>>
>> It's a question of cost vs. benefit.  It's clear the benefit is low 
>> (but that doesn't mean it's not worth having).  The cost initially 
>> appeared to be very low, until the nested virtualization wrench was 
>> thrown into the works.  Not that nested virtualization is a reality 
>> -- even on svm where it is implemented it is not yet production 
>> quality and is disabled by default.
>>
>> Now nested virtualization is beginning to look interesting, with 
>> Windows 7's XP mode requiring virtualization extensions.  Desktop 
>> virtualization is also something likely to use device assignment 
>> (though you probably won't assign a virtio device to the XP instance 
>> inside Windows 7).
>>
>> Maybe we should revisit the mmio hypercall idea again, it might be 
>> workable if we find a way to let the guest know if it should use the 
>> hypercall or not for a given memory range.
>>
>> mmio hypercall is nice because
>> - it falls back nicely to pure mmio
>> - it optimizes an existing slow path, not just new device models
>> - it has preexisting semantics, so we have less ABI to screw up
>> - for nested virtualization + device assignment, we can drop it and 
>> get a nice speed win (or rather, less speed loss)
>
> If it's a PCI device, then we can also have an interrupt which we 
> currently lack with vmcall-based hypercalls.  This would give us 
> guestcalls, upcalls, or whatever we've previously decided to call them.

Sorry, I totally failed to understand this.  Please explain.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 18:02                                 ` Avi Kivity
@ 2009-05-11 18:18                                   ` Anthony Liguori
  0 siblings, 0 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-11 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Hollis Blanchard, Gregory Haskins, Gregory Haskins, Chris Wright,
	linux-kernel, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>>
>>> It's a question of cost vs. benefit.  It's clear the benefit is low 
>>> (but that doesn't mean it's not worth having).  The cost initially 
>>> appeared to be very low, until the nested virtualization wrench was 
>>> thrown into the works.  Not that nested virtualization is a reality 
>>> -- even on svm where it is implemented it is not yet production 
>>> quality and is disabled by default.
>>>
>>> Now nested virtualization is beginning to look interesting, with 
>>> Windows 7's XP mode requiring virtualization extensions.  Desktop 
>>> virtualization is also something likely to use device assignment 
>>> (though you probably won't assign a virtio device to the XP instance 
>>> inside Windows 7).
>>>
>>> Maybe we should revisit the mmio hypercall idea again, it might be 
>>> workable if we find a way to let the guest know if it should use the 
>>> hypercall or not for a given memory range.
>>>
>>> mmio hypercall is nice because
>>> - it falls back nicely to pure mmio
>>> - it optimizes an existing slow path, not just new device models
>>> - it has preexisting semantics, so we have less ABI to screw up
>>> - for nested virtualization + device assignment, we can drop it and 
>>> get a nice speed win (or rather, less speed loss)
>>
>> If it's a PCI device, then we can also have an interrupt which we 
>> currently lack with vmcall-based hypercalls.  This would give us 
>> guestcalls, upcalls, or whatever we've previously decided to call them.
>
> Sorry, I totally failed to understand this.  Please explain.

I totally missed what you meant by MMIO hypercall.

In what cases do you think MMIO hypercall would result in a net benefit?

I think the difference in MMIO vs hcall will be overshadowed by the 
heavy weight transition to userspace.  The only thing I can think of 
where it may matter is for in-kernel devices like the APIC but that's a 
totally different path in Linux.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: PowerPC page faults
  2009-05-11 17:54                           ` Anthony Liguori
@ 2009-05-11 19:24                             ` Hollis Blanchard
  2009-05-11 22:17                               ` Anthony Liguori
  0 siblings, 1 reply; 115+ messages in thread
From: Hollis Blanchard @ 2009-05-11 19:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote:
> For future ppcemb's, do you know if there is an equivalent of a PF exit 
> type?  Does the hardware squirrel away the faulting address somewhere 
> and set PC to the start of the instruction?  If so, no guest memory load 
> should be required.

Ahhh... you're saying that the address itself (or offset within a page)
is the hypercall token, totally separate from IO emulation, and so we
could ignore the access size. I guess it looks like this:

page fault vector:
        if (faulting_address & PAGE_MASK) == vcpu->hcall_page
                handle_hcall(faulting_address & ~PAGE_MASK)
        else
                if (faulting_address is IO)
                        emulate_io(faulting_address)
                else
                        handle_pagefault(faulting_address)

Testing for hypercalls in the page fault handler path would add some
overhead, and on processors with software-managed TLBs, the page fault
path is *very* hot. Implementing the above pseudocode wouldn't be ideal,
especially because Power processors with hardware virtualization support
have a separate vector for hypercalls. However, I suspect it wouldn't be
a show-stopper from a performance point of view.

Note that other Power virtualization solutions (hypervisors from IBM,
Sony, and Toshiba) use the dedicated hypercall instruction and interrupt
vector, which after all is how the hardware was designed. To my
knowledge, they also don't do IO emulation, so they avoid both
conditionals in the above psuedocode.

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: PowerPC page faults
  2009-05-11 19:24                             ` PowerPC page faults Hollis Blanchard
@ 2009-05-11 22:17                               ` Anthony Liguori
  2009-05-12  5:46                                 ` Liu Yu-B13201
  2009-05-12 14:50                                 ` Hollis Blanchard
  0 siblings, 2 replies; 115+ messages in thread
From: Anthony Liguori @ 2009-05-11 22:17 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

Hollis Blanchard wrote:
> On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote:
>   
>> For future ppcemb's, do you know if there is an equivalent of a PF exit 
>> type?  Does the hardware squirrel away the faulting address somewhere 
>> and set PC to the start of the instruction?  If so, no guest memory load 
>> should be required.
>>     
>
> Ahhh... you're saying that the address itself (or offset within a page)
> is the hypercall token, totally separate from IO emulation, and so we
> could ignore the access size.

No, I'm not being nearly that clever.

I was suggesting that hardware virtualization support in future PPC 
systems might contain a mechanism to intercept a guest-mode TLB miss.  
If it did, it would be useful if that guest-mode TLB miss "exit" 
contained extra information somewhere that included the PC of the 
faulting instruction, the address response for the fault, and enough 
information to handle the fault without instruction decoding.

I assume all MMIO comes from the same set of instructions in PPC?  
Something like ld/st instructions?  Presumably all you need to know from 
instruction decoding is the destination register and whether it was a 
read or write?

Or am I oversimplifying?  Does the current hardware spec contain this 
sort of assistance?

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 115+ messages in thread

* RE: PowerPC page faults
  2009-05-11 22:17                               ` Anthony Liguori
@ 2009-05-12  5:46                                 ` Liu Yu-B13201
  2009-05-12 14:50                                 ` Hollis Blanchard
  1 sibling, 0 replies; 115+ messages in thread
From: Liu Yu-B13201 @ 2009-05-12  5:46 UTC (permalink / raw)
  To: Anthony Liguori, Hollis Blanchard
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

 

> -----Original Message-----
> From: kvm-owner@vger.kernel.org 
> [mailto:kvm-owner@vger.kernel.org] On Behalf Of Anthony Liguori
> Sent: Tuesday, May 12, 2009 6:18 AM
> To: Hollis Blanchard
> Cc: Gregory Haskins; Avi Kivity; Chris Wright; Gregory 
> Haskins; linux-kernel@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: PowerPC page faults
> 
> Hollis Blanchard wrote:
> > On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote:
> >   
> >> For future ppcemb's, do you know if there is an equivalent 
> of a PF exit 
> >> type?  Does the hardware squirrel away the faulting 
> address somewhere 
> >> and set PC to the start of the instruction?  If so, no 
> guest memory load 
> >> should be required.
> >>     
> >
> > Ahhh... you're saying that the address itself (or offset 
> within a page)
> > is the hypercall token, totally separate from IO emulation, 
> and so we
> > could ignore the access size.
> 
> No, I'm not being nearly that clever.
> 
> I was suggesting that hardware virtualization support in future PPC 
> systems might contain a mechanism to intercept a guest-mode 
> TLB miss.  
> If it did, it would be useful if that guest-mode TLB miss "exit" 
> contained extra information somewhere that included the PC of the 
> faulting instruction, the address response for the fault, and enough 
> information to handle the fault without instruction decoding.

I don't think this can come true if it is software managed TLB.
For guest will never know the content of its TLB without the help of software.

> 
> I assume all MMIO comes from the same set of instructions in PPC?  
> Something like ld/st instructions?  Presumably all you need 
> to know from 
> instruction decoding is the destination register and whether it was a 
> read or write?

Yes, but more there is a search in guest TLB to see if it is a guest TLB miss currently.
A big overhead, especially for a large TLB without being organized by sth. like rbtree.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: PowerPC page faults
  2009-05-11 22:17                               ` Anthony Liguori
  2009-05-12  5:46                                 ` Liu Yu-B13201
@ 2009-05-12 14:50                                 ` Hollis Blanchard
  1 sibling, 0 replies; 115+ messages in thread
From: Hollis Blanchard @ 2009-05-12 14:50 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, Gregory Haskins,
	linux-kernel, kvm

On Monday 11 May 2009 17:17:53 Anthony Liguori wrote:
> Hollis Blanchard wrote:
> > On Mon, 2009-05-11 at 12:54 -0500, Anthony Liguori wrote:
> >   
> >> For future ppcemb's, do you know if there is an equivalent of a PF exit 
> >> type?  Does the hardware squirrel away the faulting address somewhere 
> >> and set PC to the start of the instruction?  If so, no guest memory load 
> >> should be required.
> >>     
> >
> > Ahhh... you're saying that the address itself (or offset within a page)
> > is the hypercall token, totally separate from IO emulation, and so we
> > could ignore the access size.
> 
> No, I'm not being nearly that clever.

I started thinking you weren't, but then I realized you must be working 
several levels above me and just forgot to explain... :)

> I was suggesting that hardware virtualization support in future PPC 
> systems might contain a mechanism to intercept a guest-mode TLB miss.  
> If it did, it would be useful if that guest-mode TLB miss "exit" 
> contained extra information somewhere that included the PC of the 
> faulting instruction, the address response for the fault, and enough 
> information to handle the fault without instruction decoding.
> 
> I assume all MMIO comes from the same set of instructions in PPC?  
> Something like ld/st instructions?  Presumably all you need to know from 
> instruction decoding is the destination register and whether it was a 
> read or write?

In addition to register source/target, we also need to know the access size of 
the memory reference. That information isn't stuffed into registers for us by 
hardware, and it's not in published specifications for future hardware.

Now, if you wanted to define a hypercall as a byte access within a particular 
4K page, where the register source/target is ignored, that could be 
interesting, but I don't know if that's relevant to this "hypercall vs MMIO" 
discussion.

-- 
Hollis Blanchard
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:31                           ` Anthony Liguori
@ 2009-05-13 10:53                             ` Gregory Haskins
  2009-05-13 14:45                             ` Gregory Haskins
  1 sibling, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-13 10:53 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Avi Kivity, Chris Wright, linux-kernel, kvm,
	Hollis Blanchard

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>>
>> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
>> improvement.  So what?  Its still an improvement.  If that improvement
>> were for free, would you object?  And we all know that this change isn't
>> "free" because we have to change some code (+128/-0, to be exact).  But
>> what is it specifically you are objecting to in the first place?  Adding
>> hypercall support as an pv_ops primitive isn't exactly hard or complex,
>> or even very much code.
>>   
>
> Where does 25us come from?  The number you post below are 33us and 66us.

<snip>

The 25us is approximately the max from an in-kernel harness strapped
directly to the driver gathered informally during testing.  The 33us is
from formally averaging multiple runs of a userspace socket app in
preparation for publishing.  I consider the 25us the "target goal" since
there is obviously overhead that a socket application deals with that
theoretically a guest bypasses with the tap-device.  Note that the
socket application itself often sees < 30us itself...this was just a
particularly "slow" set of runs that day.

Note that this is why I express the impact as "approximately" (e.g.
"~4%").  Sorry for the confusion.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [RFC PATCH 0/3] generic hypercall support
  2009-05-11 17:31                           ` Anthony Liguori
  2009-05-13 10:53                             ` Gregory Haskins
@ 2009-05-13 14:45                             ` Gregory Haskins
  1 sibling, 0 replies; 115+ messages in thread
From: Gregory Haskins @ 2009-05-13 14:45 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Chris Wright, linux-kernel, kvm, Hollis Blanchard

[-- Attachment #1: Type: text/plain, Size: 17947 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> I specifically generalized my statement above because #1 I assume
>> everyone here is smart enough to convert that nice round unit into the
>> relevant figure.  And #2, there are multiple potential latency sources
>> at play which we need to factor in when looking at the big picture.  For
>> instance, the difference between PF exit, and an IO exit (2.58us on x86,
>> to be precise).  Or whether you need to take a heavy-weight exit.  Or a
>> context switch to qemu, the the kernel, back to qemu, and back to the
>> vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models
>> IO. I know you wish that this whole discussion would just go away,
>> but these
>> little "300ns here, 1600ns there" really add up in aggregate despite
>> your dismissive attitude towards them.  And it doesn't take much to
>> affect the results in a measurable way.  As stated, each 1us costs
>> ~4%. My motivation is to reduce as many of these sources as possible.
>>
>> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
>> improvement.  So what?  Its still an improvement.  If that improvement
>> were for free, would you object?  And we all know that this change isn't
>> "free" because we have to change some code (+128/-0, to be exact).  But
>> what is it specifically you are objecting to in the first place?  Adding
>> hypercall support as an pv_ops primitive isn't exactly hard or complex,
>> or even very much code.
>>   
>
> Where does 25us come from?  The number you post below are 33us and
> 66us.  This is part of what's frustrating me in this thread.  Things
> are way too theoretical.  Saying that "if packet latency was 25us,
> then it would be a 1.4% improvement" is close to misleading.
[ answered in the last reply ]

> The numbers you've posted are also measuring on-box speeds.  What
> really matters are off-box latencies and that's just going to exaggerate.

I'm not 100% clear on what you mean with on-box vs off-box.  These
figures were gathered between two real machines connected via 10GE
cross-over cable.  The 5.8Gb/s and 33us (25us) values were gathered
sending real data between these hosts.  This sounds "off-box" to me, but
I am not sure I truly understand your assertion.

>
>
> IIUC, if you switched vbus to using PIO today, you would go from 66us
> to to 65.65, which you'd round to 66us for on-box latencies.  Even if
> you didn't round, it's a 0.5% improvement in latency.

I think part of what you are missing is that in order to create vbus, I
needed to _create_ an in-kernel hook from scratch since there were no
existing methods. Since I measured HC to be superior in performance (if
by only a little), I wasn't going to chose the slower way if there
wasn't a reason, and at the time I didn't see one.  Now after community
review, perhaps we do have a reason, but that is the point of the review
process.  So now we can push something like iofd as a PIO hook instead. 
But either way, something needed to be created.

>
>
> Adding hypercall support as a pv_ops primitive is adding a fair bit of
> complexity.  You need a hypercall fd mechanism to plumb this down to
> userspace otherwise, you can't support migration from in-kernel
> backend to non in-kernel backend.

I respectfully disagree.   This is orthogonal to the simple issue of the
IO type for the exit.  Where you *do* have a point is that the bigger
benefit comes from in-kernel termination (like the iofd stuff I posted
yesterday).  However, in-kernel termination is not strictly necessary to
exploit some reduction in overhead in the IO latency.  In either case we
know we can shave off about 2.56us from an MMIO.

Since I formally measured MMIO rtt to userspace yesterday, we now know
that we can do qemu-mmio in about 110k IOPS, 9.09us rtt.  Switching to
pv_io_ops->mmio() alone would be a boost to approximately 153k IOPS,
6.53us rtt.  This would have a tangible benefit to all models without
any hypercall plumbing screwing up migration.  Therefore I still stand
by the assertion that the hypercall discussion alone doesn't add very
much complexity.

>   You need some way to allocate hypercalls to particular devices which
> so far, has been completely ignored.

I'm sorry, but thats not true.  Vbus already handles this mapping.


>   I've already mentioned why hypercalls are also unfortunate from a
> guest perspective.  They require kernel patching and this is almost
> certainly going to break at least Vista as a guest.  Certainly Windows 7.

Yes, you have a point here.

>
> So it's not at all fair to trivialize the complexity introduce here. 
> I'm simply asking for justification to introduce this complexity.  I
> don't see why this is unfair for me to ask.

In summary, I don't think there is really much complexity being added
because this stuff really doesn't depend on the hypercallfd (iofd)
interface in order to have some benefit, as you assert above.  The
hypercall page is a good point for attestation, but that issue exists
already today and is not a newly created issue by this proposal.

>
>>> As a more general observation, we need numbers to justify an
>>> optimization, not to justify not including an optimization.
>>>
>>> In other words, the burden is on you to present a scenario where this
>>> optimization would result in a measurable improvement in a real world
>>> work load.
>>>     
>>
>> I have already done this.  You seem to have chosen to ignore my
>> statements and results, but if you insist on rehashing:
>>
>> I started this project by analyzing system traces and finding some of
>> the various bottlenecks in comparison to a native host.  Throughput was
>> already pretty decent, but latency was pretty bad (and recently got
>> *really* bad, but I know you already have a handle on whats causing
>> that).  I digress...one of the conclusions of the research was that  I
>> wanted to focus on building an IO subsystem designed to minimize the
>> quantity of exits, minimize the cost of each exit, and shorten the
>> end-to-end signaling path to achieve optimal performance.  I also wanted
>> to build a system that was extensible enough to work with a variety of
>> client types, on a variety of architectures, etc, so we would only need
>> to solve these problems "once".  The end result was vbus, and the first
>> working example was venet.  The measured performance data of this work
>> was as follows:
>>
>> 802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
>> with Chelsio T3 10GE via crossover.
>>
>> Bare metal            : tput = 9717Mb/s, round-trip = 30396pps (33us
>> rtt)
>> Virtio-net (PCI)    : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
>> Venet      (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)
>>
>> For more details:  http://lkml.org/lkml/2009/4/21/408
>>   
>
> Sending out a massive infrastructure change that does things wildly
> differently from how they're done today without any indication of why
> those changes were necessary is disruptive.

Well, this is an unfortunate situation.  I had to build my ideas before
I could even know if they worked and therefore if they were worth
discussing.  So we are where we are because of that fact.  Anyway, this
is somewhat irrelevant here because the topic at hand in this thread is
purely the hypercall interface.  I only mentioned the backstory of vbus
so we are focused/aware of the end-goal.

>
> If you could characterize all of the changes that vbus makes that are
> different from virtio, demonstrating at each stage why the change
> mattered and what benefit it brought, then we'd be having a completely
> different discussion.

Splitting this up into small changes is already work is in progress. 
Thats what we are discussing in this thread, afterall.

>   I have no problem throwing away virtio today if there's something
> else better.

I am not advocating this.  Virtio can (and does, though it needs testing
and backends written) run happily on top of vbus.  I am not even
advocating getting rid of the userspace virtio models.  They have a
place and could simply be run-time switched on or off, depending on the
user configuration.

>
> That's not what you've done though.  You wrote a bunch of code without
> understanding why virtio does things the way it does and then dropped
> it all on the list.

False.  I studied the system for the bottlenecks before I began.

>   This isn't necessarily a bad exercise, but there's a ton of work
> necessary to determine which things vbus does differently actually
> matter.  I'm not saying that you shouldn't have done vbus, but I'm
> saying there's a bunch of analysis work that you haven't done that
> needs to be done before we start making any changes in upstream code.
>
> I've been trying to argue why I don't think hypercalls are an
> important part of vbus from a performance perspective.   I've tried to
> demonstrate why I don't think this is an important part of vbus.

Yes, lets just drop them.  I will use PIO and not worry about non-x86. 
I generally don't like to ignore issues like this where at least
conjecture indicates I should try to think ahead. However, its not worth
this headache to continue the conversation.  If anything, we can do the
IOoHC idea but lets just backburner this whole HC thing for now.

>   The frustration I have with this series is that you seem unwilling
> to compromise any aspect of vbus design.

Thats an unfair assertion.  As an example, I already started migrating
to PCI even though I was pretty against it in the beginning.  I also
conceded that the 350ns is possibly a non-issue if we don't care about
non-x86.

>   I understand you've made your decisions  in vbus for some reasons
> and you think the way you've done things is better, but that's not
> enough.  We have virtio today, it provides greater functionality than
> vbus does, it supports multiple guest types, and it's gotten quite a
> lot of testing.
Yes, and I admit that is a compelling argument against changing.

> It has its warts, but most things that have been around for some time do.
>
>> Now I know you have been quick in the past to dismiss my efforts, and to
>> claim you can get the same results without needing the various tricks
>> and optimizations I uncovered.  But quite frankly, until you post some
>> patches for community review and comparison (as I have done), it's just
>> meaningless talk.
>
> I can just as easily say that until you post a full series that covers
> all of the functionality that virtio has today, vbus is just
> meaningless talk.

That's a stretch.  Its a series that offers a new way to do a
device-model that we can re-use in multiple environments, and has
performance considerations specifically for software-to-software
interaction (such as guest/hypervisor) built into it.  It will also
hopefully support some features that we don't have today, like RDMA
interfaces, but that is in the future.

To your point, it is missing some features like live-migration, but IIUC
thats how FOSS works.  Start with an idea...If others think its good,
you polish it up together.  I don't believe KVM was 100% when it was
released either, or we wouldn't be having this conversation.

>   But I'm trying not to be dismissive in all of this because I do want
> to see you contribute to the KVM paravirtual IO infrastructure. 
> Clearly, you have useful ideas.
>
> We can't just go rewriting things without a clear understanding of why
> something's better.  What's missing is a detailed analysis of what
> virtio-net does today and what vbus does so that it's possible to draw
> some conclusions.

This is ongoing.  For one, it hops through userspace which as I've
already demonstrated has a cost (that at least I care about).

>
> For instance, this could look like:
>
> For a single packet delivery:
>
> 150ns are spent from PIO operation
> 320ns are spent in heavy-weight exit handler
> 150ns are spent transitioning to userspace
> 5us are spent contending on qemu_mutex
> 30us are spent copying data in tun/tap driver
> 40us are spent waiting for RX
> ...
>
> For vbus, it would look like:
>
> 130ns are spent from HC instruction
> 100ns are spent signaling TX thread
> ...
>
> But single packet delivery is just one part of the puzzle.  Bulk
> transfers are also important.

Both of those are covered today (and, I believe, superior)...see the
throughput numbers in my post.

>   CPU consumption is important.

Yes, this is a good point.  I need to quantify this.  I fully expect to
be worse since one of the design points is that we should try to use as
much parallel effort as possible in the face of these multi-core boxes.

Note that this particular design decision is a attribute of venettap
specifically.  vbus itself doesn't force a threading policy.

>   How we address things like live migration
Yes, this feature is not started.  Looking for community help on that.

> , non-privileged user initialization

Nor is this.  I think I need to write some patches for configfs to get
there.


> , and userspace plumbing are all also important.

Basic userspace plumbing is complete, but I need to hook qemu into it

>
> Right now, the whole discussion around this series is wildly
> speculative and quite frankly, counter productive.

How?  I've posted tangible patches you can test today that show an
measurable improvement. If someone wants to investigate/propose an
improvement in general, how else would you do it so it is productive and
not speculative? 

>   A few RTT benchmarks are not sufficient to make any kind of forward
> progress here.  I certainly like rewriting things as much as anyone
> else, but you need a substantial amount of justification for it that
> so far hasn't been presented.

Well, unfortunately for me the real interesting stuff hasn't been
completed yet nor articulated very well, so I can't fault you for not
seeing the complete picture.  In fact, as of right now we don't even
know if the "real interesting" stuff will work yet.  Its still being
hashed out.  What we do know is that the basic performance elements of
the design (in-kernel, etc) seemed to have made a substantial
improvement.  And more tweaks are coming to widen the gap.

>
>
> Do you understand what my concerns are and why I don't want to just
> switch to a new large infrastructure?

Yes.  However, to reiterate: its not a switch.  Its just another option
in the mix to have in-kernel virtio or not.  Today you have
emulated+virtio, tomorrow you could have emulate+virtio+kern_virtio

>
>
> Do you feel like you understand what sort of data I'm looking for to
> justify the changes vbus is proposing to make?  Is this something your
> willing to do because IMHO this is a prerequisite for any sort of
> merge consideration.  The analysis of the virtio-net side of things is
> just as important as the vbus side of things.

Work in progress.

>
>
> I've tried to explain this to you a number of times now and so far it
> doesn't seem like I've been successful.  If it isn't clear, please let
> me know.

I don't think this is accurate, and I don't really appreciate the
patronizing tone.

I've been gathering the figures for the kernel side of things, and last
we spoke, *you* were gathering all the performance data and issues in
qemu and were working on improvements.  In fact, last I heard, you had
reduced the 4000us rtt to 90us as part of that effort.  While I have yet
to see your patches, I was certainly not going to duplicate your work
here.  So to infer that I am somehow simply being obtuse w.r.t. your
request/guidance is a bit disingenuous.  That wasn't really what we
discussed from my perspective.

That said, I will certainly continue to analyze the gross path through
userspace, as I have done with "doorbell", because that is directly
relevant to my points.  Figuring out something like the details of any
qemu_mutex bottlenecks, on the other hand, is not.  I already contend
that the userspace path, at any cost, is just a waste.  Ultimately we
need to come back to the kernel anyway to do the IO, so there is no real
use in digging in deeper there IMO.  Just skip it and inject the IO
directly.

If you look, venettap is about the same complexity as tuntap.  Userpsace
is just acting as a marginal shim, so the arguments about the MMU
protection in userspace, etc, are of diminished value.  That said, if
you want to submit your patches to me that improve the userspace
overhead, I will include them in my benchmarks so we can gather the most
accurate picture possible.

The big question is: does the KVM community believe in in-kernel devices
or not?  If we don't, then the kvm+vbus conversation is done and I will
just be forced to maintain the kvm-connector in my own tree, or (perhaps
your preferred outcome?) let it die.  If we *do* believe in in-kernel,
then I hope that I can convince the relevant maintainers that vbus is a
good infrastructure to provide a framework for it.  Perhaps I need to
generate more numbers before I can convince everyone of this, and thats
fine.  Everything else beyond that is just work to fill in the missing
details.

If you and others want to join me in implementing things like a
virtio-net backend and adding live-migration support, that would be
awesome.  The choice is yours.  For now, the plan that Avi and I have
tentatively laid out is to work on making KVM more extensible via some
of this plumbing (e.g. irqfd) so that these types of enhancements are
possible without requiring "pollution" (my word, though Avi would likely
concur) of the KVM core.

Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2009-05-13 14:45 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-05 13:24 [RFC PATCH 0/3] generic hypercall support Gregory Haskins
2009-05-05 13:24 ` [RFC PATCH 1/3] add " Gregory Haskins
2009-05-05 17:03   ` Hollis Blanchard
2009-05-06 13:52   ` Anthony Liguori
2009-05-06 15:16     ` Gregory Haskins
2009-05-06 16:52       ` Arnd Bergmann
2009-05-05 13:24 ` [RFC PATCH 2/3] x86: " Gregory Haskins
2009-05-05 13:24 ` [RFC PATCH 3/3] kvm: add pv_cpu_ops.hypercall support to the guest Gregory Haskins
2009-05-05 13:36 ` [RFC PATCH 0/3] generic hypercall support Avi Kivity
2009-05-05 13:40   ` Gregory Haskins
2009-05-05 14:00     ` Avi Kivity
2009-05-05 14:14       ` Gregory Haskins
2009-05-05 14:21         ` Gregory Haskins
2009-05-05 15:02         ` Avi Kivity
2009-05-05 23:17         ` Chris Wright
2009-05-06  3:51           ` Gregory Haskins
2009-05-06  7:22             ` Chris Wright
2009-05-06 13:17               ` Gregory Haskins
2009-05-06 16:07                 ` Chris Wright
2009-05-07 17:03                   ` Gregory Haskins
2009-05-07 18:05                     ` Avi Kivity
2009-05-07 18:08                       ` Gregory Haskins
2009-05-07 18:12                         ` Avi Kivity
2009-05-07 18:16                           ` Gregory Haskins
2009-05-07 18:24                             ` Avi Kivity
2009-05-07 18:37                               ` Gregory Haskins
2009-05-07 19:00                                 ` Avi Kivity
2009-05-07 19:05                                   ` Gregory Haskins
2009-05-07 19:43                                     ` Avi Kivity
2009-05-07 20:07                                       ` Gregory Haskins
2009-05-07 20:15                                         ` Avi Kivity
2009-05-07 20:26                                           ` Gregory Haskins
2009-05-08  8:35                                             ` Avi Kivity
2009-05-08 11:29                                               ` Gregory Haskins
2009-05-07 19:07                                   ` Chris Wright
2009-05-07 19:12                                     ` Gregory Haskins
2009-05-07 19:21                                       ` Chris Wright
2009-05-07 19:26                                         ` Avi Kivity
2009-05-07 19:44                                           ` Avi Kivity
2009-05-07 19:29                                         ` Gregory Haskins
2009-05-07 20:25                                           ` Chris Wright
2009-05-07 20:34                                             ` Gregory Haskins
2009-05-07 20:54                                           ` Arnd Bergmann
2009-05-07 21:13                                             ` Gregory Haskins
2009-05-07 21:57                                               ` Chris Wright
2009-05-07 22:11                                                 ` Arnd Bergmann
2009-05-08 22:33                                                   ` Benjamin Herrenschmidt
2009-05-11 13:01                                                     ` Arnd Bergmann
2009-05-11 13:04                                                       ` Gregory Haskins
2009-05-07 20:00                             ` Arnd Bergmann
2009-05-07 20:31                               ` Gregory Haskins
2009-05-07 20:42                                 ` Arnd Bergmann
2009-05-07 20:47                                   ` Arnd Bergmann
2009-05-07 20:50                                 ` Chris Wright
2009-05-07 23:35                     ` Marcelo Tosatti
2009-05-07 23:43                       ` Marcelo Tosatti
2009-05-08  3:17                         ` Gregory Haskins
2009-05-08  7:55                         ` Avi Kivity
     [not found]                           ` <20090508103253.GC3011@amt.cnet>
2009-05-08 11:37                             ` Avi Kivity
2009-05-08 14:35                           ` Marcelo Tosatti
2009-05-08 14:45                             ` Gregory Haskins
2009-05-08 15:51                               ` Marcelo Tosatti
2009-05-08 19:56                                 ` David S. Ahern
2009-05-08 20:01                                   ` Gregory Haskins
2009-05-08 23:23                                     ` David S. Ahern
2009-05-09  8:45                                       ` Avi Kivity
2009-05-09 11:27                                         ` Gregory Haskins
2009-05-10  4:27                                           ` David S. Ahern
2009-05-10  5:24                                             ` Avi Kivity
2009-05-10  4:24                                         ` David S. Ahern
2009-05-08  3:13                       ` Gregory Haskins
2009-05-08  7:59                       ` Avi Kivity
2009-05-08 11:09                         ` Gregory Haskins
     [not found]                         ` <20090508104228.GD3011@amt.cnet>
2009-05-08 12:43                           ` Gregory Haskins
2009-05-08 15:33                             ` Marcelo Tosatti
2009-05-08 19:02                               ` Avi Kivity
2009-05-08 16:48                             ` Paul E. McKenney
2009-05-08 19:55                               ` Gregory Haskins
2009-05-08 14:15                       ` Gregory Haskins
2009-05-08 14:53                         ` Anthony Liguori
2009-05-08 18:50                           ` Avi Kivity
2009-05-08 19:02                             ` Anthony Liguori
2009-05-08 19:06                               ` Avi Kivity
2009-05-11 16:37                               ` Jeremy Fitzhardinge
2009-05-07 12:29                 ` Avi Kivity
2009-05-08 14:59                   ` Anthony Liguori
2009-05-09 12:01                     ` Gregory Haskins
2009-05-10 18:38                       ` Anthony Liguori
2009-05-11 13:14                         ` Gregory Haskins
2009-05-11 16:35                           ` Hollis Blanchard
2009-05-11 17:06                             ` Avi Kivity
2009-05-11 17:29                               ` Gregory Haskins
2009-05-11 17:53                                 ` Avi Kivity
2009-05-11 17:51                               ` Anthony Liguori
2009-05-11 18:02                                 ` Avi Kivity
2009-05-11 18:18                                   ` Anthony Liguori
2009-05-11 17:31                           ` Anthony Liguori
2009-05-13 10:53                             ` Gregory Haskins
2009-05-13 14:45                             ` Gregory Haskins
2009-05-11 16:44                         ` Hollis Blanchard
2009-05-11 17:54                           ` Anthony Liguori
2009-05-11 19:24                             ` PowerPC page faults Hollis Blanchard
2009-05-11 22:17                               ` Anthony Liguori
2009-05-12  5:46                                 ` Liu Yu-B13201
2009-05-12 14:50                                 ` Hollis Blanchard
2009-05-06 13:56             ` [RFC PATCH 0/3] generic hypercall support Anthony Liguori
2009-05-06 16:03               ` Gregory Haskins
2009-05-08  8:17                 ` Avi Kivity
2009-05-08 15:20                   ` Gregory Haskins
2009-05-08 17:00                     ` Avi Kivity
2009-05-08 18:55                       ` Gregory Haskins
2009-05-08 19:05                         ` Anthony Liguori
2009-05-08 19:12                         ` Avi Kivity
2009-05-08 19:59                           ` Gregory Haskins
2009-05-10  9:59                             ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.