linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [announce] [patch] KVM paravirtualization for Linux
@ 2007-01-05 21:52 Ingo Molnar
  2007-01-05 22:15 ` Zachary Amsden
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Ingo Molnar @ 2007-01-05 21:52 UTC (permalink / raw)
  To: kvm-devel; +Cc: linux-kernel, Avi Kivity


i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:                       1.11
 ----------------------------------
 Qemu:                        61.18
 KVM upstream:                53.01
 KVM trunk:                    6.36
 KVM trunk+paravirt/cr3:       1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and 
is now quite close to native speed!

"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

 native:                       0.25
 ----------------------------------
 Qemu:                         7.8
 KVM upstream:                 2.8
 KVM trunk:                    0.55
 KVM paravirt/cr3:             0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

 native:                       0.9
 ----------------------------------
 Qemu:                        35.2
 KVM upstream:                 9.4
 KVM trunk:                    2.8
 KVM paravirt/cr3:             2.2

still a 30% improvement - which isnt too bad considering that 200 tasks 
are context-switching in this workload and the cr3 cache in current CPUs 
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux 
  guest and a Linux host. (this will be replaced with a proper
  hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in 
  Intel VMX CPUs, and extends KVM to make use of that cache. This 
  feature allows the avoidance of expensive VM exits into hypervisor 
  context. (The guest needs to be 'aware' and the cache has to be
  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO 
  port delays do not cause a VM exit anymore, the i8259A IRQ controller 
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB 
  flushes are more efficient, because no cr3 reads happen which would 
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by 
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
  HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
now, this too is ad-hoc) - if that is passed then the KVM guest will 
probe for paravirtualization availability on the hypervisor side - and 
will use it if found. (If the guest does not find KVM-paravirt support 
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
too many changes, the paravirt.c bits can be carried over to 64-bit 
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
@ 2007-01-05 22:15 ` Zachary Amsden
  2007-01-05 22:30   ` Ingo Molnar
  2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Zachary Amsden @ 2007-01-05 22:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel, Avi Kivity

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
>    http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
>   

Your code looks generally good.  I have some comments.

You can't do this, even though you want to:

-EXPORT_SYMBOL(paravirt_ops);
+EXPORT_SYMBOL_GPL(paravirt_ops);


The problem is it makes all modules GPL - or at least all modules that 
use any kind of locking, pull in the basic definitions to enable and 
disable interrupts, thus the paravirt_ops symbol, so basically all modules.

What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.


Index: linux/drivers/serial/8250.c
===================================================================
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
 
 		l = l->next;
 
-		if (l == i->head && pass_counter++ > PASS_LIMIT) {
+		if (!kvm_paravirt 



Is this a bug that might happen under other virtualizations as well, not 
just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.

Which also gets rid of the need for this unusually placed extern:

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1911,6 +1911,11 @@ static inline void set_task_cpu(struct t
 
 #endif /* CONFIG_SMP */
 
+/*
+ * Is paravirtualization active?
+ */
+extern int kvm_paravirt;

+




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 22:15 ` Zachary Amsden
@ 2007-01-05 22:30   ` Ingo Molnar
  2007-01-05 22:50     ` Zachary Amsden
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-01-05 22:30 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm-devel, linux-kernel, Avi Kivity


* Zachary Amsden <zach@vmware.com> wrote:

> What you really want is more like 
> EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

yep. Not a big issue - what is important is to put the paravirt ops into 
the read-only section so that it's somewhat harder for rootkits to 
modify. (Also, it needs to be made clear that this is fundamental, 
lowlevel system functionality written by people under the GPLv2, so that 
if you utilize it beyond its original purpose, using its internals, you 
likely create a work derived from the kernel. Something simple as irq 
disabling probably doesnt qualify, and that we exported to modules for a 
long time, but lots of other details do. So the existence of 
paravirt_ops isnt a free-for all.)

> But I'm not sure that is technically feasible yet.
> 
> The kvm code should probably go in kvm.c instead of paravirt.c.

no. This is fundamental architecture boot code, not module code. kvm.c 
should eventually go into kernel/ and arch/*/kernel, not the other way 
around.

> Index: linux/drivers/serial/8250.c
> ===================================================================
> --- linux.orig/drivers/serial/8250.c
> +++ linux/drivers/serial/8250.c
> @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
> 
> 		l = l->next;
> 
> -		if (l == i->head && pass_counter++ > PASS_LIMIT) {
> +		if (!kvm_paravirt 
> 
> Is this a bug that might happen under other virtualizations as well, 
> not just kvm? Perhaps it deserves a disable feature instead of a kvm 
> specific check.

yes - this limit is easily triggered via the KVM/Qemu virtual serial 
drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just 
a flag.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 22:30   ` Ingo Molnar
@ 2007-01-05 22:50     ` Zachary Amsden
  2007-01-05 23:28       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Zachary Amsden @ 2007-01-05 22:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel, Avi Kivity

Ingo Molnar wrote:
> * Zachary Amsden <zach@vmware.com> wrote:
>
>   
>> What you really want is more like 
>> EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
>>     
>
> yep. Not a big issue - what is important is to put the paravirt ops into 
> the read-only section so that it's somewhat harder for rootkits to 
> modify. (Also, it needs to be made clear that this is fundamental, 
> lowlevel system functionality written by people under the GPLv2, so that 
> if you utilize it beyond its original purpose, using its internals, you 
> likely create a work derived from the kernel. Something simple as irq 
> disabling probably doesnt qualify, and that we exported to modules for a 
> long time, but lots of other details do. So the existence of 
> paravirt_ops isnt a free-for all.)
>   

I agree completely.  It would be nice to have a way to make certain 
kernel structures available, but non-mutable to non-GPL modules.

>> But I'm not sure that is technically feasible yet.
>>
>> The kvm code should probably go in kvm.c instead of paravirt.c.
>>     
>
> no. This is fundamental architecture boot code, not module code. kvm.c 
> should eventually go into kernel/ and arch/*/kernel, not the other way 
> around.
>   

What I meant was kvm.c in arch/i386/kernel - as symmetric to the other 
paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c, 
etc.  Either that, or we should move them to be symmetric, but I don't 
think paravirt.c is the proper place for kvm specific code.


>   
>> Index: linux/drivers/serial/8250.c
>> ===================================================================
>> --- linux.orig/drivers/serial/8250.c
>> +++ linux/drivers/serial/8250.c
>> @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
>>
>> 		l = l->next;
>>
>> -		if (l == i->head && pass_counter++ > PASS_LIMIT) {
>> +		if (!kvm_paravirt 
>>
>> Is this a bug that might happen under other virtualizations as well, 
>> not just kvm? Perhaps it deserves a disable feature instead of a kvm 
>> specific check.
>>     
>
> yes - this limit is easily triggered via the KVM/Qemu virtual serial 
> drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just 
> a flag.
>   

Can't you just test paravirt_enabled() in that case?


Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
  2007-01-05 22:15 ` Zachary Amsden
@ 2007-01-05 23:02 ` Anthony Liguori
  2007-01-06 13:08 ` Pavel Machek
  2007-01-07 12:20 ` Avi Kivity
  3 siblings, 0 replies; 18+ messages in thread
From: Anthony Liguori @ 2007-01-05 23:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel

This is pretty cool.  I've read the VT spec a number of times and never 
understood why they included the CR3 caching :-)  I suspect that you may 
even be faster than Xen for context switches because of the hardware 
assistance here.  Any chance you can run your benchmarks against Xen?

You may already know this, but there isn't a CR3 cache in SVM AFAIK so a 
fair bit of the code will probably have to be enabled conditionally.  
Otherwise, I don't think SVM support would be that hard.  The only other 
odd bit I noticed was:

> +	magic_val = KVM_API_MAGIC;
> +	para_state->guest_version = KVM_PARA_API_VERSION;
> +	para_state->host_version = -1;
> +	para_state->size = sizeof(*para_state);
> +
> +	asm volatile ("movl %0, %%cr3"
> +			: "=&r" (ret)
> +			: "a" (para_state),
> +			  "0" (magic_val)
> +	);

If I read this correctly, you're using a CR3 write as a hypercall (and 
relying on registers being set/returned that aren't part of the actual 
CR3 move)?  Any reason for not just using vmcall?  It seems a bit less 
awkward.  Even a PIO operation would be a bit cleaner IMHO.  PIO exits 
tend to be fast in VT/SVM so that's an added benefit.

Regards,

Anthony Liguori

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux
> under Linux), which includes support for the hardware cr3-cache feature
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
>    http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the
> code is stable enough in my testing and i'd like to have some feedback.
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
>  native:                       1.11
>  ----------------------------------
>  Qemu:                        61.18
>  KVM upstream:                53.01
>  KVM trunk:                    6.36
>  KVM trunk+paravirt/cr3:       1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and
> is now quite close to native speed!
>
> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
>  native:                       0.25
>  ----------------------------------
>  Qemu:                         7.8
>  KVM upstream:                 2.8
>  KVM trunk:                    0.55
>  KVM paravirt/cr3:             0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
>  native:                       0.9
>  ----------------------------------
>  Qemu:                        35.2
>  KVM upstream:                 9.4
>  KVM trunk:                    2.8
>  KVM paravirt/cr3:             2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks
> are context-switching in this workload and the cr3 cache in current CPUs
> is only 4 entries.
>
> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux
>   guest and a Linux host. (this will be replaced with a proper
>   hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in
>   Intel VMX CPUs, and extends KVM to make use of that cache. This
>   feature allows the avoidance of expensive VM exits into hypervisor
>   context. (The guest needs to be 'aware' and the cache has to be
>   shared between the guest and the hypervisor. So fully emulated OSs
>   wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO
>   port delays do not cause a VM exit anymore, the i8259A IRQ controller
>   code got simplified (this will be replaced with a proper, hypercall
>   based and host-maintained IRQ controller implementation) and TLB
>   flushes are more efficient, because no cr3 reads happen which would
>   otherwise cause a VM exit. These changes have a visible effect
>   already: they reduce qemu's CPU usage when a guest idles in HLT, by
>   about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
>   HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
> now, this too is ad-hoc) - if that is passed then the KVM guest will
> probe for paravirtualization availability on the hypervisor side - and
> will use it if found. (If the guest does not find KVM-paravirt support
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not
> too many changes, the paravirt.c bits can be carried over to 64-bit
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> 	Ingo
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 22:50     ` Zachary Amsden
@ 2007-01-05 23:28       ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2007-01-05 23:28 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm-devel, linux-kernel, Avi Kivity


* Zachary Amsden <zach@vmware.com> wrote:

> >>What you really want is more like 
> >>EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
> >>    
> >
> >yep. Not a big issue - what is important is to put the paravirt ops 
> >into the read-only section so that it's somewhat harder for rootkits 
> >to modify. (Also, it needs to be made clear that this is fundamental, 
> >lowlevel system functionality written by people under the GPLv2, so 
> >that if you utilize it beyond its original purpose, using its 
> >internals, you likely create a work derived from the kernel. 
> >Something simple as irq disabling probably doesnt qualify, and that 
> >we exported to modules for a long time, but lots of other details do. 
> >So the existence of paravirt_ops isnt a free-for all.)
> 
> I agree completely.  It would be nice to have a way to make certain 
> kernel structures available, but non-mutable to non-GPL modules.

the thing is, we are not 'making these available to non-GPL modules'. 
The _GPL postfix does not turn the other normal exports from grey into 
white. The _GPL postfix turns exports into almost-black for non-GPL 
modules. The rest is still grey.

in this case, i'd only make local_irq_*() available to modules (of any 
type), not the other paravirt ops. Exporting the whole of paravirt_ops 
was a mistake, and this has to be fixed before 2.6.20. I'll cook up a 
patch.

> >yes - this limit is easily triggered via the KVM/Qemu virtual serial 
> >drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's 
> >just a flag.
> 
> Can't you just test paravirt_enabled() in that case?

yep - i've changed this in my tree.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
  2007-01-05 22:15 ` Zachary Amsden
  2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
@ 2007-01-06 13:08 ` Pavel Machek
  2007-01-07 18:29   ` Christoph Hellwig
  2007-01-08 18:18   ` Christoph Lameter
  2007-01-07 12:20 ` Avi Kivity
  3 siblings, 2 replies; 18+ messages in thread
From: Pavel Machek @ 2007-01-06 13:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel, Avi Kivity

Hi!

> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
> 
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
> 
>    http://redhat.com/~mingo/kvm-paravirt-patches/
> 
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
> 
> Firstly, here are some numbers:
> 
> 2-task context-switch performance (in microseconds, lower is better):
> 
>  native:                       1.11
>  ----------------------------------
>  Qemu:                        61.18
>  KVM upstream:                53.01
>  KVM trunk:                    6.36
>  KVM trunk+paravirt/cr3:       1.60

Does this make Xen obsolete? I mean... we have xen patches in suse
kernels, should we keep updating them, or just drop them in favour of
KVM?
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-01-06 13:08 ` Pavel Machek
@ 2007-01-07 12:20 ` Avi Kivity
  2007-01-07 17:42   ` [kvm-devel] " Hollis Blanchard
  2007-01-07 17:44   ` Ingo Molnar
  3 siblings, 2 replies; 18+ messages in thread
From: Avi Kivity @ 2007-01-07 12:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
>    http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
>  native:                       1.11
>  ----------------------------------
>  Qemu:                        61.18
>  KVM upstream:                53.01
>  KVM trunk:                    6.36
>  KVM trunk+paravirt/cr3:       1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and 
> is now quite close to native speed!
>   

Very impressive!  The gain probably comes not only from avoiding the 
vmentry/vmexit, but also from avoiding the flushing of the global page 
tlb entries.

> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
>  native:                       0.25
>  ----------------------------------
>  Qemu:                         7.8
>  KVM upstream:                 2.8
>  KVM trunk:                    0.55
>  KVM paravirt/cr3:             0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
>  native:                       0.9
>  ----------------------------------
>  Qemu:                        35.2
>  KVM upstream:                 9.4
>  KVM trunk:                    2.8
>  KVM paravirt/cr3:             2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks 
> are context-switching in this workload and the cr3 cache in current CPUs 
> is only 4 entries.
>   

This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?

I'm also concerned that at this point in time the cr3 optimizations will 
only show an improvement in microbenchmarks.  In real life workloads a 
context switch is usually preceded by an I/O, and with the current sorry 
state of kvm I/O the context switch time would be dominated by the I/O time.

> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux 
>   guest and a Linux host. (this will be replaced with a proper
>   hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in 
>   Intel VMX CPUs, and extends KVM to make use of that cache. This 
>   feature allows the avoidance of expensive VM exits into hypervisor 
>   context. (The guest needs to be 'aware' and the cache has to be
>   shared between the guest and the hypervisor. So fully emulated OSs
>   wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO 
>   port delays do not cause a VM exit anymore, the i8259A IRQ controller 
>   code got simplified (this will be replaced with a proper, hypercall
>   based and host-maintained IRQ controller implementation) and TLB 
>   flushes are more efficient, because no cr3 reads happen which would 
>   otherwise cause a VM exit. These changes have a visible effect
>   already: they reduce qemu's CPU usage when a guest idles in HLT, by 
>   about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
>   HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
> now, this too is ad-hoc) - if that is passed then the KVM guest will 
> probe for paravirtualization availability on the hypervisor side - and 
> will use it if found. (If the guest does not find KVM-paravirt support 
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
> too many changes, the paravirt.c bits can be carried over to 64-bit 
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> 	Ingo
>   

Some comments below on the code.

> +
> +/*
> + * Special, register-to-cr3 instruction based hypercall API
> + * variant to the KVM host. This utilizes the cr3 filter capability
> + * of the hardware - if this works out then no VM exit happens,
> + * if a VM exit happens then KVM will get the virtual address too.
> + */
> +static void kvm_write_cr3(unsigned long guest_cr3)
> +{
> +    struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
> +    struct kvm_cr3_cache *cache = &para_state->cr3_cache;
> +    int idx;
> +
> +    /*
> +     * Check the cache (maintained by the host) for a matching
> +     * guest_cr3 => host_cr3 mapping. Use it if found:
> +     */
> +    for (idx = 0; idx < cache->max_idx; idx++) {
> +        if (cache->entry[idx].guest_cr3 == guest_cr3) {
> +            /*
> +             * Cache-hit: we load the cached host-CR3 value.
> +             * This never causes any VM exit. (if it does then the
> +             * hypervisor could do nothing with this instruction
> +             * and the guest OS would be aborted)
> +             */
> +            asm volatile("movl %0, %%cr3"
> +                : : "r" (cache->entry[idx].host_cr3));
> +            goto out;
> +        }
> +    }
> +
> +    /*
> +     * Cache-miss. Load the guest-cr3 value into cr3, which will
> +     * cause a VM exit to the hypervisor, which then loads the
> +     * host cr3 value and updates the cr3_cache.
> +     */
> +    asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
> +out:
> +    put_cpu_var(para_state);
> +}

Well, you did say it was ad-hoc.  For reference, this is how I see the 
hypercall API:

- A virtual pci device exports a page through the pci rom interface.  
The page contains the hypercall code approriate for the current cpu.  
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the 
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the 
virtual->physical translation is much cheaper on the guest (__pa() vs a 
page table walk).


>
> +
> +/*
> + * Simplified i8259A controller handling:
> + */
> +static void mask_and_ack_kvm(unsigned int irq)
> +{
> +    unsigned int irqmask = 1 << irq;
> +    unsigned long flags;
> +
> +    spin_lock_irqsave(&i8259A_lock, flags);
> +    cached_irq_mask |= irqmask;
> +
> +    if (irq & 8) {
> +        outb(cached_slave_mask, PIC_SLAVE_IMR);
> +        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> +        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> to master-IRQ2 */
> +    } else {
> +        outb(cached_master_mask, PIC_MASTER_IMR);
> +        /* 'Specific EOI' to master: */
> +        outb(0x60+irq, PIC_MASTER_CMD);
> +    }
> +    spin_unlock_irqrestore(&i8259A_lock, flags);
> +}

Any reason this can't be applied to mainline?  There's probably no 
downside to native, and it would benefit all virtualization solutions 
equally.

> Index: linux/drivers/kvm/kvm.h
> ===================================================================
> --- linux.orig/drivers/kvm/kvm.h
> +++ linux/drivers/kvm/kvm.h
> @@ -165,7 +165,7 @@ struct kvm_mmu {
>      int root_level;
>      int shadow_root_level;
>  
> -    u64 *pae_root;
> +    u64 *pae_root[KVM_CR3_CACHE_SIZE];

hmm.  wouldn't it be simpler to have pae_root always point at the 
current root?

>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
>  
>  static void mmu_free_roots(struct kvm_vcpu *vcpu)
>  {
> -    int i;
> +    int i, j;
>      struct kvm_mmu_page *page;
>  
>  #ifdef CONFIG_X86_64
> @@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
>          return;
>      }
>  #endif
> -    for (i = 0; i < 4; ++i) {
> -        hpa_t root = vcpu->mmu.pae_root[i];
> +    /*
> +     * Skip to the next cr3 filter entry and free it (if it's occupied):
> +     */
> +    vcpu->cr3_cache_idx++;
> +    if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
> +        vcpu->cr3_cache_idx = 0;
>  
> -        ASSERT(VALID_PAGE(root));
> -        root &= PT64_BASE_ADDR_MASK;
> -        page = page_header(root);
> -        --page->root_count;
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    j = vcpu->cr3_cache_idx;
> +    /*
> +     * Clear the guest-visible entry:
> +     */
> +    if (vcpu->para_state) {
> +        vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
> +        vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
> +    }
> +    ASSERT(vcpu->mmu.pae_root[j]);
> +    if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
> +        vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
> +        for (i = 0; i < 4; ++i) {
> +            hpa_t root = vcpu->mmu.pae_root[j][i];
> +
> +            ASSERT(VALID_PAGE(root));
> +            root &= PT64_BASE_ADDR_MASK;
> +            page = page_header(root);
> +            --page->root_count;
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +        }
>      }
>      vcpu->mmu.root_hpa = INVALID_PAGE;
>  }

You keep the page directories pinned here.  This can be a problem if a 
guest frees a page directory, and then starts using it as a regular 
page.  kvm sometimes chooses not to emulate a write to a guest page 
table, but instead to zap it, which is impossible when the page is 
freed.  You need to either unpin the page when that happens, or add a 
hypercall to let kvm know when a page directory is freed.

There is also a minor problem that changes to the pgd aren't caught by 
kvm.  It doesn't hurt much as this is PV and we can relax the guest/host 
contract a little.


>
>  static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
>  {
>      struct page *page;
> -    int i;
> +    int i, j;
>  
>      ASSERT(vcpu);
>  
> @@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
>          ++vcpu->kvm->n_free_mmu_pages;
>      }
>  
> -    /*
> -     * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
> -     * Therefore we need to allocate shadow page tables in the first
> -     * 4GB of memory, which happens to fit the DMA32 zone.
> -     */
> -    page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> -    if (!page)
> -        goto error_1;
> -    vcpu->mmu.pae_root = page_address(page);
> -    for (i = 0; i < 4; ++i)
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> +        /*
> +         * When emulating 32-bit mode, cr3 is only 32 bits even on
> +         * x86_64. Therefore we need to allocate shadow page tables
> +         * in the first 4GB of memory, which happens to fit the DMA32
> +         * zone:
> +         */
> +        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> +        if (!page)
> +            goto error_1;
> +
> +        ASSERT(!vcpu->mmu.pae_root[j]);
> +        vcpu->mmu.pae_root[j] = page_address(page);
> +        for (i = 0; i < 4; ++i)
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +    }

Since a pae root uses just 32 bytes, you can store all cache entries in 
a single page.  Not that it matters much.

>  #define ENABLE_INTERRUPTS_SYSEXIT            \
> Index: linux/include/linux/kvm.h
> ===================================================================
> --- linux.orig/include/linux/kvm.h
> +++ linux/include/linux/kvm.h
> @@ -238,4 +238,44 @@ struct kvm_dirty_log {
>  
>  #define KVM_DUMP_VCPU             _IOW(KVMIO, 250, int /* vcpu_slot */)
>  
> +
> +#define KVM_CR3_CACHE_SIZE    4
> +
> +struct kvm_cr3_cache_entry {
> +    u64 guest_cr3;
> +    u64 host_cr3;
> +};
> +
> +struct kvm_cr3_cache {
> +    struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
> +    u32 max_idx;
> +};
> +
> +/*
> + * Per-VCPU descriptor area shared between guest and host. Writable to
> + * both guest and host. Registered with the host by the guest when
> + * a guest acknowledges paravirtual mode.
> + */
> +struct kvm_vcpu_para_state {
> +    /*
> +     * API version information for compatibility. If there's any support
> +     * mismatch (too old host trying to execute too new guest) then
> +     * the host will deny entry into paravirtual mode. Any other
> +     * combination (new host + old guest and new host + new guest)
> +     * is supposed to work - new host versions will support all old
> +     * guest API versions.
> +     */
> +    u32 guest_version;
> +    u32 host_version;
> +    u32 size;
> +    u32 __pad_00;
> +
> +    struct kvm_cr3_cache cr3_cache;
> +
> +} __attribute__ ((aligned(PAGE_SIZE)));
> +
> +#define KVM_PARA_API_VERSION 1
> +
> +#define KVM_API_MAGIC 0x87654321
> +

<linux/kvm.h> is the vmm userspace interface.  The guest/host interface 
should probably go somewhere else.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
  2007-01-07 12:20 ` Avi Kivity
@ 2007-01-07 17:42   ` Hollis Blanchard
  2007-01-07 17:44   ` Ingo Molnar
  1 sibling, 0 replies; 18+ messages in thread
From: Hollis Blanchard @ 2007-01-07 17:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, kvm-devel, linux-kernel

On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote:
> 
> 
> Well, you did say it was ad-hoc.  For reference, this is how I see the
> hypercall API:
[snip]
> - Guest/host communications is by guest physical addressed, as the 
> virtual->physical translation is much cheaper on the guest (__pa() vs
> a page table walk). 

Strongly agreed. One of the major problems we had with the PowerPC Xen
port was that Xen passes virtual addresses (even userspace virtual
addresses!) to the hypervisor. Performing a MMU search on PowerPC is far
more convoluted than x86's table walk and is not feasible in software.

I'm anxious to avoid the same mistake wherever possible.

Of course, even with physical addresses, data structures that straddle
page boundaries prevent the hypervisor from mapping contiguous physical
pages to discontiguous machine pages (or whatever terminology you want
to use).

IBM's Power Hypervisor passes hypercall data in registers most of the
time. In the places it communicates via memory, I believe the parameter
is actually a physical frame number, and the size is explicitly limited
to one page.

-Hollis


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-07 12:20 ` Avi Kivity
  2007-01-07 17:42   ` [kvm-devel] " Hollis Blanchard
@ 2007-01-07 17:44   ` Ingo Molnar
  2007-01-08  8:22     ` Avi Kivity
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-01-07 17:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, linux-kernel


* Avi Kivity <avi@qumranet.com> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native:                       1.11
> > ----------------------------------
> > Qemu:                        61.18
> > KVM upstream:                53.01
> > KVM trunk:                    6.36
> > KVM trunk+paravirt/cr3:       1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and 
> >is now quite close to native speed!
> >  
> 
> Very impressive!  The gain probably comes not only from avoiding the 
> vmentry/vmexit, but also from avoiding the flushing of the global page 
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this 
more precisely i have added an artificial __flush_tlb_global() call to 
after switch_to(), just to see how much impact an extra global flush has 
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, 
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.25
> > ----------------------------------
> > Qemu:                         7.8
> > KVM upstream:                 2.8
> > KVM trunk:                    0.55
> > KVM paravirt/cr3:             0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.9
> > ----------------------------------
> > Qemu:                        35.2
> > KVM upstream:                 9.4
> > KVM trunk:                    2.8
> > KVM paravirt/cr3:             2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks 
> >are context-switching in this workload and the cr3 cache in current CPUs 
> >is only 4 entries.
> >  
> 
> This is a little too good to be true.  Were both runs with the same 
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations 
> will only show an improvement in microbenchmarks.  In real life 
> workloads a context switch is usually preceded by an I/O, and with the 
> current sorry state of kvm I/O the context switch time would be 
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is 
really easy. Accelerating the context-switch path (and basic syscall 
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs 
well in all the hard cases, before looking at the low hanging 
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application 
workloads can be heavily context-switching - it's not just I/O that 
generates them. (pipes, critical sections / futexes, etc.) So having 
near-native performance for context-switches is very important.

> >+    if (irq & 8) {
> >+        outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> >to master-IRQ2 */
> >+    } else {
> >+        outb(cached_master_mask, PIC_MASTER_IMR);
> >+        /* 'Specific EOI' to master: */
> >+        outb(0x60+irq, PIC_MASTER_CMD);
> >+    }
> >+    spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
> 
> Any reason this can't be applied to mainline?  There's probably no 
> downside to native, and it would benefit all virtualization solutions 
> equally.

this is legacy stuff ...

> >-    u64 *pae_root;
> >+    u64 *pae_root[KVM_CR3_CACHE_SIZE];
> 
> hmm.  wouldn't it be simpler to have pae_root always point at the 
> current root?

does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.

> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+        }
> >     }
> >     vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
> 
> You keep the page directories pinned here. [...]

yes.

> [...]  This can be a problem if a guest frees a page directory, and 
> then starts using it as a regular page.  kvm sometimes chooses not to 
> emulate a write to a guest page table, but instead to zap it, which is 
> impossible when the page is freed.  You need to either unpin the page 
> when that happens, or add a hypercall to let kvm know when a page 
> directory is freed.

the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?

> >-    for (i = 0; i < 4; ++i)
> >-        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+        /*
> >+         * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+         * x86_64. Therefore we need to allocate shadow page tables
> >+         * in the first 4GB of memory, which happens to fit the DMA32
> >+         * zone:
> >+         */
> >+        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+        if (!page)
> >+            goto error_1;
> >+
> >+        ASSERT(!vcpu->mmu.pae_root[j]);
> >+        vcpu->mmu.pae_root[j] = page_address(page);
> >+        for (i = 0; i < 4; ++i)
> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+    }
> 
> Since a pae root uses just 32 bytes, you can store all cache entries 
> in a single page.  Not that it matters much.

yeah - i wanted to extend the current code in a safe way, before 
optimizing it.

> >+#define KVM_API_MAGIC 0x87654321
> >+
> 
> <linux/kvm.h> is the vmm userspace interface.  The guest/host 
> interface should probably go somewhere else.

yeah. kvm_para.h?

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-06 13:08 ` Pavel Machek
@ 2007-01-07 18:29   ` Christoph Hellwig
  2007-01-08 18:18   ` Christoph Lameter
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2007-01-07 18:29 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Ingo Molnar, kvm-devel, linux-kernel, Avi Kivity

On Sat, Jan 06, 2007 at 01:08:18PM +0000, Pavel Machek wrote:
> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?

After all the Novell Marketing Hype you'll probably have to keep Xen ;-)
Except for that I suspect a paravirt kvm or lhype might be the better
hypervisor choice in the long term.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-07 17:44   ` Ingo Molnar
@ 2007-01-08  8:22     ` Avi Kivity
  2007-01-08  8:39       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2007-01-08  8:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel

Ingo Molnar wrote:

 

>> This is a little too good to be true.  Were both runs with the same 
>> KVM_NUM_MMU_PAGES?
>>     
>
> yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
> run should have been labeled as: 'cr3 tree with paravirt turned off'. 
> That's not completely 'trunk' but close to it, and all other changes 
> (like elimination of unnecessary TLB flushes) are fairly applied to 
> both.
>   

Ok.  I guess there's a switch/switch back pattern in there.

> i also did a run with much less MMU cache pages of 256, and hackbench 1 
> stayed the same, while hackbench 5 numbers started fluctuating badly (i 
> think that workload if trashing the MMU cache badly).
>   

Yes, 256 is too low.

>   
>>> -    u64 *pae_root;
>>> +    u64 *pae_root[KVM_CR3_CACHE_SIZE];
>>>       
>> hmm.  wouldn't it be simpler to have pae_root always point at the 
>> current root?
>>     
>
> does that guarantee that it's available? I wanted to 'pin' the root 
> itself this way, to make sure that if a guest switches to it via the 
> cache, that it's truly available and a valid root. cr3 addresses are 
> non-virtual so this is the only mechanism available to guarantee that 
> the host-side memory truly contains a root pagetable.
>
>   

I meant

   u64 *pae_root_cache;
   u64 *pae_root; /* == pae_root_cache + 4*cache_index */

so that the rest of the code need not worry about the cache.

>>> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
>>> +        }
>>>     }
>>>     vcpu->mmu.root_hpa = INVALID_PAGE;
>>> }
>>>       
>> You keep the page directories pinned here. [...]
>>     
>
> yes.
>
>   
>> [...]  This can be a problem if a guest frees a page directory, and 
>> then starts using it as a regular page.  kvm sometimes chooses not to 
>> emulate a write to a guest page table, but instead to zap it, which is 
>> impossible when the page is freed.  You need to either unpin the page 
>> when that happens, or add a hypercall to let kvm know when a page 
>> directory is freed.
>>     
>
> the cache is zapped upon pagefaults anyway, so unpinning ought to be 
> possible. Which one would you prefer?
>   

It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing ->root_count).

However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.

>
>   
>>> +#define KVM_API_MAGIC 0x87654321
>>> +
>>>       
>> <linux/kvm.h> is the vmm userspace interface.  The guest/host 
>> interface should probably go somewhere else.
>>     
>
> yeah. kvm_para.h?
>
>   

Sounds good.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-08  8:22     ` Avi Kivity
@ 2007-01-08  8:39       ` Ingo Molnar
  2007-01-08  9:08         ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-01-08  8:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, linux-kernel


* Avi Kivity <avi@qumranet.com> wrote:

> > the cache is zapped upon pagefaults anyway, so unpinning ought to be 
> > possible. Which one would you prefer?
> 
> It's zapped by the equivalent of mmu_free_roots(), right?  That's 
> effectively unpinning it (by zeroing ->root_count).

no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.

> However, kvm takes pagefaults even for silly things like setting (in 
> hardware) or clearing (in software) the dirty bit.

yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:

                rmap_remove(vcpu, spte);
                kvm_arch_ops->tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.

	Ingo

Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
 		BUG_ON(!(*spte & PT_WRITABLE_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
 		rmap_remove(vcpu, spte);
-		kvm_arch_ops->tlb_flush(vcpu);
+		/*
+		 * While we removed a mapping there's no need to explicitly
+		 * flush the TLB here, because this codepath only triggers
+		 * if we write a new cr3 - which will flush the TLB anyway.
+		 */
 		*spte &= ~(u64)PT_WRITABLE_MASK;
 	}
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-08  8:39       ` Ingo Molnar
@ 2007-01-08  9:08         ` Avi Kivity
  2007-01-08  9:18           ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2007-01-08  9:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel

Ingo Molnar wrote:
> * Avi Kivity <avi@qumranet.com> wrote:
>
>   
>>> the cache is zapped upon pagefaults anyway, so unpinning ought to be 
>>> possible. Which one would you prefer?
>>>       
>> It's zapped by the equivalent of mmu_free_roots(), right?  That's 
>> effectively unpinning it (by zeroing ->root_count).
>>     
>
> no, right now only the guest-visible cache is zapped - the roots are 
> zapped by natural rotation. I guess they should be zapped in 
> kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
> to the other MMU state, to make it easier to call it from whatever mmu 
> codepath.
>
>   

Ok.  Some mechanism is then needed to unpin the pages in case they are 
recycled by the guest.

>> However, kvm takes pagefaults even for silly things like setting (in 
>> hardware) or clearing (in software) the dirty bit.
>>     
>
> yeah. I think it also does some TLB flushes that are not needed. For 
> example in rmap_write_protect() we do this:
>
>                 rmap_remove(vcpu, spte);
>                 kvm_arch_ops->tlb_flush(vcpu);
>
> but AFAICS rmap_write_protect() is only ever called if we write a new 
> cr3 - hence a TLB flush will happen anyway, because we do a 
> vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? 

No, rmap_write_protect() is called whenever we shadow a new guest page 
table (the mechanism used to maintain coherency is write faults on page 
tables).

> I didnt want to 
> remove it as part of the cr3 patches (to keep things simpler), but that 
> flush looks quite unnecessary to me. The patch below seems to work in 
> light testing.
>
> 	Ingo
>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
>  		BUG_ON(!(*spte & PT_WRITABLE_MASK));
>  		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
>  		rmap_remove(vcpu, spte);
> -		kvm_arch_ops->tlb_flush(vcpu);
> +		/*
> +		 * While we removed a mapping there's no need to explicitly
> +		 * flush the TLB here, because this codepath only triggers
> +		 * if we write a new cr3 - which will flush the TLB anyway.
> +		 */
>  		*spte &= ~(u64)PT_WRITABLE_MASK;
>  	}
>  }
>   

This will kill svm.

I don't think the tlb flushes are really necessary on today's Intel 
machines, as they probably flush the tlb on each vmx switch.  But AMD 
keeps the TLB, and the flushes are critical there.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-08  9:08         ` Avi Kivity
@ 2007-01-08  9:18           ` Ingo Molnar
  2007-01-08  9:31             ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2007-01-08  9:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, linux-kernel


* Avi Kivity <avi@qumranet.com> wrote:

> >but AFAICS rmap_write_protect() is only ever called if we write a new 
> >cr3 - hence a TLB flush will happen anyway, because we do a 
> >vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
> 
> No, rmap_write_protect() is called whenever we shadow a new guest page 
> table (the mechanism used to maintain coherency is write faults on 
> page tables).

hm. Where? I only see rmap_write_protect() called from 
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
doesnt call the rmap function in that case, due to metaphysical], and 
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
(init-only thing) or from paging_new_cr3().

ahh ... i missed paging_tmpl.h.

how about the patch below then?

furthermore, shouldnt we pass in the flush area size from the pagefault 
handler? In most cases it would be 4096 bytes, that would mean an invlpg 
is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
there a way (or a need) to flush a single TLB of another context ID?

	Ingo

Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
 	}
 }
 
-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
 		BUG_ON(!(*spte & PT_WRITABLE_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
 		rmap_remove(vcpu, spte);
-		kvm_arch_ops->tlb_flush(vcpu);
+		/*
+		 * While we removed a mapping there's no need to explicitly
+		 * flush the TLB here if we write a new cr3. It is needed
+		 * from the pagefault path though.
+		 */
+		if (flush_tlb)
+			kvm_arch_ops->tlb_flush(vcpu);
 		*spte &= ~(u64)PT_WRITABLE_MASK;
 	}
 }
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
 					     gva_t gaddr,
 					     unsigned level,
 					     int metaphysical,
-					     u64 *parent_pte)
+					     u64 *parent_pte,
+					     int flush_tlb)
 {
 	union kvm_mmu_page_role role;
 	unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
 	page->role = role;
 	hlist_add_head(&page->hash_link, bucket);
 	if (!metaphysical)
-		rmap_write_protect(vcpu, gfn);
+		rmap_write_protect(vcpu, gfn, flush_tlb);
 	return page;
 }
 
@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
 				>> PAGE_SHIFT;
 			new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
 						     v, level - 1,
-						     1, &table[index]);
+						     1, &table[index], 0);
 			if (!new_table) {
 				pgprintk("nonpaging_map: ENOMEM\n");
 				return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
 
 		ASSERT(!VALID_PAGE(root));
 		page = kvm_mmu_get_page(vcpu, root_gfn, 0,
-					PT64_ROOT_LEVEL, 0, NULL);
+					PT64_ROOT_LEVEL, 0, NULL, 0);
 		root = page->page_hpa;
 		++page->root_count;
 		vcpu->mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
 			root_gfn = 0;
 		page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
 					PT32_ROOT_LEVEL, !is_paging(vcpu),
-					NULL);
+					NULL, 0);
 		root = page->page_hpa;
 		++page->root_count;
 		vcpu->mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
Index: linux/drivers/kvm/paging_tmpl.h
===================================================================
--- linux.orig/drivers/kvm/paging_tmpl.h
+++ linux/drivers/kvm/paging_tmpl.h
@@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 			table_gfn = walker->table_gfn[level - 2];
 		}
 		shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
-					       metaphysical, shadow_ent);
+					       metaphysical, shadow_ent, 1);
 		shadow_addr = shadow_page->page_hpa;
 		shadow_pte = shadow_addr | PT_PRESENT_MASK | PT_ACCESSED_MASK
 			| PT_WRITABLE_MASK | PT_USER_MASK;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-08  9:18           ` Ingo Molnar
@ 2007-01-08  9:31             ` Avi Kivity
  2007-01-08  9:43               ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2007-01-08  9:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel, linux-kernel

Ingo Molnar wrote:
> * Avi Kivity <avi@qumranet.com> wrote:
>
>   
>>> but AFAICS rmap_write_protect() is only ever called if we write a new 
>>> cr3 - hence a TLB flush will happen anyway, because we do a 
>>> vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
>>>       
>> No, rmap_write_protect() is called whenever we shadow a new guest page 
>> table (the mechanism used to maintain coherency is write faults on 
>> page tables).
>>     
>
> hm. Where? I only see rmap_write_protect() called from 
> kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
> doesnt call the rmap function in that case, due to metaphysical], and 
> mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
> (init-only thing) or from paging_new_cr3().
>
> ahh ... i missed paging_tmpl.h.
>
> how about the patch below then?
>   

Looks like a lot of complexity for very little gain.  I'm not sure what 
the vmwrite cost is, cut it can't be that high compared to vmexit.

I think there are two better options:

1.  Find out if tlb_flush() can be implemented as a no-op on intel -- 
I'm fairly sure it can.
2.  If not, implement tlb_flush() as a counter increment like on amd.  
Then, successive tlb flushes and context switches are folded into one.


> furthermore, shouldnt we pass in the flush area size from the pagefault 
> handler? In most cases it would be 4096 bytes, that would mean an invlpg 
> is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
> a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
> there a way (or a need) to flush a single TLB of another context ID?
>   

Yes (and yes), invlpga.

A complication is that a single shadow pte can be used to map multiple 
guest virtual addresses (by mapping the page table using multiple pdes), 
so the kvm_mmu_page object has to keep track of the page table gva, and 
whether it is multiply mapped or not.  I plan to do that later.

> 	Ingo
>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
>  	}
>  }
>  
> -static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
> +static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	struct page *page;
> @@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
>  		BUG_ON(!(*spte & PT_WRITABLE_MASK));
>  		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
>  		rmap_remove(vcpu, spte);
> -		kvm_arch_ops->tlb_flush(vcpu);
> +		/*
> +		 * While we removed a mapping there's no need to explicitly
> +		 * flush the TLB here if we write a new cr3. It is needed
> +		 * from the pagefault path though.
> +		 */
> +		if (flush_tlb)
> +			kvm_arch_ops->tlb_flush(vcpu);
>  		*spte &= ~(u64)PT_WRITABLE_MASK;
>  	}
>  }
> @@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
>  					     gva_t gaddr,
>  					     unsigned level,
>  					     int metaphysical,
> -					     u64 *parent_pte)
> +					     u64 *parent_pte,
> +					     int flush_tlb)
>  {
>  	union kvm_mmu_page_role role;
>  	unsigned index;
> @@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
>  	page->role = role;
>  	hlist_add_head(&page->hash_link, bucket);
>  	if (!metaphysical)
> -		rmap_write_protect(vcpu, gfn);
> +		rmap_write_protect(vcpu, gfn, flush_tlb);
>  	return page;
>  }
>  
> @@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
>  				>> PAGE_SHIFT;
>  			new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
>  						     v, level - 1,
> -						     1, &table[index]);
> +						     1, &table[index], 0);
>  			if (!new_table) {
>  				pgprintk("nonpaging_map: ENOMEM\n");
>  				return -ENOMEM;
> @@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
>  
>  		ASSERT(!VALID_PAGE(root));
>  		page = kvm_mmu_get_page(vcpu, root_gfn, 0,
> -					PT64_ROOT_LEVEL, 0, NULL);
> +					PT64_ROOT_LEVEL, 0, NULL, 0);
>  		root = page->page_hpa;
>  		++page->root_count;
>  		vcpu->mmu.root_hpa = root;
> @@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
>  			root_gfn = 0;
>  		page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
>  					PT32_ROOT_LEVEL, !is_paging(vcpu),
> -					NULL);
> +					NULL, 0);
>  		root = page->page_hpa;
>  		++page->root_count;
>  		vcpu->mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
> Index: linux/drivers/kvm/paging_tmpl.h
> ===================================================================
> --- linux.orig/drivers/kvm/paging_tmpl.h
> +++ linux/drivers/kvm/paging_tmpl.h
> @@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  			table_gfn = walker->table_gfn[level - 2];
>  		}
>  		shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
> -					       metaphysical, shadow_ent);
> +					       metaphysical, shadow_ent, 1);
>  		shadow_addr = shadow_page->page_hpa;
>  		shadow_pte = shadow_addr | PT_PRESENT_MASK | PT_ACCESSED_MASK
>  			| PT_WRITABLE_MASK | PT_USER_MASK;
>
>   


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-08  9:31             ` Avi Kivity
@ 2007-01-08  9:43               ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2007-01-08  9:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, linux-kernel


* Avi Kivity <avi@qumranet.com> wrote:

> Looks like a lot of complexity for very little gain.  I'm not sure 
> what the vmwrite cost is, cut it can't be that high compared to 
> vmexit.

while i disagree with characterising one extra parameter passed down 
plus one extra branch as 'a lot of complexity', i agree that making the 
flush a NOP on VMX is even better. (if it's possible without breaking 
future hardware) I guess you are right that the only true cost here is 
the vmwrite cost, and that there's no impact on CR3 flushing (because it 
happens unconditionally). Forget about this patch.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [announce] [patch] KVM paravirtualization for Linux
  2007-01-06 13:08 ` Pavel Machek
  2007-01-07 18:29   ` Christoph Hellwig
@ 2007-01-08 18:18   ` Christoph Lameter
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2007-01-08 18:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Ingo Molnar, kvm-devel, linux-kernel, Avi Kivity

On Sat, 6 Jan 2007, Pavel Machek wrote:

> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?
> 							Pavel

Xen is duplicating basic OS components like the scheduler etc. As 
a result its difficult to maintain and not well integrated with Linux. 
KVM looks like a better approach.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-01-08 18:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
2007-01-05 22:15 ` Zachary Amsden
2007-01-05 22:30   ` Ingo Molnar
2007-01-05 22:50     ` Zachary Amsden
2007-01-05 23:28       ` Ingo Molnar
2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
2007-01-06 13:08 ` Pavel Machek
2007-01-07 18:29   ` Christoph Hellwig
2007-01-08 18:18   ` Christoph Lameter
2007-01-07 12:20 ` Avi Kivity
2007-01-07 17:42   ` [kvm-devel] " Hollis Blanchard
2007-01-07 17:44   ` Ingo Molnar
2007-01-08  8:22     ` Avi Kivity
2007-01-08  8:39       ` Ingo Molnar
2007-01-08  9:08         ` Avi Kivity
2007-01-08  9:18           ` Ingo Molnar
2007-01-08  9:31             ` Avi Kivity
2007-01-08  9:43               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).