All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/8] apic: eoi optimization support
@ 2012-05-15 14:35 Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 1/8] apic: fix typo EIO_ACK -> EOI_ACK and document Michael S. Tsirkin
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:35 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

I'm looking at reducing the interrupt overhead for virtualized guests:
some workloads spend a large part of their time processing interrupts.
This patchset supplies infrastructure to reduce the IRQ ack overhead on
x86: the idea is to add an eoi_write callback that we can then optimize
without touching other apic functionality.

The main user is kvm: on kvm, an EOI write from the guest causes an
expensive exit to host; we can avoid this using shared memory as the
last patches in the series demonstrate.

But I also wrote a micro-optimized version for the regular x2apic: this
shaves off a branch and about 9 instructions from EOI when x2apic is
used, and a comment in ack_APIC_irq implies that someone counted
instructions there, at some point.

That's patch 4 in the series - if someone's unhappy with
this patch specifically this patch can be dropped
as nothing else in the series depends on it.

Also included in the patchset are a couple of trivial macro fixes.

The patches work fine on my boxes. See individual patches
for perf tests. You need to patch qemu to whitelist the kvm feature.
qemu patch will be sent as a reply to this mail.

The patches are against 3.4-rc7 - let me know if
I need to rebase.

Please review, and consider for linux 3.5.

Thanks,
MST

Changes from v2:
	Kill guest with GP on an illegal MSR value
	Add documentation

Changes from v1:
	Add host side patch to series
	Remove kvm-specific __test_and_clear_bit, document
	that x86 one does what we want already
	Clear msr on cpu unplug

Michael S. Tsirkin (8):
  apic: fix typo EIO_ACK -> EOI_ACK and document
  apic: use symbolic APIC_EOI_ACK
  x86: add apic->eoi_write callback
  x86: eoi micro-optimization
  kvm_para: guest side for eoi avoidance
  x86/bitops: note on __test_and_clear_bit atomicity
  kvm: host side for eoi optimization
  kvm: eoi msi documentation

 Documentation/virtual/kvm/msr.txt      |   32 +++++++++
 arch/x86/include/asm/apic.h            |   22 ++++++-
 arch/x86/include/asm/apicdef.h         |    2 +-
 arch/x86/include/asm/bitops.h          |   13 +++-
 arch/x86/include/asm/kvm_host.h        |    6 ++
 arch/x86/include/asm/kvm_para.h        |    2 +
 arch/x86/kernel/apic/apic_flat_64.c    |    2 +
 arch/x86/kernel/apic/apic_noop.c       |    1 +
 arch/x86/kernel/apic/apic_numachip.c   |    1 +
 arch/x86/kernel/apic/bigsmp_32.c       |    1 +
 arch/x86/kernel/apic/es7000_32.c       |    2 +
 arch/x86/kernel/apic/numaq_32.c        |    1 +
 arch/x86/kernel/apic/probe_32.c        |    1 +
 arch/x86/kernel/apic/summit_32.c       |    1 +
 arch/x86/kernel/apic/x2apic_cluster.c  |    1 +
 arch/x86/kernel/apic/x2apic_phys.c     |    1 +
 arch/x86/kernel/apic/x2apic_uv_x.c     |    1 +
 arch/x86/kernel/kvm.c                  |   50 ++++++++++++++-
 arch/x86/kvm/cpuid.c                   |    1 +
 arch/x86/kvm/irq.c                     |    2 +-
 arch/x86/kvm/lapic.c                   |  110 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/lapic.h                   |    2 +
 arch/x86/kvm/trace.h                   |   34 ++++++++++
 arch/x86/kvm/x86.c                     |    8 ++-
 arch/x86/platform/visws/visws_quirks.c |    2 +-
 25 files changed, 282 insertions(+), 17 deletions(-)

-- 
MST

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCHv3 1/8] apic: fix typo EIO_ACK -> EOI_ACK and document
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
@ 2012-05-15 14:35 ` Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 2/8] apic: use symbolic APIC_EOI_ACK Michael S. Tsirkin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:35 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

Fix typo in the macro name and document the
reason it has this value. Update users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/apicdef.h         |    2 +-
 arch/x86/platform/visws/visws_quirks.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
index 134bba0..c46bb99 100644
--- a/arch/x86/include/asm/apicdef.h
+++ b/arch/x86/include/asm/apicdef.h
@@ -37,7 +37,7 @@
 #define		APIC_ARBPRI_MASK	0xFFu
 #define	APIC_PROCPRI	0xA0
 #define	APIC_EOI	0xB0
-#define		APIC_EIO_ACK		0x0
+#define		APIC_EOI_ACK		0x0 /* Docs say 0 for future compat. */
 #define	APIC_RRR	0xC0
 #define	APIC_LDR	0xD0
 #define		APIC_LDR_MASK		(0xFFu << 24)
diff --git a/arch/x86/platform/visws/visws_quirks.c b/arch/x86/platform/visws/visws_quirks.c
index c7abf13..94d8a39 100644
--- a/arch/x86/platform/visws/visws_quirks.c
+++ b/arch/x86/platform/visws/visws_quirks.c
@@ -445,7 +445,7 @@ static void ack_cobalt_irq(struct irq_data *data)
 
 	spin_lock_irqsave(&cobalt_lock, flags);
 	disable_cobalt_irq(data);
-	apic_write(APIC_EOI, APIC_EIO_ACK);
+	apic_write(APIC_EOI, APIC_EOI_ACK);
 	spin_unlock_irqrestore(&cobalt_lock, flags);
 }
 
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 2/8] apic: use symbolic APIC_EOI_ACK
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 1/8] apic: fix typo EIO_ACK -> EOI_ACK and document Michael S. Tsirkin
@ 2012-05-15 14:35 ` Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 3/8] x86: add apic->eoi_write callback Michael S. Tsirkin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:35 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

Use the symbol instead of hard-coded numbers,
now that the reason for the value is documented
where the constant is defined we don't need to
duplicate this explanation in code.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/apic.h |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index d854101..a09e9ab 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -463,9 +463,7 @@ static inline void ack_APIC_irq(void)
 	 * ack_APIC_irq() actually gets compiled as a single instruction
 	 * ... yummie.
 	 */
-
-	/* Docs say use 0 for future compatibility */
-	apic_write(APIC_EOI, 0);
+	apic_write(APIC_EOI, APIC_EOI_ACK);
 }
 
 static inline unsigned default_get_apic_id(unsigned long x)
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 3/8] x86: add apic->eoi_write callback
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 1/8] apic: fix typo EIO_ACK -> EOI_ACK and document Michael S. Tsirkin
  2012-05-15 14:35 ` [PATCHv3 2/8] apic: use symbolic APIC_EOI_ACK Michael S. Tsirkin
@ 2012-05-15 14:35 ` Michael S. Tsirkin
  2012-05-15 14:36 ` [PATCHv3 4/8] x86: eoi micro-optimization Michael S. Tsirkin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:35 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

Add eoi_write callback so that kvm can override
eoi accesses without touching the rest of the apic.
As a side-effect, this will enable a micro-optimization
for apics using msr.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/apic.h           |   15 ++++++++++++++-
 arch/x86/kernel/apic/apic_flat_64.c   |    2 ++
 arch/x86/kernel/apic/apic_noop.c      |    1 +
 arch/x86/kernel/apic/apic_numachip.c  |    1 +
 arch/x86/kernel/apic/bigsmp_32.c      |    1 +
 arch/x86/kernel/apic/es7000_32.c      |    2 ++
 arch/x86/kernel/apic/numaq_32.c       |    1 +
 arch/x86/kernel/apic/probe_32.c       |    1 +
 arch/x86/kernel/apic/summit_32.c      |    1 +
 arch/x86/kernel/apic/x2apic_cluster.c |    1 +
 arch/x86/kernel/apic/x2apic_phys.c    |    1 +
 arch/x86/kernel/apic/x2apic_uv_x.c    |    1 +
 12 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index a09e9ab..74efb8d 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -351,6 +351,13 @@ struct apic {
 	/* apic ops */
 	u32 (*read)(u32 reg);
 	void (*write)(u32 reg, u32 v);
+        /*
+	 * eoi_write has the same signature as write.
+	 * Drivers can support both eoi_write and write by passing the same
+	 * callback value. Kernel can override eoi_write and fall back
+	 * on write for EOI.
+	 */
+	void (*eoi_write)(u32 reg, u32 v);
 	u64 (*icr_read)(void);
 	void (*icr_write)(u32 low, u32 high);
 	void (*wait_icr_idle)(void);
@@ -426,6 +433,11 @@ static inline void apic_write(u32 reg, u32 val)
 	apic->write(reg, val);
 }
 
+static inline void apic_eoi(void)
+{
+	apic->eoi_write(APIC_EOI, APIC_EOI_ACK);
+}
+
 static inline u64 apic_icr_read(void)
 {
 	return apic->icr_read();
@@ -450,6 +462,7 @@ static inline u32 safe_apic_wait_icr_idle(void)
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
+static inline void apic_eoi(void) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void apic_wait_icr_idle(void) { }
@@ -463,7 +476,7 @@ static inline void ack_APIC_irq(void)
 	 * ack_APIC_irq() actually gets compiled as a single instruction
 	 * ... yummie.
 	 */
-	apic_write(APIC_EOI, APIC_EOI_ACK);
+	apic_eoi();
 }
 
 static inline unsigned default_get_apic_id(unsigned long x)
diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index 359b689..0e881c4 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -227,6 +227,7 @@ static struct apic apic_flat =  {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
@@ -386,6 +387,7 @@ static struct apic apic_physflat =  {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/apic_noop.c b/arch/x86/kernel/apic/apic_noop.c
index 634ae6c..a6e4c6e 100644
--- a/arch/x86/kernel/apic/apic_noop.c
+++ b/arch/x86/kernel/apic/apic_noop.c
@@ -181,6 +181,7 @@ struct apic apic_noop = {
 
 	.read				= noop_apic_read,
 	.write				= noop_apic_write,
+	.eoi_write			= noop_apic_write,
 	.icr_read			= noop_apic_icr_read,
 	.icr_write			= noop_apic_icr_write,
 	.wait_icr_idle			= noop_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c
index 23e7542..6ec6d5d 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -295,6 +295,7 @@ static struct apic apic_numachip __refconst = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/bigsmp_32.c b/arch/x86/kernel/apic/bigsmp_32.c
index 0cdec70..31fbdbf 100644
--- a/arch/x86/kernel/apic/bigsmp_32.c
+++ b/arch/x86/kernel/apic/bigsmp_32.c
@@ -248,6 +248,7 @@ static struct apic apic_bigsmp = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/es7000_32.c b/arch/x86/kernel/apic/es7000_32.c
index e42d1d3b9..db4ab1b 100644
--- a/arch/x86/kernel/apic/es7000_32.c
+++ b/arch/x86/kernel/apic/es7000_32.c
@@ -678,6 +678,7 @@ static struct apic __refdata apic_es7000_cluster = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
@@ -742,6 +743,7 @@ static struct apic __refdata apic_es7000 = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/numaq_32.c b/arch/x86/kernel/apic/numaq_32.c
index 00d2422..f00a68c 100644
--- a/arch/x86/kernel/apic/numaq_32.c
+++ b/arch/x86/kernel/apic/numaq_32.c
@@ -530,6 +530,7 @@ static struct apic __refdata apic_numaq = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c
index ff2c1b9..1b291da 100644
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -142,6 +142,7 @@ static struct apic apic_default = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/summit_32.c b/arch/x86/kernel/apic/summit_32.c
index fea000b..659897c 100644
--- a/arch/x86/kernel/apic/summit_32.c
+++ b/arch/x86/kernel/apic/summit_32.c
@@ -546,6 +546,7 @@ static struct apic apic_summit = {
 
 	.read				= native_apic_mem_read,
 	.write				= native_apic_mem_write,
+	.eoi_write			= native_apic_mem_write,
 	.icr_read			= native_apic_icr_read,
 	.icr_write			= native_apic_icr_write,
 	.wait_icr_idle			= native_apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 48f3103..a5baa78 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -260,6 +260,7 @@ static struct apic apic_x2apic_cluster = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index 991e315..8340356 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -172,6 +172,7 @@ static struct apic apic_x2apic_phys = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
index 87bfa69..5b0e3d0 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -404,6 +404,7 @@ static struct apic __refdata apic_x2apic_uv_x = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 4/8] x86: eoi micro-optimization
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2012-05-15 14:35 ` [PATCHv3 3/8] x86: add apic->eoi_write callback Michael S. Tsirkin
@ 2012-05-15 14:36 ` Michael S. Tsirkin
  2012-05-15 14:36 ` [PATCHv3 6/8] x86/bitops: note on __test_and_clear_bit atomicity Michael S. Tsirkin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:36 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

We know both register and value for eoi beforehand,
so there's no need to check it and no need to do math
to calculate the msr. Saves instructions/branches
on each EOI when using x2apic.

I'm not sure what kind of tests should one run
to check whether this patch is good for performance.

Some data below - in case it's insufficient,
this patch can be dropped from the series for now:

I looked at the objdump output to verify that the generated code
looks right and actually is shorter.

Some benchmark results below show a tiny
but measureable improvement. The tests were run on
an AMD box with 24 cpus.

- A clean kernel build after reboot shows
a tiny but measureable improvement in system time
which means lower CPU overhead (though not measureable
in total time - that is dominated by user time and fluctuates
too much):
linux# reboot -f
...
linux# make clean
linux# time make -j 64 LOCALVERSION= 2>&1 > /dev/null

Before:

real    2m52.244s
user    35m53.833s
sys     6m7.194s

After:

real    2m52.827s
user    35m48.916s
sys     6m2.305s

- perf micro-benchmarks seem to consistently show
  a tiny improvement in total time as well but it's below
  the confidence level of 3 std deviations:

...
       0.414666797 seconds time elapsed ( +-  1.29% )

Performance counter stats for 'perf bench sched messaging' (100 runs):

       0.395370891 seconds time elapsed ( +-  1.04% )

       0.307019664 seconds time elapsed ( +-  0.10% )

       0.304738024 seconds time elapsed ( +-  0.08% )

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/apic.h           |    5 +++++
 arch/x86/kernel/apic/x2apic_cluster.c |    2 +-
 arch/x86/kernel/apic/x2apic_phys.c    |    2 +-
 arch/x86/kernel/apic/x2apic_uv_x.c    |    2 +-
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 74efb8d..5eb6d56 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -138,6 +138,11 @@ static inline void native_apic_msr_write(u32 reg, u32 v)
 	wrmsr(APIC_BASE_MSR + (reg >> 4), v, 0);
 }
 
+static inline void native_apic_msr_eoi_write(u32 reg, u32 v)
+{
+	wrmsr(APIC_BASE_MSR + (APIC_EOI >> 4), APIC_EOI_ACK, 0);
+}
+
 static inline u32 native_apic_msr_read(u32 reg)
 {
 	u64 msr;
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index a5baa78..ff35cff 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -260,7 +260,7 @@ static struct apic apic_x2apic_cluster = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
-	.eoi_write			= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_eoi_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index 8340356..c17e982 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -172,7 +172,7 @@ static struct apic apic_x2apic_phys = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
-	.eoi_write			= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_eoi_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
index 5b0e3d0..c6d03f7 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -404,7 +404,7 @@ static struct apic __refdata apic_x2apic_uv_x = {
 
 	.read				= native_apic_msr_read,
 	.write				= native_apic_msr_write,
-	.eoi_write			= native_apic_msr_write,
+	.eoi_write			= native_apic_msr_eoi_write,
 	.icr_read			= native_x2apic_icr_read,
 	.icr_write			= native_x2apic_icr_write,
 	.wait_icr_idle			= native_x2apic_wait_icr_idle,
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 6/8] x86/bitops: note on __test_and_clear_bit atomicity
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2012-05-15 14:36 ` [PATCHv3 4/8] x86: eoi micro-optimization Michael S. Tsirkin
@ 2012-05-15 14:36 ` Michael S. Tsirkin
  2012-05-15 14:36 ` [PATCHv3 5/8] kvm_para: guest side for eoi avoidance Michael S. Tsirkin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:36 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/bitops.h |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index c9c70ea..86f3a1e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -264,6 +264,13 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long *addr)
  * This operation is non-atomic and can be reordered.
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
+ *
+ * Note: the operation is performed atomically with respect to
+ * the local CPU, but not other CPUs. Portable code should not
+ * rely on this behaviour.
+ * KVM relies on this behaviour on x86 for modifying memory that is also
+ * accessed from a hypervisor on the same CPU if running in a VM: don't change
+ * this without also updating arch/x86/kernel/kvm.c
  */
 static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 {
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 5/8] kvm_para: guest side for eoi avoidance
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2012-05-15 14:36 ` [PATCHv3 6/8] x86/bitops: note on __test_and_clear_bit atomicity Michael S. Tsirkin
@ 2012-05-15 14:36 ` Michael S. Tsirkin
  2012-05-16  1:51   ` Marcelo Tosatti
  2012-05-15 14:36 ` [PATCHv3 7/8] kvm: host side for eoi optimization Michael S. Tsirkin
  2012-05-15 14:36 ` [PATCHv3 8/8] kvm: eoi msi documentation Michael S. Tsirkin
  7 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:36 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
Guest tests it using a single est and clear operation - this is
necessary so that host can detect interrupt nesting - and if set, it can
skip the EOI MSR.

I run a simple microbenchmark to show exit reduction
(note: for testing, need to apply patch 7 from the series + a qemu patch
 I posted separately, on host):

Before:

Performance counter stats for 'sleep 1s':

            47,357 kvm:kvm_entry                                                [99.98%]
                 0 kvm:kvm_hypercall                                            [99.98%]
                 0 kvm:kvm_hv_hypercall                                         [99.98%]
             5,001 kvm:kvm_pio                                                  [99.98%]
                 0 kvm:kvm_cpuid                                                [99.98%]
            22,124 kvm:kvm_apic                                                 [99.98%]
            49,849 kvm:kvm_exit                                                 [99.98%]
            21,115 kvm:kvm_inj_virq                                             [99.98%]
                 0 kvm:kvm_inj_exception                                        [99.98%]
                 0 kvm:kvm_page_fault                                           [99.98%]
            22,937 kvm:kvm_msr                                                  [99.98%]
                 0 kvm:kvm_cr                                                   [99.98%]
                 0 kvm:kvm_pic_set_irq                                          [99.98%]
                 0 kvm:kvm_apic_ipi                                             [99.98%]
            22,207 kvm:kvm_apic_accept_irq                                      [99.98%]
            22,421 kvm:kvm_eoi                                                  [99.98%]
                 0 kvm:kvm_pv_eoi                                               [99.99%]
                 0 kvm:kvm_nested_vmrun                                         [99.99%]
                 0 kvm:kvm_nested_intercepts                                    [99.99%]
                 0 kvm:kvm_nested_vmexit                                        [99.99%]
                 0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                 0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                 0 kvm:kvm_invlpga                                              [99.99%]
                 0 kvm:kvm_skinit                                               [99.99%]
                57 kvm:kvm_emulate_insn                                         [99.99%]
                 0 kvm:vcpu_match_mmio                                          [99.99%]
                 0 kvm:kvm_userspace_exit                                       [99.99%]
                 2 kvm:kvm_set_irq                                              [99.99%]
                 2 kvm:kvm_ioapic_set_irq                                       [99.99%]
            23,609 kvm:kvm_msi_set_irq                                          [99.99%]
                 1 kvm:kvm_ack_irq                                              [99.99%]
               131 kvm:kvm_mmio                                                 [99.99%]
               226 kvm:kvm_fpu                                                  [100.00%]
                 0 kvm:kvm_age_page                                             [100.00%]
                 0 kvm:kvm_try_async_get_page                                    [100.00%]
                 0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                 0 kvm:kvm_async_pf_not_present                                    [100.00%]
                 0 kvm:kvm_async_pf_ready                                       [100.00%]
                 0 kvm:kvm_async_pf_completed

       1.002100578 seconds time elapsed

After:

 Performance counter stats for 'sleep 1s':

            28,354 kvm:kvm_entry                                                [99.98%]
                 0 kvm:kvm_hypercall                                            [99.98%]
                 0 kvm:kvm_hv_hypercall                                         [99.98%]
             1,347 kvm:kvm_pio                                                  [99.98%]
                 0 kvm:kvm_cpuid                                                [99.98%]
             1,931 kvm:kvm_apic                                                 [99.98%]
            29,595 kvm:kvm_exit                                                 [99.98%]
            24,884 kvm:kvm_inj_virq                                             [99.98%]
                 0 kvm:kvm_inj_exception                                        [99.98%]
                 0 kvm:kvm_page_fault                                           [99.98%]
             1,986 kvm:kvm_msr                                                  [99.98%]
                 0 kvm:kvm_cr                                                   [99.98%]
                 0 kvm:kvm_pic_set_irq                                          [99.98%]
                 0 kvm:kvm_apic_ipi                                             [99.99%]
            25,953 kvm:kvm_apic_accept_irq                                      [99.99%]
            26,132 kvm:kvm_eoi                                                  [99.99%]
            26,593 kvm:kvm_pv_eoi                                               [99.99%]
                 0 kvm:kvm_nested_vmrun                                         [99.99%]
                 0 kvm:kvm_nested_intercepts                                    [99.99%]
                 0 kvm:kvm_nested_vmexit                                        [99.99%]
                 0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                 0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                 0 kvm:kvm_invlpga                                              [99.99%]
                 0 kvm:kvm_skinit                                               [99.99%]
               284 kvm:kvm_emulate_insn                                         [99.99%]
                68 kvm:vcpu_match_mmio                                          [99.99%]
                68 kvm:kvm_userspace_exit                                       [99.99%]
                 2 kvm:kvm_set_irq                                              [99.99%]
                 2 kvm:kvm_ioapic_set_irq                                       [99.99%]
            28,288 kvm:kvm_msi_set_irq                                          [99.99%]
                 1 kvm:kvm_ack_irq                                              [99.99%]
               131 kvm:kvm_mmio                                                 [100.00%]
               588 kvm:kvm_fpu                                                  [100.00%]
                 0 kvm:kvm_age_page                                             [100.00%]
                 0 kvm:kvm_try_async_get_page                                    [100.00%]
                 0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                 0 kvm:kvm_async_pf_not_present                                    [100.00%]
                 0 kvm:kvm_async_pf_ready                                       [100.00%]
                 0 kvm:kvm_async_pf_completed

       1.002039622 seconds time elapsed

We see that # of exits is almost halved.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/bitops.h |    6 +++-
 arch/x86/kernel/kvm.c         |   50 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index b97596e..c9c70ea 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -26,11 +26,13 @@
 #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 1)
 /* Technically wrong, but this avoids compilation errors on some gcc
    versions. */
-#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))
+#define BITOP_ADDR_CONSTRAINT "=m"
 #else
-#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
+#define BITOP_ADDR_CONSTRAINT "+m"
 #endif
 
+#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
+
 #define ADDR				BITOP_ADDR(addr)
 
 /*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e554e5a..31f85cb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,8 @@
 #include <asm/desc.h>
 #include <asm/tlbflush.h>
 #include <asm/idle.h>
+#include <asm/apic.h>
+#include <asm/apicdef.h>
 
 static int kvmapf = 1;
 
@@ -283,6 +285,23 @@ static void kvm_register_steal_time(void)
 		cpu, __pa(st));
 }
 
+/* size alignment is implied but just to make it explicit. */
+static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) = 0;
+
+static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
+{
+	/**
+	 * This relies on __test_and_clear_bit to modify the memory
+	 * in a way that is atomic with respect to the local CPU.
+	 * The hypervisor only accesses this memory from the local CPU so
+	 * there's no need for lock or memory barriers.
+	 * An optimization barrier is implied in apic write.
+	 */
+	if (__test_and_clear_bit(0, &__get_cpu_var(kvm_apic_eoi)))
+		return;
+	apic->write(APIC_EOI, APIC_EOI_ACK);
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
 	if (!kvm_para_available())
@@ -300,11 +319,17 @@ void __cpuinit kvm_guest_cpu_init(void)
 		       smp_processor_id());
 	}
 
+	if (kvm_para_has_feature(KVM_FEATURE_EOI)) {
+		__get_cpu_var(kvm_apic_eoi) = 0;
+		wrmsrl(MSR_KVM_EOI_EN, __pa(&__get_cpu_var(kvm_apic_eoi)) |
+		       KVM_MSR_ENABLED);
+	}
+
 	if (has_steal_clock)
 		kvm_register_steal_time();
 }
 
-static void kvm_pv_disable_apf(void *unused)
+static void kvm_pv_disable_apf(void)
 {
 	if (!__get_cpu_var(apf_reason).enabled)
 		return;
@@ -316,11 +341,18 @@ static void kvm_pv_disable_apf(void *unused)
 	       smp_processor_id());
 }
 
+static void kvm_pv_guest_cpu_reboot(void *unused)
+{
+	if (kvm_para_has_feature(KVM_FEATURE_EOI))
+		wrmsrl(MSR_KVM_EOI_EN, 0);
+	kvm_pv_disable_apf();
+}
+
 static int kvm_pv_reboot_notify(struct notifier_block *nb,
 				unsigned long code, void *unused)
 {
 	if (code == SYS_RESTART)
-		on_each_cpu(kvm_pv_disable_apf, NULL, 1);
+		on_each_cpu(kvm_pv_guest_cpu_reboot, NULL, 1);
 	return NOTIFY_DONE;
 }
 
@@ -371,7 +403,9 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 static void kvm_guest_cpu_offline(void *dummy)
 {
 	kvm_disable_steal_time();
-	kvm_pv_disable_apf(NULL);
+	if (kvm_para_has_feature(KVM_FEATURE_EOI))
+		wrmsrl(MSR_KVM_EOI_EN, 0);
+	kvm_pv_disable_apf();
 	apf_task_wake_all();
 }
 
@@ -424,6 +458,16 @@ void __init kvm_guest_init(void)
 		pv_time_ops.steal_clock = kvm_steal_clock;
 	}
 
+	if (kvm_para_has_feature(KVM_FEATURE_EOI)) {
+		struct apic **drv;
+
+		for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) {
+			/* Should happen once for each apic */
+			WARN_ON((*drv)->eoi_write == kvm_guest_apic_eoi_write);
+			(*drv)->eoi_write = kvm_guest_apic_eoi_write;
+		}
+	}
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 7/8] kvm: host side for eoi optimization
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2012-05-15 14:36 ` [PATCHv3 5/8] kvm_para: guest side for eoi avoidance Michael S. Tsirkin
@ 2012-05-15 14:36 ` Michael S. Tsirkin
  2012-05-16  1:55   ` Marcelo Tosatti
  2012-05-15 14:36 ` [PATCHv3 8/8] kvm: eoi msi documentation Michael S. Tsirkin
  7 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:36 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For tetsing results see description of patch 7 of the series.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    6 ++
 arch/x86/include/asm/kvm_para.h |    2 +
 arch/x86/kvm/cpuid.c            |    1 +
 arch/x86/kvm/irq.c              |    2 +-
 arch/x86/kvm/lapic.c            |  110 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/lapic.h            |    2 +
 arch/x86/kvm/trace.h            |   34 ++++++++++++
 arch/x86/kvm/x86.c              |    8 +++-
 8 files changed, 158 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32297f2..7673f4d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -174,6 +174,7 @@ enum {
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
+#define KVM_APIC_EOI_PENDING	1
 
 /*
  * We don't want allocation failures within the mmu code, so we preallocate
@@ -485,6 +486,11 @@ struct kvm_vcpu_arch {
 		u64 length;
 		u64 status;
 	} osvw;
+
+	struct {
+		u64 msr_val;
+		struct gfn_to_hva_cache data;
+	} eoi;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..195840d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -22,6 +22,7 @@
 #define KVM_FEATURE_CLOCKSOURCE2        3
 #define KVM_FEATURE_ASYNC_PF		4
 #define KVM_FEATURE_STEAL_TIME		5
+#define KVM_FEATURE_EOI			6
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -37,6 +38,7 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_EOI_EN      0x4b564d04
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index eba8345..bda4877 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -408,6 +408,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
 			     (1 << KVM_FEATURE_ASYNC_PF) |
+			     (1 << KVM_FEATURE_EOI) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 
 		if (sched_info_on())
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..07bdfab 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -48,7 +48,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (!irqchip_in_kernel(v->kvm))
 		return v->arch.interrupt.pending;
 
-	if (kvm_apic_has_interrupt(v) == -1) {	/* LAPIC */
+	if (kvm_apic_has_interrupt(v) < 0) {	/* LAPIC */
 		if (kvm_apic_accept_pic_intr(v)) {
 			s = pic_irqchip(v->kvm);	/* PIC */
 			return s->output;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 93c1574..c7e6ffb 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -270,6 +270,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
 			irq->level, irq->trig_mode);
 }
 
+static int eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
+{
+
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, &val,
+				      sizeof(val));
+}
+
+static int eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
+{
+
+	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, val,
+				      sizeof(*val));
+}
+
+static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.eoi.msr_val & KVM_MSR_ENABLED;
+}
+
+static bool eoi_get_pending(struct kvm_vcpu *vcpu)
+{
+	u8 val;
+	if (eoi_get_user(vcpu, &val) < 0)
+		apic_debug("Can't read EOI MSR value: 0x%llx\n",
+			   (unsigned long long)vcpi->arch.eoi.msr_val);
+	return val & 0x1;
+}
+
+static void eoi_set_pending(struct kvm_vcpu *vcpu)
+{
+	if (eoi_put_user(vcpu, 0x1) < 0) {
+		apic_debug("Can't set EOI MSR value: 0x%llx\n",
+			   (unsigned long long)vcpi->arch.eoi.msr_val);
+		return;
+	}
+	__set_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
+static void eoi_clr_pending(struct kvm_vcpu *vcpu)
+{
+	if (eoi_put_user(vcpu, 0x0) < 0) {
+		apic_debug("Can't clear EOI MSR value: 0x%llx\n",
+			   (unsigned long long)vcpi->arch.eoi.msr_val);
+		return;
+	}
+	__clear_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
+}
+
 static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 {
 	int result;
@@ -482,16 +530,21 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
-static void apic_set_eoi(struct kvm_lapic *apic)
+static int apic_set_eoi(struct kvm_lapic *apic)
 {
 	int vector = apic_find_highest_isr(apic);
+
+	trace_kvm_eoi(apic, vector);
+
 	/*
 	 * Not every write EOI will has corresponding ISR,
 	 * one example is when Kernel check timer on setup_IO_APIC
 	 */
 	if (vector == -1)
-		return;
+		return vector;
 
+	if (eoi_enabled(apic->vcpu))
+		eoi_clr_pending(apic->vcpu);
 	apic_clear_vector(vector, apic->regs + APIC_ISR);
 	apic_update_ppr(apic);
 
@@ -505,6 +558,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
 		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
 	}
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+	return vector;
 }
 
 static void apic_send_ipi(struct kvm_lapic *apic)
@@ -1085,6 +1139,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
 	atomic_set(&apic->lapic_timer.pending, 0);
 	if (kvm_vcpu_is_bsp(vcpu))
 		vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+	vcpu->arch.eoi.msr_val = 0;
 	apic_update_ppr(apic);
 
 	vcpu->arch.apic_arb_prio = 0;
@@ -1211,9 +1266,10 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 
 	apic_update_ppr(apic);
 	highest_irr = apic_find_highest_irr(apic);
-	if ((highest_irr == -1) ||
-	    ((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
+	if (highest_irr == -1)
 		return -1;
+	if (((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
+		return -2;
 	return highest_irr;
 }
 
@@ -1245,9 +1301,20 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 	int vector = kvm_apic_has_interrupt(vcpu);
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
-	if (vector == -1)
+	/* Detect interrupt nesting and disable EOI optimization */
+	if (eoi_enabled(vcpu) && vector == -2)
+		eoi_clr_pending(vcpu);
+
+	if (vector < 0)
 		return -1;
 
+	if (eoi_enabled(vcpu)) {
+		if (kvm_ioapic_handles_vector(vcpu->kvm, vector))
+			eoi_clr_pending(vcpu);
+		else
+			eoi_set_pending(vcpu);
+	}
+
 	apic_set_vector(vector, apic->regs + APIC_ISR);
 	apic_update_ppr(apic);
 	apic_clear_irr(vector, apic);
@@ -1262,6 +1329,14 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
 			     MSR_IA32_APICBASE_BASE;
 	kvm_apic_set_version(vcpu);
 
+	if (eoi_enabled(apic->vcpu)) {
+		if (eoi_get_pending(apic->vcpu))
+			__set_bit(KVM_APIC_EOI_PENDING,
+				  &vcpu->arch.apic_attention);
+		else
+			__clear_bit(KVM_APIC_EOI_PENDING,
+				    &vcpu->arch.apic_attention);
+	}
 	apic_update_ppr(apic);
 	hrtimer_cancel(&apic->lapic_timer.timer);
 	update_divide_count(apic);
@@ -1283,11 +1358,25 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
 }
 
+static void apic_update_eoi(struct kvm_lapic *apic)
+{
+	int vector;
+	BUG_ON(!eoi_enabled(apic->vcpu));
+	if (eoi_get_pending(apic->vcpu))
+		return;
+	__clear_bit(KVM_APIC_EOI_PENDING, &apic->vcpu->arch.apic_attention);
+	vector = apic_set_eoi(apic);
+	trace_kvm_pv_eoi(apic, vector);
+}
+
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
 {
 	u32 data;
 	void *vapic;
 
+	if (test_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention))
+		apic_update_eoi(vcpu->arch.apic);
+
 	if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
 		return;
 
@@ -1394,3 +1483,14 @@ int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
 
 	return 0;
 }
+
+int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data)
+{
+	u64 addr = data & ~KVM_MSR_ENABLED;
+	if (eoi_enabled(vcpu))
+		eoi_clr_pending(vcpu);
+	vcpu->arch.eoi.msr_val = data;
+	if (!eoi_enabled(vcpu))
+		return 0;
+	return kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.eoi.data, addr);
+}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6f4ce25..042dace 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -60,4 +60,6 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
 }
+
+int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data);
 #endif
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 911d264..851914e 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -517,6 +517,40 @@ TRACE_EVENT(kvm_apic_accept_irq,
 		  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_eoi,
+	    TP_PROTO(struct kvm_lapic *apic, int vector),
+	    TP_ARGS(apic, vector),
+
+	TP_STRUCT__entry(
+		__field(	__u32,		apicid		)
+		__field(	int,		vector		)
+	),
+
+	TP_fast_assign(
+		__entry->apicid		= apic->vcpu->vcpu_id;
+		__entry->vector		= vector;
+	),
+
+	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
+);
+
+TRACE_EVENT(kvm_pv_eoi,
+	    TP_PROTO(struct kvm_lapic *apic, int vector),
+	    TP_ARGS(apic, vector),
+
+	TP_STRUCT__entry(
+		__field(	__u32,		apicid		)
+		__field(	int,		vector		)
+	),
+
+	TP_fast_assign(
+		__entry->apicid		= apic->vcpu->vcpu_id;
+		__entry->vector		= vector;
+	),
+
+	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
+);
+
 /*
  * Tracepoint for nested VMRUN
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 185a2b8..bfcd97d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -795,6 +795,7 @@ static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+	MSR_KVM_EOI_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1653,6 +1654,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
 		break;
+	case MSR_KVM_EOI_EN:
+		if (kvm_pv_enable_apic_eoi(vcpu, data))
+			return 1;
+		break;
 
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
@@ -5370,7 +5375,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (unlikely(vcpu->arch.tsc_always_catchup))
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
-	kvm_lapic_sync_from_vapic(vcpu);
+	if (unlikely(vcpu->arch.apic_attention))
+		kvm_lapic_sync_from_vapic(vcpu);
 
 	r = kvm_x86_ops->handle_exit(vcpu);
 out:
-- 
MST


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv3 8/8] kvm: eoi msi documentation
  2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2012-05-15 14:36 ` [PATCHv3 7/8] kvm: host side for eoi optimization Michael S. Tsirkin
@ 2012-05-15 14:36 ` Michael S. Tsirkin
  7 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-15 14:36 UTC (permalink / raw)
  To: x86, kvm
  Cc: Ingo Molnar, H. Peter Anvin, Avi Kivity, Marcelo Tosatti, gleb,
	Linus Torvalds, linux-kernel

Document the new EOI MSR. Couldn't decide whether this change belongs
conceptually on guest or host side, so a separate patch.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 Documentation/virtual/kvm/msr.txt |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 5031780..3481f1b 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -219,3 +219,35 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
 		steal: the amount of time in which this vCPU did not run, in
 		nanoseconds. Time during which the vcpu is idle, will not be
 		reported as steal time.
+
+MSR_KVM_EOI_EN: 0x4b564d04
+	data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+	when disabled.  When enabled, bits 63-1 hold 2-byte aligned physical address
+	of a 2 byte memory area which must be in guest RAM and must be zeroed.
+
+	The first, least significant bit of 2 byte memory location will be
+	written to by the hypervisor, typically at the time of interrupt
+	injection.  Value of 1 means that guest can skip writing EOI to the apic
+	(using MSR or MMIO write); instead, it is sufficient to signal
+	EOI by clearing the bit in guest memory - this location will
+	later be polled by the hypervisor.
+	Value of 0 means that the EOI write is required.
+
+	It is always safe for the guest to ignore the optimization and perform
+	the APIC EOI write anyway.
+
+	Hypervisor is guaranteed to only modify this least
+	significant bit while in the current VCPU context, this means that
+	guest does not need to use either lock prefix or memory ordering
+	primitives to synchronise with the hypervisor.
+
+	However, hypervisor can set and clear this memory bit at any time:
+	therefore to make sure hypervisor does not interrupt the
+	guest and clear the least significant bit in the memory area
+	in the window between guest testing it to detect
+	whether it can skip EOI apic write and between guest
+	clearing it to signal EOI to the hypervisor,
+	guest must both read the least sgnificant bit in the memory area and
+	clear it using a single CPU instruction, such as test and clear, or
+	compare and exchange.
+
-- 
MST

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCHv3 5/8] kvm_para: guest side for eoi avoidance
  2012-05-15 14:36 ` [PATCHv3 5/8] kvm_para: guest side for eoi avoidance Michael S. Tsirkin
@ 2012-05-16  1:51   ` Marcelo Tosatti
  2012-05-16  6:22     ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2012-05-16  1:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: x86, kvm, Ingo Molnar, H. Peter Anvin, Avi Kivity, gleb,
	Linus Torvalds, linux-kernel

On Tue, May 15, 2012 at 05:36:12PM +0300, Michael S. Tsirkin wrote:
> The idea is simple: there's a bit, per APIC, in guest memory,
> that tells the guest that it does not need EOI.
> Guest tests it using a single est and clear operation - this is
> necessary so that host can detect interrupt nesting - and if set, it can
> skip the EOI MSR.
> 
> I run a simple microbenchmark to show exit reduction
> (note: for testing, need to apply patch 7 from the series + a qemu patch
>  I posted separately, on host):
> 
> Before:
> 
> Performance counter stats for 'sleep 1s':
> 
>             47,357 kvm:kvm_entry                                                [99.98%]
>                  0 kvm:kvm_hypercall                                            [99.98%]
>                  0 kvm:kvm_hv_hypercall                                         [99.98%]
>              5,001 kvm:kvm_pio                                                  [99.98%]
>                  0 kvm:kvm_cpuid                                                [99.98%]
>             22,124 kvm:kvm_apic                                                 [99.98%]
>             49,849 kvm:kvm_exit                                                 [99.98%]
>             21,115 kvm:kvm_inj_virq                                             [99.98%]
>                  0 kvm:kvm_inj_exception                                        [99.98%]
>                  0 kvm:kvm_page_fault                                           [99.98%]
>             22,937 kvm:kvm_msr                                                  [99.98%]
>                  0 kvm:kvm_cr                                                   [99.98%]
>                  0 kvm:kvm_pic_set_irq                                          [99.98%]
>                  0 kvm:kvm_apic_ipi                                             [99.98%]
>             22,207 kvm:kvm_apic_accept_irq                                      [99.98%]
>             22,421 kvm:kvm_eoi                                                  [99.98%]
>                  0 kvm:kvm_pv_eoi                                               [99.99%]
>                  0 kvm:kvm_nested_vmrun                                         [99.99%]
>                  0 kvm:kvm_nested_intercepts                                    [99.99%]
>                  0 kvm:kvm_nested_vmexit                                        [99.99%]
>                  0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
>                  0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
>                  0 kvm:kvm_invlpga                                              [99.99%]
>                  0 kvm:kvm_skinit                                               [99.99%]
>                 57 kvm:kvm_emulate_insn                                         [99.99%]
>                  0 kvm:vcpu_match_mmio                                          [99.99%]
>                  0 kvm:kvm_userspace_exit                                       [99.99%]
>                  2 kvm:kvm_set_irq                                              [99.99%]
>                  2 kvm:kvm_ioapic_set_irq                                       [99.99%]
>             23,609 kvm:kvm_msi_set_irq                                          [99.99%]
>                  1 kvm:kvm_ack_irq                                              [99.99%]
>                131 kvm:kvm_mmio                                                 [99.99%]
>                226 kvm:kvm_fpu                                                  [100.00%]
>                  0 kvm:kvm_age_page                                             [100.00%]
>                  0 kvm:kvm_try_async_get_page                                    [100.00%]
>                  0 kvm:kvm_async_pf_doublefault                                    [100.00%]
>                  0 kvm:kvm_async_pf_not_present                                    [100.00%]
>                  0 kvm:kvm_async_pf_ready                                       [100.00%]
>                  0 kvm:kvm_async_pf_completed
> 
>        1.002100578 seconds time elapsed
> 
> After:
> 
>  Performance counter stats for 'sleep 1s':
> 
>             28,354 kvm:kvm_entry                                                [99.98%]
>                  0 kvm:kvm_hypercall                                            [99.98%]
>                  0 kvm:kvm_hv_hypercall                                         [99.98%]
>              1,347 kvm:kvm_pio                                                  [99.98%]
>                  0 kvm:kvm_cpuid                                                [99.98%]
>              1,931 kvm:kvm_apic                                                 [99.98%]
>             29,595 kvm:kvm_exit                                                 [99.98%]
>             24,884 kvm:kvm_inj_virq                                             [99.98%]
>                  0 kvm:kvm_inj_exception                                        [99.98%]
>                  0 kvm:kvm_page_fault                                           [99.98%]
>              1,986 kvm:kvm_msr                                                  [99.98%]
>                  0 kvm:kvm_cr                                                   [99.98%]
>                  0 kvm:kvm_pic_set_irq                                          [99.98%]
>                  0 kvm:kvm_apic_ipi                                             [99.99%]
>             25,953 kvm:kvm_apic_accept_irq                                      [99.99%]
>             26,132 kvm:kvm_eoi                                                  [99.99%]
>             26,593 kvm:kvm_pv_eoi                                               [99.99%]
>                  0 kvm:kvm_nested_vmrun                                         [99.99%]
>                  0 kvm:kvm_nested_intercepts                                    [99.99%]
>                  0 kvm:kvm_nested_vmexit                                        [99.99%]
>                  0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
>                  0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
>                  0 kvm:kvm_invlpga                                              [99.99%]
>                  0 kvm:kvm_skinit                                               [99.99%]
>                284 kvm:kvm_emulate_insn                                         [99.99%]
>                 68 kvm:vcpu_match_mmio                                          [99.99%]
>                 68 kvm:kvm_userspace_exit                                       [99.99%]
>                  2 kvm:kvm_set_irq                                              [99.99%]
>                  2 kvm:kvm_ioapic_set_irq                                       [99.99%]
>             28,288 kvm:kvm_msi_set_irq                                          [99.99%]
>                  1 kvm:kvm_ack_irq                                              [99.99%]
>                131 kvm:kvm_mmio                                                 [100.00%]
>                588 kvm:kvm_fpu                                                  [100.00%]
>                  0 kvm:kvm_age_page                                             [100.00%]
>                  0 kvm:kvm_try_async_get_page                                    [100.00%]
>                  0 kvm:kvm_async_pf_doublefault                                    [100.00%]
>                  0 kvm:kvm_async_pf_not_present                                    [100.00%]
>                  0 kvm:kvm_async_pf_ready                                       [100.00%]
>                  0 kvm:kvm_async_pf_completed
> 
>        1.002039622 seconds time elapsed
> 
> We see that # of exits is almost halved.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  arch/x86/include/asm/bitops.h |    6 +++-
>  arch/x86/kernel/kvm.c         |   50 ++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index b97596e..c9c70ea 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -26,11 +26,13 @@
>  #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 1)
>  /* Technically wrong, but this avoids compilation errors on some gcc
>     versions. */
> -#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))
> +#define BITOP_ADDR_CONSTRAINT "=m"
>  #else
> -#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
> +#define BITOP_ADDR_CONSTRAINT "+m"
>  #endif
>  
> +#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
> +
>  #define ADDR				BITOP_ADDR(addr)
>  
>  /*
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index e554e5a..31f85cb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -39,6 +39,8 @@
>  #include <asm/desc.h>
>  #include <asm/tlbflush.h>
>  #include <asm/idle.h>
> +#include <asm/apic.h>
> +#include <asm/apicdef.h>
>  
>  static int kvmapf = 1;
>  
> @@ -283,6 +285,23 @@ static void kvm_register_steal_time(void)
>  		cpu, __pa(st));
>  }
>  
> +/* size alignment is implied but just to make it explicit. */
> +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) = 0;

DECLARE_PER_CPU_ALIGNED, can be heavily accessed.

> +static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
> +{
> +	/**
> +	 * This relies on __test_and_clear_bit to modify the memory
> +	 * in a way that is atomic with respect to the local CPU.
> +	 * The hypervisor only accesses this memory from the local CPU so
> +	 * there's no need for lock or memory barriers.
> +	 * An optimization barrier is implied in apic write.
> +	 */
> +	if (__test_and_clear_bit(0, &__get_cpu_var(kvm_apic_eoi)))
> +		return;

Name and #define for the bit, please.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv3 7/8] kvm: host side for eoi optimization
  2012-05-15 14:36 ` [PATCHv3 7/8] kvm: host side for eoi optimization Michael S. Tsirkin
@ 2012-05-16  1:55   ` Marcelo Tosatti
  2012-05-16  6:16     ` Michael S. Tsirkin
  2012-05-16  8:58     ` Michael S. Tsirkin
  0 siblings, 2 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2012-05-16  1:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: x86, kvm, Ingo Molnar, H. Peter Anvin, Avi Kivity, gleb,
	Linus Torvalds, linux-kernel

On Tue, May 15, 2012 at 05:36:19PM +0300, Michael S. Tsirkin wrote:
> Implementation of PV EOI using shared memory.
> This reduces the number of exits an interrupt
> causes as much as by half.
> 
> The idea is simple: there's a bit, per APIC, in guest memory,
> that tells the guest that it does not need EOI.
> We set it before injecting an interrupt and clear
> before injecting a nested one. Guest tests it using
> a test and clear operation - this is necessary
> so that host can detect interrupt nesting -
> and if set, it can skip the EOI MSR.
> 
> There's a new MSR to set the address of said register
> in guest memory. Otherwise not much changed:
> - Guest EOI is not required
> - Register is tested & ISR is automatically cleared on exit
> For tetsing results see description of patch 7 of the series.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    6 ++
>  arch/x86/include/asm/kvm_para.h |    2 +
>  arch/x86/kvm/cpuid.c            |    1 +
>  arch/x86/kvm/irq.c              |    2 +-
>  arch/x86/kvm/lapic.c            |  110 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/lapic.h            |    2 +
>  arch/x86/kvm/trace.h            |   34 ++++++++++++
>  arch/x86/kvm/x86.c              |    8 +++-
>  8 files changed, 158 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 32297f2..7673f4d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -174,6 +174,7 @@ enum {
>  
>  /* apic attention bits */
>  #define KVM_APIC_CHECK_VAPIC	0
> +#define KVM_APIC_EOI_PENDING	1
>  
>  /*
>   * We don't want allocation failures within the mmu code, so we preallocate
> @@ -485,6 +486,11 @@ struct kvm_vcpu_arch {
>  		u64 length;
>  		u64 status;
>  	} osvw;
> +
> +	struct {
> +		u64 msr_val;
> +		struct gfn_to_hva_cache data;
> +	} eoi;
>  };
>  
>  struct kvm_lpage_info {
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 734c376..195840d 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -22,6 +22,7 @@
>  #define KVM_FEATURE_CLOCKSOURCE2        3
>  #define KVM_FEATURE_ASYNC_PF		4
>  #define KVM_FEATURE_STEAL_TIME		5
> +#define KVM_FEATURE_EOI			6

SKIP_EOI?

>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -37,6 +38,7 @@
>  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
> +#define MSR_KVM_EOI_EN      0x4b564d04
>  
>  struct kvm_steal_time {
>  	__u64 steal;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index eba8345..bda4877 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -408,6 +408,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
>  			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
>  			     (1 << KVM_FEATURE_ASYNC_PF) |
> +			     (1 << KVM_FEATURE_EOI) |
>  			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>  
>  		if (sched_info_on())
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 7e06ba1..07bdfab 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -48,7 +48,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>  	if (!irqchip_in_kernel(v->kvm))
>  		return v->arch.interrupt.pending;
>  
> -	if (kvm_apic_has_interrupt(v) == -1) {	/* LAPIC */
> +	if (kvm_apic_has_interrupt(v) < 0) {	/* LAPIC */
>  		if (kvm_apic_accept_pic_intr(v)) {
>  			s = pic_irqchip(v->kvm);	/* PIC */
>  			return s->output;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 93c1574..c7e6ffb 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -270,6 +270,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
>  			irq->level, irq->trig_mode);
>  }
>  
> +static int eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
> +{
> +
> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, &val,
> +				      sizeof(val));
> +}
> +
> +static int eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
> +{
> +
> +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, val,
> +				      sizeof(*val));
> +}
> +
> +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.eoi.msr_val & KVM_MSR_ENABLED;
> +}

Please find a more descriptive name, such as pv_eoi_* (or skip_eoi,
or...).

> +
> +static bool eoi_get_pending(struct kvm_vcpu *vcpu)
> +{
> +	u8 val;
> +	if (eoi_get_user(vcpu, &val) < 0)
> +		apic_debug("Can't read EOI MSR value: 0x%llx\n",
> +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> +	return val & 0x1;
> +}
> +
> +static void eoi_set_pending(struct kvm_vcpu *vcpu)
> +{
> +	if (eoi_put_user(vcpu, 0x1) < 0) {
> +		apic_debug("Can't set EOI MSR value: 0x%llx\n",
> +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> +		return;
> +	}
> +	__set_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> +}
> +
> +static void eoi_clr_pending(struct kvm_vcpu *vcpu)
> +{
> +	if (eoi_put_user(vcpu, 0x0) < 0) {
> +		apic_debug("Can't clear EOI MSR value: 0x%llx\n",
> +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> +		return;
> +	}
> +	__clear_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> +}
> +
>  static inline int apic_find_highest_isr(struct kvm_lapic *apic)
>  {
>  	int result;
> @@ -482,16 +530,21 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
>  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
>  }
>  
> -static void apic_set_eoi(struct kvm_lapic *apic)
> +static int apic_set_eoi(struct kvm_lapic *apic)
>  {
>  	int vector = apic_find_highest_isr(apic);
> +
> +	trace_kvm_eoi(apic, vector);
> +
>  	/*
>  	 * Not every write EOI will has corresponding ISR,
>  	 * one example is when Kernel check timer on setup_IO_APIC
>  	 */
>  	if (vector == -1)
> -		return;
> +		return vector;
>  
> +	if (eoi_enabled(apic->vcpu))
> +		eoi_clr_pending(apic->vcpu);

Why is this here again? If you got here, the bit should be cleared 
already?

>  	apic_clear_vector(vector, apic->regs + APIC_ISR);
>  	apic_update_ppr(apic);
>  
> @@ -505,6 +558,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
>  		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
>  	}
>  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> +	return vector;
>  }
>  
>  static void apic_send_ipi(struct kvm_lapic *apic)
> @@ -1085,6 +1139,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
>  	atomic_set(&apic->lapic_timer.pending, 0);
>  	if (kvm_vcpu_is_bsp(vcpu))
>  		vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
> +	vcpu->arch.eoi.msr_val = 0;
>  	apic_update_ppr(apic);
>  
>  	vcpu->arch.apic_arb_prio = 0;
> @@ -1211,9 +1266,10 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
>  
>  	apic_update_ppr(apic);
>  	highest_irr = apic_find_highest_irr(apic);
> -	if ((highest_irr == -1) ||
> -	    ((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> +	if (highest_irr == -1)
>  		return -1;
> +	if (((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> +		return -2;
>  	return highest_irr;
>  }
>  
> @@ -1245,9 +1301,20 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>  	int vector = kvm_apic_has_interrupt(vcpu);
>  	struct kvm_lapic *apic = vcpu->arch.apic;
>  
> -	if (vector == -1)
> +	/* Detect interrupt nesting and disable EOI optimization */
> +	if (eoi_enabled(vcpu) && vector == -2)
> +		eoi_clr_pending(vcpu);
> +
> +	if (vector < 0)
>  		return -1;
>  
> +	if (eoi_enabled(vcpu)) {
> +		if (kvm_ioapic_handles_vector(vcpu->kvm, vector))
> +			eoi_clr_pending(vcpu);
> +		else
> +			eoi_set_pending(vcpu);
> +	}
> +
>  	apic_set_vector(vector, apic->regs + APIC_ISR);
>  	apic_update_ppr(apic);
>  	apic_clear_irr(vector, apic);
> @@ -1262,6 +1329,14 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
>  			     MSR_IA32_APICBASE_BASE;
>  	kvm_apic_set_version(vcpu);
>  
> +	if (eoi_enabled(apic->vcpu)) {
> +		if (eoi_get_pending(apic->vcpu))
> +			__set_bit(KVM_APIC_EOI_PENDING,
> +				  &vcpu->arch.apic_attention);
> +		else
> +			__clear_bit(KVM_APIC_EOI_PENDING,
> +				    &vcpu->arch.apic_attention);
> +	}


Why is this necessary? kvm_apic_post_state_restore cannot change the 
per cpu in memory value.

>  	apic_update_ppr(apic);
>  	hrtimer_cancel(&apic->lapic_timer.timer);
>  	update_divide_count(apic);
> @@ -1283,11 +1358,25 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
>  		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
>  }
>  
> +static void apic_update_eoi(struct kvm_lapic *apic)
> +{
> +	int vector;
> +	BUG_ON(!eoi_enabled(apic->vcpu));
> +	if (eoi_get_pending(apic->vcpu))
> +		return;
> +	__clear_bit(KVM_APIC_EOI_PENDING, &apic->vcpu->arch.apic_attention);
> +	vector = apic_set_eoi(apic);
> +	trace_kvm_pv_eoi(apic, vector);
> +}

KVM_APIC_EOI_PENDING appears to be a shadow of the pervcpu value,
to speed up processing. If it is, please make that clear. 

>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
>  {
>  	u32 data;
>  	void *vapic;
>  
> +	if (test_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention))
> +		apic_update_eoi(vcpu->arch.apic);
> +
>  	if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
>  		return;
>  
> @@ -1394,3 +1483,14 @@ int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
>  
>  	return 0;
>  }
> +
> +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	u64 addr = data & ~KVM_MSR_ENABLED;
> +	if (eoi_enabled(vcpu))
> +		eoi_clr_pending(vcpu);
> +	vcpu->arch.eoi.msr_val = data;
> +	if (!eoi_enabled(vcpu))
> +		return 0;
> +	return kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.eoi.data, addr);
> +}
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 6f4ce25..042dace 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -60,4 +60,6 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
>  }
> +
> +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data);
>  #endif
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 911d264..851914e 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -517,6 +517,40 @@ TRACE_EVENT(kvm_apic_accept_irq,
>  		  __entry->coalesced ? " (coalesced)" : "")
>  );
>  
> +TRACE_EVENT(kvm_eoi,
> +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> +	    TP_ARGS(apic, vector),
> +
> +	TP_STRUCT__entry(
> +		__field(	__u32,		apicid		)
> +		__field(	int,		vector		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->apicid		= apic->vcpu->vcpu_id;
> +		__entry->vector		= vector;
> +	),
> +
> +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> +);
> +
> +TRACE_EVENT(kvm_pv_eoi,
> +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> +	    TP_ARGS(apic, vector),
> +
> +	TP_STRUCT__entry(
> +		__field(	__u32,		apicid		)
> +		__field(	int,		vector		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->apicid		= apic->vcpu->vcpu_id;
> +		__entry->vector		= vector;
> +	),
> +
> +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> +);
> +
>  /*
>   * Tracepoint for nested VMRUN
>   */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 185a2b8..bfcd97d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -795,6 +795,7 @@ static u32 msrs_to_save[] = {
>  	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>  	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>  	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> +	MSR_KVM_EOI_EN,
>  	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>  	MSR_STAR,
>  #ifdef CONFIG_X86_64
> @@ -1653,6 +1654,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>  
>  		break;
> +	case MSR_KVM_EOI_EN:
> +		if (kvm_pv_enable_apic_eoi(vcpu, data))
> +			return 1;
> +		break;
>  
>  	case MSR_IA32_MCG_CTL:
>  	case MSR_IA32_MCG_STATUS:
> @@ -5370,7 +5375,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (unlikely(vcpu->arch.tsc_always_catchup))
>  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  
> -	kvm_lapic_sync_from_vapic(vcpu);
> +	if (unlikely(vcpu->arch.apic_attention))
> +		kvm_lapic_sync_from_vapic(vcpu);

This apic_attention check is unrelated, please have in a separate patch.

Also please document that this is only possible because of interrupt window 
exiting. 

The Hyper-V spec, does it provide similar interface ? It mentions
that "Most EOI intercepts can be eliminated and done lazily if the guest
OS leaves a marker when it performs an EOI".

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv3 7/8] kvm: host side for eoi optimization
  2012-05-16  1:55   ` Marcelo Tosatti
@ 2012-05-16  6:16     ` Michael S. Tsirkin
  2012-05-16  8:58     ` Michael S. Tsirkin
  1 sibling, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-16  6:16 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: x86, kvm, Ingo Molnar, H. Peter Anvin, Avi Kivity, gleb,
	Linus Torvalds, linux-kernel

On Tue, May 15, 2012 at 10:55:19PM -0300, Marcelo Tosatti wrote:
> On Tue, May 15, 2012 at 05:36:19PM +0300, Michael S. Tsirkin wrote:
> > Implementation of PV EOI using shared memory.
> > This reduces the number of exits an interrupt
> > causes as much as by half.
> > 
> > The idea is simple: there's a bit, per APIC, in guest memory,
> > that tells the guest that it does not need EOI.
> > We set it before injecting an interrupt and clear
> > before injecting a nested one. Guest tests it using
> > a test and clear operation - this is necessary
> > so that host can detect interrupt nesting -
> > and if set, it can skip the EOI MSR.
> > 
> > There's a new MSR to set the address of said register
> > in guest memory. Otherwise not much changed:
> > - Guest EOI is not required
> > - Register is tested & ISR is automatically cleared on exit
> > For tetsing results see description of patch 7 of the series.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    6 ++
> >  arch/x86/include/asm/kvm_para.h |    2 +
> >  arch/x86/kvm/cpuid.c            |    1 +
> >  arch/x86/kvm/irq.c              |    2 +-
> >  arch/x86/kvm/lapic.c            |  110 +++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/lapic.h            |    2 +
> >  arch/x86/kvm/trace.h            |   34 ++++++++++++
> >  arch/x86/kvm/x86.c              |    8 +++-
> >  8 files changed, 158 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 32297f2..7673f4d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -174,6 +174,7 @@ enum {
> >  
> >  /* apic attention bits */
> >  #define KVM_APIC_CHECK_VAPIC	0
> > +#define KVM_APIC_EOI_PENDING	1
> >  
> >  /*
> >   * We don't want allocation failures within the mmu code, so we preallocate
> > @@ -485,6 +486,11 @@ struct kvm_vcpu_arch {
> >  		u64 length;
> >  		u64 status;
> >  	} osvw;
> > +
> > +	struct {
> > +		u64 msr_val;
> > +		struct gfn_to_hva_cache data;
> > +	} eoi;
> >  };
> >  
> >  struct kvm_lpage_info {
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 734c376..195840d 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -22,6 +22,7 @@
> >  #define KVM_FEATURE_CLOCKSOURCE2        3
> >  #define KVM_FEATURE_ASYNC_PF		4
> >  #define KVM_FEATURE_STEAL_TIME		5
> > +#define KVM_FEATURE_EOI			6
> 
> SKIP_EOI?
> 
> >  
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> > @@ -37,6 +38,7 @@
> >  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> >  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >  #define MSR_KVM_STEAL_TIME  0x4b564d03
> > +#define MSR_KVM_EOI_EN      0x4b564d04
> >  
> >  struct kvm_steal_time {
> >  	__u64 steal;
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index eba8345..bda4877 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -408,6 +408,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >  			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
> >  			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
> >  			     (1 << KVM_FEATURE_ASYNC_PF) |
> > +			     (1 << KVM_FEATURE_EOI) |
> >  			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >  
> >  		if (sched_info_on())
> > diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> > index 7e06ba1..07bdfab 100644
> > --- a/arch/x86/kvm/irq.c
> > +++ b/arch/x86/kvm/irq.c
> > @@ -48,7 +48,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
> >  	if (!irqchip_in_kernel(v->kvm))
> >  		return v->arch.interrupt.pending;
> >  
> > -	if (kvm_apic_has_interrupt(v) == -1) {	/* LAPIC */
> > +	if (kvm_apic_has_interrupt(v) < 0) {	/* LAPIC */
> >  		if (kvm_apic_accept_pic_intr(v)) {
> >  			s = pic_irqchip(v->kvm);	/* PIC */
> >  			return s->output;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 93c1574..c7e6ffb 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -270,6 +270,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
> >  			irq->level, irq->trig_mode);
> >  }
> >  
> > +static int eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
> > +{
> > +
> > +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, &val,
> > +				      sizeof(val));
> > +}
> > +
> > +static int eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
> > +{
> > +
> > +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, val,
> > +				      sizeof(*val));
> > +}
> > +
> > +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
> > +{
> > +	return vcpu->arch.eoi.msr_val & KVM_MSR_ENABLED;
> > +}
> 
> Please find a more descriptive name, such as pv_eoi_* (or skip_eoi,
> or...).
> 
> > +
> > +static bool eoi_get_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	u8 val;
> > +	if (eoi_get_user(vcpu, &val) < 0)
> > +		apic_debug("Can't read EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +	return val & 0x1;
> > +}
> > +
> > +static void eoi_set_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	if (eoi_put_user(vcpu, 0x1) < 0) {
> > +		apic_debug("Can't set EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +		return;
> > +	}
> > +	__set_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> > +static void eoi_clr_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	if (eoi_put_user(vcpu, 0x0) < 0) {
> > +		apic_debug("Can't clear EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +		return;
> > +	}
> > +	__clear_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> >  static inline int apic_find_highest_isr(struct kvm_lapic *apic)
> >  {
> >  	int result;
> > @@ -482,16 +530,21 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
> >  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
> >  }
> >  
> > -static void apic_set_eoi(struct kvm_lapic *apic)
> > +static int apic_set_eoi(struct kvm_lapic *apic)
> >  {
> >  	int vector = apic_find_highest_isr(apic);
> > +
> > +	trace_kvm_eoi(apic, vector);
> > +
> >  	/*
> >  	 * Not every write EOI will has corresponding ISR,
> >  	 * one example is when Kernel check timer on setup_IO_APIC
> >  	 */
> >  	if (vector == -1)
> > -		return;
> > +		return vector;
> >  
> > +	if (eoi_enabled(apic->vcpu))
> > +		eoi_clr_pending(apic->vcpu);
> 
> Why is this here again? If you got here, the bit should be cleared 
> already?

It's here in case guest invokes the EOI MSR instead of
clearing the bit. In this case we clear it instead of the guest.
The idea being, PV EOI is always optional never mandatory.
I'll add a comment.

> >  	apic_clear_vector(vector, apic->regs + APIC_ISR);
> >  	apic_update_ppr(apic);
> >  
> > @@ -505,6 +558,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
> >  		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> >  	}
> >  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> > +	return vector;
> >  }
> >  
> >  static void apic_send_ipi(struct kvm_lapic *apic)
> > @@ -1085,6 +1139,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> >  	atomic_set(&apic->lapic_timer.pending, 0);
> >  	if (kvm_vcpu_is_bsp(vcpu))
> >  		vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
> > +	vcpu->arch.eoi.msr_val = 0;
> >  	apic_update_ppr(apic);
> >  
> >  	vcpu->arch.apic_arb_prio = 0;
> > @@ -1211,9 +1266,10 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
> >  
> >  	apic_update_ppr(apic);
> >  	highest_irr = apic_find_highest_irr(apic);
> > -	if ((highest_irr == -1) ||
> > -	    ((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> > +	if (highest_irr == -1)
> >  		return -1;
> > +	if (((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> > +		return -2;
> >  	return highest_irr;
> >  }
> >  
> > @@ -1245,9 +1301,20 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> >  	int vector = kvm_apic_has_interrupt(vcpu);
> >  	struct kvm_lapic *apic = vcpu->arch.apic;
> >  
> > -	if (vector == -1)
> > +	/* Detect interrupt nesting and disable EOI optimization */
> > +	if (eoi_enabled(vcpu) && vector == -2)
> > +		eoi_clr_pending(vcpu);
> > +
> > +	if (vector < 0)
> >  		return -1;
> >  
> > +	if (eoi_enabled(vcpu)) {
> > +		if (kvm_ioapic_handles_vector(vcpu->kvm, vector))
> > +			eoi_clr_pending(vcpu);
> > +		else
> > +			eoi_set_pending(vcpu);
> > +	}
> > +
> >  	apic_set_vector(vector, apic->regs + APIC_ISR);
> >  	apic_update_ppr(apic);
> >  	apic_clear_irr(vector, apic);
> > @@ -1262,6 +1329,14 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
> >  			     MSR_IA32_APICBASE_BASE;
> >  	kvm_apic_set_version(vcpu);
> >  
> > +	if (eoi_enabled(apic->vcpu)) {
> > +		if (eoi_get_pending(apic->vcpu))
> > +			__set_bit(KVM_APIC_EOI_PENDING,
> > +				  &vcpu->arch.apic_attention);
> > +		else
> > +			__clear_bit(KVM_APIC_EOI_PENDING,
> > +				    &vcpu->arch.apic_attention);
> > +	}
> 
> 
> Why is this necessary? kvm_apic_post_state_restore cannot change the 
> per cpu in memory value.

Exactly. Memory is migrated but the internal flag
in apic_attention isn't. So we make sure apic_attention
is in sync with the per cpu value.
I'll add a comment.

> >  	apic_update_ppr(apic);
> >  	hrtimer_cancel(&apic->lapic_timer.timer);
> >  	update_divide_count(apic);
> > @@ -1283,11 +1358,25 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
> >  		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
> >  }
> >  
> > +static void apic_update_eoi(struct kvm_lapic *apic)
> > +{
> > +	int vector;
> > +	BUG_ON(!eoi_enabled(apic->vcpu));
> > +	if (eoi_get_pending(apic->vcpu))
> > +		return;
> > +	__clear_bit(KVM_APIC_EOI_PENDING, &apic->vcpu->arch.apic_attention);
> > +	vector = apic_set_eoi(apic);
> > +	trace_kvm_pv_eoi(apic, vector);
> > +}
> 
> KVM_APIC_EOI_PENDING appears to be a shadow of the pervcpu value,
> to speed up processing. If it is, please make that clear. 

Exactly. I'll add a comment where this bit is defined.

> >  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
> >  {
> >  	u32 data;
> >  	void *vapic;
> >  
> > +	if (test_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention))
> > +		apic_update_eoi(vcpu->arch.apic);
> > +
> >  	if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
> >  		return;
> >  
> > @@ -1394,3 +1483,14 @@ int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
> >  
> >  	return 0;
> >  }
> > +
> > +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data)
> > +{
> > +	u64 addr = data & ~KVM_MSR_ENABLED;
> > +	if (eoi_enabled(vcpu))
> > +		eoi_clr_pending(vcpu);
> > +	vcpu->arch.eoi.msr_val = data;
> > +	if (!eoi_enabled(vcpu))
> > +		return 0;
> > +	return kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.eoi.data, addr);
> > +}
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 6f4ce25..042dace 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -60,4 +60,6 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
> >  {
> >  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
> >  }
> > +
> > +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data);
> >  #endif
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 911d264..851914e 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -517,6 +517,40 @@ TRACE_EVENT(kvm_apic_accept_irq,
> >  		  __entry->coalesced ? " (coalesced)" : "")
> >  );
> >  
> > +TRACE_EVENT(kvm_eoi,
> > +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> > +	    TP_ARGS(apic, vector),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	__u32,		apicid		)
> > +		__field(	int,		vector		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->apicid		= apic->vcpu->vcpu_id;
> > +		__entry->vector		= vector;
> > +	),
> > +
> > +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> > +);
> > +
> > +TRACE_EVENT(kvm_pv_eoi,
> > +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> > +	    TP_ARGS(apic, vector),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	__u32,		apicid		)
> > +		__field(	int,		vector		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->apicid		= apic->vcpu->vcpu_id;
> > +		__entry->vector		= vector;
> > +	),
> > +
> > +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> > +);
> > +
> >  /*
> >   * Tracepoint for nested VMRUN
> >   */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 185a2b8..bfcd97d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -795,6 +795,7 @@ static u32 msrs_to_save[] = {
> >  	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >  	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >  	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> > +	MSR_KVM_EOI_EN,
> >  	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >  	MSR_STAR,
> >  #ifdef CONFIG_X86_64
> > @@ -1653,6 +1654,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
> >  
> >  		break;
> > +	case MSR_KVM_EOI_EN:
> > +		if (kvm_pv_enable_apic_eoi(vcpu, data))
> > +			return 1;
> > +		break;
> >  
> >  	case MSR_IA32_MCG_CTL:
> >  	case MSR_IA32_MCG_STATUS:
> > @@ -5370,7 +5375,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	if (unlikely(vcpu->arch.tsc_always_catchup))
> >  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >  
> > -	kvm_lapic_sync_from_vapic(vcpu);
> > +	if (unlikely(vcpu->arch.apic_attention))
> > +		kvm_lapic_sync_from_vapic(vcpu);
> 
> This apic_attention check is unrelated, please have in a separate patch.

OK.

> Also please document that this is only possible because of interrupt window 
> exiting. 

Not sure I understand what you are saying here.

When we try to inject an interrupt, this causes the
destination vcpu to exit. At that point we check whether guest has
finished handling the previous interrupt.
That's one reason to have apic_attention set.
Another is use of vapic.

So if we add a comment it belongs at kvm_lapic_sync_from_vapic
not here in the generic code.

> The Hyper-V spec, does it provide similar interface ? It mentions
> that "Most EOI intercepts can be eliminated and done lazily if the guest
> OS leaves a marker when it performs an EOI".

I think the kvm data path for that will be able to reuse our code.

-- 
MST

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv3 5/8] kvm_para: guest side for eoi avoidance
  2012-05-16  1:51   ` Marcelo Tosatti
@ 2012-05-16  6:22     ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-16  6:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: x86, kvm, Ingo Molnar, H. Peter Anvin, Avi Kivity, gleb,
	Linus Torvalds, linux-kernel

On Tue, May 15, 2012 at 10:51:24PM -0300, Marcelo Tosatti wrote:
> On Tue, May 15, 2012 at 05:36:12PM +0300, Michael S. Tsirkin wrote:
> > The idea is simple: there's a bit, per APIC, in guest memory,
> > that tells the guest that it does not need EOI.
> > Guest tests it using a single est and clear operation - this is
> > necessary so that host can detect interrupt nesting - and if set, it can
> > skip the EOI MSR.
> > 
> > I run a simple microbenchmark to show exit reduction
> > (note: for testing, need to apply patch 7 from the series + a qemu patch
> >  I posted separately, on host):
> > 
> > Before:
> > 
> > Performance counter stats for 'sleep 1s':
> > 
> >             47,357 kvm:kvm_entry                                                [99.98%]
> >                  0 kvm:kvm_hypercall                                            [99.98%]
> >                  0 kvm:kvm_hv_hypercall                                         [99.98%]
> >              5,001 kvm:kvm_pio                                                  [99.98%]
> >                  0 kvm:kvm_cpuid                                                [99.98%]
> >             22,124 kvm:kvm_apic                                                 [99.98%]
> >             49,849 kvm:kvm_exit                                                 [99.98%]
> >             21,115 kvm:kvm_inj_virq                                             [99.98%]
> >                  0 kvm:kvm_inj_exception                                        [99.98%]
> >                  0 kvm:kvm_page_fault                                           [99.98%]
> >             22,937 kvm:kvm_msr                                                  [99.98%]
> >                  0 kvm:kvm_cr                                                   [99.98%]
> >                  0 kvm:kvm_pic_set_irq                                          [99.98%]
> >                  0 kvm:kvm_apic_ipi                                             [99.98%]
> >             22,207 kvm:kvm_apic_accept_irq                                      [99.98%]
> >             22,421 kvm:kvm_eoi                                                  [99.98%]
> >                  0 kvm:kvm_pv_eoi                                               [99.99%]
> >                  0 kvm:kvm_nested_vmrun                                         [99.99%]
> >                  0 kvm:kvm_nested_intercepts                                    [99.99%]
> >                  0 kvm:kvm_nested_vmexit                                        [99.99%]
> >                  0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
> >                  0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
> >                  0 kvm:kvm_invlpga                                              [99.99%]
> >                  0 kvm:kvm_skinit                                               [99.99%]
> >                 57 kvm:kvm_emulate_insn                                         [99.99%]
> >                  0 kvm:vcpu_match_mmio                                          [99.99%]
> >                  0 kvm:kvm_userspace_exit                                       [99.99%]
> >                  2 kvm:kvm_set_irq                                              [99.99%]
> >                  2 kvm:kvm_ioapic_set_irq                                       [99.99%]
> >             23,609 kvm:kvm_msi_set_irq                                          [99.99%]
> >                  1 kvm:kvm_ack_irq                                              [99.99%]
> >                131 kvm:kvm_mmio                                                 [99.99%]
> >                226 kvm:kvm_fpu                                                  [100.00%]
> >                  0 kvm:kvm_age_page                                             [100.00%]
> >                  0 kvm:kvm_try_async_get_page                                    [100.00%]
> >                  0 kvm:kvm_async_pf_doublefault                                    [100.00%]
> >                  0 kvm:kvm_async_pf_not_present                                    [100.00%]
> >                  0 kvm:kvm_async_pf_ready                                       [100.00%]
> >                  0 kvm:kvm_async_pf_completed
> > 
> >        1.002100578 seconds time elapsed
> > 
> > After:
> > 
> >  Performance counter stats for 'sleep 1s':
> > 
> >             28,354 kvm:kvm_entry                                                [99.98%]
> >                  0 kvm:kvm_hypercall                                            [99.98%]
> >                  0 kvm:kvm_hv_hypercall                                         [99.98%]
> >              1,347 kvm:kvm_pio                                                  [99.98%]
> >                  0 kvm:kvm_cpuid                                                [99.98%]
> >              1,931 kvm:kvm_apic                                                 [99.98%]
> >             29,595 kvm:kvm_exit                                                 [99.98%]
> >             24,884 kvm:kvm_inj_virq                                             [99.98%]
> >                  0 kvm:kvm_inj_exception                                        [99.98%]
> >                  0 kvm:kvm_page_fault                                           [99.98%]
> >              1,986 kvm:kvm_msr                                                  [99.98%]
> >                  0 kvm:kvm_cr                                                   [99.98%]
> >                  0 kvm:kvm_pic_set_irq                                          [99.98%]
> >                  0 kvm:kvm_apic_ipi                                             [99.99%]
> >             25,953 kvm:kvm_apic_accept_irq                                      [99.99%]
> >             26,132 kvm:kvm_eoi                                                  [99.99%]
> >             26,593 kvm:kvm_pv_eoi                                               [99.99%]
> >                  0 kvm:kvm_nested_vmrun                                         [99.99%]
> >                  0 kvm:kvm_nested_intercepts                                    [99.99%]
> >                  0 kvm:kvm_nested_vmexit                                        [99.99%]
> >                  0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
> >                  0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
> >                  0 kvm:kvm_invlpga                                              [99.99%]
> >                  0 kvm:kvm_skinit                                               [99.99%]
> >                284 kvm:kvm_emulate_insn                                         [99.99%]
> >                 68 kvm:vcpu_match_mmio                                          [99.99%]
> >                 68 kvm:kvm_userspace_exit                                       [99.99%]
> >                  2 kvm:kvm_set_irq                                              [99.99%]
> >                  2 kvm:kvm_ioapic_set_irq                                       [99.99%]
> >             28,288 kvm:kvm_msi_set_irq                                          [99.99%]
> >                  1 kvm:kvm_ack_irq                                              [99.99%]
> >                131 kvm:kvm_mmio                                                 [100.00%]
> >                588 kvm:kvm_fpu                                                  [100.00%]
> >                  0 kvm:kvm_age_page                                             [100.00%]
> >                  0 kvm:kvm_try_async_get_page                                    [100.00%]
> >                  0 kvm:kvm_async_pf_doublefault                                    [100.00%]
> >                  0 kvm:kvm_async_pf_not_present                                    [100.00%]
> >                  0 kvm:kvm_async_pf_ready                                       [100.00%]
> >                  0 kvm:kvm_async_pf_completed
> > 
> >        1.002039622 seconds time elapsed
> > 
> > We see that # of exits is almost halved.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  arch/x86/include/asm/bitops.h |    6 +++-
> >  arch/x86/kernel/kvm.c         |   50 ++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 51 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index b97596e..c9c70ea 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -26,11 +26,13 @@
> >  #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 1)
> >  /* Technically wrong, but this avoids compilation errors on some gcc
> >     versions. */
> > -#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))
> > +#define BITOP_ADDR_CONSTRAINT "=m"
> >  #else
> > -#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
> > +#define BITOP_ADDR_CONSTRAINT "+m"
> >  #endif
> >  
> > +#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
> > +
> >  #define ADDR				BITOP_ADDR(addr)
> >  
> >  /*
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index e554e5a..31f85cb 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -39,6 +39,8 @@
> >  #include <asm/desc.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/idle.h>
> > +#include <asm/apic.h>
> > +#include <asm/apicdef.h>
> >  
> >  static int kvmapf = 1;
> >  
> > @@ -283,6 +285,23 @@ static void kvm_register_steal_time(void)
> >  		cpu, __pa(st));
> >  }
> >  
> > +/* size alignment is implied but just to make it explicit. */
> > +static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) __aligned(2) = 0;
> 
> DECLARE_PER_CPU_ALIGNED, can be heavily accessed.

Are you sure this will help? Without this, will all values
share a cache line?
I was unable to trigger any perf difference and it does
increase the cache footprint.

> > +static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
> > +{
> > +	/**
> > +	 * This relies on __test_and_clear_bit to modify the memory
> > +	 * in a way that is atomic with respect to the local CPU.
> > +	 * The hypervisor only accesses this memory from the local CPU so
> > +	 * there's no need for lock or memory barriers.
> > +	 * An optimization barrier is implied in apic write.
> > +	 */
> > +	if (__test_and_clear_bit(0, &__get_cpu_var(kvm_apic_eoi)))
> > +		return;
> 
> Name and #define for the bit, please.

Where would be a good place for this define?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv3 7/8] kvm: host side for eoi optimization
  2012-05-16  1:55   ` Marcelo Tosatti
  2012-05-16  6:16     ` Michael S. Tsirkin
@ 2012-05-16  8:58     ` Michael S. Tsirkin
  1 sibling, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2012-05-16  8:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: x86, kvm, Ingo Molnar, H. Peter Anvin, Avi Kivity, gleb,
	Linus Torvalds, linux-kernel

On Tue, May 15, 2012 at 10:55:19PM -0300, Marcelo Tosatti wrote:
> On Tue, May 15, 2012 at 05:36:19PM +0300, Michael S. Tsirkin wrote:
> > Implementation of PV EOI using shared memory.
> > This reduces the number of exits an interrupt
> > causes as much as by half.
> > 
> > The idea is simple: there's a bit, per APIC, in guest memory,
> > that tells the guest that it does not need EOI.
> > We set it before injecting an interrupt and clear
> > before injecting a nested one. Guest tests it using
> > a test and clear operation - this is necessary
> > so that host can detect interrupt nesting -
> > and if set, it can skip the EOI MSR.
> > 
> > There's a new MSR to set the address of said register
> > in guest memory. Otherwise not much changed:
> > - Guest EOI is not required
> > - Register is tested & ISR is automatically cleared on exit
> > For tetsing results see description of patch 7 of the series.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    6 ++
> >  arch/x86/include/asm/kvm_para.h |    2 +
> >  arch/x86/kvm/cpuid.c            |    1 +
> >  arch/x86/kvm/irq.c              |    2 +-
> >  arch/x86/kvm/lapic.c            |  110 +++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/lapic.h            |    2 +
> >  arch/x86/kvm/trace.h            |   34 ++++++++++++
> >  arch/x86/kvm/x86.c              |    8 +++-
> >  8 files changed, 158 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 32297f2..7673f4d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -174,6 +174,7 @@ enum {
> >  
> >  /* apic attention bits */
> >  #define KVM_APIC_CHECK_VAPIC	0
> > +#define KVM_APIC_EOI_PENDING	1
> >  
> >  /*
> >   * We don't want allocation failures within the mmu code, so we preallocate
> > @@ -485,6 +486,11 @@ struct kvm_vcpu_arch {
> >  		u64 length;
> >  		u64 status;
> >  	} osvw;
> > +
> > +	struct {
> > +		u64 msr_val;
> > +		struct gfn_to_hva_cache data;
> > +	} eoi;
> >  };
> >  
> >  struct kvm_lpage_info {
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 734c376..195840d 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -22,6 +22,7 @@
> >  #define KVM_FEATURE_CLOCKSOURCE2        3
> >  #define KVM_FEATURE_ASYNC_PF		4
> >  #define KVM_FEATURE_STEAL_TIME		5
> > +#define KVM_FEATURE_EOI			6
> 
> SKIP_EOI?
> 
> >  
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> > @@ -37,6 +38,7 @@
> >  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> >  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >  #define MSR_KVM_STEAL_TIME  0x4b564d03
> > +#define MSR_KVM_EOI_EN      0x4b564d04
> >  
> >  struct kvm_steal_time {
> >  	__u64 steal;
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index eba8345..bda4877 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -408,6 +408,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >  			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
> >  			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
> >  			     (1 << KVM_FEATURE_ASYNC_PF) |
> > +			     (1 << KVM_FEATURE_EOI) |
> >  			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >  
> >  		if (sched_info_on())
> > diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> > index 7e06ba1..07bdfab 100644
> > --- a/arch/x86/kvm/irq.c
> > +++ b/arch/x86/kvm/irq.c
> > @@ -48,7 +48,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
> >  	if (!irqchip_in_kernel(v->kvm))
> >  		return v->arch.interrupt.pending;
> >  
> > -	if (kvm_apic_has_interrupt(v) == -1) {	/* LAPIC */
> > +	if (kvm_apic_has_interrupt(v) < 0) {	/* LAPIC */
> >  		if (kvm_apic_accept_pic_intr(v)) {
> >  			s = pic_irqchip(v->kvm);	/* PIC */
> >  			return s->output;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 93c1574..c7e6ffb 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -270,6 +270,54 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
> >  			irq->level, irq->trig_mode);
> >  }
> >  
> > +static int eoi_put_user(struct kvm_vcpu *vcpu, u8 val)
> > +{
> > +
> > +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, &val,
> > +				      sizeof(val));
> > +}
> > +
> > +static int eoi_get_user(struct kvm_vcpu *vcpu, u8 *val)
> > +{
> > +
> > +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.eoi.data, val,
> > +				      sizeof(*val));
> > +}
> > +
> > +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
> > +{
> > +	return vcpu->arch.eoi.msr_val & KVM_MSR_ENABLED;
> > +}
> 
> Please find a more descriptive name, such as pv_eoi_* (or skip_eoi,
> or...).
> 
> > +
> > +static bool eoi_get_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	u8 val;
> > +	if (eoi_get_user(vcpu, &val) < 0)
> > +		apic_debug("Can't read EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +	return val & 0x1;
> > +}
> > +
> > +static void eoi_set_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	if (eoi_put_user(vcpu, 0x1) < 0) {
> > +		apic_debug("Can't set EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +		return;
> > +	}
> > +	__set_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> > +static void eoi_clr_pending(struct kvm_vcpu *vcpu)
> > +{
> > +	if (eoi_put_user(vcpu, 0x0) < 0) {
> > +		apic_debug("Can't clear EOI MSR value: 0x%llx\n",
> > +			   (unsigned long long)vcpi->arch.eoi.msr_val);
> > +		return;
> > +	}
> > +	__clear_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention);
> > +}
> > +
> >  static inline int apic_find_highest_isr(struct kvm_lapic *apic)
> >  {
> >  	int result;
> > @@ -482,16 +530,21 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
> >  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
> >  }
> >  
> > -static void apic_set_eoi(struct kvm_lapic *apic)
> > +static int apic_set_eoi(struct kvm_lapic *apic)
> >  {
> >  	int vector = apic_find_highest_isr(apic);
> > +
> > +	trace_kvm_eoi(apic, vector);
> > +
> >  	/*
> >  	 * Not every write EOI will has corresponding ISR,
> >  	 * one example is when Kernel check timer on setup_IO_APIC
> >  	 */
> >  	if (vector == -1)
> > -		return;
> > +		return vector;
> >  
> > +	if (eoi_enabled(apic->vcpu))
> > +		eoi_clr_pending(apic->vcpu);
> 
> Why is this here again? If you got here, the bit should be cleared 
> already?
> 
> >  	apic_clear_vector(vector, apic->regs + APIC_ISR);
> >  	apic_update_ppr(apic);
> >  
> > @@ -505,6 +558,7 @@ static void apic_set_eoi(struct kvm_lapic *apic)
> >  		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> >  	}
> >  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> > +	return vector;
> >  }
> >  
> >  static void apic_send_ipi(struct kvm_lapic *apic)
> > @@ -1085,6 +1139,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> >  	atomic_set(&apic->lapic_timer.pending, 0);
> >  	if (kvm_vcpu_is_bsp(vcpu))
> >  		vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
> > +	vcpu->arch.eoi.msr_val = 0;
> >  	apic_update_ppr(apic);
> >  
> >  	vcpu->arch.apic_arb_prio = 0;
> > @@ -1211,9 +1266,10 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
> >  
> >  	apic_update_ppr(apic);
> >  	highest_irr = apic_find_highest_irr(apic);
> > -	if ((highest_irr == -1) ||
> > -	    ((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> > +	if (highest_irr == -1)
> >  		return -1;
> > +	if (((highest_irr & 0xF0) <= apic_get_reg(apic, APIC_PROCPRI)))
> > +		return -2;
> >  	return highest_irr;
> >  }
> >  
> > @@ -1245,9 +1301,20 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> >  	int vector = kvm_apic_has_interrupt(vcpu);
> >  	struct kvm_lapic *apic = vcpu->arch.apic;
> >  
> > -	if (vector == -1)
> > +	/* Detect interrupt nesting and disable EOI optimization */
> > +	if (eoi_enabled(vcpu) && vector == -2)
> > +		eoi_clr_pending(vcpu);
> > +
> > +	if (vector < 0)
> >  		return -1;
> >  
> > +	if (eoi_enabled(vcpu)) {
> > +		if (kvm_ioapic_handles_vector(vcpu->kvm, vector))
> > +			eoi_clr_pending(vcpu);
> > +		else
> > +			eoi_set_pending(vcpu);
> > +	}
> > +
> >  	apic_set_vector(vector, apic->regs + APIC_ISR);
> >  	apic_update_ppr(apic);
> >  	apic_clear_irr(vector, apic);
> > @@ -1262,6 +1329,14 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu)
> >  			     MSR_IA32_APICBASE_BASE;
> >  	kvm_apic_set_version(vcpu);
> >  
> > +	if (eoi_enabled(apic->vcpu)) {
> > +		if (eoi_get_pending(apic->vcpu))
> > +			__set_bit(KVM_APIC_EOI_PENDING,
> > +				  &vcpu->arch.apic_attention);
> > +		else
> > +			__clear_bit(KVM_APIC_EOI_PENDING,
> > +				    &vcpu->arch.apic_attention);
> > +	}
> 
> 
> Why is this necessary? kvm_apic_post_state_restore cannot change the 
> per cpu in memory value.
> 
> >  	apic_update_ppr(apic);
> >  	hrtimer_cancel(&apic->lapic_timer.timer);
> >  	update_divide_count(apic);
> > @@ -1283,11 +1358,25 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
> >  		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
> >  }
> >  
> > +static void apic_update_eoi(struct kvm_lapic *apic)
> > +{
> > +	int vector;
> > +	BUG_ON(!eoi_enabled(apic->vcpu));
> > +	if (eoi_get_pending(apic->vcpu))
> > +		return;
> > +	__clear_bit(KVM_APIC_EOI_PENDING, &apic->vcpu->arch.apic_attention);
> > +	vector = apic_set_eoi(apic);
> > +	trace_kvm_pv_eoi(apic, vector);
> > +}
> 
> KVM_APIC_EOI_PENDING appears to be a shadow of the pervcpu value,
> to speed up processing. If it is, please make that clear. 

BTW yes it is a shadow but it's not just for speed:
without it there would be no easy way to distinguish between
two states:
	- we decided to not use pv eoi for a specific interrupt
	- we enabled pv eoi for an interrupt, and guest has cleared it

This bit let us distinguish: in 1st case it is cleared in second
case it is set.

> >  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu)
> >  {
> >  	u32 data;
> >  	void *vapic;
> >  
> > +	if (test_bit(KVM_APIC_EOI_PENDING, &vcpu->arch.apic_attention))
> > +		apic_update_eoi(vcpu->arch.apic);
> > +
> >  	if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
> >  		return;
> >  
> > @@ -1394,3 +1483,14 @@ int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
> >  
> >  	return 0;
> >  }
> > +
> > +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data)
> > +{
> > +	u64 addr = data & ~KVM_MSR_ENABLED;
> > +	if (eoi_enabled(vcpu))
> > +		eoi_clr_pending(vcpu);
> > +	vcpu->arch.eoi.msr_val = data;
> > +	if (!eoi_enabled(vcpu))
> > +		return 0;
> > +	return kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.eoi.data, addr);
> > +}
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 6f4ce25..042dace 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -60,4 +60,6 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
> >  {
> >  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
> >  }
> > +
> > +int kvm_pv_enable_apic_eoi(struct kvm_vcpu *vcpu, u64 data);
> >  #endif
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 911d264..851914e 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -517,6 +517,40 @@ TRACE_EVENT(kvm_apic_accept_irq,
> >  		  __entry->coalesced ? " (coalesced)" : "")
> >  );
> >  
> > +TRACE_EVENT(kvm_eoi,
> > +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> > +	    TP_ARGS(apic, vector),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	__u32,		apicid		)
> > +		__field(	int,		vector		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->apicid		= apic->vcpu->vcpu_id;
> > +		__entry->vector		= vector;
> > +	),
> > +
> > +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> > +);
> > +
> > +TRACE_EVENT(kvm_pv_eoi,
> > +	    TP_PROTO(struct kvm_lapic *apic, int vector),
> > +	    TP_ARGS(apic, vector),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	__u32,		apicid		)
> > +		__field(	int,		vector		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->apicid		= apic->vcpu->vcpu_id;
> > +		__entry->vector		= vector;
> > +	),
> > +
> > +	TP_printk("apicid %x vector %d", __entry->apicid, __entry->vector)
> > +);
> > +
> >  /*
> >   * Tracepoint for nested VMRUN
> >   */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 185a2b8..bfcd97d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -795,6 +795,7 @@ static u32 msrs_to_save[] = {
> >  	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >  	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >  	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> > +	MSR_KVM_EOI_EN,
> >  	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >  	MSR_STAR,
> >  #ifdef CONFIG_X86_64
> > @@ -1653,6 +1654,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
> >  
> >  		break;
> > +	case MSR_KVM_EOI_EN:
> > +		if (kvm_pv_enable_apic_eoi(vcpu, data))
> > +			return 1;
> > +		break;
> >  
> >  	case MSR_IA32_MCG_CTL:
> >  	case MSR_IA32_MCG_STATUS:
> > @@ -5370,7 +5375,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	if (unlikely(vcpu->arch.tsc_always_catchup))
> >  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >  
> > -	kvm_lapic_sync_from_vapic(vcpu);
> > +	if (unlikely(vcpu->arch.apic_attention))
> > +		kvm_lapic_sync_from_vapic(vcpu);
> 
> This apic_attention check is unrelated, please have in a separate patch.
> 
> Also please document that this is only possible because of interrupt window 
> exiting. 
> 
> The Hyper-V spec, does it provide similar interface ? It mentions
> that "Most EOI intercepts can be eliminated and done lazily if the guest
> OS leaves a marker when it performs an EOI".

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-05-16  8:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-15 14:35 [PATCHv3 0/8] apic: eoi optimization support Michael S. Tsirkin
2012-05-15 14:35 ` [PATCHv3 1/8] apic: fix typo EIO_ACK -> EOI_ACK and document Michael S. Tsirkin
2012-05-15 14:35 ` [PATCHv3 2/8] apic: use symbolic APIC_EOI_ACK Michael S. Tsirkin
2012-05-15 14:35 ` [PATCHv3 3/8] x86: add apic->eoi_write callback Michael S. Tsirkin
2012-05-15 14:36 ` [PATCHv3 4/8] x86: eoi micro-optimization Michael S. Tsirkin
2012-05-15 14:36 ` [PATCHv3 6/8] x86/bitops: note on __test_and_clear_bit atomicity Michael S. Tsirkin
2012-05-15 14:36 ` [PATCHv3 5/8] kvm_para: guest side for eoi avoidance Michael S. Tsirkin
2012-05-16  1:51   ` Marcelo Tosatti
2012-05-16  6:22     ` Michael S. Tsirkin
2012-05-15 14:36 ` [PATCHv3 7/8] kvm: host side for eoi optimization Michael S. Tsirkin
2012-05-16  1:55   ` Marcelo Tosatti
2012-05-16  6:16     ` Michael S. Tsirkin
2012-05-16  8:58     ` Michael S. Tsirkin
2012-05-15 14:36 ` [PATCHv3 8/8] kvm: eoi msi documentation Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.