All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/3] x86, apicv: Add APIC virtualization support
@ 2012-12-17  5:30 Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 1/3] x86, apicv: add APICv register " Yang Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Yang Zhang @ 2012-12-17  5:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, Yang Zhang

From: Yang Zhang <yang.z.zhang@Intel.com>

APIC virtualization is a new feature which can eliminate most of VM exit
when vcpu handle a interrupt:

APIC register virtualization:
        APIC read access doesn't cause APIC-access VM exits.
        APIC write becomes trap-like.

Virtual interrupt delivery:
        Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
        manually, which is fully taken care of by the hardware.

Please refer to Intel SDM volume 3, chapter 29 for more details.
Changes v6 to v7:
 * fix a bug when set exit bitmap.
 * Rebased on top of KVM upstream.

Changes v5 to v6:
 * minor adjustments according gleb's comments
 * Rebased on top of KVM upstream.

Changes v4 to v5:
 * Set eoi exit bitmap when an interrupt has notifier registered.
 * Use request to track ioapic entry's modification.
 * Rebased on top of KVM upstream.

Changes v3 to v4:
 * use one option to control both register virtualization and virtual interrupt
   delivery.
 * Update eoi exit bitmap when programing ioapic or programing apic's id/dfr/ldr.
 * Rebased on top of KVM upstream.

Changes v2 to v3:
 * Drop Posted Interrupt patch from v3.
   According Gleb's suggestion, we will use global vector for all VCPUs as notification
   event vector. So we will rewrite the Posted Interrupt patch. And resend it later.
 * Use TMR to set the eoi exiting bitmap. We only want to set eoi exiting bitmap for
   those interrupt which is level trigger or has notifier in EOI write path. So TMR is
   enough to distinguish the interrupt trigger mode.
 * Simplify some code according Gleb's comments.
 * rebased on top of KVM upstream.

Changes v1 to v2:
 * Add Posted Interrupt support in this series patch.
 * Since there is a notifer hook in vAPIC EOI for PIT interrupt. So always Set PIT
   interrupt in eoi exit bitmap to force vmexit when EOI to interrupt.
 * Rebased on top of KVM upstream

Yang Zhang (3):
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual interrupt delivery support
  x86, apicv: add virtual x2apic support

 arch/ia64/kvm/lapic.h           |    6 +
 arch/x86/include/asm/kvm_host.h |    6 +
 arch/x86/include/asm/vmx.h      |   13 ++
 arch/x86/kvm/irq.c              |   56 ++++++++-
 arch/x86/kvm/lapic.c            |   80 +++++++++----
 arch/x86/kvm/lapic.h            |   30 ++++-
 arch/x86/kvm/svm.c              |   24 ++++
 arch/x86/kvm/vmx.c              |  241 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |   11 ++-
 include/linux/kvm_host.h        |    2 +
 virt/kvm/ioapic.c               |   36 ++++++
 virt/kvm/ioapic.h               |    1 +
 virt/kvm/irq_comm.c             |   20 ++++
 13 files changed, 481 insertions(+), 45 deletions(-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v7 1/3] x86, apicv: add APICv register virtualization support
  2012-12-17  5:30 [PATCH v7 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
@ 2012-12-17  5:30 ` Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 3/3] x86, apicv: add virtual x2apic support Yang Zhang
  2 siblings, 0 replies; 43+ messages in thread
From: Yang Zhang @ 2012-12-17  5:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, Yang Zhang, Kevin Tian

- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/x86/include/asm/vmx.h |    2 ++
 arch/x86/kvm/lapic.c       |   15 +++++++++++++++
 arch/x86/kvm/lapic.h       |    2 ++
 arch/x86/kvm/vmx.c         |   33 ++++++++++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index e385df9..44c3f7e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
+#define EXIT_REASON_APIC_WRITE          56
 #define EXIT_REASON_INVPCID             58
 
 #define VMX_EXIT_REASONS \
@@ -141,6 +142,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9392f52..0664c13 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1212,6 +1212,21 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 
+/* emulate APIC access in a trap manner */
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+	u32 val = 0;
+
+	/* hw has done the conditional check and inst decode */
+	offset &= 0xff0;
+
+	apic_reg_read(vcpu->arch.apic, offset, 4, &val);
+
+	/* TODO: optimize to just emulate side effect w/o one more write */
+	apic_reg_write(vcpu->arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e5ebf9f..9a8ee22 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b3101e3..7ed26ec 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -84,6 +84,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg_vid = 1;
+module_param(enable_apicv_reg_vid, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -762,6 +765,12 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2539,7 +2548,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_UNRESTRICTED_GUEST |
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
-			SECONDARY_EXEC_ENABLE_INVPCID;
+			SECONDARY_EXEC_ENABLE_INVPCID |
+			SECONDARY_EXEC_APIC_REGISTER_VIRT;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2550,6 +2560,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 		_cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
 #endif
+
+	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
+		_cpu_based_2nd_exec_control &= ~(
+				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
 		   enabled */
@@ -2747,6 +2762,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
+	if (!cpu_has_vmx_apic_register_virt())
+		enable_apicv_reg_vid = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3861,6 +3879,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
 	if (!ple_gap)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+	if (!enable_apicv_reg_vid)
+		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
 	return exec_control;
 }
 
@@ -4820,6 +4840,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_write(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 offset = exit_qualification & 0xfff;
+
+	/* APIC-write VM exit is trap-like and thus no need to adjust IP */
+	kvm_apic_write_nodecode(vcpu, offset);
+	return 1;
+}
+
 static int handle_task_switch(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5754,6 +5784,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
+	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-17  5:30 [PATCH v7 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 1/3] x86, apicv: add APICv register " Yang Zhang
@ 2012-12-17  5:30 ` Yang Zhang
  2012-12-20  0:59   ` Marcelo Tosatti
                     ` (3 more replies)
  2012-12-17  5:30 ` [PATCH v7 3/3] x86, apicv: add virtual x2apic support Yang Zhang
  2 siblings, 4 replies; 43+ messages in thread
From: Yang Zhang @ 2012-12-17  5:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, Yang Zhang, Kevin Tian, Yang Zhang

From: Yang Zhang <yang.z.zhang@Intel.com>

Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/ia64/kvm/lapic.h           |    6 ++
 arch/x86/include/asm/kvm_host.h |    6 ++
 arch/x86/include/asm/vmx.h      |   11 +++
 arch/x86/kvm/irq.c              |   56 +++++++++++++-
 arch/x86/kvm/lapic.c            |   65 ++++++++++-------
 arch/x86/kvm/lapic.h            |   28 ++++++-
 arch/x86/kvm/svm.c              |   24 ++++++
 arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   11 ++-
 include/linux/kvm_host.h        |    2 +
 virt/kvm/ioapic.c               |   36 +++++++++
 virt/kvm/ioapic.h               |    1 +
 virt/kvm/irq_comm.c             |   20 +++++
 13 files changed, 379 insertions(+), 41 deletions(-)

diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
index c5f92a9..cb59eb4 100644
--- a/arch/ia64/kvm/lapic.h
+++ b/arch/ia64/kvm/lapic.h
@@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
 #define kvm_apic_present(x) (true)
 #define kvm_lapic_enabled(x) (true)
 
+static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
+					struct kvm_lapic_irq *irq)
+{
+	/* IA64 has no apicv supporting, do nothing here */
+}
+
 #endif
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c431b33..b63a144 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,11 @@ struct kvm_x86_ops {
 	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
 	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
 	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
+	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
+	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
+	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(void);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
@@ -991,6 +996,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 44c3f7e..d1ab331 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -62,6 +62,7 @@
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
+#define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
@@ -143,6 +144,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
@@ -180,6 +182,7 @@ enum vmcs_field {
 	GUEST_GS_SELECTOR               = 0x0000080a,
 	GUEST_LDTR_SELECTOR             = 0x0000080c,
 	GUEST_TR_SELECTOR               = 0x0000080e,
+	GUEST_INTR_STATUS               = 0x00000810,
 	HOST_ES_SELECTOR                = 0x00000c00,
 	HOST_CS_SELECTOR                = 0x00000c02,
 	HOST_SS_SELECTOR                = 0x00000c04,
@@ -207,6 +210,14 @@ enum vmcs_field {
 	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
 	EPT_POINTER                     = 0x0000201a,
 	EPT_POINTER_HIGH                = 0x0000201b,
+	EOI_EXIT_BITMAP0                = 0x0000201c,
+	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
+	EOI_EXIT_BITMAP1                = 0x0000201e,
+	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
+	EOI_EXIT_BITMAP2                = 0x00002020,
+	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
+	EOI_EXIT_BITMAP3                = 0x00002022,
+	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index b111aee..e113440 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -38,6 +38,38 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
 
 /*
+ * check if there is pending interrupt from
+ * non-APIC source without intack.
+ */
+static int kvm_cpu_has_extint(struct kvm_vcpu *v)
+{
+	if (kvm_apic_accept_pic_intr(v))
+		return pic_irqchip(v->kvm)->output;	/* PIC */
+	else
+		return 0;
+}
+
+/*
+ * check if there is injectable interrupt:
+ * when virtual interrupt delivery enabled,
+ * interrupt from apic will handled by hardware,
+ * we don't need to check it here.
+ */
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
+{
+	if (!irqchip_in_kernel(v->kvm))
+		return v->arch.interrupt.pending;
+
+	if (kvm_cpu_has_extint(v))
+		return 1;
+
+	if (kvm_apic_vid_enabled(v))
+		return 0;
+
+	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
+}
+
+/*
  * check if there is pending interrupt without
  * intack.
  */
@@ -46,27 +78,41 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (!irqchip_in_kernel(v->kvm))
 		return v->arch.interrupt.pending;
 
-	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
-		return pic_irqchip(v->kvm)->output;	/* PIC */
+	if (kvm_cpu_has_extint(v))
+		return 1;
 
 	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
 
 /*
+ * Read pending interrupt(from non-APIC source)
+ * vector and intack.
+ */
+static int kvm_cpu_get_extint(struct kvm_vcpu *v)
+{
+	if (kvm_cpu_has_extint(v))
+		return kvm_pic_read_irq(v->kvm); /* PIC */
+	return -1;
+}
+
+/*
  * Read pending interrupt vector and intack.
  */
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 {
+	int vector;
+
 	if (!irqchip_in_kernel(v->kvm))
 		return v->arch.interrupt.nr;
 
-	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
-		return kvm_pic_read_irq(v->kvm);	/* PIC */
+	vector = kvm_cpu_get_extint(v);
+
+	if (kvm_apic_vid_enabled(v) || vector != -1)
+		return vector;			/* PIC */
 
 	return kvm_get_apic_interrupt(v);	/* APIC */
 }
-EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0664c13..61971bd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -133,6 +133,12 @@ static inline int apic_enabled(struct kvm_lapic *apic)
 	return kvm_apic_sw_enabled(apic) &&	kvm_apic_hw_enabled(apic);
 }
 
+bool kvm_apic_present(struct kvm_vcpu *vcpu)
+{
+	return kvm_vcpu_has_lapic(vcpu) && kvm_apic_hw_enabled(vcpu->arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_present);
+
 #define LVT_MASK	\
 	(APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK)
 
@@ -150,23 +156,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
 	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
 }
 
-static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
-{
-	u16 cid;
-	ldr >>= 32 - map->ldr_bits;
-	cid = (ldr >> map->cid_shift) & map->cid_mask;
-
-	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
-
-	return cid;
-}
-
-static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
-{
-	ldr >>= (32 - map->ldr_bits);
-	return ldr & map->lid_mask;
-}
-
 static void recalculate_apic_map(struct kvm *kvm)
 {
 	struct kvm_apic_map *new, *old = NULL;
@@ -236,12 +225,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
 {
 	apic_set_reg(apic, APIC_ID, id << 24);
 	recalculate_apic_map(apic->vcpu->kvm);
+	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 }
 
 static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
 {
 	apic_set_reg(apic, APIC_LDR, id);
 	recalculate_apic_map(apic->vcpu->kvm);
+	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 }
 
 static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
@@ -740,6 +731,19 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
+static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
+{
+	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
+	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
+		int trigger_mode;
+		if (apic_test_vector(vector, apic->regs + APIC_TMR))
+			trigger_mode = IOAPIC_LEVEL_TRIG;
+		else
+			trigger_mode = IOAPIC_EDGE_TRIG;
+		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
+	}
+}
+
 static int apic_set_eoi(struct kvm_lapic *apic)
 {
 	int vector = apic_find_highest_isr(apic);
@@ -756,19 +760,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 	apic_clear_isr(vector, apic);
 	apic_update_ppr(apic);
 
-	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
-	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
-		int trigger_mode;
-		if (apic_test_vector(vector, apic->regs + APIC_TMR))
-			trigger_mode = IOAPIC_LEVEL_TRIG;
-		else
-			trigger_mode = IOAPIC_EDGE_TRIG;
-		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
-	}
+	kvm_ioapic_send_eoi(apic, vector);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 	return vector;
 }
 
+/*
+ * this interface assumes a trap-like exit, which has already finished
+ * desired side effect including vISR and vPPR update.
+ */
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	trace_kvm_eoi(apic, vector);
+
+	kvm_ioapic_send_eoi(apic, vector);
+	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
+
 static void apic_send_ipi(struct kvm_lapic *apic)
 {
 	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
@@ -1071,6 +1082,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 		if (!apic_x2apic_mode(apic)) {
 			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
 			recalculate_apic_map(apic->vcpu->kvm);
+			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 		} else
 			ret = 1;
 		break;
@@ -1318,6 +1330,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 		else
 			static_key_slow_inc(&apic_hw_disabled.key);
 		recalculate_apic_map(vcpu->kvm);
+		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 	}
 
 	if (!kvm_vcpu_is_bsp(apic->vcpu))
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 9a8ee22..9e3f888 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -39,6 +39,7 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu);
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
 int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
+int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
@@ -65,6 +66,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
 
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
@@ -75,6 +77,7 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 
 int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+bool kvm_apic_present(struct kvm_vcpu *vcpu);
 
 static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
 {
@@ -116,14 +119,31 @@ static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic)
 	return APIC_SPIV_APIC_ENABLED;
 }
 
-static inline bool kvm_apic_present(struct kvm_vcpu *vcpu)
+static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
 {
-	return kvm_vcpu_has_lapic(vcpu) && kvm_apic_hw_enabled(vcpu->arch.apic);
+	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
 }
 
-static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
+static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
 {
-	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
+	return kvm_x86_ops->has_virtual_interrupt_delivery(vcpu);
+}
+
+static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
+{
+	u16 cid;
+	ldr >>= 32 - map->ldr_bits;
+	cid = (ldr >> map->cid_shift) & map->cid_mask;
+
+	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
+
+	return cid;
+}
+
+static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
+{
+	ldr >>= (32 - map->ldr_bits);
+	return ldr & map->lid_mask;
 }
 
 #endif
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d29d3cd..951f1ec 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3571,6 +3571,26 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
 }
 
+static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+static void svm_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
+{
+	return ;
+}
+
+static void svm_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	return ;
+}
+
+static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	return ;
+}
+
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4290,6 +4310,10 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
+	.update_eoi_exitmap = svm_update_eoi_exitmap,
+	.reset_eoi_exitmap = svm_reset_eoi_exitmap,
+	.load_eoi_exitmap = svm_load_eoi_exitmap,
 
 	.set_tss_addr = svm_set_tss_addr,
 	.get_tdp_level = get_npt_level,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7ed26ec..be66c3e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -433,6 +433,8 @@ struct vcpu_vmx {
 
 	bool rdtscp_enabled;
 
+	unsigned long eoi_exit_bitmap[4];
+
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 };
@@ -771,6 +773,12 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
 		SECONDARY_EXEC_APIC_REGISTER_VIRT;
 }
 
+static inline bool cpu_has_vmx_virtual_intr_delivery(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2549,7 +2557,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
 			SECONDARY_EXEC_ENABLE_INVPCID |
-			SECONDARY_EXEC_APIC_REGISTER_VIRT;
+			SECONDARY_EXEC_APIC_REGISTER_VIRT |
+			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2563,7 +2572,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 
 	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
 		_cpu_based_2nd_exec_control &= ~(
-				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+				SECONDARY_EXEC_APIC_REGISTER_VIRT |
+				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
@@ -2762,9 +2772,15 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
-	if (!cpu_has_vmx_apic_register_virt())
+	if (!cpu_has_vmx_apic_register_virt() ||
+				!cpu_has_vmx_virtual_intr_delivery())
 		enable_apicv_reg_vid = 0;
 
+	if (enable_apicv_reg_vid)
+		kvm_x86_ops->update_cr8_intercept = NULL;
+	else
+		kvm_x86_ops->update_apic_irq = NULL;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3880,7 +3896,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	if (!ple_gap)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 	if (!enable_apicv_reg_vid)
-		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
+		exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT |
+				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 	return exec_control;
 }
 
@@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 				vmx_secondary_exec_control(vmx));
 	}
 
+	if (enable_apicv_reg_vid) {
+		vmcs_write64(EOI_EXIT_BITMAP0, 0);
+		vmcs_write64(EOI_EXIT_BITMAP1, 0);
+		vmcs_write64(EOI_EXIT_BITMAP2, 0);
+		vmcs_write64(EOI_EXIT_BITMAP3, 0);
+
+		vmcs_write16(GUEST_INTR_STATUS, 0);
+	}
+
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmcs_write32(PLE_WINDOW, ple_window);
@@ -4840,6 +4866,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int vector = exit_qualification & 0xff;
+
+	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
+	kvm_apic_set_eoi_accelerated(vcpu, vector);
+	return 1;
+}
+
 static int handle_apic_write(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -5785,6 +5821,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
+	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
@@ -6134,6 +6171,110 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 	vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return enable_apicv_reg_vid;
+}
+
+static void vmx_update_rvi(int vector)
+{
+	u16 status;
+	u8 old;
+
+	status = vmcs_read16(GUEST_INTR_STATUS);
+	old = (u8)status & 0xff;
+	if ((u8)vector != old) {
+		status &= ~0xff;
+		status |= (u8)vector;
+		vmcs_write16(GUEST_INTR_STATUS, status);
+	}
+}
+
+static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
+{
+	if (max_irr == -1)
+		return ;
+
+	vmx_update_rvi(max_irr);
+}
+
+static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
+				u32 vector)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!enable_apicv_reg_vid)
+		return ;
+
+	if (WARN_ONCE((vector > 255),
+		"KVM VMX: vector (%d) out of range\n", vector))
+		return;
+
+	set_bit(vector, vmx->eoi_exit_bitmap);
+
+	kvm_make_request(KVM_REQ_EOIBITMAP, vcpu);
+}
+
+void vmx_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_lapic **dst;
+	struct kvm_apic_map *map;
+	unsigned long bitmap = 1;
+	int i;
+
+	rcu_read_lock();
+	map = rcu_dereference(kvm->arch.apic_map);
+
+	if (unlikely(!map)) {
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			set_eoi_exitmap_one(vcpu, irq->vector);
+		goto out;
+	}
+
+	if (irq->dest_mode == 0) { /* physical mode */
+		if (irq->delivery_mode == APIC_DM_LOWEST ||
+				irq->dest_id == 0xff) {
+			kvm_for_each_vcpu(i, vcpu, kvm)
+				set_eoi_exitmap_one(vcpu, irq->vector);
+			goto out;
+		}
+		dst = &map->phys_map[irq->dest_id & 0xff];
+	} else {
+		u32 mda = irq->dest_id << (32 - map->ldr_bits);
+
+		dst = map->logical_map[apic_cluster_id(map, mda)];
+
+		bitmap = apic_logical_id(map, mda);
+	}
+
+	for_each_set_bit(i, &bitmap, 16) {
+		if (!dst[i])
+			continue;
+		set_eoi_exitmap_one(dst[i]->vcpu, irq->vector);
+	}
+
+out:
+	rcu_read_unlock();
+}
+
+static void vmx_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	memset(vmx->eoi_exit_bitmap, 0, 32);
+}
+
+static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
+	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
+	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
+	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]);
+}
+
 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 {
 	u32 exit_intr_info;
@@ -7397,6 +7538,11 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
+	.update_apic_irq = vmx_update_apic_irq,
+	.update_eoi_exitmap = vmx_update_eoi_exitmap,
+	.reset_eoi_exitmap = vmx_reset_eoi_exitmap,
+	.load_eoi_exitmap = vmx_load_eoi_exitmap,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1c9c834..ffacfef 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
 			vcpu->arch.nmi_injected = true;
 			kvm_x86_ops->set_nmi(vcpu);
 		}
-	} else if (kvm_cpu_has_interrupt(vcpu)) {
+	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
 		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
 			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
 					    false);
@@ -5648,6 +5648,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_handle_pmu_event(vcpu);
 		if (kvm_check_request(KVM_REQ_PMI, vcpu))
 			kvm_deliver_pmi(vcpu);
+		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu))
+			kvm_x86_ops->load_eoi_exitmap(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
@@ -5656,10 +5658,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		/* enable NMI/IRQ window open exits if needed */
 		if (vcpu->arch.nmi_pending)
 			kvm_x86_ops->enable_nmi_window(vcpu);
-		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
+		else if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
+			/* update archtecture specific hints for APIC
+			 * virtual interrupt delivery */
+			if (kvm_x86_ops->update_apic_irq)
+				kvm_x86_ops->update_apic_irq(vcpu,
+					      kvm_lapic_find_highest_irr(vcpu));
 			update_cr8_intercept(vcpu);
 			kvm_lapic_sync_to_vapic(vcpu);
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 32fdc45..6d3b0aa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -122,6 +122,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_WATCHDOG          18
 #define KVM_REQ_MASTERCLOCK_UPDATE 19
 #define KVM_REQ_MCLOCK_INPROGRESS 20
+#define KVM_REQ_EOIBITMAP         21
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -689,6 +690,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
 		int irq_source_id, int level);
+bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index cfb7e4d..ea25331 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -115,6 +115,41 @@ static void update_handled_vectors(struct kvm_ioapic *ioapic)
 	smp_wmb();
 }
 
+static void ioapic_update_eoi_exitmap_one(struct kvm_ioapic *ioapic, int pin)
+{
+	union kvm_ioapic_redirect_entry *e;
+
+	e = &ioapic->redirtbl[pin];
+
+	if ((e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
+		 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, pin))) {
+		struct kvm_lapic_irq irqe;
+
+		irqe.dest_id = e->fields.dest_id;
+		irqe.vector = e->fields.vector;
+		irqe.dest_mode = e->fields.dest_mode;
+		irqe.delivery_mode = e->fields.delivery_mode << 8;
+		kvm_x86_ops->update_eoi_exitmap(ioapic->kvm, &irqe);
+	}
+}
+
+void ioapic_update_eoi_exitmap(struct kvm *kvm)
+{
+	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+	union kvm_ioapic_redirect_entry *e;
+	int index;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(index, vcpu, kvm)
+		kvm_x86_ops->reset_eoi_exitmap(vcpu);
+
+	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
+		e = &ioapic->redirtbl[index];
+		if (!e->fields.mask)
+			ioapic_update_eoi_exitmap_one(ioapic, index);
+	}
+}
+
 static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
 {
 	unsigned index;
@@ -156,6 +191,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
 		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
 		    && ioapic->irr & (1 << index))
 			ioapic_service(ioapic, index);
+		ioapic_update_eoi_exitmap(ioapic->kvm);
 		break;
 	}
 }
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index a30abfe..e8d67cf 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -82,5 +82,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq);
 int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
 int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
+void ioapic_update_eoi_exitmap(struct kvm *kvm);
 
 #endif
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 656fa45..5455f5a 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -237,6 +237,24 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 	return ret;
 }
 
+bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
+{
+	struct kvm_irq_ack_notifier *kian;
+	struct hlist_node *n;
+	int gsi;
+
+	rcu_read_lock();
+	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+	if (gsi != -1)
+		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
+					 link)
+			if (kian->gsi == gsi)
+				return true;
+	rcu_read_unlock();
+
+	return false;
+}
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
 	struct kvm_irq_ack_notifier *kian;
@@ -261,6 +279,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
 	mutex_lock(&kvm->irq_lock);
 	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
 	mutex_unlock(&kvm->irq_lock);
+	ioapic_update_eoi_exitmap(kvm);
 }
 
 void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
@@ -270,6 +289,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 	hlist_del_init_rcu(&kian->link);
 	mutex_unlock(&kvm->irq_lock);
 	synchronize_rcu();
+	ioapic_update_eoi_exitmap(kvm);
 }
 
 int kvm_request_irq_source_id(struct kvm *kvm)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-17  5:30 [PATCH v7 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 1/3] x86, apicv: add APICv register " Yang Zhang
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
@ 2012-12-17  5:30 ` Yang Zhang
  2012-12-20  8:31   ` Gleb Natapov
  2 siblings, 1 reply; 43+ messages in thread
From: Yang Zhang @ 2012-12-17  5:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, Yang Zhang, Kevin Tian

basically to benefit from apicv, we need clear MSR bitmap for
corresponding x2apic MSRs:
    0x800 - 0x8ff: no read intercept for apicv register virtualization
    TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
---
 arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index be66c3e..9b5e7a2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
 	spin_unlock(&vmx_vpid_lock);
 }
 
-static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
+#define MSR_TYPE_R	1
+#define MSR_TYPE_W	2
+static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
 {
 	int f = sizeof(unsigned long);
 
@@ -3786,20 +3789,52 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
 	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
 	 */
 	if (msr <= 0x1fff) {
-		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
-		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__clear_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__clear_bit(msr, msr_bitmap + 0x800 / f);
+
 	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
 		msr &= 0x1fff;
-		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
-		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__clear_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__clear_bit(msr, msr_bitmap + 0xc00 / f);
+
 	}
 }
 
 static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
 {
 	if (!longmode_only)
-		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
-	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+}
+
+static void vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only)
+{
+	if (!longmode_only)
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_R);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+					msr, MSR_TYPE_R);
+}
+
+static void vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only)
+{
+	if (!longmode_only)
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_W);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+					msr, MSR_TYPE_W);
 }
 
 /*
@@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
 
+	if (enable_apicv_reg_vid) {
+		int msr;
+		for (msr = 0x800; msr <= 0x8ff; msr++)
+			vmx_disable_intercept_for_msr_read(msr, false);
+
+		/* TPR */
+		vmx_disable_intercept_for_msr_write(0x808, false);
+		/* EOI */
+		vmx_disable_intercept_for_msr_write(0x80b, false);
+		/* SELF-IPI */
+		vmx_disable_intercept_for_msr_write(0x83f, false);
+	}
+
 	if (enable_ept) {
 		kvm_mmu_set_mask_ptes(0ull,
 			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
@ 2012-12-20  0:59   ` Marcelo Tosatti
  2012-12-20  1:01     ` Marcelo Tosatti
  2012-12-20  6:42     ` Gleb Natapov
  2012-12-20  1:26   ` Marcelo Tosatti
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20  0:59 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>


Resuming previous discussion:

> > How about to recaculate irr_pending according the VIRR on each
> > vmexit?
> > 
> No need really. Since HW can only clear VIRR the only situation that
> may
> happen is that irr_pending will be true but VIRR is empty and
> apic_find_highest_irr() will return correct result in this case.

Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
Virtualization".

Also, an example of problem with ISR caching optimization (which was written
with ISR/IRR management entirely in software):

isr_count variable is never incremented (because HW sets ISR):
kvm_cpu_get_interrupt does not call into kvm_get_apic_interrupt 
if virtual interrupt delivery enabled.

Therefore apic_find_highest_isr can return -1 even though VISR
is not zero.

In that case (VISR non-zero but apic_find_highest_isr() == -1), 
apic_update_ppr does not correctly set

PPR = max(ISRV, TPR)

Which can result in kvm_arch_vcpu_runnable returning 1 when it should not.

Please disable usage of isr_count if virtual interrupt delivery enabled.
Given that self-IPI writes to VIRR, also please disable usage of
highest_isr_cache and irr_pending.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20  0:59   ` Marcelo Tosatti
@ 2012-12-20  1:01     ` Marcelo Tosatti
  2012-12-20  6:42     ` Gleb Natapov
  1 sibling, 0 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20  1:01 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> > 
> > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > manually, which is fully taken care of by the hardware. This needs
> > some special awareness into existing interrupr injection path:
> > 
> > - for pending interrupt, instead of direct injection, we may need
> >   update architecture specific indicators before resuming to guest.
> > 
> > - A pending interrupt, which is masked by ISR, should be also
> >   considered in above update action, since hardware will decide
> >   when to inject it at right time. Current has_interrupt and
> >   get_interrupt only returns a valid vector from injection p.o.v.
> > 
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> 
> 
> Resuming previous discussion:
> 
> > > How about to recaculate irr_pending according the VIRR on each
> > > vmexit?
> > > 
> > No need really. Since HW can only clear VIRR the only situation that
> > may
> > happen is that irr_pending will be true but VIRR is empty and
> > apic_find_highest_irr() will return correct result in this case.
> 
> Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
> Virtualization".
> 
> Also, an example of problem with ISR caching optimization (which was written
> with ISR/IRR management entirely in software):
> 
> isr_count variable is never incremented (because HW sets ISR):
> kvm_cpu_get_interrupt does not call into kvm_get_apic_interrupt 
> if virtual interrupt delivery enabled.
> 
> Therefore apic_find_highest_isr can return -1 even though VISR
> is not zero.
> 
> In that case (VISR non-zero but apic_find_highest_isr() == -1), 
> apic_update_ppr does not correctly set
> 
> PPR = max(ISRV, TPR)
> 
> Which can result in kvm_arch_vcpu_runnable returning 1 when it should not.
> 
> Please disable usage of isr_count if virtual interrupt delivery enabled.
> Given that self-IPI writes to VIRR, also please disable usage of
> highest_isr_cache and irr_pending.

BTW, no need to update irr_pending or the other caching variables on
VM-exit. Just skip them if virtual interrupt delivery is enabled.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2012-12-20  0:59   ` Marcelo Tosatti
@ 2012-12-20  1:26   ` Marcelo Tosatti
  2012-12-20  6:51     ` Gleb Natapov
  2012-12-20 22:00   ` Marcelo Tosatti
  2012-12-20 22:59   ` Marcelo Tosatti
  3 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20  1:26 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>

> +	if (enable_apicv_reg_vid)
> +		kvm_x86_ops->update_cr8_intercept = NULL;
> +	else
> +		kvm_x86_ops->update_apic_irq = NULL;

Loading the module with enable_apicv=0, then enable_apicv=1, 
fails?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20  0:59   ` Marcelo Tosatti
  2012-12-20  1:01     ` Marcelo Tosatti
@ 2012-12-20  6:42     ` Gleb Natapov
  2012-12-20 12:53       ` Marcelo Tosatti
  1 sibling, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-20  6:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> > 
> > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > manually, which is fully taken care of by the hardware. This needs
> > some special awareness into existing interrupr injection path:
> > 
> > - for pending interrupt, instead of direct injection, we may need
> >   update architecture specific indicators before resuming to guest.
> > 
> > - A pending interrupt, which is masked by ISR, should be also
> >   considered in above update action, since hardware will decide
> >   when to inject it at right time. Current has_interrupt and
> >   get_interrupt only returns a valid vector from injection p.o.v.
> > 
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> 
> 
> Resuming previous discussion:
> 
> > > How about to recaculate irr_pending according the VIRR on each
> > > vmexit?
> > > 
> > No need really. Since HW can only clear VIRR the only situation that
> > may
> > happen is that irr_pending will be true but VIRR is empty and
> > apic_find_highest_irr() will return correct result in this case.
> 
> Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
> Virtualization".
>
True. But as I said later in that discussion once irr_pending is set
to true it never becomes false, so the optimization is effectively
disable. We can set it to true doing apic initialization to make it
explicit.

> Also, an example of problem with ISR caching optimization (which was written
> with ISR/IRR management entirely in software):
> 
> isr_count variable is never incremented (because HW sets ISR):
> kvm_cpu_get_interrupt does not call into kvm_get_apic_interrupt 
> if virtual interrupt delivery enabled.
> 
> Therefore apic_find_highest_isr can return -1 even though VISR
> is not zero.
> 
> In that case (VISR non-zero but apic_find_highest_isr() == -1), 
> apic_update_ppr does not correctly set
> 
> PPR = max(ISRV, TPR)
> 
> Which can result in kvm_arch_vcpu_runnable returning 1 when it should not.
> 
> Please disable usage of isr_count if virtual interrupt delivery enabled.
> Given that self-IPI writes to VIRR, also please disable usage of
> highest_isr_cache and irr_pending.
> 
Good catch about isr_count. We can set it to 1 during apic
initialization too, or skip call to apic_update_ppr() in
kvm_apic_has_interrupt() if vid is enabled.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20  1:26   ` Marcelo Tosatti
@ 2012-12-20  6:51     ` Gleb Natapov
  2012-12-20 13:01       ` Marcelo Tosatti
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-20  6:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Wed, Dec 19, 2012 at 11:26:30PM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> > 
> > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > manually, which is fully taken care of by the hardware. This needs
> > some special awareness into existing interrupr injection path:
> > 
> > - for pending interrupt, instead of direct injection, we may need
> >   update architecture specific indicators before resuming to guest.
> > 
> > - A pending interrupt, which is masked by ISR, should be also
> >   considered in above update action, since hardware will decide
> >   when to inject it at right time. Current has_interrupt and
> >   get_interrupt only returns a valid vector from injection p.o.v.
> > 
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> 
> > +	if (enable_apicv_reg_vid)
> > +		kvm_x86_ops->update_cr8_intercept = NULL;
> > +	else
> > +		kvm_x86_ops->update_apic_irq = NULL;
> 
> Loading the module with enable_apicv=0, then enable_apicv=1, 
> fails?
It is not changeable after modules is loaded. This is true for all
kvm-intel.ko parameters.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-17  5:30 ` [PATCH v7 3/3] x86, apicv: add virtual x2apic support Yang Zhang
@ 2012-12-20  8:31   ` Gleb Natapov
  2012-12-24  1:41     ` Zhang, Yang Z
  2012-12-24  2:35     ` Zhang, Yang Z
  0 siblings, 2 replies; 43+ messages in thread
From: Gleb Natapov @ 2012-12-20  8:31 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, haitao.shan, Kevin Tian

On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> basically to benefit from apicv, we need clear MSR bitmap for
> corresponding x2apic MSRs:
>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> 
We do not set "Virtualize x2APIC mode" bit in secondary execution
control. If I read the spec correctly without that those MSR read/writes
will go straight to physical local APIC.

> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 files changed, 55 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index be66c3e..9b5e7a2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
>  	spin_unlock(&vmx_vpid_lock);
>  }
>  
> -static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
> +#define MSR_TYPE_R	1
> +#define MSR_TYPE_W	2
> +static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> +						u32 msr, int type)
>  {
>  	int f = sizeof(unsigned long);
>  
> @@ -3786,20 +3789,52 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>  	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
>  	 */
>  	if (msr <= 0x1fff) {
> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
> +		if (type & MSR_TYPE_R)
> +			/* read-low */
> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-low */
> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
> +
>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>  		msr &= 0x1fff;
> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
> +		if (type & MSR_TYPE_R)
> +			/* read-high */
> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-high */
> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
> +
>  	}
>  }
>  
>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
>  {
>  	if (!longmode_only)
> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> +}
> +
> +static void vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only)
> +{
> +	if (!longmode_only)
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +						msr, MSR_TYPE_R);
> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +					msr, MSR_TYPE_R);
> +}
> +
> +static void vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only)
> +{
> +	if (!longmode_only)
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +						msr, MSR_TYPE_W);
> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +					msr, MSR_TYPE_W);
>  }
>  
>  /*
> @@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
>  
> +	if (enable_apicv_reg_vid) {
> +		int msr;
> +		for (msr = 0x800; msr <= 0x8ff; msr++)
> +			vmx_disable_intercept_for_msr_read(msr, false);
> +
> +		/* TPR */
> +		vmx_disable_intercept_for_msr_write(0x808, false);
> +		/* EOI */
> +		vmx_disable_intercept_for_msr_write(0x80b, false);
> +		/* SELF-IPI */
> +		vmx_disable_intercept_for_msr_write(0x83f, false);
> +	}
> +
>  	if (enable_ept) {
>  		kvm_mmu_set_mask_ptes(0ull,
>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20  6:42     ` Gleb Natapov
@ 2012-12-20 12:53       ` Marcelo Tosatti
  2012-12-20 13:12         ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 12:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 08:42:06AM +0200, Gleb Natapov wrote:
> On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
> > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > 
> > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > manually, which is fully taken care of by the hardware. This needs
> > > some special awareness into existing interrupr injection path:
> > > 
> > > - for pending interrupt, instead of direct injection, we may need
> > >   update architecture specific indicators before resuming to guest.
> > > 
> > > - A pending interrupt, which is masked by ISR, should be also
> > >   considered in above update action, since hardware will decide
> > >   when to inject it at right time. Current has_interrupt and
> > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > 
> > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > 
> > 
> > Resuming previous discussion:
> > 
> > > > How about to recaculate irr_pending according the VIRR on each
> > > > vmexit?
> > > > 
> > > No need really. Since HW can only clear VIRR the only situation that
> > > may
> > > happen is that irr_pending will be true but VIRR is empty and
> > > apic_find_highest_irr() will return correct result in this case.
> > 
> > Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
> > Virtualization".
> >
> True. But as I said later in that discussion once irr_pending is set
> to true it never becomes false, so the optimization is effectively
> disable. We can set it to true doing apic initialization to make it
> explicit.

Its just confusing, to have a variable which has different meanings
in different configurations. I would rather have it explicit that 
its not used rather than check every time the i read the code.

if (apic_vid() == 0 && !apic->irr_pending)
	return -1;

> > Also, an example of problem with ISR caching optimization (which was written
> > with ISR/IRR management entirely in software):
> > 
> > isr_count variable is never incremented (because HW sets ISR):
> > kvm_cpu_get_interrupt does not call into kvm_get_apic_interrupt 
> > if virtual interrupt delivery enabled.
> > 
> > Therefore apic_find_highest_isr can return -1 even though VISR
> > is not zero.
> > 
> > In that case (VISR non-zero but apic_find_highest_isr() == -1), 
> > apic_update_ppr does not correctly set
> > 
> > PPR = max(ISRV, TPR)
> > 
> > Which can result in kvm_arch_vcpu_runnable returning 1 when it should not.
> > 
> > Please disable usage of isr_count if virtual interrupt delivery enabled.
> > Given that self-IPI writes to VIRR, also please disable usage of
> > highest_isr_cache and irr_pending.
> > 
> Good catch about isr_count. We can set it to 1 during apic
> initialization too, or skip call to apic_update_ppr() in
> kvm_apic_has_interrupt() if vid is enabled.

Not sure if you can skip it, its probably necessary to calculate it
before HW does so (say migration etc).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20  6:51     ` Gleb Natapov
@ 2012-12-20 13:01       ` Marcelo Tosatti
  2012-12-20 13:02         ` Marcelo Tosatti
  0 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 13:01 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 08:51:32AM +0200, Gleb Natapov wrote:
> On Wed, Dec 19, 2012 at 11:26:30PM -0200, Marcelo Tosatti wrote:
> > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > 
> > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > manually, which is fully taken care of by the hardware. This needs
> > > some special awareness into existing interrupr injection path:
> > > 
> > > - for pending interrupt, instead of direct injection, we may need
> > >   update architecture specific indicators before resuming to guest.
> > > 
> > > - A pending interrupt, which is masked by ISR, should be also
> > >   considered in above update action, since hardware will decide
> > >   when to inject it at right time. Current has_interrupt and
> > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > 
> > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > 
> > > +	if (enable_apicv_reg_vid)
> > > +		kvm_x86_ops->update_cr8_intercept = NULL;
> > > +	else
> > > +		kvm_x86_ops->update_apic_irq = NULL;
> > 
> > Loading the module with enable_apicv=0, then enable_apicv=1, 
> > fails?
> It is not changeable after modules is loaded. This is true for all
> kvm-intel.ko parameters.

Are you sure? Works just fine here.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 13:01       ` Marcelo Tosatti
@ 2012-12-20 13:02         ` Marcelo Tosatti
  0 siblings, 0 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 13:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 11:01:42AM -0200, Marcelo Tosatti wrote:
> On Thu, Dec 20, 2012 at 08:51:32AM +0200, Gleb Natapov wrote:
> > On Wed, Dec 19, 2012 at 11:26:30PM -0200, Marcelo Tosatti wrote:
> > > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > 
> > > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > > manually, which is fully taken care of by the hardware. This needs
> > > > some special awareness into existing interrupr injection path:
> > > > 
> > > > - for pending interrupt, instead of direct injection, we may need
> > > >   update architecture specific indicators before resuming to guest.
> > > > 
> > > > - A pending interrupt, which is masked by ISR, should be also
> > > >   considered in above update action, since hardware will decide
> > > >   when to inject it at right time. Current has_interrupt and
> > > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > > 
> > > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > > 
> > > > +	if (enable_apicv_reg_vid)
> > > > +		kvm_x86_ops->update_cr8_intercept = NULL;
> > > > +	else
> > > > +		kvm_x86_ops->update_apic_irq = NULL;
> > > 
> > > Loading the module with enable_apicv=0, then enable_apicv=1, 
> > > fails?
> > It is not changeable after modules is loaded. This is true for all
> > kvm-intel.ko parameters.
> 
> Are you sure? Works just fine here.

Its OK, nevermind.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 12:53       ` Marcelo Tosatti
@ 2012-12-20 13:12         ` Gleb Natapov
  2012-12-20 23:07           ` Marcelo Tosatti
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-20 13:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 10:53:16AM -0200, Marcelo Tosatti wrote:
> On Thu, Dec 20, 2012 at 08:42:06AM +0200, Gleb Natapov wrote:
> > On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
> > > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > 
> > > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > > manually, which is fully taken care of by the hardware. This needs
> > > > some special awareness into existing interrupr injection path:
> > > > 
> > > > - for pending interrupt, instead of direct injection, we may need
> > > >   update architecture specific indicators before resuming to guest.
> > > > 
> > > > - A pending interrupt, which is masked by ISR, should be also
> > > >   considered in above update action, since hardware will decide
> > > >   when to inject it at right time. Current has_interrupt and
> > > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > > 
> > > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > > 
> > > 
> > > Resuming previous discussion:
> > > 
> > > > > How about to recaculate irr_pending according the VIRR on each
> > > > > vmexit?
> > > > > 
> > > > No need really. Since HW can only clear VIRR the only situation that
> > > > may
> > > > happen is that irr_pending will be true but VIRR is empty and
> > > > apic_find_highest_irr() will return correct result in this case.
> > > 
> > > Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
> > > Virtualization".
> > >
> > True. But as I said later in that discussion once irr_pending is set
> > to true it never becomes false, so the optimization is effectively
> > disable. We can set it to true doing apic initialization to make it
> > explicit.
> 
> Its just confusing, to have a variable which has different meanings
> in different configurations. I would rather have it explicit that 
> its not used rather than check every time the i read the code.
> 
> if (apic_vid() == 0 && !apic->irr_pending)
> 	return -1;
> 
I'd prefer to avoid this additional if() especially as its sole purpose
is documentation.  We can add comment instead. Note that irr_pending
is just a hint anyway.  It can be true when no interrupt is pending in
irr. We can even rename it to irr_pending_hint or something.

> > > Also, an example of problem with ISR caching optimization (which was written
> > > with ISR/IRR management entirely in software):
> > > 
> > > isr_count variable is never incremented (because HW sets ISR):
> > > kvm_cpu_get_interrupt does not call into kvm_get_apic_interrupt 
> > > if virtual interrupt delivery enabled.
> > > 
> > > Therefore apic_find_highest_isr can return -1 even though VISR
> > > is not zero.
> > > 
> > > In that case (VISR non-zero but apic_find_highest_isr() == -1), 
> > > apic_update_ppr does not correctly set
> > > 
> > > PPR = max(ISRV, TPR)
> > > 
> > > Which can result in kvm_arch_vcpu_runnable returning 1 when it should not.
> > > 
> > > Please disable usage of isr_count if virtual interrupt delivery enabled.
> > > Given that self-IPI writes to VIRR, also please disable usage of
> > > highest_isr_cache and irr_pending.
> > > 
> > Good catch about isr_count. We can set it to 1 during apic
> > initialization too, or skip call to apic_update_ppr() in
> > kvm_apic_has_interrupt() if vid is enabled.
> 
> Not sure if you can skip it, its probably necessary to calculate it
> before HW does so (say migration etc).
kvm_apic_has_interrupt() is not called during migration and
kvm_apic_post_state_restore() calls apic_update_ppr() explicitly. I am
not sure it is needed though since migrated value should be already
correct anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2012-12-20  0:59   ` Marcelo Tosatti
  2012-12-20  1:26   ` Marcelo Tosatti
@ 2012-12-20 22:00   ` Marcelo Tosatti
  2012-12-27  3:32     ` Zhang, Yang Z
  2012-12-20 22:59   ` Marcelo Tosatti
  3 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 22:00 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  arch/ia64/kvm/lapic.h           |    6 ++
>  arch/x86/include/asm/kvm_host.h |    6 ++
>  arch/x86/include/asm/vmx.h      |   11 +++
>  arch/x86/kvm/irq.c              |   56 +++++++++++++-
>  arch/x86/kvm/lapic.c            |   65 ++++++++++-------
>  arch/x86/kvm/lapic.h            |   28 ++++++-
>  arch/x86/kvm/svm.c              |   24 ++++++
>  arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   11 ++-
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/ioapic.c               |   36 +++++++++
>  virt/kvm/ioapic.h               |    1 +
>  virt/kvm/irq_comm.c             |   20 +++++
>  13 files changed, 379 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> index c5f92a9..cb59eb4 100644
> --- a/arch/ia64/kvm/lapic.h
> +++ b/arch/ia64/kvm/lapic.h
> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
>  #define kvm_apic_present(x) (true)
>  #define kvm_lapic_enabled(x) (true)
>  
> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> +					struct kvm_lapic_irq *irq)
> +{
> +	/* IA64 has no apicv supporting, do nothing here */
> +}
> +
>  #endif
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c431b33..b63a144 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -697,6 +697,11 @@ struct kvm_x86_ops {
>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
> +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
> +	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> @@ -991,6 +996,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> +int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 44c3f7e..d1ab331 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -62,6 +62,7 @@
>  #define EXIT_REASON_MCE_DURING_VMENTRY  41
>  #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
>  #define EXIT_REASON_APIC_ACCESS         44
> +#define EXIT_REASON_EOI_INDUCED         45
>  #define EXIT_REASON_EPT_VIOLATION       48
>  #define EXIT_REASON_EPT_MISCONFIG       49
>  #define EXIT_REASON_WBINVD              54
> @@ -143,6 +144,7 @@
>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
>  #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>  #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
> +#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
>  #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
>  #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>  
> @@ -180,6 +182,7 @@ enum vmcs_field {
>  	GUEST_GS_SELECTOR               = 0x0000080a,
>  	GUEST_LDTR_SELECTOR             = 0x0000080c,
>  	GUEST_TR_SELECTOR               = 0x0000080e,
> +	GUEST_INTR_STATUS               = 0x00000810,
>  	HOST_ES_SELECTOR                = 0x00000c00,
>  	HOST_CS_SELECTOR                = 0x00000c02,
>  	HOST_SS_SELECTOR                = 0x00000c04,
> @@ -207,6 +210,14 @@ enum vmcs_field {
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
> +	EOI_EXIT_BITMAP0                = 0x0000201c,
> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
> +	EOI_EXIT_BITMAP1                = 0x0000201e,
> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
> +	EOI_EXIT_BITMAP2                = 0x00002020,
> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
> +	EOI_EXIT_BITMAP3                = 0x00002022,
> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index b111aee..e113440 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -38,6 +38,38 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
>  EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
>  
>  /*
> + * check if there is pending interrupt from
> + * non-APIC source without intack.
> + */
> +static int kvm_cpu_has_extint(struct kvm_vcpu *v)
> +{
> +	if (kvm_apic_accept_pic_intr(v))
> +		return pic_irqchip(v->kvm)->output;	/* PIC */
> +	else
> +		return 0;
> +}
> +
> +/*
> + * check if there is injectable interrupt:
> + * when virtual interrupt delivery enabled,
> + * interrupt from apic will handled by hardware,
> + * we don't need to check it here.
> + */
> +int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
> +{
> +	if (!irqchip_in_kernel(v->kvm))
> +		return v->arch.interrupt.pending;
> +
> +	if (kvm_cpu_has_extint(v))
> +		return 1;
> +
> +	if (kvm_apic_vid_enabled(v))
> +		return 0;
> +
> +	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
> +}
> +
> +/*
>   * check if there is pending interrupt without
>   * intack.
>   */
> @@ -46,27 +78,41 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>  	if (!irqchip_in_kernel(v->kvm))
>  		return v->arch.interrupt.pending;
>  
> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
> -		return pic_irqchip(v->kvm)->output;	/* PIC */
> +	if (kvm_cpu_has_extint(v))
> +		return 1;
>  
>  	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
>  }
>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>  
>  /*
> + * Read pending interrupt(from non-APIC source)
> + * vector and intack.
> + */
> +static int kvm_cpu_get_extint(struct kvm_vcpu *v)
> +{
> +	if (kvm_cpu_has_extint(v))
> +		return kvm_pic_read_irq(v->kvm); /* PIC */
> +	return -1;
> +}
> +
> +/*
>   * Read pending interrupt vector and intack.
>   */
>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
>  {
> +	int vector;
> +
>  	if (!irqchip_in_kernel(v->kvm))
>  		return v->arch.interrupt.nr;
>  
> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
> -		return kvm_pic_read_irq(v->kvm);	/* PIC */
> +	vector = kvm_cpu_get_extint(v);
> +
> +	if (kvm_apic_vid_enabled(v) || vector != -1)
> +		return vector;			/* PIC */
>  
>  	return kvm_get_apic_interrupt(v);	/* APIC */
>  }
> -EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
>  
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0664c13..61971bd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -133,6 +133,12 @@ static inline int apic_enabled(struct kvm_lapic *apic)
>  	return kvm_apic_sw_enabled(apic) &&	kvm_apic_hw_enabled(apic);
>  }
>  
> +bool kvm_apic_present(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_vcpu_has_lapic(vcpu) && kvm_apic_hw_enabled(vcpu->arch.apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_present);
> +
>  #define LVT_MASK	\
>  	(APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK)
>  
> @@ -150,23 +156,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
>  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
>  }
>  
> -static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
> -{
> -	u16 cid;
> -	ldr >>= 32 - map->ldr_bits;
> -	cid = (ldr >> map->cid_shift) & map->cid_mask;
> -
> -	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
> -
> -	return cid;
> -}
> -
> -static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
> -{
> -	ldr >>= (32 - map->ldr_bits);
> -	return ldr & map->lid_mask;
> -}
> -
>  static void recalculate_apic_map(struct kvm *kvm)
>  {
>  	struct kvm_apic_map *new, *old = NULL;
> @@ -236,12 +225,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
>  {
>  	apic_set_reg(apic, APIC_ID, id << 24);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }
>  
>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
>  {
>  	apic_set_reg(apic, APIC_LDR, id);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }
>  
>  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> @@ -740,6 +731,19 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
>  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
>  }
>  
> +static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
> +{
> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
> +	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
> +		int trigger_mode;
> +		if (apic_test_vector(vector, apic->regs + APIC_TMR))
> +			trigger_mode = IOAPIC_LEVEL_TRIG;
> +		else
> +			trigger_mode = IOAPIC_EDGE_TRIG;
> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> +	}
> +}
> +
>  static int apic_set_eoi(struct kvm_lapic *apic)
>  {
>  	int vector = apic_find_highest_isr(apic);
> @@ -756,19 +760,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
>  	apic_clear_isr(vector, apic);
>  	apic_update_ppr(apic);
>  
> -	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
> -	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
> -		int trigger_mode;
> -		if (apic_test_vector(vector, apic->regs + APIC_TMR))
> -			trigger_mode = IOAPIC_LEVEL_TRIG;
> -		else
> -			trigger_mode = IOAPIC_EDGE_TRIG;
> -		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> -	}
> +	kvm_ioapic_send_eoi(apic, vector);
>  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
>  	return vector;
>  }
>  
> +/*
> + * this interface assumes a trap-like exit, which has already finished
> + * desired side effect including vISR and vPPR update.
> + */
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	trace_kvm_eoi(apic, vector);
> +
> +	kvm_ioapic_send_eoi(apic, vector);
> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
> +
>  static void apic_send_ipi(struct kvm_lapic *apic)
>  {
>  	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
> @@ -1071,6 +1082,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  		if (!apic_x2apic_mode(apic)) {
>  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
>  			recalculate_apic_map(apic->vcpu->kvm);
> +			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  		} else
>  			ret = 1;
>  		break;
> @@ -1318,6 +1330,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  		else
>  			static_key_slow_inc(&apic_hw_disabled.key);
>  		recalculate_apic_map(vcpu->kvm);
> +		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  	}
>  
>  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 9a8ee22..9e3f888 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -39,6 +39,7 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu);
>  int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
>  int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
>  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> @@ -65,6 +66,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>  
>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
>  
>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
> @@ -75,6 +77,7 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
>  
>  int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data);
>  int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
> +bool kvm_apic_present(struct kvm_vcpu *vcpu);
>  
>  static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
>  {
> @@ -116,14 +119,31 @@ static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic)
>  	return APIC_SPIV_APIC_ENABLED;
>  }
>  
> -static inline bool kvm_apic_present(struct kvm_vcpu *vcpu)
> +static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_vcpu_has_lapic(vcpu) && kvm_apic_hw_enabled(vcpu->arch.apic);
> +	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
>  }
>  
> -static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
> +	return kvm_x86_ops->has_virtual_interrupt_delivery(vcpu);
> +}
> +
> +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
> +{
> +	u16 cid;
> +	ldr >>= 32 - map->ldr_bits;
> +	cid = (ldr >> map->cid_shift) & map->cid_mask;
> +
> +	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
> +
> +	return cid;
> +}
> +
> +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
> +{
> +	ldr >>= (32 - map->ldr_bits);
> +	return ldr & map->lid_mask;
>  }
>  
>  #endif
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d29d3cd..951f1ec 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3571,6 +3571,26 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>  }
>  
> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return 0;
> +}
> +
> +static void svm_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
> +{
> +	return ;
> +}
> +
> +static void svm_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	return ;
> +}
> +
> +static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	return ;
> +}
> +
>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4290,6 +4310,10 @@ static struct kvm_x86_ops svm_x86_ops = {
>  	.enable_nmi_window = enable_nmi_window,
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
> +	.update_eoi_exitmap = svm_update_eoi_exitmap,
> +	.reset_eoi_exitmap = svm_reset_eoi_exitmap,
> +	.load_eoi_exitmap = svm_load_eoi_exitmap,
>  
>  	.set_tss_addr = svm_set_tss_addr,
>  	.get_tdp_level = get_npt_level,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7ed26ec..be66c3e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -433,6 +433,8 @@ struct vcpu_vmx {
>  
>  	bool rdtscp_enabled;
>  
> +	unsigned long eoi_exit_bitmap[4];
> +
>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
>  };
> @@ -771,6 +773,12 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>  }
>  
> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -2549,7 +2557,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>  			SECONDARY_EXEC_RDTSCP |
>  			SECONDARY_EXEC_ENABLE_INVPCID |
> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  		if (adjust_vmx_controls(min2, opt2,
>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>  					&_cpu_based_2nd_exec_control) < 0)
> @@ -2563,7 +2572,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  
>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>  		_cpu_based_2nd_exec_control &= ~(
> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  
>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
>  		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
> @@ -2762,9 +2772,15 @@ static __init int hardware_setup(void)
>  	if (!cpu_has_vmx_ple())
>  		ple_gap = 0;
>  
> -	if (!cpu_has_vmx_apic_register_virt())
> +	if (!cpu_has_vmx_apic_register_virt() ||
> +				!cpu_has_vmx_virtual_intr_delivery())
>  		enable_apicv_reg_vid = 0;
>  
> +	if (enable_apicv_reg_vid)
> +		kvm_x86_ops->update_cr8_intercept = NULL;
> +	else
> +		kvm_x86_ops->update_apic_irq = NULL;
> +
>  	if (nested)
>  		nested_vmx_setup_ctls_msrs();
>  
> @@ -3880,7 +3896,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	if (!ple_gap)
>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>  	if (!enable_apicv_reg_vid)
> -		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +		exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  	return exec_control;
>  }
>  
> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  				vmx_secondary_exec_control(vmx));
>  	}
>  
> +	if (enable_apicv_reg_vid) {
> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> +
> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> +	}

AFAICS SVI should be regenerated on migration. Consider:

1. vintr delivery, sets SVI = vector = RVI.
2. clears RVI.
3. migration.
4. RVI properly set from VIRR on entry.
5. SVI = 0.
6. EOI -> EOI virtualization with SVI = 0.

Could hook into kvm_apic_post_state_restore() to do that (set highest
index of bit set in VISR).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
                     ` (2 preceding siblings ...)
  2012-12-20 22:00   ` Marcelo Tosatti
@ 2012-12-20 22:59   ` Marcelo Tosatti
  2012-12-21  7:51     ` Gleb Natapov
  3 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 22:59 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  arch/ia64/kvm/lapic.h           |    6 ++
>  arch/x86/include/asm/kvm_host.h |    6 ++
>  arch/x86/include/asm/vmx.h      |   11 +++
>  arch/x86/kvm/irq.c              |   56 +++++++++++++-
>  arch/x86/kvm/lapic.c            |   65 ++++++++++-------
>  arch/x86/kvm/lapic.h            |   28 ++++++-
>  arch/x86/kvm/svm.c              |   24 ++++++
>  arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   11 ++-
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/ioapic.c               |   36 +++++++++
>  virt/kvm/ioapic.h               |    1 +
>  virt/kvm/irq_comm.c             |   20 +++++
>  13 files changed, 379 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> index c5f92a9..cb59eb4 100644
> --- a/arch/ia64/kvm/lapic.h
> +++ b/arch/ia64/kvm/lapic.h
> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
>  #define kvm_apic_present(x) (true)
>  #define kvm_lapic_enabled(x) (true)
>  
> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> +					struct kvm_lapic_irq *irq)
> +{
> +	/* IA64 has no apicv supporting, do nothing here */
> +}
> +
>  #endif
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c431b33..b63a144 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -697,6 +697,11 @@ struct kvm_x86_ops {
>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
> +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
> +	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

EOI exit bitmap is problematic (its racy). Please do this:

1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC register
modifications which require EOI exit bitmap updates.
2. On VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
and adjusts its own EOI exit bitmap VMCS registers.

1) that waits until remote executing VCPUs have acknowledge the request,
using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
remote TLB flushes.

What is the problem now: there is no control over _when_ a VCPU updates
its EOI exit bitmap VMCS register from the (remotely updated) master EOI
exit bitmap. The VCPU can be processing a KVM_REQ_EOIBITMAP relative to
a precedence IOAPIC register write while the current IOAPIC register
write is updating the EOI exit bitmap. There is no way to fix that
without locking (which can be avoided if the IOAPIC->EOI exit bitmap
synchronization is vcpu local).

Questions below.

> @@ -236,12 +225,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
>  {
>  	apic_set_reg(apic, APIC_ID, id << 24);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }

Why is it necessary to update EOI exit bitmap here?

>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
>  {
>  	apic_set_reg(apic, APIC_LDR, id);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }

And here?

> +/*
> + * this interface assumes a trap-like exit, which has already finished
> + * desired side effect including vISR and vPPR update.
> + */
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	trace_kvm_eoi(apic, vector);
> +
> +	kvm_ioapic_send_eoi(apic, vector);
> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> +}

Why is it necessary to set KVM_REQ_EVENT here?

>  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
>  			recalculate_apic_map(apic->vcpu->kvm);
> +			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  		} else
>  			ret = 1;
>  		break;
> @@ -1318,6 +1330,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  		else
>  			static_key_slow_inc(&apic_hw_disabled.key);
>  		recalculate_apic_map(vcpu->kvm);
> +		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  	}

Again, why all this EOI exit bitmap updates?

> +static void svm_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
> +{
> +	return ;
> +}
> +
> +static void svm_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	return ;
> +}
> +
> +static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	return ;
> +}

Please mind coding style.

> +static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
> +{
> +	if (max_irr == -1)
> +		return ;

Coding style.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 13:12         ` Gleb Natapov
@ 2012-12-20 23:07           ` Marcelo Tosatti
  2012-12-25  7:49             ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-20 23:07 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 03:12:32PM +0200, Gleb Natapov wrote:
> On Thu, Dec 20, 2012 at 10:53:16AM -0200, Marcelo Tosatti wrote:
> > On Thu, Dec 20, 2012 at 08:42:06AM +0200, Gleb Natapov wrote:
> > > On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
> > > > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > > 
> > > > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > > > manually, which is fully taken care of by the hardware. This needs
> > > > > some special awareness into existing interrupr injection path:
> > > > > 
> > > > > - for pending interrupt, instead of direct injection, we may need
> > > > >   update architecture specific indicators before resuming to guest.
> > > > > 
> > > > > - A pending interrupt, which is masked by ISR, should be also
> > > > >   considered in above update action, since hardware will decide
> > > > >   when to inject it at right time. Current has_interrupt and
> > > > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > > > 
> > > > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > > > 
> > > > 
> > > > Resuming previous discussion:
> > > > 
> > > > > > How about to recaculate irr_pending according the VIRR on each
> > > > > > vmexit?
> > > > > > 
> > > > > No need really. Since HW can only clear VIRR the only situation that
> > > > > may
> > > > > happen is that irr_pending will be true but VIRR is empty and
> > > > > apic_find_highest_irr() will return correct result in this case.
> > > > 
> > > > Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI 
> > > > Virtualization".
> > > >
> > > True. But as I said later in that discussion once irr_pending is set
> > > to true it never becomes false, so the optimization is effectively
> > > disable. We can set it to true doing apic initialization to make it
> > > explicit.
> > 
> > Its just confusing, to have a variable which has different meanings
> > in different configurations. I would rather have it explicit that 
> > its not used rather than check every time the i read the code.
> > 
> > if (apic_vid() == 0 && !apic->irr_pending)
> > 	return -1;
> > 
> I'd prefer to avoid this additional if() especially as its sole purpose
> is documentation.  We can add comment instead. Note that irr_pending
> is just a hint anyway.  It can be true when no interrupt is pending in
> irr. We can even rename it to irr_pending_hint or something.

Works for me (documentation).

> > Not sure if you can skip it, its probably necessary to calculate it
> > before HW does so (say migration etc).
> kvm_apic_has_interrupt() is not called during migration and
> kvm_apic_post_state_restore() calls apic_update_ppr() explicitly.
> I am not sure it is needed though since migrated value should be already
> correct anyway.

Ok, best force isr_count to 1 if apic vintr enabled (and add a comment,
please).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 22:59   ` Marcelo Tosatti
@ 2012-12-21  7:51     ` Gleb Natapov
  2012-12-21 11:39       ` Marcelo Tosatti
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-21  7:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> > 
> > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > manually, which is fully taken care of by the hardware. This needs
> > some special awareness into existing interrupr injection path:
> > 
> > - for pending interrupt, instead of direct injection, we may need
> >   update architecture specific indicators before resuming to guest.
> > 
> > - A pending interrupt, which is masked by ISR, should be also
> >   considered in above update action, since hardware will decide
> >   when to inject it at right time. Current has_interrupt and
> >   get_interrupt only returns a valid vector from injection p.o.v.
> > 
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > ---
> >  arch/ia64/kvm/lapic.h           |    6 ++
> >  arch/x86/include/asm/kvm_host.h |    6 ++
> >  arch/x86/include/asm/vmx.h      |   11 +++
> >  arch/x86/kvm/irq.c              |   56 +++++++++++++-
> >  arch/x86/kvm/lapic.c            |   65 ++++++++++-------
> >  arch/x86/kvm/lapic.h            |   28 ++++++-
> >  arch/x86/kvm/svm.c              |   24 ++++++
> >  arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c              |   11 ++-
> >  include/linux/kvm_host.h        |    2 +
> >  virt/kvm/ioapic.c               |   36 +++++++++
> >  virt/kvm/ioapic.h               |    1 +
> >  virt/kvm/irq_comm.c             |   20 +++++
> >  13 files changed, 379 insertions(+), 41 deletions(-)
> > 
> > diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> > index c5f92a9..cb59eb4 100644
> > --- a/arch/ia64/kvm/lapic.h
> > +++ b/arch/ia64/kvm/lapic.h
> > @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> >  #define kvm_apic_present(x) (true)
> >  #define kvm_lapic_enabled(x) (true)
> >  
> > +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> > +					struct kvm_lapic_irq *irq)
> > +{
> > +	/* IA64 has no apicv supporting, do nothing here */
> > +}
> > +
> >  #endif
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index c431b33..b63a144 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -697,6 +697,11 @@ struct kvm_x86_ops {
> >  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> > +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> > +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> > +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
> > +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
> > +	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
> >  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >  	int (*get_tdp_level)(void);
> >  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> 
> EOI exit bitmap is problematic (its racy). Please do this:
> 
> 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC register
> modifications which require EOI exit bitmap updates.
> 2. On VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
> and adjusts its own EOI exit bitmap VMCS registers.
> 
> 1) that waits until remote executing VCPUs have acknowledge the request,
> using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
> remote TLB flushes.
> 
> What is the problem now: there is no control over _when_ a VCPU updates
> its EOI exit bitmap VMCS register from the (remotely updated) master EOI
> exit bitmap. The VCPU can be processing a KVM_REQ_EOIBITMAP relative to
> a precedence IOAPIC register write while the current IOAPIC register
> write is updating the EOI exit bitmap. There is no way to fix that
> without locking (which can be avoided if the IOAPIC->EOI exit bitmap
> synchronization is vcpu local).
> 
The race is benign. We have similar one for interrupt injection and the same race
exists on a real HW. The two cases that can happen due to the race are:

1. exitbitmap bit X is changed from 1 to 0
  No problem. It is harmless to do an exit, on the next entry exitbitmap
  will be fixed.
2. exitbitmap bit X is changed from 0 to 1
  If vcpu serves X at the time this happens it was delivered as edge, so
  no need to exit. The exitbitmap will be updated after the next vmexit
  which will happen due to KVM_REQ_EOIBITMAP processing.

But software really should take care of not changing interrupt vector
configuration while there is an interrupt in flight with the same vector.

> Questions below.
> 
> > @@ -236,12 +225,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> >  {
> >  	apic_set_reg(apic, APIC_ID, id << 24);
> >  	recalculate_apic_map(apic->vcpu->kvm);
> > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  }
> 
> Why is it necessary to update EOI exit bitmap here?
> 
All the places where apic_map, which is used to build exit
bitmap, can change should call ioapic_update_eoi_exitmap().

> >  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> >  {
> >  	apic_set_reg(apic, APIC_LDR, id);
> >  	recalculate_apic_map(apic->vcpu->kvm);
> > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  }
> 
> And here?
> 
> > +/*
> > + * this interface assumes a trap-like exit, which has already finished
> > + * desired side effect including vISR and vPPR update.
> > + */
> > +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	trace_kvm_eoi(apic, vector);
> > +
> > +	kvm_ioapic_send_eoi(apic, vector);
> > +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> > +}
> 
> Why is it necessary to set KVM_REQ_EVENT here?
> 
> >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> >  			recalculate_apic_map(apic->vcpu->kvm);
> > +			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  		} else
> >  			ret = 1;
> >  		break;
> > @@ -1318,6 +1330,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> >  		else
> >  			static_key_slow_inc(&apic_hw_disabled.key);
> >  		recalculate_apic_map(vcpu->kvm);
> > +		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  	}
> 
> Again, why all this EOI exit bitmap updates?
> 
> > +static void svm_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
> > +{
> > +	return ;
> > +}
> > +
> > +static void svm_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
> > +{
> > +	return ;
> > +}
> > +
> > +static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > +{
> > +	return ;
> > +}
> 
> Please mind coding style.
> 
> > +static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
> > +{
> > +	if (max_irr == -1)
> > +		return ;
> 
> Coding style.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-21  7:51     ` Gleb Natapov
@ 2012-12-21 11:39       ` Marcelo Tosatti
  2012-12-21 12:08         ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-21 11:39 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
> On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
> > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > 
> > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > manually, which is fully taken care of by the hardware. This needs
> > > some special awareness into existing interrupr injection path:
> > > 
> > > - for pending interrupt, instead of direct injection, we may need
> > >   update architecture specific indicators before resuming to guest.
> > > 
> > > - A pending interrupt, which is masked by ISR, should be also
> > >   considered in above update action, since hardware will decide
> > >   when to inject it at right time. Current has_interrupt and
> > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > 
> > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > > ---
> > >  arch/ia64/kvm/lapic.h           |    6 ++
> > >  arch/x86/include/asm/kvm_host.h |    6 ++
> > >  arch/x86/include/asm/vmx.h      |   11 +++
> > >  arch/x86/kvm/irq.c              |   56 +++++++++++++-
> > >  arch/x86/kvm/lapic.c            |   65 ++++++++++-------
> > >  arch/x86/kvm/lapic.h            |   28 ++++++-
> > >  arch/x86/kvm/svm.c              |   24 ++++++
> > >  arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
> > >  arch/x86/kvm/x86.c              |   11 ++-
> > >  include/linux/kvm_host.h        |    2 +
> > >  virt/kvm/ioapic.c               |   36 +++++++++
> > >  virt/kvm/ioapic.h               |    1 +
> > >  virt/kvm/irq_comm.c             |   20 +++++
> > >  13 files changed, 379 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> > > index c5f92a9..cb59eb4 100644
> > > --- a/arch/ia64/kvm/lapic.h
> > > +++ b/arch/ia64/kvm/lapic.h
> > > @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> > >  #define kvm_apic_present(x) (true)
> > >  #define kvm_lapic_enabled(x) (true)
> > >  
> > > +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> > > +					struct kvm_lapic_irq *irq)
> > > +{
> > > +	/* IA64 has no apicv supporting, do nothing here */
> > > +}
> > > +
> > >  #endif
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index c431b33..b63a144 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -697,6 +697,11 @@ struct kvm_x86_ops {
> > >  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> > >  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> > >  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> > > +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> > > +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> > > +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
> > > +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
> > > +	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
> > >  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> > >  	int (*get_tdp_level)(void);
> > >  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > 
> > EOI exit bitmap is problematic (its racy). Please do this:
> > 
> > 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC register
> > modifications which require EOI exit bitmap updates.
> > 2. On VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
> > and adjusts its own EOI exit bitmap VMCS registers.
> > 
> > 1) that waits until remote executing VCPUs have acknowledge the request,
> > using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
> > remote TLB flushes.
> > 
> > What is the problem now: there is no control over _when_ a VCPU updates
> > its EOI exit bitmap VMCS register from the (remotely updated) master EOI
> > exit bitmap. The VCPU can be processing a KVM_REQ_EOIBITMAP relative to
> > a precedence IOAPIC register write while the current IOAPIC register
> > write is updating the EOI exit bitmap. There is no way to fix that
> > without locking (which can be avoided if the IOAPIC->EOI exit bitmap
> > synchronization is vcpu local).
> > 
> The race is benign. We have similar one for interrupt injection and the same race
> exists on a real HW. The two cases that can happen due to the race are:
> 
> 1. exitbitmap bit X is changed from 1 to 0
>   No problem. It is harmless to do an exit, on the next entry exitbitmap
>   will be fixed.
> 2. exitbitmap bit X is changed from 0 to 1
>   If vcpu serves X at the time this happens it was delivered as edge, so
>   no need to exit. The exitbitmap will be updated after the next vmexit
>   which will happen due to KVM_REQ_EOIBITMAP processing.

1. Missed the case where bitmap is being reset (where EOI exit bitmaps
are zeroed). In this case vcpu enters guest with incorrect EOI exit
bitmap.

2. Missed the case where current code allows vcpu to enter guest 
with EOI exit bitmap unsynchronized relative to IOAPIC registers
(see one KVM_REQ made at a time, no IPI sent). In that case interrupt
can be delivered.

Thus the suggestions to update bitmap locally, on entry. Do you 
see any disadvantage?

Other than that, there is a window between IOAPIC map update and 
EOI bitmap request, where an interrupt can be delivered without 
EOI bitmap being updated (which i think local updates don't cover,
either).

> But software really should take care of not changing interrupt vector
> configuration while there is an interrupt in flight with the same vector.

None of these are guest faults. As soon as interrupts are allowed they 
must be handled properly (including synchronized EOI bitmap etc).

> 
> > Questions below.
> > 
> > > @@ -236,12 +225,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > >  {
> > >  	apic_set_reg(apic, APIC_ID, id << 24);
> > >  	recalculate_apic_map(apic->vcpu->kvm);
> > > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> > >  }
> > 
> > Why is it necessary to update EOI exit bitmap here?
> > 
> All the places where apic_map, which is used to build exit
> bitmap, can change should call ioapic_update_eoi_exitmap().
> 
> > >  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > >  {
> > >  	apic_set_reg(apic, APIC_LDR, id);
> > >  	recalculate_apic_map(apic->vcpu->kvm);
> > > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> > >  }
> > 
> > And here?
> > 
> > > +/*
> > > + * this interface assumes a trap-like exit, which has already finished
> > > + * desired side effect including vISR and vPPR update.
> > > + */
> > > +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> > > +{
> > > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > > +
> > > +	trace_kvm_eoi(apic, vector);
> > > +
> > > +	kvm_ioapic_send_eoi(apic, vector);
> > > +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> > > +}
> > 
> > Why is it necessary to set KVM_REQ_EVENT here?
> > 
> > >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> > >  			recalculate_apic_map(apic->vcpu->kvm);
> > > +			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> > >  		} else
> > >  			ret = 1;
> > >  		break;
> > > @@ -1318,6 +1330,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > >  		else
> > >  			static_key_slow_inc(&apic_hw_disabled.key);
> > >  		recalculate_apic_map(vcpu->kvm);
> > > +		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> > >  	}
> > 
> > Again, why all this EOI exit bitmap updates?
> > 
> > > +static void svm_update_eoi_exitmap(struct kvm *kvm, struct kvm_lapic_irq *irq)
> > > +{
> > > +	return ;
> > > +}
> > > +
> > > +static void svm_reset_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > +{
> > > +	return ;
> > > +}
> > > +
> > > +static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > +{
> > > +	return ;
> > > +}
> > 
> > Please mind coding style.
> > 
> > > +static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
> > > +{
> > > +	if (max_irr == -1)
> > > +		return ;
> > 
> > Coding style.
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-21 11:39       ` Marcelo Tosatti
@ 2012-12-21 12:08         ` Gleb Natapov
  2012-12-27  2:24           ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-21 12:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Fri, Dec 21, 2012 at 09:39:20AM -0200, Marcelo Tosatti wrote:
> On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
> > On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
> > > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > 
> > > > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > > > manually, which is fully taken care of by the hardware. This needs
> > > > some special awareness into existing interrupr injection path:
> > > > 
> > > > - for pending interrupt, instead of direct injection, we may need
> > > >   update architecture specific indicators before resuming to guest.
> > > > 
> > > > - A pending interrupt, which is masked by ISR, should be also
> > > >   considered in above update action, since hardware will decide
> > > >   when to inject it at right time. Current has_interrupt and
> > > >   get_interrupt only returns a valid vector from injection p.o.v.
> > > > 
> > > > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> > > > ---
> > > >  arch/ia64/kvm/lapic.h           |    6 ++
> > > >  arch/x86/include/asm/kvm_host.h |    6 ++
> > > >  arch/x86/include/asm/vmx.h      |   11 +++
> > > >  arch/x86/kvm/irq.c              |   56 +++++++++++++-
> > > >  arch/x86/kvm/lapic.c            |   65 ++++++++++-------
> > > >  arch/x86/kvm/lapic.h            |   28 ++++++-
> > > >  arch/x86/kvm/svm.c              |   24 ++++++
> > > >  arch/x86/kvm/vmx.c              |  154 ++++++++++++++++++++++++++++++++++++++-
> > > >  arch/x86/kvm/x86.c              |   11 ++-
> > > >  include/linux/kvm_host.h        |    2 +
> > > >  virt/kvm/ioapic.c               |   36 +++++++++
> > > >  virt/kvm/ioapic.h               |    1 +
> > > >  virt/kvm/irq_comm.c             |   20 +++++
> > > >  13 files changed, 379 insertions(+), 41 deletions(-)
> > > > 
> > > > diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> > > > index c5f92a9..cb59eb4 100644
> > > > --- a/arch/ia64/kvm/lapic.h
> > > > +++ b/arch/ia64/kvm/lapic.h
> > > > @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> > > >  #define kvm_apic_present(x) (true)
> > > >  #define kvm_lapic_enabled(x) (true)
> > > >  
> > > > +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> > > > +					struct kvm_lapic_irq *irq)
> > > > +{
> > > > +	/* IA64 has no apicv supporting, do nothing here */
> > > > +}
> > > > +
> > > >  #endif
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index c431b33..b63a144 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -697,6 +697,11 @@ struct kvm_x86_ops {
> > > >  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> > > >  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> > > >  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> > > > +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> > > > +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> > > > +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq *irq);
> > > > +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu);
> > > > +	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
> > > >  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> > > >  	int (*get_tdp_level)(void);
> > > >  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > > 
> > > EOI exit bitmap is problematic (its racy). Please do this:
> > > 
> > > 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC register
> > > modifications which require EOI exit bitmap updates.
> > > 2. On VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
> > > and adjusts its own EOI exit bitmap VMCS registers.
> > > 
> > > 1) that waits until remote executing VCPUs have acknowledge the request,
> > > using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
> > > remote TLB flushes.
> > > 
> > > What is the problem now: there is no control over _when_ a VCPU updates
> > > its EOI exit bitmap VMCS register from the (remotely updated) master EOI
> > > exit bitmap. The VCPU can be processing a KVM_REQ_EOIBITMAP relative to
> > > a precedence IOAPIC register write while the current IOAPIC register
> > > write is updating the EOI exit bitmap. There is no way to fix that
> > > without locking (which can be avoided if the IOAPIC->EOI exit bitmap
> > > synchronization is vcpu local).
> > > 
> > The race is benign. We have similar one for interrupt injection and the same race
> > exists on a real HW. The two cases that can happen due to the race are:
> > 
> > 1. exitbitmap bit X is changed from 1 to 0
> >   No problem. It is harmless to do an exit, on the next entry exitbitmap
> >   will be fixed.
> > 2. exitbitmap bit X is changed from 0 to 1
> >   If vcpu serves X at the time this happens it was delivered as edge, so
> >   no need to exit. The exitbitmap will be updated after the next vmexit
> >   which will happen due to KVM_REQ_EOIBITMAP processing.
> 
> 1. Missed the case where bitmap is being reset (where EOI exit bitmaps
> are zeroed). In this case vcpu enters guest with incorrect EOI exit
> bitmap.
> 
Right, the bitmap reset us problematic indeed since it does not
represent real vector configuration change.

> 2. Missed the case where current code allows vcpu to enter guest 
> with EOI exit bitmap unsynchronized relative to IOAPIC registers
> (see one KVM_REQ made at a time, no IPI sent). In that case interrupt
> can be delivered.
> 
Ugh, I was sure there is a kick there. Missing kick is just a bug of
course.

> Thus the suggestions to update bitmap locally, on entry. Do you 
> see any disadvantage?
> 
Only one. The recalculation logic is such that given a vector it
calculates set of vcpus, so each vcpu will do this calculation for each
vector and see if it is in the set instead of recalculating once. But
this should be rare enough for us to not care.

> Other than that, there is a window between IOAPIC map update and 
> EOI bitmap request, where an interrupt can be delivered without 
> EOI bitmap being updated (which i think local updates don't cover,
> either).
Interrupt cannot be delivered through IOAPIC while bitmap is updated
since IOAPIC has a lock.

> 
> > But software really should take care of not changing interrupt vector
> > configuration while there is an interrupt in flight with the same vector.
> 
> None of these are guest faults. As soon as interrupts are allowed they 
> must be handled properly (including synchronized EOI bitmap etc).
> 
All my cases are guest faults and the guest will get in trouble on real
HW too with such behaviour. The case where we clear bitmap before
recalculation is not a guest fault though and have to be dealt with
somehow either with locks or your suggestion.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-20  8:31   ` Gleb Natapov
@ 2012-12-24  1:41     ` Zhang, Yang Z
  2012-12-24  2:35     ` Zhang, Yang Z
  1 sibling, 0 replies; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-24  1:41 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-20:
> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>> basically to benefit from apicv, we need clear MSR bitmap for
>> corresponding x2apic MSRs:
>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> We do not set "Virtualize x2APIC mode" bit in secondary execution
> control. If I read the spec correctly without that those MSR read/writes
> will go straight to physical local APIC.
Right. Now it cannot get benefit, but we may enable it in future and then we can benefit from it.

>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  arch/x86/kvm/vmx.c |   62
>>  ++++++++++++++++++++++++++++++++++++++++++++++------ 1 files changed,
>>  55 insertions(+), 7 deletions(-)
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index be66c3e..9b5e7a2 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
>>  	spin_unlock(&vmx_vpid_lock);
>>  }
>> -static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
>> u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2 +static void
>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, +						u32
>> msr, int type)
>>  {
>>  	int f = sizeof(unsigned long);
>> @@ -3786,20 +3789,52 @@ static void
> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>  	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
>>  	 */
>>  	if (msr <= 0x1fff) {
>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>> +		if (type & MSR_TYPE_R)
>> +			/* read-low */
>> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
>> +
>> +		if (type & MSR_TYPE_W)
>> +			/* write-low */
>> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
>> +
>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>  		msr &= 0x1fff;
>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>> +		if (type & MSR_TYPE_R)
>> +			/* read-high */
>> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
>> +
>> +		if (type & MSR_TYPE_W)
>> +			/* write-high */
>> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
>> +
>>  	}
>>  }
>>  
>>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
>>  {
>>  	if (!longmode_only)
>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
>> MSR_TYPE_R | MSR_TYPE_W);
>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +						msr,
>> MSR_TYPE_R | MSR_TYPE_W); +} + +static void
>> vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only) +{ +	if
>> (!longmode_only)
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
>> MSR_TYPE_R); +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +					msr, MSR_TYPE_R); +} + +static void
>> vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only) +{
>> +	if (!longmode_only)
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
>> MSR_TYPE_W); +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +					msr, MSR_TYPE_W);
>>  }
>>  
>>  /* @@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
>> +	if (enable_apicv_reg_vid) {
>> +		int msr;
>> +		for (msr = 0x800; msr <= 0x8ff; msr++)
>> +			vmx_disable_intercept_for_msr_read(msr, false);
>> +
>> +		/* TPR */
>> +		vmx_disable_intercept_for_msr_write(0x808, false);
>> +		/* EOI */
>> +		vmx_disable_intercept_for_msr_write(0x80b, false);
>> +		/* SELF-IPI */
>> +		vmx_disable_intercept_for_msr_write(0x83f, false);
>> +	}
>> +
>>  	if (enable_ept) {
>>  		kvm_mmu_set_mask_ptes(0ull,
>>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-20  8:31   ` Gleb Natapov
  2012-12-24  1:41     ` Zhang, Yang Z
@ 2012-12-24  2:35     ` Zhang, Yang Z
  2012-12-24  9:23       ` Gleb Natapov
  1 sibling, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-24  2:35 UTC (permalink / raw)
  To: Zhang, Yang Z, Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Zhang, Yang Z wrote on 2012-12-24:
> Gleb Natapov wrote on 2012-12-20:
>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>> basically to benefit from apicv, we need clear MSR bitmap for
>>> corresponding x2apic MSRs:
>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
>> We do not set "Virtualize x2APIC mode" bit in secondary execution
>> control. If I read the spec correctly without that those MSR read/writes
>> will go straight to physical local APIC.
> Right. Now it cannot get benefit, but we may enable it in future and then we can
> benefit from it.
how about to add the following check:
if (apicv_enabled && virtual_x2apic_enabled)
	clear_msr();


>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>> ---
>>>  arch/x86/kvm/vmx.c |   62
>>>  ++++++++++++++++++++++++++++++++++++++++++++++------ 1 files changed,
>>>  55 insertions(+), 7 deletions(-)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index be66c3e..9b5e7a2 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
>>>  	spin_unlock(&vmx_vpid_lock);
>>>  }
>>> -static void __vmx_disable_intercept_for_msr(unsigned long
>>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
>>> +static void __vmx_disable_intercept_for_msr(unsigned long
>>> *msr_bitmap, + 				u32 msr, int type)
>>>  {
>>>  	int f = sizeof(unsigned long);
>>> @@ -3786,20 +3789,52 @@ static void
>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>>  	 * We can control MSRs 0x00000000-0x00001fff and
>>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
>>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>>> +		if (type & MSR_TYPE_R)
>>> +			/* read-low */
>>> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
>>> +
>>> +		if (type & MSR_TYPE_W)
>>> +			/* write-low */
>>> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
>>> +
>>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>>  		msr &= 0x1fff;
>>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>>> +		if (type & MSR_TYPE_R)
>>> +			/* read-high */
>>> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
>>> +
>>> +		if (type & MSR_TYPE_W)
>>> +			/* write-high */
>>> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
>>> +
>>>  	}
>>>  }
>>>  
>>>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
>>>  {
>>>  	if (!longmode_only)
>>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
>>> MSR_TYPE_R | MSR_TYPE_W);
>>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, + 					msr,
>>> MSR_TYPE_R | MSR_TYPE_W); +} + +static void
>>> vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only) +{ +
>>> 	if (!longmode_only)
>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
>>> MSR_TYPE_R); +
>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +					msr,
>>> MSR_TYPE_R); +} + +static void vmx_disable_intercept_for_msr_write(u32
>>> msr, bool longmode_only) +{ +	if (!longmode_only)
>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
>>> MSR_TYPE_W); +
>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +					msr,
>>> MSR_TYPE_W);
>>>  }
>>>  
>>>  /* @@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
>>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
>>> +	if (enable_apicv_reg_vid) {
>>> +		int msr;
>>> +		for (msr = 0x800; msr <= 0x8ff; msr++)
>>> +			vmx_disable_intercept_for_msr_read(msr, false);
>>> +
>>> +		/* TPR */
>>> +		vmx_disable_intercept_for_msr_write(0x808, false);
>>> +		/* EOI */
>>> +		vmx_disable_intercept_for_msr_write(0x80b, false);
>>> +		/* SELF-IPI */
>>> +		vmx_disable_intercept_for_msr_write(0x83f, false);
>>> +	}
>>> +
>>>  	if (enable_ept) {
>>>  		kvm_mmu_set_mask_ptes(0ull,
>>>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
>>> --
>>> 1.7.1
>> 
>> --
>> 			Gleb.
> 
> 
> Best regards,
> Yang
>


Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-24  2:35     ` Zhang, Yang Z
@ 2012-12-24  9:23       ` Gleb Natapov
  2012-12-24 23:53         ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-24  9:23 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> Zhang, Yang Z wrote on 2012-12-24:
> > Gleb Natapov wrote on 2012-12-20:
> >> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>> basically to benefit from apicv, we need clear MSR bitmap for
> >>> corresponding x2apic MSRs:
> >>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
> >>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> >> We do not set "Virtualize x2APIC mode" bit in secondary execution
> >> control. If I read the spec correctly without that those MSR read/writes
> >> will go straight to physical local APIC.
> > Right. Now it cannot get benefit, but we may enable it in future and then we can
> > benefit from it.
Without enabling it you cannot disable MSR intercept for x2apic MSRs.

> how about to add the following check:
> if (apicv_enabled && virtual_x2apic_enabled)
> 	clear_msr();
> 
I do not understand what do you mean here.

> 
> >>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >>> ---
> >>>  arch/x86/kvm/vmx.c |   62
> >>>  ++++++++++++++++++++++++++++++++++++++++++++++------ 1 files changed,
> >>>  55 insertions(+), 7 deletions(-)
> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>> index be66c3e..9b5e7a2 100644
> >>> --- a/arch/x86/kvm/vmx.c
> >>> +++ b/arch/x86/kvm/vmx.c
> >>> @@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
> >>>  	spin_unlock(&vmx_vpid_lock);
> >>>  }
> >>> -static void __vmx_disable_intercept_for_msr(unsigned long
> >>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
> >>> +static void __vmx_disable_intercept_for_msr(unsigned long
> >>> *msr_bitmap, + 				u32 msr, int type)
> >>>  {
> >>>  	int f = sizeof(unsigned long);
> >>> @@ -3786,20 +3789,52 @@ static void
> >> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
> >>>  	 * We can control MSRs 0x00000000-0x00001fff and
> >>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
> >>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
> >>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
> >>> +		if (type & MSR_TYPE_R)
> >>> +			/* read-low */
> >>> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
> >>> +
> >>> +		if (type & MSR_TYPE_W)
> >>> +			/* write-low */
> >>> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
> >>> +
> >>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
> >>>  		msr &= 0x1fff;
> >>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
> >>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
> >>> +		if (type & MSR_TYPE_R)
> >>> +			/* read-high */
> >>> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
> >>> +
> >>> +		if (type & MSR_TYPE_W)
> >>> +			/* write-high */
> >>> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
> >>> +
> >>>  	}
> >>>  }
> >>>  
> >>>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
> >>>  {
> >>>  	if (!longmode_only)
> >>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
> >>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
> >>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
> >>> MSR_TYPE_R | MSR_TYPE_W);
> >>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, + 					msr,
> >>> MSR_TYPE_R | MSR_TYPE_W); +} + +static void
> >>> vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only) +{ +
> >>> 	if (!longmode_only)
> >>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
> >>> MSR_TYPE_R); +
> >>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +					msr,
> >>> MSR_TYPE_R); +} + +static void vmx_disable_intercept_for_msr_write(u32
> >>> msr, bool longmode_only) +{ +	if (!longmode_only)
> >>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, + 					msr,
> >>> MSR_TYPE_W); +
> >>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +					msr,
> >>> MSR_TYPE_W);
> >>>  }
> >>>  
> >>>  /* @@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
> >>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
> >>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
> >>> +	if (enable_apicv_reg_vid) {
> >>> +		int msr;
> >>> +		for (msr = 0x800; msr <= 0x8ff; msr++)
> >>> +			vmx_disable_intercept_for_msr_read(msr, false);
> >>> +
> >>> +		/* TPR */
> >>> +		vmx_disable_intercept_for_msr_write(0x808, false);
> >>> +		/* EOI */
> >>> +		vmx_disable_intercept_for_msr_write(0x80b, false);
> >>> +		/* SELF-IPI */
> >>> +		vmx_disable_intercept_for_msr_write(0x83f, false);
> >>> +	}
> >>> +
> >>>  	if (enable_ept) {
> >>>  		kvm_mmu_set_mask_ptes(0ull,
> >>>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> >>> --
> >>> 1.7.1
> >> 
> >> --
> >> 			Gleb.
> > 
> > 
> > Best regards,
> > Yang
> >
> 
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-24  9:23       ` Gleb Natapov
@ 2012-12-24 23:53         ` Zhang, Yang Z
  2012-12-25  6:38           ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-24 23:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-24:
> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
>> Zhang, Yang Z wrote on 2012-12-24:
>>> Gleb Natapov wrote on 2012-12-20:
>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>>>> basically to benefit from apicv, we need clear MSR bitmap for
>>>>> corresponding x2apic MSRs:
>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
>>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
>>>> control. If I read the spec correctly without that those MSR read/writes
>>>> will go straight to physical local APIC.
>>> Right. Now it cannot get benefit, but we may enable it in future and
>>> then we can benefit from it.
> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
> 
>> how about to add the following check:
>> if (apicv_enabled && virtual_x2apic_enabled)
>> 	clear_msr();
>> 
> I do not understand what do you mean here.
In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As you said, since kvm doesn't set "virtualize x2apic mode", APIC register virtualization never take effect. So we need to clear MSR bitmap only when apicv enabled and virtualize x2apic mode set.

>> 
>>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>>> ---
>>>>>  arch/x86/kvm/vmx.c |   62
>>>>>  ++++++++++++++++++++++++++++++++++++++++++++++------ 1 files
>>>>>  changed, 55 insertions(+), 7 deletions(-)
>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>> index be66c3e..9b5e7a2 100644
>>>>> --- a/arch/x86/kvm/vmx.c
>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>> @@ -3773,7 +3773,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
>>>>>  	spin_unlock(&vmx_vpid_lock);
>>>>>  }
>>>>> -static void __vmx_disable_intercept_for_msr(unsigned long
>>>>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
>>>>> +static void __vmx_disable_intercept_for_msr(unsigned long
>>>>> *msr_bitmap, + 				u32 msr, int type)
>>>>>  {
>>>>>  	int f = sizeof(unsigned long);
>>>>> @@ -3786,20 +3789,52 @@ static void
>>>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>>>>  	 * We can control MSRs 0x00000000-0x00001fff and
>>>>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
>>>>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>>>>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>>>>> +		if (type & MSR_TYPE_R)
>>>>> +			/* read-low */
>>>>> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
>>>>> +
>>>>> +		if (type & MSR_TYPE_W)
>>>>> +			/* write-low */
>>>>> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
>>>>> +
>>>>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>>>>  		msr &= 0x1fff;
>>>>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>>>>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>>>>> +		if (type & MSR_TYPE_R)
>>>>> +			/* read-high */
>>>>> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
>>>>> +
>>>>> +		if (type & MSR_TYPE_W)
>>>>> +			/* write-high */
>>>>> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
>>>>> +
>>>>>  	}
>>>>>  }
>>>>>  
>>>>>  static void vmx_disable_intercept_for_msr(u32 msr, bool
>>>>>  longmode_only) { 	if (!longmode_only)
>>>>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>>>>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +
>>>>> 					msr, MSR_TYPE_R | MSR_TYPE_W);
>>>>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +
>>>>> 					msr, MSR_TYPE_R | MSR_TYPE_W); +} + +static void
>>>>> vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only) +{ +
>>>>> 	if (!longmode_only)
>>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +
>>>>> 					msr, MSR_TYPE_R); +
>>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +
>>>>> 					msr, MSR_TYPE_R); +} + +static void
>>>>> vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only) +{
>>>>> +	if (!longmode_only)
>>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +
>>>>> 					msr, MSR_TYPE_W); +
>>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +
>>>>> 					msr, MSR_TYPE_W);
>>>>>  }
>>>>>  
>>>>>  /* @@ -7633,6 +7668,19 @@ static int __init vmx_init(void)
>>>>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
>>>>>  	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
>>>>> +	if (enable_apicv_reg_vid) {
>>>>> +		int msr;
>>>>> +		for (msr = 0x800; msr <= 0x8ff; msr++)
>>>>> +			vmx_disable_intercept_for_msr_read(msr, false);
>>>>> +
>>>>> +		/* TPR */
>>>>> +		vmx_disable_intercept_for_msr_write(0x808, false);
>>>>> +		/* EOI */
>>>>> +		vmx_disable_intercept_for_msr_write(0x80b, false);
>>>>> +		/* SELF-IPI */
>>>>> +		vmx_disable_intercept_for_msr_write(0x83f, false);
>>>>> +	}
>>>>> +
>>>>>  	if (enable_ept) {
>>>>>  		kvm_mmu_set_mask_ptes(0ull,
>>>>>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
>>>>> --
>>>>> 1.7.1
>>>> 
>>>> --
>>>> 			Gleb.
>>> 
>>> 
>>> Best regards,
>>> Yang
>>> 
>> 
>> 
>> Best regards,
>> Yang
>> 
> 
> --
> 			Gleb.


Best regards,
Yang


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-24 23:53         ` Zhang, Yang Z
@ 2012-12-25  6:38           ` Gleb Natapov
  2012-12-25  6:42             ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-25  6:38 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-24:
> > On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> >> Zhang, Yang Z wrote on 2012-12-24:
> >>> Gleb Natapov wrote on 2012-12-20:
> >>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>>>> basically to benefit from apicv, we need clear MSR bitmap for
> >>>>> corresponding x2apic MSRs:
> >>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
> >>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> >>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
> >>>> control. If I read the spec correctly without that those MSR read/writes
> >>>> will go straight to physical local APIC.
> >>> Right. Now it cannot get benefit, but we may enable it in future and
> >>> then we can benefit from it.
> > Without enabling it you cannot disable MSR intercept for x2apic MSRs.
> > 
> >> how about to add the following check:
> >> if (apicv_enabled && virtual_x2apic_enabled)
> >> 	clear_msr();
> >> 
> > I do not understand what do you mean here.
> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As you said, since kvm doesn't set "virtualize x2apic mode", APIC register virtualization never take effect. So we need to clear MSR bitmap only when apicv enabled and virtualize x2apic mode set.
> 
But currently it is never set.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  6:38           ` Gleb Natapov
@ 2012-12-25  6:42             ` Zhang, Yang Z
  2012-12-25  6:50               ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-25  6:42 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-25:
> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-24:
>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
>>>> Zhang, Yang Z wrote on 2012-12-24:
>>>>> Gleb Natapov wrote on 2012-12-20:
>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
>>>>>>> corresponding x2apic MSRs:
>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
>>>>>> control. If I read the spec correctly without that those MSR read/writes
>>>>>> will go straight to physical local APIC.
>>>>> Right. Now it cannot get benefit, but we may enable it in future and
>>>>> then we can benefit from it.
>>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
>>> 
>>>> how about to add the following check:
>>>> if (apicv_enabled && virtual_x2apic_enabled)
>>>> 	clear_msr();
>>>> 
>>> I do not understand what do you mean here.
>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As you
> said, since kvm doesn't set "virtualize x2apic mode", APIC register virtualization
> never take effect. So we need to clear MSR bitmap only when apicv enabled and
> virtualize x2apic mode set.
>> 
> But currently it is never set.
So you think the third patch is not necessary currently? Unless we enabled "virtualize x2apic mode".

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  6:42             ` Zhang, Yang Z
@ 2012-12-25  6:50               ` Gleb Natapov
  2012-12-25  7:25                 ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-25  6:50 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-25:
> > On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-12-24:
> >>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> >>>> Zhang, Yang Z wrote on 2012-12-24:
> >>>>> Gleb Natapov wrote on 2012-12-20:
> >>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
> >>>>>>> corresponding x2apic MSRs:
> >>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
> >>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> >>>>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
> >>>>>> control. If I read the spec correctly without that those MSR read/writes
> >>>>>> will go straight to physical local APIC.
> >>>>> Right. Now it cannot get benefit, but we may enable it in future and
> >>>>> then we can benefit from it.
> >>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
> >>> 
> >>>> how about to add the following check:
> >>>> if (apicv_enabled && virtual_x2apic_enabled)
> >>>> 	clear_msr();
> >>>> 
> >>> I do not understand what do you mean here.
> >> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As you
> > said, since kvm doesn't set "virtualize x2apic mode", APIC register virtualization
> > never take effect. So we need to clear MSR bitmap only when apicv enabled and
> > virtualize x2apic mode set.
> >> 
> > But currently it is never set.
> So you think the third patch is not necessary currently? Unless we enabled "virtualize x2apic mode".
> 
Without third patch vid will not work properly if a guest is in x2apic
mode. Actually second and third patches need to be reordered to not have
a windows where x2apic is broken. The problem is that this patch itself
is buggy since it does not set "virtualize x2apic mode" flag. It should
set the flag if vid is enabled and if the flag cannot be set vid should
be forced off.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  6:50               ` Gleb Natapov
@ 2012-12-25  7:25                 ` Zhang, Yang Z
  2012-12-25  7:31                   ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-25  7:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-25:
> On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-25:
>>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2012-12-24:
>>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
>>>>>> Zhang, Yang Z wrote on 2012-12-24:
>>>>>>> Gleb Natapov wrote on 2012-12-20:
>>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
>>>>>>>>> corresponding x2apic MSRs:
>>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>>>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
>>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
>>>>>>>> control. If I read the spec correctly without that those MSR read/writes
>>>>>>>> will go straight to physical local APIC.
>>>>>>> Right. Now it cannot get benefit, but we may enable it in future and
>>>>>>> then we can benefit from it.
>>>>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
>>>>> 
>>>>>> how about to add the following check:
>>>>>> if (apicv_enabled && virtual_x2apic_enabled)
>>>>>> 	clear_msr();
>>>>>> 
>>>>> I do not understand what do you mean here.
>>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As
> you
>>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
>>> virtualization never take effect. So we need to clear MSR bitmap only
>>> when apicv enabled and virtualize x2apic mode set.
>>>> 
>>> But currently it is never set.
>> So you think the third patch is not necessary currently? Unless we
>> enabled "virtualize x2apic mode".
>> 
> Without third patch vid will not work properly if a guest is in x2apic
> mode. Actually second and third patches need to be reordered to not have
> a windows where x2apic is broken. The problem is that this patch itself
> is buggy since it does not set "virtualize x2apic mode" flag. It should
> set the flag if vid is enabled and if the flag cannot be set vid should
> be forced off.
In what conditions this flag cannot be set? I think the only case is that KVM doesn't expose the x2apic capability to guest, if this is true, the guest will never use x2apic and we still can use vid.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  7:25                 ` Zhang, Yang Z
@ 2012-12-25  7:31                   ` Gleb Natapov
  2012-12-25  7:46                     ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-25  7:31 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Tue, Dec 25, 2012 at 07:25:15AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-25:
> > On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-12-25:
> >>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-12-24:
> >>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> >>>>>> Zhang, Yang Z wrote on 2012-12-24:
> >>>>>>> Gleb Natapov wrote on 2012-12-20:
> >>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
> >>>>>>>>> corresponding x2apic MSRs:
> >>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
> >>>>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery
> >>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary execution
> >>>>>>>> control. If I read the spec correctly without that those MSR read/writes
> >>>>>>>> will go straight to physical local APIC.
> >>>>>>> Right. Now it cannot get benefit, but we may enable it in future and
> >>>>>>> then we can benefit from it.
> >>>>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
> >>>>> 
> >>>>>> how about to add the following check:
> >>>>>> if (apicv_enabled && virtual_x2apic_enabled)
> >>>>>> 	clear_msr();
> >>>>>> 
> >>>>> I do not understand what do you mean here.
> >>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled. As
> > you
> >>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
> >>> virtualization never take effect. So we need to clear MSR bitmap only
> >>> when apicv enabled and virtualize x2apic mode set.
> >>>> 
> >>> But currently it is never set.
> >> So you think the third patch is not necessary currently? Unless we
> >> enabled "virtualize x2apic mode".
> >> 
> > Without third patch vid will not work properly if a guest is in x2apic
> > mode. Actually second and third patches need to be reordered to not have
> > a windows where x2apic is broken. The problem is that this patch itself
> > is buggy since it does not set "virtualize x2apic mode" flag. It should
> > set the flag if vid is enabled and if the flag cannot be set vid should
> > be forced off.
> In what conditions this flag cannot be set? I think the only case is that KVM doesn't expose the x2apic capability to guest, if this is true, the guest will never use x2apic and we still can use vid.
> 
We can indeed set "virtualize x2apic mode" unconditionally since it does
not take any effect if x2apic MSRs are intercepted.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  7:31                   ` Gleb Natapov
@ 2012-12-25  7:46                     ` Zhang, Yang Z
  2012-12-25  7:52                       ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-25  7:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-25:
> On Tue, Dec 25, 2012 at 07:25:15AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-25:
>>> On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2012-12-25:
>>>>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
>>>>>> Gleb Natapov wrote on 2012-12-24:
>>>>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
>>>>>>>> Zhang, Yang Z wrote on 2012-12-24:
>>>>>>>>> Gleb Natapov wrote on 2012-12-20:
>>>>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>>>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
>>>>>>>>>>> corresponding x2apic MSRs:
>>>>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
>>>>>>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt
> delivery
>>>>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary
>>>>>>>>>> execution control. If I read the spec correctly without that
>>>>>>>>>> those MSR read/writes will go straight to physical local APIC.
>>>>>>>>> Right. Now it cannot get benefit, but we may enable it in future and
>>>>>>>>> then we can benefit from it.
>>>>>>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
>>>>>>> 
>>>>>>>> how about to add the following check:
>>>>>>>> if (apicv_enabled && virtual_x2apic_enabled)
>>>>>>>> 	clear_msr();
>>>>>>>> 
>>>>>>> I do not understand what do you mean here.
>>>>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled.
> As
>>> you
>>>>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
>>>>> virtualization never take effect. So we need to clear MSR bitmap only
>>>>> when apicv enabled and virtualize x2apic mode set.
>>>>>> 
>>>>> But currently it is never set.
>>>> So you think the third patch is not necessary currently? Unless we
>>>> enabled "virtualize x2apic mode".
>>>> 
>>> Without third patch vid will not work properly if a guest is in x2apic
>>> mode. Actually second and third patches need to be reordered to not have
>>> a windows where x2apic is broken. The problem is that this patch itself
>>> is buggy since it does not set "virtualize x2apic mode" flag. It should
>>> set the flag if vid is enabled and if the flag cannot be set vid should
>>> be forced off.
>> In what conditions this flag cannot be set? I think the only case is that KVM
> doesn't expose the x2apic capability to guest, if this is true, the guest will never
> use x2apic and we still can use vid.
>> 
> We can indeed set "virtualize x2apic mode" unconditionally since it does
> not take any effect if x2apic MSRs are intercepted.
No. Since "Virtual APIC access" must be cleared if "virtualize x2apic mode" is set, and if guest still use xAPIC, then there should be lots of ept violations for apic access emulation. This will hurt performance.
We should only set "virtualize x2apic mode" when guest really uses x2apic(guest set bit 11 of APIC_BASE_MSR).

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 23:07           ` Marcelo Tosatti
@ 2012-12-25  7:49             ` Zhang, Yang Z
  0 siblings, 0 replies; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-25  7:49 UTC (permalink / raw)
  To: Marcelo Tosatti, Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Marcelo Tosatti wrote on 2012-12-21:
> On Thu, Dec 20, 2012 at 03:12:32PM +0200, Gleb Natapov wrote:
>> On Thu, Dec 20, 2012 at 10:53:16AM -0200, Marcelo Tosatti wrote:
>>> On Thu, Dec 20, 2012 at 08:42:06AM +0200, Gleb Natapov wrote:
>>>> On Wed, Dec 19, 2012 at 10:59:36PM -0200, Marcelo Tosatti wrote:
>>>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>>>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>>>> 
>>>>>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
>>>>>> manually, which is fully taken care of by the hardware. This needs
>>>>>> some special awareness into existing interrupr injection path:
>>>>>> 
>>>>>> - for pending interrupt, instead of direct injection, we may need
>>>>>>   update architecture specific indicators before resuming to guest.
>>>>>> - A pending interrupt, which is masked by ISR, should be also
>>>>>>   considered in above update action, since hardware will decide
>>>>>>   when to inject it at right time. Current has_interrupt and
>>>>>>   get_interrupt only returns a valid vector from injection p.o.v.
>>>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>>>>> 
>>>>> 
>>>>> Resuming previous discussion:
>>>>> 
>>>>>>> How about to recaculate irr_pending according the VIRR on each
>>>>>>> vmexit?
>>>>>>> 
>>>>>> No need really. Since HW can only clear VIRR the only situation that
>>>>>> may
>>>>>> happen is that irr_pending will be true but VIRR is empty and
>>>>>> apic_find_highest_irr() will return correct result in this case.
>>>>> 
>>>>> Self-IPI does cause VIRR to be set, see "29.1.5 Self-IPI
>>>>> Virtualization".
>>>>> 
>>>> True. But as I said later in that discussion once irr_pending is set
>>>> to true it never becomes false, so the optimization is effectively
>>>> disable. We can set it to true doing apic initialization to make it
>>>> explicit.
>>> 
>>> Its just confusing, to have a variable which has different meanings
>>> in different configurations. I would rather have it explicit that
>>> its not used rather than check every time the i read the code.
>>> 
>>> if (apic_vid() == 0 && !apic->irr_pending)
>>> 	return -1;
>>> 
>> I'd prefer to avoid this additional if() especially as its sole purpose
>> is documentation.  We can add comment instead. Note that irr_pending
>> is just a hint anyway.  It can be true when no interrupt is pending in
>> irr. We can even rename it to irr_pending_hint or something.
> 
> Works for me (documentation).
> 
>>> Not sure if you can skip it, its probably necessary to calculate it
>>> before HW does so (say migration etc).
>> kvm_apic_has_interrupt() is not called during migration and
>> kvm_apic_post_state_restore() calls apic_update_ppr() explicitly.
>> I am not sure it is needed though since migrated value should be already
>> correct anyway.
> 
> Ok, best force isr_count to 1 if apic vintr enabled (and add a comment,
> please).

Sorry for the later reply. I will address those problems according your comments.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  7:46                     ` Zhang, Yang Z
@ 2012-12-25  7:52                       ` Gleb Natapov
  2012-12-25  8:24                         ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-25  7:52 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Tue, Dec 25, 2012 at 07:46:53AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-25:
> > On Tue, Dec 25, 2012 at 07:25:15AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-12-25:
> >>> On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-12-25:
> >>>>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2012-12-24:
> >>>>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> >>>>>>>> Zhang, Yang Z wrote on 2012-12-24:
> >>>>>>>>> Gleb Natapov wrote on 2012-12-20:
> >>>>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>>>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
> >>>>>>>>>>> corresponding x2apic MSRs:
> >>>>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization
> >>>>>>>>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt
> > delivery
> >>>>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary
> >>>>>>>>>> execution control. If I read the spec correctly without that
> >>>>>>>>>> those MSR read/writes will go straight to physical local APIC.
> >>>>>>>>> Right. Now it cannot get benefit, but we may enable it in future and
> >>>>>>>>> then we can benefit from it.
> >>>>>>> Without enabling it you cannot disable MSR intercept for x2apic MSRs.
> >>>>>>> 
> >>>>>>>> how about to add the following check:
> >>>>>>>> if (apicv_enabled && virtual_x2apic_enabled)
> >>>>>>>> 	clear_msr();
> >>>>>>>> 
> >>>>>>> I do not understand what do you mean here.
> >>>>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv enabled.
> > As
> >>> you
> >>>>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
> >>>>> virtualization never take effect. So we need to clear MSR bitmap only
> >>>>> when apicv enabled and virtualize x2apic mode set.
> >>>>>> 
> >>>>> But currently it is never set.
> >>>> So you think the third patch is not necessary currently? Unless we
> >>>> enabled "virtualize x2apic mode".
> >>>> 
> >>> Without third patch vid will not work properly if a guest is in x2apic
> >>> mode. Actually second and third patches need to be reordered to not have
> >>> a windows where x2apic is broken. The problem is that this patch itself
> >>> is buggy since it does not set "virtualize x2apic mode" flag. It should
> >>> set the flag if vid is enabled and if the flag cannot be set vid should
> >>> be forced off.
> >> In what conditions this flag cannot be set? I think the only case is that KVM
> > doesn't expose the x2apic capability to guest, if this is true, the guest will never
> > use x2apic and we still can use vid.
> >> 
> > We can indeed set "virtualize x2apic mode" unconditionally since it does
> > not take any effect if x2apic MSRs are intercepted.
> No. Since "Virtual APIC access" must be cleared if "virtualize x2apic mode" is set, and if guest still use xAPIC, then there should be lots of ept violations for apic access emulation. This will hurt performance.
Stupid HW, why this pointless limitation? Can you point me where SDM says that?

> We should only set "virtualize x2apic mode" when guest really uses x2apic(guest set bit 11 of APIC_BASE_MSR).
> 
Looks like SDM force us to.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  7:52                       ` Gleb Natapov
@ 2012-12-25  8:24                         ` Zhang, Yang Z
  2012-12-25 11:58                           ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-25  8:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-25:
> On Tue, Dec 25, 2012 at 07:46:53AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-25:
>>> On Tue, Dec 25, 2012 at 07:25:15AM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2012-12-25:
>>>>> On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
>>>>>> Gleb Natapov wrote on 2012-12-25:
>>>>>>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
>>>>>>>> Gleb Natapov wrote on 2012-12-24:
>>>>>>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
>>>>>>>>>> Zhang, Yang Z wrote on 2012-12-24:
>>>>>>>>>>> Gleb Natapov wrote on 2012-12-20:
>>>>>>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
>>>>>>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
>>>>>>>>>>>>> corresponding x2apic MSRs:
>>>>>>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register
>>>>>>>>>>>>>     virtualization TPR,EOI,SELF-IPI: no write intercept for
>>>>>>>>>>>>>     virtual interrupt
>>> delivery
>>>>>>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary
>>>>>>>>>>>> execution control. If I read the spec correctly without that
>>>>>>>>>>>> those MSR read/writes will go straight to physical local APIC.
>>>>>>>>>>> Right. Now it cannot get benefit, but we may enable it in
>>>>>>>>>>> future and then we can benefit from it.
>>>>>>>>> Without enabling it you cannot disable MSR intercept for x2apic
>>>>>>>>> MSRs.
>>>>>>>>> 
>>>>>>>>>> how about to add the following check:
>>>>>>>>>> if (apicv_enabled && virtual_x2apic_enabled)
>>>>>>>>>> 	clear_msr();
>>>>>>>>>> 
>>>>>>>>> I do not understand what do you mean here.
>>>>>>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv
> enabled.
>>> As
>>>>> you
>>>>>>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
>>>>>>> virtualization never take effect. So we need to clear MSR bitmap only
>>>>>>> when apicv enabled and virtualize x2apic mode set.
>>>>>>>> 
>>>>>>> But currently it is never set.
>>>>>> So you think the third patch is not necessary currently? Unless we
>>>>>> enabled "virtualize x2apic mode".
>>>>>> 
>>>>> Without third patch vid will not work properly if a guest is in x2apic
>>>>> mode. Actually second and third patches need to be reordered to not have
>>>>> a windows where x2apic is broken. The problem is that this patch itself
>>>>> is buggy since it does not set "virtualize x2apic mode" flag. It should
>>>>> set the flag if vid is enabled and if the flag cannot be set vid should
>>>>> be forced off.
>>>> In what conditions this flag cannot be set? I think the only case is that KVM
>>> doesn't expose the x2apic capability to guest, if this is true, the
>>> guest will never use x2apic and we still can use vid.
>>>> 
>>> We can indeed set "virtualize x2apic mode" unconditionally since it does
>>> not take any effect if x2apic MSRs are intercepted.
>> No. Since "Virtual APIC access" must be cleared if "virtualize x2apic mode" is set,
> and if guest still use xAPIC, then there should be lots of ept violations for apic
> access emulation. This will hurt performance.
> Stupid HW, why this pointless limitation? Can you point me where SDM says that?
Vol 3, 26.2.1.1

>> We should only set "virtualize x2apic mode" when guest really uses
>> x2apic(guest set bit 11 of APIC_BASE_MSR).
>> 
> Looks like SDM force us to.
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/3] x86, apicv: add virtual x2apic support
  2012-12-25  8:24                         ` Zhang, Yang Z
@ 2012-12-25 11:58                           ` Gleb Natapov
  0 siblings, 0 replies; 43+ messages in thread
From: Gleb Natapov @ 2012-12-25 11:58 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm, Shan, Haitao, Tian, Kevin

On Tue, Dec 25, 2012 at 08:24:43AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-25:
> > On Tue, Dec 25, 2012 at 07:46:53AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-12-25:
> >>> On Tue, Dec 25, 2012 at 07:25:15AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-12-25:
> >>>>> On Tue, Dec 25, 2012 at 06:42:59AM +0000, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2012-12-25:
> >>>>>>> On Mon, Dec 24, 2012 at 11:53:37PM +0000, Zhang, Yang Z wrote:
> >>>>>>>> Gleb Natapov wrote on 2012-12-24:
> >>>>>>>>> On Mon, Dec 24, 2012 at 02:35:35AM +0000, Zhang, Yang Z wrote:
> >>>>>>>>>> Zhang, Yang Z wrote on 2012-12-24:
> >>>>>>>>>>> Gleb Natapov wrote on 2012-12-20:
> >>>>>>>>>>>> On Mon, Dec 17, 2012 at 01:30:50PM +0800, Yang Zhang wrote:
> >>>>>>>>>>>>> basically to benefit from apicv, we need clear MSR bitmap for
> >>>>>>>>>>>>> corresponding x2apic MSRs:
> >>>>>>>>>>>>>     0x800 - 0x8ff: no read intercept for apicv register
> >>>>>>>>>>>>>     virtualization TPR,EOI,SELF-IPI: no write intercept for
> >>>>>>>>>>>>>     virtual interrupt
> >>> delivery
> >>>>>>>>>>>> We do not set "Virtualize x2APIC mode" bit in secondary
> >>>>>>>>>>>> execution control. If I read the spec correctly without that
> >>>>>>>>>>>> those MSR read/writes will go straight to physical local APIC.
> >>>>>>>>>>> Right. Now it cannot get benefit, but we may enable it in
> >>>>>>>>>>> future and then we can benefit from it.
> >>>>>>>>> Without enabling it you cannot disable MSR intercept for x2apic
> >>>>>>>>> MSRs.
> >>>>>>>>> 
> >>>>>>>>>> how about to add the following check:
> >>>>>>>>>> if (apicv_enabled && virtual_x2apic_enabled)
> >>>>>>>>>> 	clear_msr();
> >>>>>>>>>> 
> >>>>>>>>> I do not understand what do you mean here.
> >>>>>>>> In this patch, it will clear MSR bitmap(0x800 -0x8ff) when apicv
> > enabled.
> >>> As
> >>>>> you
> >>>>>>> said, since kvm doesn't set "virtualize x2apic mode", APIC register
> >>>>>>> virtualization never take effect. So we need to clear MSR bitmap only
> >>>>>>> when apicv enabled and virtualize x2apic mode set.
> >>>>>>>> 
> >>>>>>> But currently it is never set.
> >>>>>> So you think the third patch is not necessary currently? Unless we
> >>>>>> enabled "virtualize x2apic mode".
> >>>>>> 
> >>>>> Without third patch vid will not work properly if a guest is in x2apic
> >>>>> mode. Actually second and third patches need to be reordered to not have
> >>>>> a windows where x2apic is broken. The problem is that this patch itself
> >>>>> is buggy since it does not set "virtualize x2apic mode" flag. It should
> >>>>> set the flag if vid is enabled and if the flag cannot be set vid should
> >>>>> be forced off.
> >>>> In what conditions this flag cannot be set? I think the only case is that KVM
> >>> doesn't expose the x2apic capability to guest, if this is true, the
> >>> guest will never use x2apic and we still can use vid.
> >>>> 
> >>> We can indeed set "virtualize x2apic mode" unconditionally since it does
> >>> not take any effect if x2apic MSRs are intercepted.
> >> No. Since "Virtual APIC access" must be cleared if "virtualize x2apic mode" is set,
> > and if guest still use xAPIC, then there should be lots of ept violations for apic
> > access emulation. This will hurt performance.
> > Stupid HW, why this pointless limitation? Can you point me where SDM says that?
> Vol 3, 26.2.1.1
> 
Thanks.

> >> We should only set "virtualize x2apic mode" when guest really uses
> >> x2apic(guest set bit 11 of APIC_BASE_MSR).
> >> 
> > Looks like SDM force us to.
> > 
And we can disable x2apic MSR interception only after "virtualize x2apic
mode" is set i.e when guest sets bit 11 of APIC_BASE_MSR.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-21 12:08         ` Gleb Natapov
@ 2012-12-27  2:24           ` Zhang, Yang Z
  2012-12-27  6:23             ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-27  2:24 UTC (permalink / raw)
  To: Gleb Natapov, Marcelo Tosatti; +Cc: kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-21:
> On Fri, Dec 21, 2012 at 09:39:20AM -0200, Marcelo Tosatti wrote:
>> On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
>>> On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
>>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>>> 
>>>>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
>>>>> manually, which is fully taken care of by the hardware. This needs
>>>>> some special awareness into existing interrupr injection path:
>>>>> 
>>>>> - for pending interrupt, instead of direct injection, we may need
>>>>>   update architecture specific indicators before resuming to guest.
>>>>> - A pending interrupt, which is masked by ISR, should be also
>>>>>   considered in above update action, since hardware will decide
>>>>>   when to inject it at right time. Current has_interrupt and
>>>>>   get_interrupt only returns a valid vector from injection p.o.v.
>>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>>>>> ---
>>>>>  arch/ia64/kvm/lapic.h           |    6 ++
>>>>>  arch/x86/include/asm/kvm_host.h |    6 ++
>>>>>  arch/x86/include/asm/vmx.h      |   11 +++ arch/x86/kvm/irq.c      
>>>>>         |   56 +++++++++++++- arch/x86/kvm/lapic.c            |   65
>>>>>  ++++++++++------- arch/x86/kvm/lapic.h            |   28 ++++++-
>>>>>  arch/x86/kvm/svm.c              |   24 ++++++ arch/x86/kvm/vmx.c   
>>>>>            |  154 ++++++++++++++++++++++++++++++++++++++-
>>>>>  arch/x86/kvm/x86.c              |   11 ++- include/linux/kvm_host.h
>>>>>         |    2 + virt/kvm/ioapic.c               |   36 +++++++++
>>>>>  virt/kvm/ioapic.h               |    1 + virt/kvm/irq_comm.c       
>>>>>       |   20 +++++ 13 files changed, 379 insertions(+), 41
>>>>>  deletions(-)
>>>>> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
>>>>> index c5f92a9..cb59eb4 100644
>>>>> --- a/arch/ia64/kvm/lapic.h
>>>>> +++ b/arch/ia64/kvm/lapic.h
>>>>> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct
> kvm_lapic_irq *irq);
>>>>>  #define kvm_apic_present(x) (true)
>>>>>  #define kvm_lapic_enabled(x) (true)
>>>>> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
>>>>> +					struct kvm_lapic_irq *irq)
>>>>> +{
>>>>> +	/* IA64 has no apicv supporting, do nothing here */
>>>>> +}
>>>>> +
>>>>>  #endif
>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..b63a144 100644 ---
>>>>> a/arch/x86/include/asm/kvm_host.h +++
>>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,11 @@ struct
>>>>> kvm_x86_ops {
>>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>>>>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>>>>> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
>>>>> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq
>>>>> *irq); +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu); +	void
>>>>> (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
>>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>>>>  	int (*get_tdp_level)(void);
>>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> is_mmio);
>>>> 
>>>> EOI exit bitmap is problematic (its racy). Please do this:
>>>> 
>>>> 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC
>>>> register modifications which require EOI exit bitmap updates. 2. On
>>>> VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
>>>> and adjusts its own EOI exit bitmap VMCS registers.
>>>> 
>>>> 1) that waits until remote executing VCPUs have acknowledge the request,
>>>> using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
>>>> remote TLB flushes.
>>>> 
>>>> What is the problem now: there is no control over _when_ a VCPU
>>>> updates its EOI exit bitmap VMCS register from the (remotely updated)
>>>> master EOI exit bitmap. The VCPU can be processing a
>>>> KVM_REQ_EOIBITMAP relative to a precedence IOAPIC register write
>>>> while the current IOAPIC register write is updating the EOI exit
>>>> bitmap. There is no way to fix that without locking (which can be
>>>> avoided if the IOAPIC->EOI exit bitmap synchronization is vcpu local).
>>>> 
>>> The race is benign. We have similar one for interrupt injection and
>>> the same race exists on a real HW. The two cases that can happen due
>>> to the race are:
>>> 
>>> 1. exitbitmap bit X is changed from 1 to 0
>>>   No problem. It is harmless to do an exit, on the next entry
>>>   exitbitmap will be fixed. 2. exitbitmap bit X is changed from 0 to 1
>>>   If vcpu serves X at the time this happens it was delivered as edge,
>>>   so no need to exit. The exitbitmap will be updated after the next
>>>   vmexit which will happen due to KVM_REQ_EOIBITMAP processing.
>> 
>> 1. Missed the case where bitmap is being reset (where EOI exit bitmaps
>> are zeroed). In this case vcpu enters guest with incorrect EOI exit
>> bitmap.
>> 
> Right, the bitmap reset us problematic indeed since it does not
> represent real vector configuration change.
> 
>> 2. Missed the case where current code allows vcpu to enter guest
>> with EOI exit bitmap unsynchronized relative to IOAPIC registers
>> (see one KVM_REQ made at a time, no IPI sent). In that case interrupt
>> can be delivered.
>> 
> Ugh, I was sure there is a kick there. Missing kick is just a bug of
> course.
Do you mean add a kick when making KVM_REQ_EOIBITMAP request?

>> Thus the suggestions to update bitmap locally, on entry. Do you
>> see any disadvantage?
>> 
> Only one. The recalculation logic is such that given a vector it
> calculates set of vcpus, so each vcpu will do this calculation for each
> vector and see if it is in the set instead of recalculating once. But
> this should be rare enough for us to not care.
> 
>> Other than that, there is a window between IOAPIC map update and
>> EOI bitmap request, where an interrupt can be delivered without
>> EOI bitmap being updated (which i think local updates don't cover,
>> either).
> Interrupt cannot be delivered through IOAPIC while bitmap is updated
> since IOAPIC has a lock.
> 
>> 
>>> But software really should take care of not changing interrupt vector
>>> configuration while there is an interrupt in flight with the same vector.
>> 
>> None of these are guest faults. As soon as interrupts are allowed they
>> must be handled properly (including synchronized EOI bitmap etc).
>> 
> All my cases are guest faults and the guest will get in trouble on real
> HW too with such behaviour. The case where we clear bitmap before
> recalculation is not a guest fault though and have to be dealt with
> somehow either with locks or your suggestion.
How about to set all bits in eoi bitmap before recalculation. As you said it's harmless to do an vmexit. And eoibitmap will be updated on next vmentry.

Best regards,
Yang


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-20 22:00   ` Marcelo Tosatti
@ 2012-12-27  3:32     ` Zhang, Yang Z
  2012-12-27  6:20       ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-27  3:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, gleb, Shan, Haitao, Tian, Kevin

Marcelo Tosatti wrote on 2012-12-21:
> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  				vmx_secondary_exec_control(vmx));
>>  	}
>> +	if (enable_apicv_reg_vid) {
>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
>> +
>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
>> +	}
> 
> AFAICS SVI should be regenerated on migration. Consider:
> 
> 1. vintr delivery, sets SVI = vector = RVI.
> 2. clears RVI.
> 3. migration.
> 4. RVI properly set from VIRR on entry.
> 5. SVI = 0.
> 6. EOI -> EOI virtualization with SVI = 0.
> 
> Could hook into kvm_apic_post_state_restore() to do that (set highest
> index of bit set in VISR).
Ok. How about to make a request(KVM_REQ_UPDATE_SVI) and handle it in vmentry to set highest index of bit in VISR to RVI.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  3:32     ` Zhang, Yang Z
@ 2012-12-27  6:20       ` Gleb Natapov
  2012-12-27  6:34         ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-27  6:20 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

On Thu, Dec 27, 2012 at 03:32:47AM +0000, Zhang, Yang Z wrote:
> Marcelo Tosatti wrote on 2012-12-21:
> > On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> >> From: Yang Zhang <yang.z.zhang@Intel.com>
> >> 
> >> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>  				vmx_secondary_exec_control(vmx));
> >>  	}
> >> +	if (enable_apicv_reg_vid) {
> >> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> >> +
> >> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> >> +	}
> > 
> > AFAICS SVI should be regenerated on migration. Consider:
> > 
> > 1. vintr delivery, sets SVI = vector = RVI.
> > 2. clears RVI.
> > 3. migration.
> > 4. RVI properly set from VIRR on entry.
> > 5. SVI = 0.
> > 6. EOI -> EOI virtualization with SVI = 0.
> > 
> > Could hook into kvm_apic_post_state_restore() to do that (set highest
> > index of bit set in VISR).
> Ok. How about to make a request(KVM_REQ_UPDATE_SVI) and handle it in vmentry to set highest index of bit in VISR to RVI.
> 
Just do it in kvm_apic_post_state_restore() directly, no need to do
a request.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  2:24           ` Zhang, Yang Z
@ 2012-12-27  6:23             ` Gleb Natapov
  2012-12-27  6:25               ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-27  6:23 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

On Thu, Dec 27, 2012 at 02:24:04AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-21:
> > On Fri, Dec 21, 2012 at 09:39:20AM -0200, Marcelo Tosatti wrote:
> >> On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
> >>> On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
> >>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> >>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
> >>>>> 
> >>>>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> >>>>> manually, which is fully taken care of by the hardware. This needs
> >>>>> some special awareness into existing interrupr injection path:
> >>>>> 
> >>>>> - for pending interrupt, instead of direct injection, we may need
> >>>>>   update architecture specific indicators before resuming to guest.
> >>>>> - A pending interrupt, which is masked by ISR, should be also
> >>>>>   considered in above update action, since hardware will decide
> >>>>>   when to inject it at right time. Current has_interrupt and
> >>>>>   get_interrupt only returns a valid vector from injection p.o.v.
> >>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >>>>> ---
> >>>>>  arch/ia64/kvm/lapic.h           |    6 ++
> >>>>>  arch/x86/include/asm/kvm_host.h |    6 ++
> >>>>>  arch/x86/include/asm/vmx.h      |   11 +++ arch/x86/kvm/irq.c      
> >>>>>         |   56 +++++++++++++- arch/x86/kvm/lapic.c            |   65
> >>>>>  ++++++++++------- arch/x86/kvm/lapic.h            |   28 ++++++-
> >>>>>  arch/x86/kvm/svm.c              |   24 ++++++ arch/x86/kvm/vmx.c   
> >>>>>            |  154 ++++++++++++++++++++++++++++++++++++++-
> >>>>>  arch/x86/kvm/x86.c              |   11 ++- include/linux/kvm_host.h
> >>>>>         |    2 + virt/kvm/ioapic.c               |   36 +++++++++
> >>>>>  virt/kvm/ioapic.h               |    1 + virt/kvm/irq_comm.c       
> >>>>>       |   20 +++++ 13 files changed, 379 insertions(+), 41
> >>>>>  deletions(-)
> >>>>> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> >>>>> index c5f92a9..cb59eb4 100644
> >>>>> --- a/arch/ia64/kvm/lapic.h
> >>>>> +++ b/arch/ia64/kvm/lapic.h
> >>>>> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct
> > kvm_lapic_irq *irq);
> >>>>>  #define kvm_apic_present(x) (true)
> >>>>>  #define kvm_lapic_enabled(x) (true)
> >>>>> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> >>>>> +					struct kvm_lapic_irq *irq)
> >>>>> +{
> >>>>> +	/* IA64 has no apicv supporting, do nothing here */
> >>>>> +}
> >>>>> +
> >>>>>  #endif
> >>>>> diff --git a/arch/x86/include/asm/kvm_host.h
> >>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..b63a144 100644 ---
> >>>>> a/arch/x86/include/asm/kvm_host.h +++
> >>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,11 @@ struct
> >>>>> kvm_x86_ops {
> >>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> >>>>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> >>>>> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> >>>>> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq
> >>>>> *irq); +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu); +	void
> >>>>> (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
> >>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>>>>  	int (*get_tdp_level)(void);
> >>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> > is_mmio);
> >>>> 
> >>>> EOI exit bitmap is problematic (its racy). Please do this:
> >>>> 
> >>>> 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC
> >>>> register modifications which require EOI exit bitmap updates. 2. On
> >>>> VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC map
> >>>> and adjusts its own EOI exit bitmap VMCS registers.
> >>>> 
> >>>> 1) that waits until remote executing VCPUs have acknowledge the request,
> >>>> using make_all_cpus_request (see virt/kvm/kvm_main.c), similarly to
> >>>> remote TLB flushes.
> >>>> 
> >>>> What is the problem now: there is no control over _when_ a VCPU
> >>>> updates its EOI exit bitmap VMCS register from the (remotely updated)
> >>>> master EOI exit bitmap. The VCPU can be processing a
> >>>> KVM_REQ_EOIBITMAP relative to a precedence IOAPIC register write
> >>>> while the current IOAPIC register write is updating the EOI exit
> >>>> bitmap. There is no way to fix that without locking (which can be
> >>>> avoided if the IOAPIC->EOI exit bitmap synchronization is vcpu local).
> >>>> 
> >>> The race is benign. We have similar one for interrupt injection and
> >>> the same race exists on a real HW. The two cases that can happen due
> >>> to the race are:
> >>> 
> >>> 1. exitbitmap bit X is changed from 1 to 0
> >>>   No problem. It is harmless to do an exit, on the next entry
> >>>   exitbitmap will be fixed. 2. exitbitmap bit X is changed from 0 to 1
> >>>   If vcpu serves X at the time this happens it was delivered as edge,
> >>>   so no need to exit. The exitbitmap will be updated after the next
> >>>   vmexit which will happen due to KVM_REQ_EOIBITMAP processing.
> >> 
> >> 1. Missed the case where bitmap is being reset (where EOI exit bitmaps
> >> are zeroed). In this case vcpu enters guest with incorrect EOI exit
> >> bitmap.
> >> 
> > Right, the bitmap reset us problematic indeed since it does not
> > represent real vector configuration change.
> > 
> >> 2. Missed the case where current code allows vcpu to enter guest
> >> with EOI exit bitmap unsynchronized relative to IOAPIC registers
> >> (see one KVM_REQ made at a time, no IPI sent). In that case interrupt
> >> can be delivered.
> >> 
> > Ugh, I was sure there is a kick there. Missing kick is just a bug of
> > course.
> Do you mean add a kick when making KVM_REQ_EOIBITMAP request?
> 
Of course, otherwise vcpu is running with stale bitmap.

> >> Thus the suggestions to update bitmap locally, on entry. Do you
> >> see any disadvantage?
> >> 
> > Only one. The recalculation logic is such that given a vector it
> > calculates set of vcpus, so each vcpu will do this calculation for each
> > vector and see if it is in the set instead of recalculating once. But
> > this should be rare enough for us to not care.
> > 
> >> Other than that, there is a window between IOAPIC map update and
> >> EOI bitmap request, where an interrupt can be delivered without
> >> EOI bitmap being updated (which i think local updates don't cover,
> >> either).
> > Interrupt cannot be delivered through IOAPIC while bitmap is updated
> > since IOAPIC has a lock.
> > 
> >> 
> >>> But software really should take care of not changing interrupt vector
> >>> configuration while there is an interrupt in flight with the same vector.
> >> 
> >> None of these are guest faults. As soon as interrupts are allowed they
> >> must be handled properly (including synchronized EOI bitmap etc).
> >> 
> > All my cases are guest faults and the guest will get in trouble on real
> > HW too with such behaviour. The case where we clear bitmap before
> > recalculation is not a guest fault though and have to be dealt with
> > somehow either with locks or your suggestion.
> How about to set all bits in eoi bitmap before recalculation. As you said it's harmless to do an vmexit. And eoibitmap will be updated on next vmentry.
> 
If you set all bits before recalculation they will remain all set after
recalculation too. I do not get what you propose here.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  6:23             ` Gleb Natapov
@ 2012-12-27  6:25               ` Zhang, Yang Z
  2012-12-31 15:02                 ` Marcelo Tosatti
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-27  6:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-27:
> On Thu, Dec 27, 2012 at 02:24:04AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-21:
>>> On Fri, Dec 21, 2012 at 09:39:20AM -0200, Marcelo Tosatti wrote:
>>>> On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
>>>>> On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
>>>>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>>>>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>>>>> 
>>>>>>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
>>>>>>> manually, which is fully taken care of by the hardware. This needs
>>>>>>> some special awareness into existing interrupr injection path:
>>>>>>> 
>>>>>>> - for pending interrupt, instead of direct injection, we may need
>>>>>>>   update architecture specific indicators before resuming to
>>>>>>>   guest. - A pending interrupt, which is masked by ISR, should be
>>>>>>>   also considered in above update action, since hardware will
>>>>>>>   decide when to inject it at right time. Current has_interrupt
>>>>>>>   and get_interrupt only returns a valid vector from injection
>>>>>>>   p.o.v.
>>>>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>>>>>>> ---
>>>>>>>  arch/ia64/kvm/lapic.h           |    6 ++
>>>>>>>  arch/x86/include/asm/kvm_host.h |    6 ++
>>>>>>>  arch/x86/include/asm/vmx.h      |   11 +++ arch/x86/kvm/irq.c
>>>>>>>         |   56 +++++++++++++- arch/x86/kvm/lapic.c            |
> 65
>>>>>>>  ++++++++++------- arch/x86/kvm/lapic.h            |   28 ++++++-
>>>>>>>  arch/x86/kvm/svm.c              |   24 ++++++
> arch/x86/kvm/vmx.c
>>>>>>>            |  154 ++++++++++++++++++++++++++++++++++++++-
>>>>>>>  arch/x86/kvm/x86.c              |   11 ++-
> include/linux/kvm_host.h
>>>>>>>         |    2 + virt/kvm/ioapic.c               |   36 +++++++++
>>>>>>>  virt/kvm/ioapic.h               |    1 + virt/kvm/irq_comm.c
>>>>>>>       |   20 +++++ 13 files changed, 379 insertions(+), 41
>>>>>>>  deletions(-)
>>>>>>> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
>>>>>>> index c5f92a9..cb59eb4 100644
>>>>>>> --- a/arch/ia64/kvm/lapic.h
>>>>>>> +++ b/arch/ia64/kvm/lapic.h
>>>>>>> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu,
> struct
>>> kvm_lapic_irq *irq);
>>>>>>>  #define kvm_apic_present(x) (true)
>>>>>>>  #define kvm_lapic_enabled(x) (true)
>>>>>>> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
>>>>>>> +					struct kvm_lapic_irq *irq)
>>>>>>> +{
>>>>>>> +	/* IA64 has no apicv supporting, do nothing here */
>>>>>>> +}
>>>>>>> +
>>>>>>>  #endif
>>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..b63a144 100644 ---
>>>>>>> a/arch/x86/include/asm/kvm_host.h +++
>>>>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,11 @@ struct
>>>>>>> kvm_x86_ops {
>>>>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>>>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>>>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int
> irr);
>>>>>>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>>>>>>> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
>>>>>>> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq
>>>>>>> *irq); +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu); +	void
>>>>>>> (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
>>>>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>>>>>>  	int (*get_tdp_level)(void);
>>>>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
>>> is_mmio);
>>>>>> 
>>>>>> EOI exit bitmap is problematic (its racy). Please do this:
>>>>>> 
>>>>>> 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC
>>>>>> register modifications which require EOI exit bitmap updates. 2. On
>>>>>> VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC
>>>>>> map and adjusts its own EOI exit bitmap VMCS registers.
>>>>>> 
>>>>>> 1) that waits until remote executing VCPUs have acknowledge the
>>>>>> request, using make_all_cpus_request (see virt/kvm/kvm_main.c),
>>>>>> similarly to remote TLB flushes.
>>>>>> 
>>>>>> What is the problem now: there is no control over _when_ a VCPU
>>>>>> updates its EOI exit bitmap VMCS register from the (remotely updated)
>>>>>> master EOI exit bitmap. The VCPU can be processing a
>>>>>> KVM_REQ_EOIBITMAP relative to a precedence IOAPIC register write
>>>>>> while the current IOAPIC register write is updating the EOI exit
>>>>>> bitmap. There is no way to fix that without locking (which can be
>>>>>> avoided if the IOAPIC->EOI exit bitmap synchronization is vcpu local).
>>>>>> 
>>>>> The race is benign. We have similar one for interrupt injection and
>>>>> the same race exists on a real HW. The two cases that can happen due
>>>>> to the race are:
>>>>> 
>>>>> 1. exitbitmap bit X is changed from 1 to 0
>>>>>   No problem. It is harmless to do an exit, on the next entry
>>>>>   exitbitmap will be fixed. 2. exitbitmap bit X is changed from 0 to 1
>>>>>   If vcpu serves X at the time this happens it was delivered as edge,
>>>>>   so no need to exit. The exitbitmap will be updated after the next
>>>>>   vmexit which will happen due to KVM_REQ_EOIBITMAP processing.
>>>> 
>>>> 1. Missed the case where bitmap is being reset (where EOI exit bitmaps
>>>> are zeroed). In this case vcpu enters guest with incorrect EOI exit
>>>> bitmap.
>>>> 
>>> Right, the bitmap reset us problematic indeed since it does not
>>> represent real vector configuration change.
>>> 
>>>> 2. Missed the case where current code allows vcpu to enter guest
>>>> with EOI exit bitmap unsynchronized relative to IOAPIC registers
>>>> (see one KVM_REQ made at a time, no IPI sent). In that case interrupt
>>>> can be delivered.
>>>> 
>>> Ugh, I was sure there is a kick there. Missing kick is just a bug of
>>> course.
>> Do you mean add a kick when making KVM_REQ_EOIBITMAP request?
>> 
> Of course, otherwise vcpu is running with stale bitmap.
Right.
 
>>>> Thus the suggestions to update bitmap locally, on entry. Do you
>>>> see any disadvantage?
>>>> 
>>> Only one. The recalculation logic is such that given a vector it
>>> calculates set of vcpus, so each vcpu will do this calculation for each
>>> vector and see if it is in the set instead of recalculating once. But
>>> this should be rare enough for us to not care.
>>> 
>>>> Other than that, there is a window between IOAPIC map update and
>>>> EOI bitmap request, where an interrupt can be delivered without
>>>> EOI bitmap being updated (which i think local updates don't cover,
>>>> either).
>>> Interrupt cannot be delivered through IOAPIC while bitmap is updated
>>> since IOAPIC has a lock.
>>> 
>>>> 
>>>>> But software really should take care of not changing interrupt vector
>>>>> configuration while there is an interrupt in flight with the same vector.
>>>> 
>>>> None of these are guest faults. As soon as interrupts are allowed they
>>>> must be handled properly (including synchronized EOI bitmap etc).
>>>> 
>>> All my cases are guest faults and the guest will get in trouble on real
>>> HW too with such behaviour. The case where we clear bitmap before
>>> recalculation is not a guest fault though and have to be dealt with
>>> somehow either with locks or your suggestion.
>> How about to set all bits in eoi bitmap before recalculation. As you
>> said it's harmless to do an vmexit. And eoibitmap will be updated on
>> next vmentry.
>> 
> If you set all bits before recalculation they will remain all set after
> recalculation too. I do not get what you propose here.
Yes. I forget it.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  6:20       ` Gleb Natapov
@ 2012-12-27  6:34         ` Zhang, Yang Z
  2012-12-27  6:38           ` Gleb Natapov
  0 siblings, 1 reply; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-27  6:34 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-27:
> On Thu, Dec 27, 2012 at 03:32:47AM +0000, Zhang, Yang Z wrote:
>> Marcelo Tosatti wrote on 2012-12-21:
>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>> 
>>>> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
>>>>  				vmx_secondary_exec_control(vmx));
>>>>  	}
>>>> +	if (enable_apicv_reg_vid) {
>>>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>>>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>>>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>>>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
>>>> +
>>>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
>>>> +	}
>>> 
>>> AFAICS SVI should be regenerated on migration. Consider:
>>> 
>>> 1. vintr delivery, sets SVI = vector = RVI.
>>> 2. clears RVI.
>>> 3. migration.
>>> 4. RVI properly set from VIRR on entry.
>>> 5. SVI = 0.
>>> 6. EOI -> EOI virtualization with SVI = 0.
>>> 
>>> Could hook into kvm_apic_post_state_restore() to do that (set highest
>>> index of bit set in VISR).
>> Ok. How about to make a request(KVM_REQ_UPDATE_SVI) and handle it in
>> vmentry to set highest index of bit in VISR to RVI.
>> 
> Just do it in kvm_apic_post_state_restore() directly, no need to do
> a request.
What you mean "do it directly". Since we are not in target vcpu's context, we cannot access vmcs at this point. We still need a request or variable to indicate the migration happened.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  6:34         ` Zhang, Yang Z
@ 2012-12-27  6:38           ` Gleb Natapov
  2012-12-27  6:50             ` Zhang, Yang Z
  0 siblings, 1 reply; 43+ messages in thread
From: Gleb Natapov @ 2012-12-27  6:38 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

On Thu, Dec 27, 2012 at 06:34:37AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-27:
> > On Thu, Dec 27, 2012 at 03:32:47AM +0000, Zhang, Yang Z wrote:
> >> Marcelo Tosatti wrote on 2012-12-21:
> >>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> >>>> From: Yang Zhang <yang.z.zhang@Intel.com>
> >>>> 
> >>>> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> > *vmx)
> >>>>  				vmx_secondary_exec_control(vmx));
> >>>>  	}
> >>>> +	if (enable_apicv_reg_vid) {
> >>>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> >>>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> >>>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> >>>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> >>>> +
> >>>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>> +	}
> >>> 
> >>> AFAICS SVI should be regenerated on migration. Consider:
> >>> 
> >>> 1. vintr delivery, sets SVI = vector = RVI.
> >>> 2. clears RVI.
> >>> 3. migration.
> >>> 4. RVI properly set from VIRR on entry.
> >>> 5. SVI = 0.
> >>> 6. EOI -> EOI virtualization with SVI = 0.
> >>> 
> >>> Could hook into kvm_apic_post_state_restore() to do that (set highest
> >>> index of bit set in VISR).
> >> Ok. How about to make a request(KVM_REQ_UPDATE_SVI) and handle it in
> >> vmentry to set highest index of bit in VISR to RVI.
> >> 
> > Just do it in kvm_apic_post_state_restore() directly, no need to do
> > a request.
> What you mean "do it directly". Since we are not in target vcpu's context, we cannot access vmcs at this point. We still need a request or variable to indicate the migration happened.
> 
We are in a target vcpu context.

--
			Gleb.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  6:38           ` Gleb Natapov
@ 2012-12-27  6:50             ` Zhang, Yang Z
  0 siblings, 0 replies; 43+ messages in thread
From: Zhang, Yang Z @ 2012-12-27  6:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Marcelo Tosatti, kvm, Shan, Haitao, Tian, Kevin

Gleb Natapov wrote on 2012-12-27:
> On Thu, Dec 27, 2012 at 06:34:37AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-12-27:
>>> On Thu, Dec 27, 2012 at 03:32:47AM +0000, Zhang, Yang Z wrote:
>>>> Marcelo Tosatti wrote on 2012-12-21:
>>>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
>>>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>>>> 
>>>>>> @@ -3925,6 +3942,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>> *vmx)
>>>>>>  				vmx_secondary_exec_control(vmx));
>>>>>>  	}
>>>>>> +	if (enable_apicv_reg_vid) {
>>>>>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>>>>>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>>>>>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>>>>>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
>>>>>> +
>>>>>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
>>>>>> +	}
>>>>> 
>>>>> AFAICS SVI should be regenerated on migration. Consider:
>>>>> 
>>>>> 1. vintr delivery, sets SVI = vector = RVI.
>>>>> 2. clears RVI.
>>>>> 3. migration.
>>>>> 4. RVI properly set from VIRR on entry.
>>>>> 5. SVI = 0.
>>>>> 6. EOI -> EOI virtualization with SVI = 0.
>>>>> 
>>>>> Could hook into kvm_apic_post_state_restore() to do that (set highest
>>>>> index of bit set in VISR).
>>>> Ok. How about to make a request(KVM_REQ_UPDATE_SVI) and handle it in
>>>> vmentry to set highest index of bit in VISR to RVI.
>>>> 
>>> Just do it in kvm_apic_post_state_restore() directly, no need to do
>>> a request.
>> What you mean "do it directly". Since we are not in target vcpu's context, we
> cannot access vmcs at this point. We still need a request or variable to indicate
> the migration happened.
>> 
> We are in a target vcpu context.
Ok, I misread the code.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support
  2012-12-27  6:25               ` Zhang, Yang Z
@ 2012-12-31 15:02                 ` Marcelo Tosatti
  0 siblings, 0 replies; 43+ messages in thread
From: Marcelo Tosatti @ 2012-12-31 15:02 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Gleb Natapov, kvm, Shan, Haitao, Tian, Kevin

On Thu, Dec 27, 2012 at 06:25:25AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-12-27:
> > On Thu, Dec 27, 2012 at 02:24:04AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-12-21:
> >>> On Fri, Dec 21, 2012 at 09:39:20AM -0200, Marcelo Tosatti wrote:
> >>>> On Fri, Dec 21, 2012 at 09:51:40AM +0200, Gleb Natapov wrote:
> >>>>> On Thu, Dec 20, 2012 at 08:59:11PM -0200, Marcelo Tosatti wrote:
> >>>>>> On Mon, Dec 17, 2012 at 01:30:49PM +0800, Yang Zhang wrote:
> >>>>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
> >>>>>>> 
> >>>>>>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> >>>>>>> manually, which is fully taken care of by the hardware. This needs
> >>>>>>> some special awareness into existing interrupr injection path:
> >>>>>>> 
> >>>>>>> - for pending interrupt, instead of direct injection, we may need
> >>>>>>>   update architecture specific indicators before resuming to
> >>>>>>>   guest. - A pending interrupt, which is masked by ISR, should be
> >>>>>>>   also considered in above update action, since hardware will
> >>>>>>>   decide when to inject it at right time. Current has_interrupt
> >>>>>>>   and get_interrupt only returns a valid vector from injection
> >>>>>>>   p.o.v.
> >>>>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >>>>>>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >>>>>>> ---
> >>>>>>>  arch/ia64/kvm/lapic.h           |    6 ++
> >>>>>>>  arch/x86/include/asm/kvm_host.h |    6 ++
> >>>>>>>  arch/x86/include/asm/vmx.h      |   11 +++ arch/x86/kvm/irq.c
> >>>>>>>         |   56 +++++++++++++- arch/x86/kvm/lapic.c            |
> > 65
> >>>>>>>  ++++++++++------- arch/x86/kvm/lapic.h            |   28 ++++++-
> >>>>>>>  arch/x86/kvm/svm.c              |   24 ++++++
> > arch/x86/kvm/vmx.c
> >>>>>>>            |  154 ++++++++++++++++++++++++++++++++++++++-
> >>>>>>>  arch/x86/kvm/x86.c              |   11 ++-
> > include/linux/kvm_host.h
> >>>>>>>         |    2 + virt/kvm/ioapic.c               |   36 +++++++++
> >>>>>>>  virt/kvm/ioapic.h               |    1 + virt/kvm/irq_comm.c
> >>>>>>>       |   20 +++++ 13 files changed, 379 insertions(+), 41
> >>>>>>>  deletions(-)
> >>>>>>> diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
> >>>>>>> index c5f92a9..cb59eb4 100644
> >>>>>>> --- a/arch/ia64/kvm/lapic.h
> >>>>>>> +++ b/arch/ia64/kvm/lapic.h
> >>>>>>> @@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu,
> > struct
> >>> kvm_lapic_irq *irq);
> >>>>>>>  #define kvm_apic_present(x) (true)
> >>>>>>>  #define kvm_lapic_enabled(x) (true)
> >>>>>>> +static inline void kvm_update_eoi_exitmap(struct kvm *kvm,
> >>>>>>> +					struct kvm_lapic_irq *irq)
> >>>>>>> +{
> >>>>>>> +	/* IA64 has no apicv supporting, do nothing here */
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>  #endif
> >>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
> >>>>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..b63a144 100644 ---
> >>>>>>> a/arch/x86/include/asm/kvm_host.h +++
> >>>>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,11 @@ struct
> >>>>>>> kvm_x86_ops {
> >>>>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >>>>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >>>>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int
> > irr);
> >>>>>>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> >>>>>>> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> >>>>>>> +	void (*update_eoi_exitmap)(struct kvm *kvm, struct kvm_lapic_irq
> >>>>>>> *irq); +	void (*reset_eoi_exitmap)(struct kvm_vcpu *vcpu); +	void
> >>>>>>> (*load_eoi_exitmap)(struct kvm_vcpu *vcpu);
> >>>>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>>>>>>  	int (*get_tdp_level)(void);
> >>>>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> >>> is_mmio);
> >>>>>> 
> >>>>>> EOI exit bitmap is problematic (its racy). Please do this:
> >>>>>> 
> >>>>>> 1. Make a synchronous (1) KVM_REQ_EOIBITMAP request on IOAPIC
> >>>>>> register modifications which require EOI exit bitmap updates. 2. On
> >>>>>> VM-entry, during KVM_REQ_EOIBITMAP processing, each checks IOAPIC
> >>>>>> map and adjusts its own EOI exit bitmap VMCS registers.
> >>>>>> 
> >>>>>> 1) that waits until remote executing VCPUs have acknowledge the
> >>>>>> request, using make_all_cpus_request (see virt/kvm/kvm_main.c),
> >>>>>> similarly to remote TLB flushes.
> >>>>>> 
> >>>>>> What is the problem now: there is no control over _when_ a VCPU
> >>>>>> updates its EOI exit bitmap VMCS register from the (remotely updated)
> >>>>>> master EOI exit bitmap. The VCPU can be processing a
> >>>>>> KVM_REQ_EOIBITMAP relative to a precedence IOAPIC register write
> >>>>>> while the current IOAPIC register write is updating the EOI exit
> >>>>>> bitmap. There is no way to fix that without locking (which can be
> >>>>>> avoided if the IOAPIC->EOI exit bitmap synchronization is vcpu local).
> >>>>>> 
> >>>>> The race is benign. We have similar one for interrupt injection and
> >>>>> the same race exists on a real HW. The two cases that can happen due
> >>>>> to the race are:
> >>>>> 
> >>>>> 1. exitbitmap bit X is changed from 1 to 0
> >>>>>   No problem. It is harmless to do an exit, on the next entry
> >>>>>   exitbitmap will be fixed. 2. exitbitmap bit X is changed from 0 to 1
> >>>>>   If vcpu serves X at the time this happens it was delivered as edge,
> >>>>>   so no need to exit. The exitbitmap will be updated after the next
> >>>>>   vmexit which will happen due to KVM_REQ_EOIBITMAP processing.
> >>>> 
> >>>> 1. Missed the case where bitmap is being reset (where EOI exit bitmaps
> >>>> are zeroed). In this case vcpu enters guest with incorrect EOI exit
> >>>> bitmap.
> >>>> 
> >>> Right, the bitmap reset us problematic indeed since it does not
> >>> represent real vector configuration change.
> >>> 
> >>>> 2. Missed the case where current code allows vcpu to enter guest
> >>>> with EOI exit bitmap unsynchronized relative to IOAPIC registers
> >>>> (see one KVM_REQ made at a time, no IPI sent). In that case interrupt
> >>>> can be delivered.
> >>>> 
> >>> Ugh, I was sure there is a kick there. Missing kick is just a bug of
> >>> course.
> >> Do you mean add a kick when making KVM_REQ_EOIBITMAP request?
> >> 
> > Of course, otherwise vcpu is running with stale bitmap.
> Right.
>  
> >>>> Thus the suggestions to update bitmap locally, on entry. Do you
> >>>> see any disadvantage?
> >>>> 
> >>> Only one. The recalculation logic is such that given a vector it
> >>> calculates set of vcpus, so each vcpu will do this calculation for each
> >>> vector and see if it is in the set instead of recalculating once. But
> >>> this should be rare enough for us to not care.
> >>> 
> >>>> Other than that, there is a window between IOAPIC map update and
> >>>> EOI bitmap request, where an interrupt can be delivered without
> >>>> EOI bitmap being updated (which i think local updates don't cover,
> >>>> either).
> >>> Interrupt cannot be delivered through IOAPIC while bitmap is updated
> >>> since IOAPIC has a lock.
> >>> 
> >>>> 
> >>>>> But software really should take care of not changing interrupt vector
> >>>>> configuration while there is an interrupt in flight with the same vector.
> >>>> 
> >>>> None of these are guest faults. As soon as interrupts are allowed they
> >>>> must be handled properly (including synchronized EOI bitmap etc).
> >>>> 
> >>> All my cases are guest faults and the guest will get in trouble on real
> >>> HW too with such behaviour. The case where we clear bitmap before
> >>> recalculation is not a guest fault though and have to be dealt with
> >>> somehow either with locks or your suggestion.
> >> How about to set all bits in eoi bitmap before recalculation. As you
> >> said it's harmless to do an vmexit. And eoibitmap will be updated on
> >> next vmentry.
> >> 
> > If you set all bits before recalculation they will remain all set after
> > recalculation too. I do not get what you propose here.
> Yes. I forget it.
> 
> Best regards,
> Yang

Please just perform the recalculation on KVM_REQ_EOIBITMAP. As noted, IOAPIC updates 
are not performance critical. This will save countless number of code
audits later.


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2013-01-01 23:42 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-17  5:30 [PATCH v7 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
2012-12-17  5:30 ` [PATCH v7 1/3] x86, apicv: add APICv register " Yang Zhang
2012-12-17  5:30 ` [PATCH v7 2/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
2012-12-20  0:59   ` Marcelo Tosatti
2012-12-20  1:01     ` Marcelo Tosatti
2012-12-20  6:42     ` Gleb Natapov
2012-12-20 12:53       ` Marcelo Tosatti
2012-12-20 13:12         ` Gleb Natapov
2012-12-20 23:07           ` Marcelo Tosatti
2012-12-25  7:49             ` Zhang, Yang Z
2012-12-20  1:26   ` Marcelo Tosatti
2012-12-20  6:51     ` Gleb Natapov
2012-12-20 13:01       ` Marcelo Tosatti
2012-12-20 13:02         ` Marcelo Tosatti
2012-12-20 22:00   ` Marcelo Tosatti
2012-12-27  3:32     ` Zhang, Yang Z
2012-12-27  6:20       ` Gleb Natapov
2012-12-27  6:34         ` Zhang, Yang Z
2012-12-27  6:38           ` Gleb Natapov
2012-12-27  6:50             ` Zhang, Yang Z
2012-12-20 22:59   ` Marcelo Tosatti
2012-12-21  7:51     ` Gleb Natapov
2012-12-21 11:39       ` Marcelo Tosatti
2012-12-21 12:08         ` Gleb Natapov
2012-12-27  2:24           ` Zhang, Yang Z
2012-12-27  6:23             ` Gleb Natapov
2012-12-27  6:25               ` Zhang, Yang Z
2012-12-31 15:02                 ` Marcelo Tosatti
2012-12-17  5:30 ` [PATCH v7 3/3] x86, apicv: add virtual x2apic support Yang Zhang
2012-12-20  8:31   ` Gleb Natapov
2012-12-24  1:41     ` Zhang, Yang Z
2012-12-24  2:35     ` Zhang, Yang Z
2012-12-24  9:23       ` Gleb Natapov
2012-12-24 23:53         ` Zhang, Yang Z
2012-12-25  6:38           ` Gleb Natapov
2012-12-25  6:42             ` Zhang, Yang Z
2012-12-25  6:50               ` Gleb Natapov
2012-12-25  7:25                 ` Zhang, Yang Z
2012-12-25  7:31                   ` Gleb Natapov
2012-12-25  7:46                     ` Zhang, Yang Z
2012-12-25  7:52                       ` Gleb Natapov
2012-12-25  8:24                         ` Zhang, Yang Z
2012-12-25 11:58                           ` Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.