linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] IPI virtualization support for VM
@ 2021-08-09  3:29 Zeng Guang
  2021-08-09  3:29 ` [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Current IPI process in guest VM will virtualize the writing to interrupt
command register(ICR) of the local APIC which will cause VM-exit anyway
on source vCPU. Frequent VM-exit could induce much overhead accumulated
if running IPI intensive task.

IPI virtualization as a new VT-x feature targets to eliminate VM-exits
when issuing IPI on source vCPU. It introduces a new VM-execution
control - "IPI virtualization"(bit4) in the tertiary processor-based
VM-execution controls and a new data structure - "PID-pointer table
address" and "Last PID-pointer index" referenced by the VMCS. When "IPI
virtualization" is enabled, processor emulates following kind of writes
to APIC registers that would send IPIs, moreover without causing VM-exits.
- Memory-mapped ICR writes
- MSR-mapped ICR writes
- SENDUIPI execution

This patch series implements IPI virtualization support in KVM.

Patches 1-4 add tertiary processor-based VM-execution support
framework.

Patch 5 implements interrupt dispatch support in x2APIC mode with
APIC-write VM exit. In previous platform, no CPU would produce
APIC-write VM exit with exit qualification 300H when the "virtual x2APIC
mode" VM-execution control was 1.

Patch 6 implement IPI virtualization related function including
feature enabling through tertiary processor-based VM-execution in
various scenarios of VMCS configuration, PID table setup in vCPU creation
and vCPU block consideration.

Document for IPI virtualization is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

We did experiment to measure average time sending IPI from source vCPU
to the target vCPU completing the IPI handling by kvm unittest w/ and
w/o IPI virtualization. When IPI virtualization enabled, it will reduce
22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
respectively.

KVM unittest:vmexit/ipi, 2 vCPU, AP was modified to run in idle loop
instead of halt to ensure no VM exit impact on target vCPU.

                Cycles of IPI
                xAPIC mode              x2APIC mode
        test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
        1       6106      4816          4265      3768
        2       6244      4656          4404      3546
        3       6165      4658          4233      3474
        4       5992      4710          4363      3430
        5       6083      4741          4215      3551
        6       6238      4904          4304      3547
        7       6164      4617          4263      3709
        8       5984      4763          4518      3779
        9       5931      4712          4645      3667
        10      5955      4530          4332      3724
        11      5897      4673          4283      3569
        12      6140      4794          4178      3598
        13      6183      4728          4363      3628
        14      5991      4994          4509      3842
        15      5866      4665          4520      3739
        16      6032      4654          4229      3701
        17      6050      4653          4185      3726
        18      6004      4792          4319      3746
        19      5961      4626          4196      3392
        20      6194      4576          4433      3760

Average cycles  6059      4713.1        4337.85   3644.8
%Reduction                -22.21%                 -15.98%

--------------------------------------
IPI microbenchmark:
(https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)

2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode

Result with IPIv enabled:

Dry-run:                         0,             272798 ns
Self-IPI:                  5094123,           11114037 ns
Normal IPI:              131697087,          173321200 ns
Broadcast IPI:                   0,          155649075 ns
Broadcast lock:                  0,          161518031 ns

Result with IPIv disabled:

Dry-run:                         0,             272766 ns
Self-IPI:                  5091788,           11123699 ns
Normal IPI:              145215772,          174558920 ns
Broadcast IPI:                   0,          175785384 ns
Broadcast lock:                  0,          149076195 ns


As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
about 9.73% time saving on average out of 15 test runs when IPIv is
enabled.

Normal IPI statistics (unit:ns):
	test	w/o IPIv	w/ IPIv
	1	153346049	140907046
	2	147218648	141660618
	3	145215772	117890672
	4	146621682	136430470
	5	144821472	136199421
	6	144704378	131676928
	7	141403224	131697087
	8	144775766	125476250
	9	140658192	137263330
	10      144768626	138593127
	11	145166679	131946752
	12	145020451	116852889
	13	148161353	131406280
	14	148378655	130174353
	15	148903652	127969674 

Average time	145944306.6	131742993.1 ns
%Reduction			-9.73%

--------------------------------------
hackbench:

8 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 100000

                w/o IPIv        w/ IPIv
Time            91.887          74.605
%Reduction                      -18.808%

96 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 1000000

                w/o IPIv        w/ IPIv
Time            287.504         235.185
%Reduction                      -18.198%

--------------------------------------
v3 -> v4:
1. Refine code style of patch 2
2. Move tertiary control shadow build into patch 3
3. Make vmx_tertiary_exec_control to be static function

v2 -> v3:
1. Misc change on tertiary execution control
   definition and capability setup
2. Alternative to get tertiary execution
   control configuration

v1 -> v2:
1. Refine the IPIv enabling logic for VM.
   Remove ipiv_active definition per vCPU.

Gao Chao (1):
  KVM: VMX: enable IPI virtualization

Robert Hoo (4):
  x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control
  KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
    variation
  KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well

Zeng Guang (1):
  KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM
    exit

 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/include/asm/vmx.h         |  11 +++
 arch/x86/include/asm/vmxfeatures.h |   5 +-
 arch/x86/kernel/cpu/feat_ctl.c     |  11 ++-
 arch/x86/kvm/lapic.c               |   9 ++-
 arch/x86/kvm/vmx/capabilities.h    |  14 ++++
 arch/x86/kvm/vmx/evmcs.c           |   2 +
 arch/x86/kvm/vmx/evmcs.h           |   1 +
 arch/x86/kvm/vmx/posted_intr.c     |  22 ++++--
 arch/x86/kvm/vmx/vmcs.h            |   1 +
 arch/x86/kvm/vmx/vmx.c             | 114 +++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h             |  55 ++++++++------
 12 files changed, 208 insertions(+), 38 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-09-10 21:25   ` Sean Christopherson
  2021-08-09  3:29 ` [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

New VMX capability MSR IA32_VMX_PROCBASED_CTLS3 conresponse to this new
VM-Execution control field. And it is 64bit allow-1 semantics, not like
previous capability MSRs 32bit allow-0 and 32bit allow-1. So with Tertiary
VM-Execution control field introduced, 2 vmx_feature leaves are introduced,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/include/asm/vmxfeatures.h |  3 ++-
 arch/x86/kernel/cpu/feat_ctl.c     | 11 ++++++++++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..3df26e27b554 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -919,6 +919,7 @@
 #define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
+#define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
 
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_VMCS_SIZE_SHIFT	32
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..b264f5c43b5f 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -5,7 +5,7 @@
 /*
  * Defines VMX CPU feature bits
  */
-#define NVMXINTS			3 /* N 32-bit words worth of info */
+#define NVMXINTS			5 /* N 32-bit words worth of info */
 
 /*
  * Note: If the comment begins with a quoted string, that string is used
@@ -43,6 +43,7 @@
 #define VMX_FEATURE_RDTSC_EXITING	( 1*32+ 12) /* "" VM-Exit on RDTSC */
 #define VMX_FEATURE_CR3_LOAD_EXITING	( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
 #define VMX_FEATURE_CR3_STORE_EXITING	( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
+#define VMX_FEATURE_TERTIARY_CONTROLS	(1*32 + 17) /* "" Enable Tertiary VM-Execution Controls */
 #define VMX_FEATURE_CR8_LOAD_EXITING	( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
 #define VMX_FEATURE_CR8_STORE_EXITING	( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
 #define VMX_FEATURE_VIRTUAL_TPR		( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index da696eb4821a..4aab4def5000 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -15,6 +15,8 @@ enum vmx_feature_leafs {
 	MISC_FEATURES = 0,
 	PRIMARY_CTLS,
 	SECONDARY_CTLS,
+	TERTIARY_CTLS_LOW,
+	TERTIARY_CTLS_HIGH,
 	NR_VMX_FEATURE_WORDS,
 };
 
@@ -22,7 +24,7 @@ enum vmx_feature_leafs {
 
 static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 {
-	u32 supported, funcs, ept, vpid, ign;
+	u32 supported, funcs, ept, vpid, ign, low, high;
 
 	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
 
@@ -42,6 +44,13 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
 	c->vmx_capability[SECONDARY_CTLS] = supported;
 
+	/*
+	 * For tertiary execution controls MSR, it's actually a 64bit allowed-1.
+	 */
+	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
+	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
+	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
+
 	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
 	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
  2021-08-09  3:29 ` [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-09-10 21:28   ` Sean Christopherson
  2021-08-09  3:29 ` [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.
Also, define the auxiliary functions for Tertiary control field here, using
the new BUILD_CONTROLS_SHADOW().

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.h | 51 ++++++++++++++++++++++--------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 3979a947933a..558f61208a6f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -413,31 +413,34 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }
 
-#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
-{									    \
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
-		vmcs_write32(uname, val);				    \
-		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
-	}								    \
-}									    \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
-{									    \
-	return vmx->loaded_vmcs->controls_shadow.lname;			    \
-}									    \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
-}									    \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
+static inline								\
+void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
+		vmcs_write##bits(uname, val);				\
+		vmx->loaded_vmcs->controls_shadow.lname = val;		\
+	}								\
+}									\
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
+{									\
+	return vmx->loaded_vmcs->controls_shadow.lname;			\
+}									\
+static inline								\
+void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
+}									\
+static inline								\
+void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
 }
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
 
 static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
  2021-08-09  3:29 ` [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
  2021-08-09  3:29 ` [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-09-10 21:35   ` Sean Christopherson
  2021-08-09  3:29 ` [PATCH v4 4/6] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Check VMX feature on tertiary execution control in VMCS config setup.
Currently it's not supported for hyper-v and disabled for now.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h      |  3 +++
 arch/x86/kvm/vmx/capabilities.h |  7 ++++++
 arch/x86/kvm/vmx/evmcs.c        |  2 ++
 arch/x86/kvm/vmx/evmcs.h        |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 40 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h          |  1 +
 7 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..8c929596a299 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -31,6 +31,7 @@
 #define CPU_BASED_RDTSC_EXITING                 VMCS_CONTROL_BIT(RDTSC_EXITING)
 #define CPU_BASED_CR3_LOAD_EXITING		VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
 #define CPU_BASED_CR3_STORE_EXITING		VMCS_CONTROL_BIT(CR3_STORE_EXITING)
+#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS	VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
 #define CPU_BASED_CR8_LOAD_EXITING              VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
 #define CPU_BASED_CR8_STORE_EXITING             VMCS_CONTROL_BIT(CR8_STORE_EXITING)
 #define CPU_BASED_TPR_SHADOW                    VMCS_CONTROL_BIT(VIRTUAL_TPR)
@@ -221,6 +222,8 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
+	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..38d414f64e61 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,6 +59,7 @@ struct vmcs_config {
 	u32 pin_based_exec_ctrl;
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
+	u64 cpu_based_3rd_exec_ctrl;
 	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
 	struct nested_vmx_msrs nested;
@@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 }
 
+static inline bool cpu_has_tertiary_exec_ctrls(void)
+{
+	return vmcs_config.cpu_based_exec_ctrl &
+		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
index 896b2a50b4aa..03c15e1e5807 100644
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@@ -299,8 +299,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
 
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
 {
+	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
 	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
 	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
 
 	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
 	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 2ec9b46f0d0c..8a20295f4f0f 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
  */
 #define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
 				    PIN_BASED_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 #define EVMCS1_UNSUPPORTED_2NDEXEC					\
 	(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
 	 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 4b9957e2bf5b..83e2065a955d 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -48,6 +48,7 @@ struct vmcs_controls_shadow {
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
+	u64 tertiary_exec;
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 927a552393b9..ee8c5664dc95 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2391,6 +2391,23 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
 	return 0;
 }
 
+static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
+					 u32 msr, u64 *result)
+{
+	u64 vmx_msr;
+	u64 ctl = ctl_min | ctl_opt;
+
+	rdmsrl(msr, vmx_msr);
+	ctl &= vmx_msr; /* bit == 1 means it can be set */
+
+	/* Ensure minimum (required) set of control bits are supported. */
+	if (ctl_min & ~ctl)
+		return -EIO;
+
+	*result = ctl;
+	return 0;
+}
+
 static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				    struct vmx_capability *vmx_cap)
 {
@@ -2399,6 +2416,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
+	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 
@@ -2420,7 +2438,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
-	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
 				&_cpu_based_exec_control) < 0)
 		return -EIO;
@@ -2494,6 +2513,16 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				"1-setting enable VPID VM-execution control\n");
 	}
 
+	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+		u64 opt3 = 0;
+		u64 min3 = 0;
+
+		if (adjust_vmx_controls_64(min3, opt3,
+					   MSR_IA32_VMX_PROCBASED_CTLS3,
+					   &_cpu_based_3rd_exec_control))
+			return -EIO;
+	}
+
 	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -2581,6 +2610,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
 
@@ -4148,6 +4178,11 @@ u32 vmx_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl;
+}
+
 /*
  * Adjust a single secondary execution control bit to intercept/allow an
  * instruction in the guest.  This is usually done based on whether or not a
@@ -4319,6 +4354,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		secondary_exec_controls_set(vmx, vmx->secondary_exec_control);
 	}
 
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
+
 	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 558f61208a6f..f19ef1e14d08 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -441,6 +441,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
 
 static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 4/6] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
                   ` (2 preceding siblings ...)
  2021-08-09  3:29 ` [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-08-09  3:29 ` [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit Zeng Guang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Add tertiary_exec_control field report in dump_vmcs()

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ee8c5664dc95..9eb351c351ce 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5745,6 +5745,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
+	u64 tertiary_exec_control = 0;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -5762,6 +5763,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
 
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
+
 	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
 	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
 	pr_err("*** Guest State ***\n");
@@ -5860,8 +5864,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
 
 	pr_err("*** Control State ***\n");
-	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
-	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
+	pr_err("PinBased=0x%08x CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
+	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control,
+	       tertiary_exec_control);
 	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
                   ` (3 preceding siblings ...)
  2021-08-09  3:29 ` [PATCH v4 4/6] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-09-10 22:28   ` Sean Christopherson
  2021-08-09  3:29 ` [PATCH v4 6/6] KVM: VMX: enable IPI virtualization Zeng Guang
  2021-08-20 13:19 ` [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
  6 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Since IA x86 platform introduce features of IPI virtualization and
User Interrupts, new behavior applies to the execution of WRMSR ICR
register that causes APIC-write VM exit instead of MSR-write VM exit
in x2APIC mode.

This requires KVM to emulate writing 64-bit value to offset 300H on
the virtual-APIC page(VICR) for guest running in x2APIC mode when
APIC-wrtie VM exit occurs. Prevoisely KVM doesn't consider this
situation as CPU never produce APIC-write VM exit in x2APIC mode before.

Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ba5a27879f1d..0b0f0ce96679 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2188,7 +2188,14 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 	/* hw has done the conditional check and inst decode */
 	offset &= 0xff0;
 
-	kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
+	if (apic_x2apic_mode(vcpu->arch.apic) && (offset == APIC_ICR)) {
+		u64 icr_val = *((u64 *)(vcpu->arch.apic->regs + offset));
+
+		kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR2, (u32)(icr_val>>32));
+		val = (u32)icr_val;
+	} else {
+		kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
+	}
 
 	/* TODO: optimize to just emulate side effect w/o one more write */
 	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
                   ` (4 preceding siblings ...)
  2021-08-09  3:29 ` [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit Zeng Guang
@ 2021-08-09  3:29 ` Zeng Guang
  2021-09-10 23:43   ` Sean Christopherson
  2021-08-20 13:19 ` [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
  6 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-08-09  3:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

From: Gao Chao <chao.gao@intel.com>

With IPI virtualization enabled, the processor emulates writes
to APIC registers that would send IPIs. The processor sets the
bit corresponding to the vector in target vCPU's PIR and may send
a notification (IPI) specified by NDST and NV fields in target vCPU's
PID. It is similar to what IOMMU engine does when dealing with posted
interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID.

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
VM exit and need to be handled by VMM.

Signed-off-by: Gao Chao <chao.gao@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h         |  8 ++++
 arch/x86/include/asm/vmxfeatures.h |  2 +
 arch/x86/kvm/vmx/capabilities.h    |  7 +++
 arch/x86/kvm/vmx/posted_intr.c     | 22 +++++++---
 arch/x86/kvm/vmx/vmx.c             | 69 ++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h             |  3 ++
 6 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8c929596a299..b79b6438acaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -76,6 +76,11 @@
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
 
+/*
+ * Definitions of Tertiary Processor-Based VM-Execution Controls.
+ */
+#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
+
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
 #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
@@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID            = 0x00000000,
 	POSTED_INTR_NV                  = 0x00000002,
+	LAST_PID_POINTER_INDEX		= 0x00000008,
 	GUEST_ES_SELECTOR               = 0x00000800,
 	GUEST_CS_SELECTOR               = 0x00000802,
 	GUEST_SS_SELECTOR               = 0x00000804,
@@ -224,6 +230,8 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	PID_POINTER_TABLE		= 0x00002042,
+	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index b264f5c43b5f..e7b368a68c7c 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -86,4 +86,6 @@
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
 
+/* Tertiary Processor-Based VM-Execution Controls, word 3 */
+#define VMX_FEATURE_IPI_VIRT		( 3*32 + 4) /* "" Enable IPI virtualization */
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 38d414f64e61..78b0525dd991 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -12,6 +12,7 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
+extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
@@ -283,6 +284,12 @@ static inline bool cpu_has_vmx_apicv(void)
 		cpu_has_vmx_posted_intr();
 }
 
+static inline bool cpu_has_vmx_ipiv(void)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl &
+		TERTIARY_EXEC_IPI_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..8c1400aaa1e7 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -81,9 +81,12 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
-		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
-		!kvm_vcpu_apicv_active(vcpu))
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return;
+
+	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
+		!irq_remapping_cap(IRQ_POSTING_CAP)) &&
+		!enable_ipiv)
 		return;
 
 	/* Set SN when the vCPU is preempted */
@@ -141,9 +144,16 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
 	struct pi_desc old, new;
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
-		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
-		!kvm_vcpu_apicv_active(vcpu))
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return 0;
+
+	/* Put vCPU into a list and set NV to wakeup vector if it is
+	 * one of the following cases:
+	 * 1. any assigned device is in use.
+	 * 2. IPI virtualization is enabled.
+	 */
+	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
+		!irq_remapping_cap(IRQ_POSTING_CAP)) && !enable_ipiv)
 		return 0;
 
 	WARN_ON(irqs_disabled());
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9eb351c351ce..684c556395bf 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
 
 module_param(enable_apicv, bool, S_IRUGO);
 
+bool __read_mostly enable_ipiv = 1;
+module_param(enable_ipiv, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -225,6 +228,7 @@ static const struct {
 };
 
 #define L1D_CACHE_ORDER 4
+#define PID_TABLE_ORDER get_order(KVM_MAX_VCPU_ID << 3)
 static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
@@ -2514,7 +2518,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	}
 
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
-		u64 opt3 = 0;
+		u64 opt3 = enable_ipiv ? TERTIARY_EXEC_IPI_VIRT : 0;
 		u64 min3 = 0;
 
 		if (adjust_vmx_controls_64(min3, opt3,
@@ -3870,6 +3874,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
 		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+		vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
+				MSR_TYPE_RW, !enable_ipiv);
 	}
 }
 
@@ -4138,14 +4144,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
 	if (cpu_has_secondary_exec_ctrls()) {
-		if (kvm_vcpu_apicv_active(vcpu))
+		if (kvm_vcpu_apicv_active(vcpu)) {
 			secondary_exec_controls_setbit(vmx,
 				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
 				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		else
+			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
+				tertiary_exec_controls_setbit(vmx,
+					TERTIARY_EXEC_IPI_VIRT);
+		} else {
 			secondary_exec_controls_clearbit(vmx,
 					SECONDARY_EXEC_APIC_REGISTER_VIRT |
 					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+			if (cpu_has_tertiary_exec_ctrls())
+				tertiary_exec_controls_clearbit(vmx,
+					TERTIARY_EXEC_IPI_VIRT);
+		}
 	}
 
 	if (cpu_has_vmx_msr_bitmap())
@@ -4180,7 +4193,13 @@ u32 vmx_exec_control(struct vcpu_vmx *vmx)
 
 static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
 {
-	return vmcs_config.cpu_based_3rd_exec_ctrl;
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
+	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
+
+	if (!kvm_vcpu_apicv_active(vcpu))
+		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
+
+	return exec_control;
 }
 
 /*
@@ -4330,6 +4349,17 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
 
 #define VMX_XSS_EXIT_BITMAP 0
 
+static void install_pid(struct vcpu_vmx *vmx)
+{
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
+
+	BUG_ON(vmx->vcpu.vcpu_id > kvm_vmx->pid_last_index);
+	/* Bit 0 is the valid bit */
+	kvm_vmx->pid_table[vmx->vcpu.vcpu_id] = __pa(&vmx->pi_desc) | 1;
+	vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+	vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+}
+
 /*
  * Noting that the initialization of Guest-state Area of VMCS is in
  * vmx_vcpu_reset().
@@ -4367,6 +4397,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+		if (enable_ipiv)
+			install_pid(vmx);
 	}
 
 	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
@@ -6965,6 +6998,22 @@ static int vmx_vm_init(struct kvm *kvm)
 			break;
 		}
 	}
+
+	if (enable_ipiv) {
+		struct page *pages;
+
+		/* Allocate pages for PID table in order of PID_TABLE_ORDER
+		 * depending on KVM_MAX_VCPU_ID. Each PID entry is 8 bytes.
+		 */
+		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, PID_TABLE_ORDER);
+
+		if (!pages)
+			return -ENOMEM;
+
+		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
+		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;
+	}
+
 	return 0;
 }
 
@@ -7575,6 +7624,14 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+	if (kvm_vmx->pid_table)
+		free_pages((unsigned long)kvm_vmx->pid_table, PID_TABLE_ORDER);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.hardware_unsetup = hardware_unsetup,
 
@@ -7585,6 +7642,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
+	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_create = vmx_create_vcpu,
 	.vcpu_free = vmx_free_vcpu,
@@ -7824,6 +7882,9 @@ static __init int hardware_setup(void)
 		vmx_x86_ops.sync_pir_to_irr = NULL;
 	}
 
+	if (!enable_apicv || !cpu_has_vmx_ipiv())
+		enable_ipiv = 0;
+
 	if (cpu_has_vmx_tsc_scaling()) {
 		kvm_has_tsc_control = true;
 		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f19ef1e14d08..41262a4ff87a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -351,6 +351,9 @@ struct kvm_vmx {
 	unsigned int tss_addr;
 	bool ept_identity_pagetable_done;
 	gpa_t ept_identity_map_addr;
+	/* PID table for IPI virtualization */
+	u64 *pid_table;
+	u16 pid_last_index;
 };
 
 bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/6] IPI virtualization support for VM
  2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
                   ` (5 preceding siblings ...)
  2021-08-09  3:29 ` [PATCH v4 6/6] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2021-08-20 13:19 ` Zeng Guang
  6 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-08-20 13:19 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck, Tony,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai
  Cc: x86, linux-kernel, Hu, Robert, Gao, Chao

On 8/9/2021 11:29 AM, Zeng, Guang wrote:

Gentle ping.
@Paolo, @Sean, @All maintainers
Appreciated if any comment to improve this patch set.  Hope it could be 
accepted soon. :)

Thanks.

> Current IPI process in guest VM will virtualize the writing to interrupt
> command register(ICR) of the local APIC which will cause VM-exit anyway
> on source vCPU. Frequent VM-exit could induce much overhead accumulated
> if running IPI intensive task.
>
> IPI virtualization as a new VT-x feature targets to eliminate VM-exits
> when issuing IPI on source vCPU. It introduces a new VM-execution
> control - "IPI virtualization"(bit4) in the tertiary processor-based
> VM-execution controls and a new data structure - "PID-pointer table
> address" and "Last PID-pointer index" referenced by the VMCS. When "IPI
> virtualization" is enabled, processor emulates following kind of writes
> to APIC registers that would send IPIs, moreover without causing VM-exits.
> - Memory-mapped ICR writes
> - MSR-mapped ICR writes
> - SENDUIPI execution
>
> This patch series implements IPI virtualization support in KVM.
>
> Patches 1-4 add tertiary processor-based VM-execution support
> framework.
>
> Patch 5 implements interrupt dispatch support in x2APIC mode with
> APIC-write VM exit. In previous platform, no CPU would produce
> APIC-write VM exit with exit qualification 300H when the "virtual x2APIC
> mode" VM-execution control was 1.
>
> Patch 6 implement IPI virtualization related function including
> feature enabling through tertiary processor-based VM-execution in
> various scenarios of VMCS configuration, PID table setup in vCPU creation
> and vCPU block consideration.
>
> Document for IPI virtualization is now available at the latest "Intel
> Architecture Instruction Set Extensions Programming Reference".
>
> Document Link:
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>
> We did experiment to measure average time sending IPI from source vCPU
> to the target vCPU completing the IPI handling by kvm unittest w/ and
> w/o IPI virtualization. When IPI virtualization enabled, it will reduce
> 22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
> respectively.
>
> KVM unittest:vmexit/ipi, 2 vCPU, AP was modified to run in idle loop
> instead of halt to ensure no VM exit impact on target vCPU.
>
>                  Cycles of IPI
>                  xAPIC mode              x2APIC mode
>          test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
>          1       6106      4816          4265      3768
>          2       6244      4656          4404      3546
>          3       6165      4658          4233      3474
>          4       5992      4710          4363      3430
>          5       6083      4741          4215      3551
>          6       6238      4904          4304      3547
>          7       6164      4617          4263      3709
>          8       5984      4763          4518      3779
>          9       5931      4712          4645      3667
>          10      5955      4530          4332      3724
>          11      5897      4673          4283      3569
>          12      6140      4794          4178      3598
>          13      6183      4728          4363      3628
>          14      5991      4994          4509      3842
>          15      5866      4665          4520      3739
>          16      6032      4654          4229      3701
>          17      6050      4653          4185      3726
>          18      6004      4792          4319      3746
>          19      5961      4626          4196      3392
>          20      6194      4576          4433      3760
>
> Average cycles  6059      4713.1        4337.85   3644.8
> %Reduction                -22.21%                 -15.98%
>
> --------------------------------------
> IPI microbenchmark:
> (https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)
>
> 2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode
>
> Result with IPIv enabled:
>
> Dry-run:                         0,             272798 ns
> Self-IPI:                  5094123,           11114037 ns
> Normal IPI:              131697087,          173321200 ns
> Broadcast IPI:                   0,          155649075 ns
> Broadcast lock:                  0,          161518031 ns
>
> Result with IPIv disabled:
>
> Dry-run:                         0,             272766 ns
> Self-IPI:                  5091788,           11123699 ns
> Normal IPI:              145215772,          174558920 ns
> Broadcast IPI:                   0,          175785384 ns
> Broadcast lock:                  0,          149076195 ns
>
>
> As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
> about 9.73% time saving on average out of 15 test runs when IPIv is
> enabled.
>
> Normal IPI statistics (unit:ns):
> 	test	w/o IPIv	w/ IPIv
> 	1	153346049	140907046
> 	2	147218648	141660618
> 	3	145215772	117890672
> 	4	146621682	136430470
> 	5	144821472	136199421
> 	6	144704378	131676928
> 	7	141403224	131697087
> 	8	144775766	125476250
> 	9	140658192	137263330
> 	10      144768626	138593127
> 	11	145166679	131946752
> 	12	145020451	116852889
> 	13	148161353	131406280
> 	14	148378655	130174353
> 	15	148903652	127969674
>
> Average time	145944306.6	131742993.1 ns
> %Reduction			-9.73%
>
> --------------------------------------
> hackbench:
>
> 8 vCPUs, guest VM free run, x2APIC mode
> ./hackbench -p -l 100000
>
>                  w/o IPIv        w/ IPIv
> Time            91.887          74.605
> %Reduction                      -18.808%
>
> 96 vCPUs, guest VM free run, x2APIC mode
> ./hackbench -p -l 1000000
>
>                  w/o IPIv        w/ IPIv
> Time            287.504         235.185
> %Reduction                      -18.198%
>
> --------------------------------------
> v3 -> v4:
> 1. Refine code style of patch 2
> 2. Move tertiary control shadow build into patch 3
> 3. Make vmx_tertiary_exec_control to be static function
>
> v2 -> v3:
> 1. Misc change on tertiary execution control
>     definition and capability setup
> 2. Alternative to get tertiary execution
>     control configuration
>
> v1 -> v2:
> 1. Refine the IPIv enabling logic for VM.
>     Remove ipiv_active definition per vCPU.
>
> Gao Chao (1):
>    KVM: VMX: enable IPI virtualization
>
> Robert Hoo (4):
>    x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control
>    KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
>      variation
>    KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
>    KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
>
> Zeng Guang (1):
>    KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM
>      exit
>
>   arch/x86/include/asm/msr-index.h   |   1 +
>   arch/x86/include/asm/vmx.h         |  11 +++
>   arch/x86/include/asm/vmxfeatures.h |   5 +-
>   arch/x86/kernel/cpu/feat_ctl.c     |  11 ++-
>   arch/x86/kvm/lapic.c               |   9 ++-
>   arch/x86/kvm/vmx/capabilities.h    |  14 ++++
>   arch/x86/kvm/vmx/evmcs.c           |   2 +
>   arch/x86/kvm/vmx/evmcs.h           |   1 +
>   arch/x86/kvm/vmx/posted_intr.c     |  22 ++++--
>   arch/x86/kvm/vmx/vmcs.h            |   1 +
>   arch/x86/kvm/vmx/vmx.c             | 114 +++++++++++++++++++++++++++--
>   arch/x86/kvm/vmx/vmx.h             |  55 ++++++++------
>   12 files changed, 208 insertions(+), 38 deletions(-)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control
  2021-08-09  3:29 ` [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2021-09-10 21:25   ` Sean Christopherson
  2021-09-17 16:10     ` Zeng Guang
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 21:25 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

x86/cpu: is probaby more appropriate, this touches more than just feat_ctl.

On Mon, Aug 09, 2021, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> New VMX capability MSR IA32_VMX_PROCBASED_CTLS3 conresponse to this new
> VM-Execution control field. And it is 64bit allow-1 semantics, not like
> previous capability MSRs 32bit allow-0 and 32bit allow-1. So with Tertiary
> VM-Execution control field introduced, 2 vmx_feature leaves are introduced,
> TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
> 
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---

Nits aside,

Reviewed-by: Sean Christopherson <seanjc@google.com>

> @@ -22,7 +24,7 @@ enum vmx_feature_leafs {
>  
>  static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>  {
> -	u32 supported, funcs, ept, vpid, ign;
> +	u32 supported, funcs, ept, vpid, ign, low, high;
>  
>  	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
>  
> @@ -42,6 +44,13 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>  	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
>  	c->vmx_capability[SECONDARY_CTLS] = supported;
>  
> +	/*
> +	 * For tertiary execution controls MSR, it's actually a 64bit allowed-1.
> +	 */

Maybe something like this to better fit on one line?

	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */

> +	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
> +	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
> +	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
> +
>  	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
>  	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2021-08-09  3:29 ` [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2021-09-10 21:28   ` Sean Christopherson
  2021-09-17 16:13     ` Zeng Guang
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 21:28 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Mon, Aug 09, 2021, Zeng Guang wrote:
> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
> +{									\
> +	return vmx->loaded_vmcs->controls_shadow.lname;			\
> +}									\

This conflicts with commit 389ab25216c9 ("KVM: nVMX: Pull KVM L0's desired controls
directly from vmcs01"), I believe the correct resolution is:

---
 arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4858c5fd95f2..1ae43afe52a7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -408,35 +408,38 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }

-#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
-{									    \
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
-		vmcs_write32(uname, val);				    \
-		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
-	}								    \
-}									    \
-static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
-{									    \
-	return vmcs->controls_shadow.lname;				    \
-}									    \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
-{									    \
-	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
-}									    \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
-}									    \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
+static inline								\
+void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
+		vmcs_write##bits(uname, val);				\
+		vmx->loaded_vmcs->controls_shadow.lname = val;		\
+	}								\
+}									\
+static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
+{									\
+	return vmcs->controls_shadow.lname;				\
+}									\
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
+{									\
+	return __##lname##_controls_get(vmx->loaded_vmcs);		\
+}									\
+static inline								\
+void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
+}									\
+static inline								\
+void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
 }
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)

 static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
 {
--

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2021-08-09  3:29 ` [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2021-09-10 21:35   ` Sean Christopherson
  2021-09-17 16:15     ` Zeng Guang
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 21:35 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Mon, Aug 09, 2021, Zeng Guang wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 927a552393b9..ee8c5664dc95 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2391,6 +2391,23 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>  	return 0;
>  }
>  
> +static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
> +					 u32 msr, u64 *result)
> +{
> +	u64 vmx_msr;
> +	u64 ctl = ctl_min | ctl_opt;
> +
> +	rdmsrl(msr, vmx_msr);
> +	ctl &= vmx_msr; /* bit == 1 means it can be set */
> +
> +	/* Ensure minimum (required) set of control bits are supported. */
> +	if (ctl_min & ~ctl)
> +		return -EIO;
> +
> +	*result = ctl;
> +	return 0;
> +}

More succinctly, since we don't need to force-set bits in the final value:

	u64 allowed1;

	rdmsrl(msr, allowed1);

	/* Ensure minimum (required) set of control bits are supported. */
	if (ctl_min & ~allowed1)
		return -EIO;

	*result = (ctl_min | ctl_opt) & allowed1;
	return 0;

>  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  				    struct vmx_capability *vmx_cap)
>  {


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit
  2021-08-09  3:29 ` [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit Zeng Guang
@ 2021-09-10 22:28   ` Sean Christopherson
  2021-09-17 16:00     ` Zeng Guang
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 22:28 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Mon, Aug 09, 2021, Zeng Guang wrote:
> Since IA x86 platform introduce features of IPI virtualization and
> User Interrupts, new behavior applies to the execution of WRMSR ICR

What do User Interrupts have to do with anything?

> register that causes APIC-write VM exit instead of MSR-write VM exit
> in x2APIC mode.

Please lead with what support is actually being added, and more directly state
what the new behavior actually is, e.g. when should KVM expect these types of
traps.  The shortlog helps a bit, but APIC-write is somewhat ambiguous without
the context that it refers to the trap-like exits, not exception-like exits on
the WRMSR itself.

Peeking ahead, this probably should be squashed with the next patch that adds
IPI virtualizatio support.  Without that patch's code that disables ICR MSR
intercepts for IPIv, this patch makes zero sense.

I'm not totally opposed to splitting IPIv support into two patches, I just don't
like splitting out this tiny subset that makes zero sense without the IPIv
code/context.  I assume you took this approach so that the shortlog could be
"KVM: VMX:" for the IPIv code.  IMO it's perfectly ok to keep that shortlog even
though there are minor changes outside of vmx/.  VMX is the only user of
kvm_apic_write_nodecode(), so it's not wrong to say it affects only VMX.

> This requires KVM to emulate writing 64-bit value to offset 300H on
> the virtual-APIC page(VICR) for guest running in x2APIC mode when

Maybe stylize that as vICR to make it stand out as virtual ICR?

> APIC-wrtie VM exit occurs. Prevoisely KVM doesn't consider this
       ^^^^^                 ^^^^^^^^^^
       write                 Previously

> situation as CPU never produce APIC-write VM exit in x2APIC mode before.
> 
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ba5a27879f1d..0b0f0ce96679 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2188,7 +2188,14 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>  	/* hw has done the conditional check and inst decode */
>  	offset &= 0xff0;
>  
> -	kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);

Probably worth snapshotting vcpu->arch.apic.

> +	if (apic_x2apic_mode(vcpu->arch.apic) && (offset == APIC_ICR)) {


A comment here would be _extremely_ helpful.  IIUC, this path is reached when IPIv
is enabled for all ICR writes that can't be virtualized, e.g. broadcast IPIs.

And I'm tempted to say this should WARN and do nothing if KVM gets an exit on
anything except ICR writes.

> +		u64 icr_val = *((u64 *)(vcpu->arch.apic->regs + offset));

Maybe just bump "val" to a u64?

Rather than open code this, can't this be:

		kvm_lapic_reg_read(apic, offset, 8, &val);
> +
> +		kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR2, (u32)(icr_val>>32));
> +		val = (u32)icr_val;

Hmm, this is the third path that open codes the ICR2:ICR split.  I think it's
probably worth adding a helper (patch below), and this can become:

void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
{
	struct kvm_lapic *apic = vcpu->arch.apic;
	u64 val = 0;

	/* hw has done the conditional check and inst decode */
	offset &= 0xff0;

	/* TODO: optimize to just emulate side effect w/o one more write */
	if (apic_x2apic_mode(apic)) {
		if (WARN_ON_ONCE(offset != APIC_ICR))
			return 1;

		kvm_lapic_reg_read(apic, offset, 8, &val);
		kvm_lapic_reg_write64(apic, offset, val);
	} else {
		kvm_lapic_reg_read(apic, offset, 4, &val);
		kvm_lapic_reg_write(apic, offset, val);
	}
}

There is some risk my idea will backfire if the CPU traps other WRMSRs, but even
then the pedant in me thinks the code for that should be:


	if (apic_x2apic_mode(apic)) {
		int size = offset == APIC_ICR ? 8 : 4;

		kvm_lapic_reg_read(apic, offset, size, &val);
		kvm_lapic_reg_write64(apic, offset, val);
	} else {
		...
	}

or worst case scenario, move the APIC_ICR check back so that the non-ICR path
back to "if (apic_x2apic_mode(vcpu->arch.apic) && (offset == APIC_ICR))" so that
it naturally falls into the 4-byte read+write.

> +	} else {
> +		kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
> +	}
>  
>  	/* TODO: optimize to just emulate side effect w/o one more write */
>  	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
> -- 
> 2.25.1


From c7641cf0c2ea2a1c5e6dda4007f8d285595ff82d Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 10 Sep 2021 15:07:57 -0700
Subject: [PATCH] KVM: x86: Add a helper to handle 64-bit APIC writes to ICR

Add a helper to handle 64-bit APIC writes, e.g. for x2APIC WRMSR, to
deduplicate the handling of ICR writes, which KVM needs to emulate as
back-to-back writes to ICR2 and then ICR.  Future support for IPI
virtualization will add yet another path where KVM must handle a 64-bit
APIC write.

Opportunistically fix the comment; ICR2 holds the destination (if there's
no shorthand), not the vector.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 76fb00921203..5f526ee10301 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2183,6 +2183,14 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);

+static int kvm_lapic_reg_write64(struct kvm_lapic *apic, u32 reg, u64 data)
+{
+	/* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
+	if (reg == APIC_ICR)
+		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
+	return kvm_lapic_reg_write(apic, reg, (u32)data);
+}
+
 /* emulate APIC access in a trap manner */
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 {
@@ -2794,10 +2802,7 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	if (reg == APIC_ICR2)
 		return 1;

-	/* if this is ICR write vector before command */
-	if (reg == APIC_ICR)
-		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
-	return kvm_lapic_reg_write(apic, reg, (u32)data);
+	return kvm_lapic_reg_write64(apic, reg, data);
 }

 int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
@@ -2828,10 +2833,7 @@ int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
 	if (!lapic_in_kernel(vcpu))
 		return 1;

-	/* if this is ICR write vector before command */
-	if (reg == APIC_ICR)
-		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
-	return kvm_lapic_reg_write(apic, reg, (u32)data);
+	return kvm_lapic_reg_write64(apic, reg, data);
 }

 int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
--


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-08-09  3:29 ` [PATCH v4 6/6] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2021-09-10 23:43   ` Sean Christopherson
  2021-09-10 23:55     ` Sean Christopherson
  2021-09-17 16:00     ` Zeng Guang
  0 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 23:43 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Mon, Aug 09, 2021, Zeng Guang wrote:
> From: Gao Chao <chao.gao@intel.com>
> 
> With IPI virtualization enabled, the processor emulates writes
> to APIC registers that would send IPIs. The processor sets the
> bit corresponding to the vector in target vCPU's PIR and may send
> a notification (IPI) specified by NDST and NV fields in target vCPU's
> PID.

PID needs to be dis-ambiguated.  Normal kernel terminology for PID is Process ID.
Skimming through the code without paying attention to the changelog, that was my
initial reaction.  My next guest was that it was some new CPU identifier.  Turns
out it's Posted-Interrupt Descriptors.

> It is similar to what IOMMU engine does when dealing with posted
> interrupt from devices.
> 
> A PID-pointer table is used by the processor to locate the PID of a
> vCPU with the vCPU's APIC ID.
> 
> Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
> notification vector to wakeup vector. This can ensure that when an IPI
> for blocked vCPUs arrives, VMM can get control and wake up blocked
> vCPUs. And if a VCPU is preempted, its posted interrupt notification
> is suppressed.
> 
> Note that IPI virtualization can only virualize physical-addressing,
> flat mode, unicast IPIs. Sending other IPIs would still cause a
> VM exit and need to be handled by VMM.

It's worth calling out that they are the trap-like APIC-write exits.

> Signed-off-by: Gao Chao <chao.gao@intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/include/asm/vmx.h         |  8 ++++
>  arch/x86/include/asm/vmxfeatures.h |  2 +
>  arch/x86/kvm/vmx/capabilities.h    |  7 +++
>  arch/x86/kvm/vmx/posted_intr.c     | 22 +++++++---
>  arch/x86/kvm/vmx/vmx.c             | 69 ++++++++++++++++++++++++++++--
>  arch/x86/kvm/vmx/vmx.h             |  3 ++
>  6 files changed, 101 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 8c929596a299..b79b6438acaa 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -76,6 +76,11 @@
>  #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
>  #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
>  
> +/*
> + * Definitions of Tertiary Processor-Based VM-Execution Controls.
> + */
> +#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
> +
>  #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
>  #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
>  #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
> @@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
>  enum vmcs_field {
>  	VIRTUAL_PROCESSOR_ID            = 0x00000000,
>  	POSTED_INTR_NV                  = 0x00000002,
> +	LAST_PID_POINTER_INDEX		= 0x00000008,
>  	GUEST_ES_SELECTOR               = 0x00000800,
>  	GUEST_CS_SELECTOR               = 0x00000802,
>  	GUEST_SS_SELECTOR               = 0x00000804,
> @@ -224,6 +230,8 @@ enum vmcs_field {
>  	TSC_MULTIPLIER_HIGH             = 0x00002033,
>  	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
>  	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
> +	PID_POINTER_TABLE		= 0x00002042,
> +	PID_POINTER_TABLE_HIGH		= 0x00002043,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index b264f5c43b5f..e7b368a68c7c 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -86,4 +86,6 @@
>  #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>  #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
>  
> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
> +#define VMX_FEATURE_IPI_VIRT		( 3*32 + 4) /* "" Enable IPI virtualization */

I don't think this should be suppressed, finding CPUs with IPIv support is
definitely interesting.

>  #endif /* _ASM_X86_VMXFEATURES_H */

...

> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 5f81ef092bd4..8c1400aaa1e7 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -81,9 +81,12 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
>  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> -		!kvm_vcpu_apicv_active(vcpu))
> +	if (!kvm_vcpu_apicv_active(vcpu))
> +		return;
> +
> +	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
> +		!irq_remapping_cap(IRQ_POSTING_CAP)) &&
> +		!enable_ipiv)

Please fix the indentation while you're at it.

And maybe check for enable_ipi first so that the fast path (IPIv enabled) is
optimized to reduce mispredicts?

>  		return;
>  
>  	/* Set SN when the vCPU is preempted */
> @@ -141,9 +144,16 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>  	struct pi_desc old, new;
>  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> -		!kvm_vcpu_apicv_active(vcpu))
> +	if (!kvm_vcpu_apicv_active(vcpu))
> +		return 0;
> +
> +	/* Put vCPU into a list and set NV to wakeup vector if it is

	/*
	 * Multi-line comment goes here...
	 */

> +	 * one of the following cases:
> +	 * 1. any assigned device is in use.
> +	 * 2. IPI virtualization is enabled.
> +	 */
> +	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
> +		!irq_remapping_cap(IRQ_POSTING_CAP)) && !enable_ipiv)

Can we encapsulate this logic in a helper?  No idea what the name would be.

>  		return 0;
>  
>  	WARN_ON(irqs_disabled());
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9eb351c351ce..684c556395bf 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
>  
>  module_param(enable_apicv, bool, S_IRUGO);
>  
> +bool __read_mostly enable_ipiv = 1;
> +module_param(enable_ipiv, bool, S_IRUGO);

Please use octal, i.e. 0444.

> +
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -225,6 +228,7 @@ static const struct {
>  };
>  
>  #define L1D_CACHE_ORDER 4
> +#define PID_TABLE_ORDER get_order(KVM_MAX_VCPU_ID << 3)

IMO, the shift is unnecessary obfuscation:

  #define PID_TABLE_ORDER get_order(KVM_MAX_VCPU_ID * sizeof(u64))

>  static void *vmx_l1d_flush_pages;
>  
>  static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
> @@ -2514,7 +2518,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	}
>  
>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
> -		u64 opt3 = 0;
> +		u64 opt3 = enable_ipiv ? TERTIARY_EXEC_IPI_VIRT : 0;

This is not the right place to deal with module params.  It's not horrific when
there's only one control, but it'll devolve into a mess as more features are
added, e.g. try applying this pattern to secondary execution controls :-)

>  		u64 min3 = 0;
>  
>  		if (adjust_vmx_controls_64(min3, opt3,
> @@ -3870,6 +3874,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
>  		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
> +		vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
> +				MSR_TYPE_RW, !enable_ipiv);

This needs to account for kvm_vcpu_apicv_active(), otherwise KVM will expose the
"real" ICR to the guest if APICv and thus IPIv are deactivated for whatever reason.
It might be worth adding kvm_vcpu_ipiv_active().

>  	}
>  }
>  
> @@ -4138,14 +4144,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>  	if (cpu_has_secondary_exec_ctrls()) {
> -		if (kvm_vcpu_apicv_active(vcpu))
> +		if (kvm_vcpu_apicv_active(vcpu)) {
>  			secondary_exec_controls_setbit(vmx,
>  				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> -		else
> +			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
> +				tertiary_exec_controls_setbit(vmx,
> +					TERTIARY_EXEC_IPI_VIRT);
> +		} else {
>  			secondary_exec_controls_clearbit(vmx,
>  					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> +			if (cpu_has_tertiary_exec_ctrls())
> +				tertiary_exec_controls_clearbit(vmx,
> +					TERTIARY_EXEC_IPI_VIRT);
> +		}
>  	}
>  
>  	if (cpu_has_vmx_msr_bitmap())
> @@ -4180,7 +4193,13 @@ u32 vmx_exec_control(struct vcpu_vmx *vmx)
>  
>  static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>  {
> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
> +
> +	if (!kvm_vcpu_apicv_active(vcpu))

This should be:

	if (!enable_ipiv || !kvm_vcpu_apicv_active(vcpu))

or with a helper

	if (!kvm_vcpu_ipiv_active(vcpu))

I believe you took the dependency only on kvm_vcpu_apicv_active() because
enable_ipiv is handled in setup_vmcs_config(), but that's fragile since it falls
apart if enable_ipiv is forced off for some other reason.

> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
> +
> +	return exec_control;
>  }
>  
>  /*
> @@ -4330,6 +4349,17 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
>  
>  #define VMX_XSS_EXIT_BITMAP 0
>  
> +static void install_pid(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
> +
> +	BUG_ON(vmx->vcpu.vcpu_id > kvm_vmx->pid_last_index);

This is pointless since pid_last_index is hardcoded.  E.g. if we drop
pid_last_index then this becomes

	BUG_ON(vmx->vcpu.vcpu_id > KVM_MAX_VCPU_ID)

which is silly.  And if pid_last_index is ever needed because the table grows
dynamically, the BUG_ON would still be silly since pid_last_index would be
derived directly from the max vcpu_id.

> +	/* Bit 0 is the valid bit */

Use a #define instead of a comment.

> +	kvm_vmx->pid_table[vmx->vcpu.vcpu_id] = __pa(&vmx->pi_desc) | 1;

This needs WRITE_ONCE to avoid theoretical badness if userspace creates a vCPU
while others are running and a vCPU concurrently accesses the entry, e.g. CPU
sees '1' but not the PA (which I think the compiler can technically do?).

And IIUC, IPIv works for both xAPIC and x2APIC.  With xAPIC, the guest can change
its APIC ID at will, and all of this goes kaput.  I assume kvm_apic_set_xapic_id()
and maybe even kvm_apic_set_x2apic_id() need to hook into IPIv.  E.g. see

	case APIC_ID:		/* Local APIC ID */
		if (!apic_x2apic_mode(apic))
			kvm_apic_set_xapic_id(apic, val >> 24);
		else
			ret = 1;
		break;


> +	vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
> +	vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
> +}
> +
>  /*
>   * Noting that the initialization of Guest-state Area of VMCS is in
>   * vmx_vcpu_reset().
> @@ -4367,6 +4397,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  
>  		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (enable_ipiv)
> +			install_pid(vmx);

I'd say avoid the whole "pid" naming mess and drop the helper.  Without context,
intall_pid() absolutely looks like it's installing a process ID somewhere.  The
line wraps can be largely avoided with a bit of value caching, e.g.

 static void init_vmcs(struct vcpu_vmx *vmx)
 {
+       struct kvm_vcpu *vcpu = &vmx->vcpu;
+       u64 *pid_table = to_kvm_vmx(vcpu->kvm)->pid_table;
+
        if (nested)
                nested_vmx_set_vmcs_shadowing_bitmap();

@@ -4428,8 +4420,12 @@ static void init_vmcs(struct vcpu_vmx *vmx)
                vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
                vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));

-               if (enable_ipiv)
-                       install_pid(vmx);
+               if (enable_ipiv) {
+                       pid_table[vcpu->vcpu_id] = __pa(&vmx->pi_desc) |
+                                                  PID_TABLE_ENTRY_VALID;
+                       vmcs_write64(PID_POINTER_TABLE, __pa(pid_table));
+                       vmcs_write16(LAST_PID_POINTER_INDEX, KVM_MAX_VCPU_ID);
+               }
        }

>  	}
>  
>  	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
> @@ -6965,6 +6998,22 @@ static int vmx_vm_init(struct kvm *kvm)
>  			break;
>  		}
>  	}
> +
> +	if (enable_ipiv) {
> +		struct page *pages;
> +
> +		/* Allocate pages for PID table in order of PID_TABLE_ORDER
> +		 * depending on KVM_MAX_VCPU_ID. Each PID entry is 8 bytes.
> +		 */

Not a helpful comment, it doesn't provide any info that isn't readily available
in the code.

> +		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, PID_TABLE_ORDER);
> +

Unnecessary space.

> +		if (!pages)
> +			return -ENOMEM;
> +
> +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
> +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;

I don't see the point of pid_last_index if we're hardcoding it to KVM_MAX_VCPU_ID.
If I understand the ucode pseudocode, there's no performance hit in the happy
case, i.e. it only guards against out-of-bounds accesses.

And I wonder if we want to fail the build if this grows beyond an order-1
allocation, e.g.

		BUILD_BUG_ON(PID_TABLE_ORDER > 1);

Allocating two pages per VM isn't terrible, but 4+ starts to get painful when
considering the fact that most VMs aren't going to need more than one page.  For
now I agree the simplicity of not dynamically growing the table is worth burning
a page.

> +	}
> +
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-09-10 23:43   ` Sean Christopherson
@ 2021-09-10 23:55     ` Sean Christopherson
  2021-09-17 16:10       ` Zeng Guang
  2021-09-17 16:00     ` Zeng Guang
  1 sibling, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-09-10 23:55 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Fri, Sep 10, 2021, Sean Christopherson wrote:
> On Mon, Aug 09, 2021, Zeng Guang wrote:
> > +		if (!pages)
> > +			return -ENOMEM;
> > +
> > +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
> > +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;
> 
> I don't see the point of pid_last_index if we're hardcoding it to KVM_MAX_VCPU_ID.
> If I understand the ucode pseudocode, there's no performance hit in the happy
> case, i.e. it only guards against out-of-bounds accesses.
> 
> And I wonder if we want to fail the build if this grows beyond an order-1
> allocation, e.g.
> 
> 		BUILD_BUG_ON(PID_TABLE_ORDER > 1);
> 
> Allocating two pages per VM isn't terrible, but 4+ starts to get painful when
> considering the fact that most VMs aren't going to need more than one page.  For
> now I agree the simplicity of not dynamically growing the table is worth burning
> a page.

Ugh, Paolo has queued a series which bumps KVM_MAX_VCPU_ID to 4096[*].  That makes
this an order-3 allocation, which is quite painful.  One thought would be to let
userspace declare the max vCPU it wants to create, not sure if that would work for
xAPIC though.

[*] https://lkml.kernel.org/r/1111efc8-b32f-bd50-2c0f-4c6f506b544b@redhat.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit
  2021-09-10 22:28   ` Sean Christopherson
@ 2021-09-17 16:00     ` Zeng Guang
  0 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On 9/11/2021 6:28 AM, Sean Christopherson wrote:
> On Mon, Aug 09, 2021, Zeng Guang wrote:
>> Since IA x86 platform introduce features of IPI virtualization and
>> User Interrupts, new behavior applies to the execution of WRMSR ICR
> What do User Interrupts have to do with anything?

User interrupt will also use APIC-Write VM exit to emulate the MSR write 
process on vICR register.
May be better here to give a general description rather than list 
detailed feature samples.  Actually
this new behavior start to be common implementation in Intel platform now.

>> register that causes APIC-write VM exit instead of MSR-write VM exit
>> in x2APIC mode.
> Please lead with what support is actually being added, and more directly state
> what the new behavior actually is, e.g. when should KVM expect these types of
> traps.  The shortlog helps a bit, but APIC-write is somewhat ambiguous without
> the context that it refers to the trap-like exits, not exception-like exits on
> the WRMSR itself.

IIUC, APIC-write is always trap-like exits. I would give more clear and 
accurate description here as you suggested.

> Peeking ahead, this probably should be squashed with the next patch that adds
> IPI virtualizatio support.  Without that patch's code that disables ICR MSR
> intercepts for IPIv, this patch makes zero sense.
>
> I'm not totally opposed to splitting IPIv support into two patches, I just don't
> like splitting out this tiny subset that makes zero sense without the IPIv
> code/context.  I assume you took this approach so that the shortlog could be
> "KVM: VMX:" for the IPIv code.  IMO it's perfectly ok to keep that shortlog even
> though there are minor changes outside of vmx/.  VMX is the only user of
> kvm_apic_write_nodecode(), so it's not wrong to say it affects only VMX.

IMO, this patch targets to solve the mishandling in APIC-write VM exit 
if it presents 64-bit value writing to vICR in x2APIC mode.
Since Intel hardware with some new features will support such function, 
KVM have to enhance its solution and also be backward-compatible with
previous platform. I think it could be an independent fix patch.

>> This requires KVM to emulate writing 64-bit value to offset 300H on
>> the virtual-APIC page(VICR) for guest running in x2APIC mode when
> Maybe stylize that as vICR to make it stand out as virtual ICR?
OK.
>> APIC-wrtie VM exit occurs. Prevoisely KVM doesn't consider this
>         ^^^^^                 ^^^^^^^^^^
>         write                 Previously
Thanks for correction.
>> situation as CPU never produce APIC-write VM exit in x2APIC mode before.
>>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>   arch/x86/kvm/lapic.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index ba5a27879f1d..0b0f0ce96679 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2188,7 +2188,14 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>>   	/* hw has done the conditional check and inst decode */
>>   	offset &= 0xff0;
>>   
>> -	kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
> Probably worth snapshotting vcpu->arch.apic.
>
>> +	if (apic_x2apic_mode(vcpu->arch.apic) && (offset == APIC_ICR)) {
>
> A comment here would be _extremely_ helpful.  IIUC, this path is reached when IPIv
> is enabled for all ICR writes that can't be virtualized, e.g. broadcast IPIs.
>
> And I'm tempted to say this should WARN and do nothing if KVM gets an exit on
> anything except ICR writes.
>
>> +		u64 icr_val = *((u64 *)(vcpu->arch.apic->regs + offset));
> Maybe just bump "val" to a u64?
>
> Rather than open code this, can't this be:
>
> 		kvm_lapic_reg_read(apic, offset, 8, &val);
>> +
>> +		kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR2, (u32)(icr_val>>32));
>> +		val = (u32)icr_val;
> Hmm, this is the third path that open codes the ICR2:ICR split.  I think it's
> probably worth adding a helper (patch below), and this can become:
>
> void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> 	u64 val = 0;
>
> 	/* hw has done the conditional check and inst decode */
> 	offset &= 0xff0;
>
> 	/* TODO: optimize to just emulate side effect w/o one more write */
> 	if (apic_x2apic_mode(apic)) {
> 		if (WARN_ON_ONCE(offset != APIC_ICR))
> 			return 1;
>
> 		kvm_lapic_reg_read(apic, offset, 8, &val);
> 		kvm_lapic_reg_write64(apic, offset, val);
> 	} else {
> 		kvm_lapic_reg_read(apic, offset, 4, &val);
> 		kvm_lapic_reg_write(apic, offset, val);
> 	}
> }
>
> There is some risk my idea will backfire if the CPU traps other WRMSRs, but even
> then the pedant in me thinks the code for that should be:
>
>
> 	if (apic_x2apic_mode(apic)) {
> 		int size = offset == APIC_ICR ? 8 : 4;
>
> 		kvm_lapic_reg_read(apic, offset, size, &val);
> 		kvm_lapic_reg_write64(apic, offset, val);
> 	} else {
> 		...
> 	}
>
> or worst case scenario, move the APIC_ICR check back so that the non-ICR path
> back to "if (apic_x2apic_mode(vcpu->arch.apic) && (offset == APIC_ICR))" so that
> it naturally falls into the 4-byte read+write.
Only vICR in x2APIC mode is an exception to process 64-bit data 
instead.  That could be expensive to use
kvm_lapic_reg_read which will do many check unnecessarily specific for 
vICR, especially in case there are
intensive interrupts arising. Besides, kvm_lapic_reg_write64() requires 
caller to ensure 64-bit data containing
right ICR2 and ICR value along with respective apic mode. Otherwise it 
will corrupt APIC_ICR2 setting.

Probably it's better to optimize code as below. How do you think?

void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
{
	struct kvm_lapic *apic = vcpu->arch.apic;
	u64 val = 0;

	/* hw has done the conditional check and inst decode */
	offset &= 0xff0;

	/* TODO: optimize to just emulate side effect w/o one more write */
	if ((offset == APIC_ICR) && apic_x2apic_mode(apic)) {
		val = *((u64 *)(apic->regs + offset));
		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(val>>32));
	 } else {
		kvm_lapic_reg_read(apic, offset, 4, &val);
	}
		
	kvm_lapic_reg_write(apic, offset, (u32)val);
  }

+ } else {
>> +		kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
>> +	}
>>   
>>   	/* TODO: optimize to just emulate side effect w/o one more write */
>>   	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
>> -- 
>> 2.25.1
>
>  From c7641cf0c2ea2a1c5e6dda4007f8d285595ff82d Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 10 Sep 2021 15:07:57 -0700
> Subject: [PATCH] KVM: x86: Add a helper to handle 64-bit APIC writes to ICR
>
> Add a helper to handle 64-bit APIC writes, e.g. for x2APIC WRMSR, to
> deduplicate the handling of ICR writes, which KVM needs to emulate as
> back-to-back writes to ICR2 and then ICR.  Future support for IPI
> virtualization will add yet another path where KVM must handle a 64-bit
> APIC write.
>
> Opportunistically fix the comment; ICR2 holds the destination (if there's
> no shorthand), not the vector.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/lapic.c | 18 ++++++++++--------
>   1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 76fb00921203..5f526ee10301 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2183,6 +2183,14 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
>   }
>   EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>
> +static int kvm_lapic_reg_write64(struct kvm_lapic *apic, u32 reg, u64 data)
> +{
> +	/* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
> +	if (reg == APIC_ICR)
> +		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> +	return kvm_lapic_reg_write(apic, reg, (u32)data);
> +}
> +
>   /* emulate APIC access in a trap manner */
>   void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>   {
> @@ -2794,10 +2802,7 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>   	if (reg == APIC_ICR2)
>   		return 1;
>
> -	/* if this is ICR write vector before command */
> -	if (reg == APIC_ICR)
> -		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> -	return kvm_lapic_reg_write(apic, reg, (u32)data);
> +	return kvm_lapic_reg_write64(apic, reg, data);
>   }
>
>   int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> @@ -2828,10 +2833,7 @@ int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
>   	if (!lapic_in_kernel(vcpu))
>   		return 1;
>
> -	/* if this is ICR write vector before command */
> -	if (reg == APIC_ICR)
> -		kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> -	return kvm_lapic_reg_write(apic, reg, (u32)data);
> +	return kvm_lapic_reg_write64(apic, reg, data);
>   }
>
>   int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
> --
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-09-10 23:43   ` Sean Christopherson
  2021-09-10 23:55     ` Sean Christopherson
@ 2021-09-17 16:00     ` Zeng Guang
  1 sibling, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On 9/11/2021 7:43 AM, Sean Christopherson wrote:
> On Mon, Aug 09, 2021, Zeng Guang wrote:
>> From: Gao Chao <chao.gao@intel.com>
>>
>> With IPI virtualization enabled, the processor emulates writes
>> to APIC registers that would send IPIs. The processor sets the
>> bit corresponding to the vector in target vCPU's PIR and may send
>> a notification (IPI) specified by NDST and NV fields in target vCPU's
>> PID.
> PID needs to be dis-ambiguated.  Normal kernel terminology for PID is Process ID.
> Skimming through the code without paying attention to the changelog, that was my
> initial reaction.  My next guest was that it was some new CPU identifier.  Turns
> out it's Posted-Interrupt Descriptors.
>
> It is similar to what IOMMU engine does when dealing with posted
> interrupt from devices.
>
> A PID-pointer table is used by the processor to locate the PID of a
> vCPU with the vCPU's APIC ID.
>
> Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
> notification vector to wakeup vector. This can ensure that when an IPI
> for blocked vCPUs arrives, VMM can get control and wake up blocked
> vCPUs. And if a VCPU is preempted, its posted interrupt notification
> is suppressed.
>
> Note that IPI virtualization can only virualize physical-addressing,
> flat mode, unicast IPIs. Sending other IPIs would still cause a
> VM exit and need to be handled by VMM.
> It's worth calling out that they are the trap-like APIC-write exits.
OK. I will modify above accordingly.
>> Signed-off-by: Gao Chao <chao.gao@intel.com>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>   arch/x86/include/asm/vmx.h         |  8 ++++
>>   arch/x86/include/asm/vmxfeatures.h |  2 +
>>   arch/x86/kvm/vmx/capabilities.h    |  7 +++
>>   arch/x86/kvm/vmx/posted_intr.c     | 22 +++++++---
>>   arch/x86/kvm/vmx/vmx.c             | 69 ++++++++++++++++++++++++++++--
>>   arch/x86/kvm/vmx/vmx.h             |  3 ++
>>   6 files changed, 101 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 8c929596a299..b79b6438acaa 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -76,6 +76,11 @@
>>   #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
>>   #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
>>   
>> +/*
>> + * Definitions of Tertiary Processor-Based VM-Execution Controls.
>> + */
>> +#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
>> +
>>   #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
>>   #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
>>   #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
>> @@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
>>   enum vmcs_field {
>>   	VIRTUAL_PROCESSOR_ID            = 0x00000000,
>>   	POSTED_INTR_NV                  = 0x00000002,
>> +	LAST_PID_POINTER_INDEX		= 0x00000008,
>>   	GUEST_ES_SELECTOR               = 0x00000800,
>>   	GUEST_CS_SELECTOR               = 0x00000802,
>>   	GUEST_SS_SELECTOR               = 0x00000804,
>> @@ -224,6 +230,8 @@ enum vmcs_field {
>>   	TSC_MULTIPLIER_HIGH             = 0x00002033,
>>   	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
>>   	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
>> +	PID_POINTER_TABLE		= 0x00002042,
>> +	PID_POINTER_TABLE_HIGH		= 0x00002043,
>>   	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>>   	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>>   	VMCS_LINK_POINTER               = 0x00002800,
>> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
>> index b264f5c43b5f..e7b368a68c7c 100644
>> --- a/arch/x86/include/asm/vmxfeatures.h
>> +++ b/arch/x86/include/asm/vmxfeatures.h
>> @@ -86,4 +86,6 @@
>>   #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>>   #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
>>   
>> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
>> +#define VMX_FEATURE_IPI_VIRT		( 3*32 + 4) /* "" Enable IPI virtualization */
> I don't think this should be suppressed, finding CPUs with IPIv support is
> definitely interesting.
>
>>   #endif /* _ASM_X86_VMXFEATURES_H */
> ...
>
>> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
>> index 5f81ef092bd4..8c1400aaa1e7 100644
>> --- a/arch/x86/kvm/vmx/posted_intr.c
>> +++ b/arch/x86/kvm/vmx/posted_intr.c
>> @@ -81,9 +81,12 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>>   {
>>   	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>>   
>> -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
>> -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
>> -		!kvm_vcpu_apicv_active(vcpu))
>> +	if (!kvm_vcpu_apicv_active(vcpu))
>> +		return;
>> +
>> +	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
>> +		!irq_remapping_cap(IRQ_POSTING_CAP)) &&
>> +		!enable_ipiv)
> Please fix the indentation while you're at it.

Any indentation error here ?  Passed the checkpatch scan :-(

>
> And maybe check for enable_ipi first so that the fast path (IPIv enabled) is
> optimized to reduce mispredicts?
Good idea to check enable_ipi first.
>>   		return;
>>   
>>   	/* Set SN when the vCPU is preempted */
>> @@ -141,9 +144,16 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>>   	struct pi_desc old, new;
>>   	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>>   
>> -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
>> -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
>> -		!kvm_vcpu_apicv_active(vcpu))
>> +	if (!kvm_vcpu_apicv_active(vcpu))
>> +		return 0;
>> +
>> +	/* Put vCPU into a list and set NV to wakeup vector if it is
> 	/*
> 	 * Multi-line comment goes here...
> 	 */
OK. Got it.
>> +	 * one of the following cases:
>> +	 * 1. any assigned device is in use.
>> +	 * 2. IPI virtualization is enabled.
>> +	 */
>> +	if ((!kvm_arch_has_assigned_device(vcpu->kvm) ||
>> +		!irq_remapping_cap(IRQ_POSTING_CAP)) && !enable_ipiv)
> Can we encapsulate this logic in a helper?  No idea what the name would be.
Is it worth using helper here ? Will the helper involve additional 
overhead ?
>>   		return 0;
>>   
>>   	WARN_ON(irqs_disabled());
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 9eb351c351ce..684c556395bf 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
>>   
>>   module_param(enable_apicv, bool, S_IRUGO);
>>   
>> +bool __read_mostly enable_ipiv = 1;
>> +module_param(enable_ipiv, bool, S_IRUGO);
> Please use octal, i.e. 0444.
OK.
>
>> +
>>   /*
>>    * If nested=1, nested virtualization is supported, i.e., guests may use
>>    * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -225,6 +228,7 @@ static const struct {
>>   };
>>   
>>   #define L1D_CACHE_ORDER 4
>> +#define PID_TABLE_ORDER get_order(KVM_MAX_VCPU_ID << 3)
> IMO, the shift is unnecessary obfuscation:
>
>    #define PID_TABLE_ORDER get_order(KVM_MAX_VCPU_ID * sizeof(u64))
Or put comments here to clarify why it's shifted by 3. Shift should be 
faster anyway :-)
>>   static void *vmx_l1d_flush_pages;
>>   
>>   static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>> @@ -2514,7 +2518,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>>   	}
>>   
>>   	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
>> -		u64 opt3 = 0;
>> +		u64 opt3 = enable_ipiv ? TERTIARY_EXEC_IPI_VIRT : 0;
> This is not the right place to deal with module params.  It's not horrific when
> there's only one control, but it'll devolve into a mess as more features are
> added, e.g. try applying this pattern to secondary execution controls :-)

Paolo seems have similar idea to move the handling of other module 
parameters from hardware_setup() to setup_vmcs_config().
As of now he may look into this idea further.

>>   		u64 min3 = 0;
>>   
>>   		if (adjust_vmx_controls_64(min3, opt3,
>> @@ -3870,6 +3874,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
>>   		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>>   		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>>   		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
>> +		vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
>> +				MSR_TYPE_RW, !enable_ipiv);
> This needs to account for kvm_vcpu_apicv_active(), otherwise KVM will expose the
> "real" ICR to the guest if APICv and thus IPIv are deactivated for whatever reason.
> It might be worth adding kvm_vcpu_ipiv_active().

Here it's executing in MSR_BITMAP_MODE_X2APIC_APICV mode. So it needn't 
consider apicv status checking again.
>
>>   	}
>>   }
>>   
>> @@ -4138,14 +4144,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>>   
>>   	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>>   	if (cpu_has_secondary_exec_ctrls()) {
>> -		if (kvm_vcpu_apicv_active(vcpu))
>> +		if (kvm_vcpu_apicv_active(vcpu)) {
>>   			secondary_exec_controls_setbit(vmx,
>>   				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>   				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> -		else
>> +			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
>> +				tertiary_exec_controls_setbit(vmx,
>> +					TERTIARY_EXEC_IPI_VIRT);
>> +		} else {
>>   			secondary_exec_controls_clearbit(vmx,
>>   					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>   					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> +			if (cpu_has_tertiary_exec_ctrls())
>> +				tertiary_exec_controls_clearbit(vmx,
>> +					TERTIARY_EXEC_IPI_VIRT);
>> +		}
>>   	}
>>   
>>   	if (cpu_has_vmx_msr_bitmap())
>> @@ -4180,7 +4193,13 @@ u32 vmx_exec_control(struct vcpu_vmx *vmx)
>>   
>>   static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>>   {
>> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
>> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
>> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
>> +
>> +	if (!kvm_vcpu_apicv_active(vcpu))
> This should be:
>
> 	if (!enable_ipiv || !kvm_vcpu_apicv_active(vcpu))
>
> or with a helper
>
> 	if (!kvm_vcpu_ipiv_active(vcpu))
>
> I believe you took the dependency only on kvm_vcpu_apicv_active() because
> enable_ipiv is handled in setup_vmcs_config(), but that's fragile since it falls
> apart if enable_ipiv is forced off for some other reason.

The other possibility to force off enable_ipiv results from disabled 
enable_apicv. kvm_vcpu_apicv_active() already cover this situation.
So it's safe enough to check apicv active status only here.

>> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
>> +
>> +	return exec_control;
>>   }
>>   
>>   /*
>> @@ -4330,6 +4349,17 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
>>   
>>   #define VMX_XSS_EXIT_BITMAP 0
>>   
>> +static void install_pid(struct vcpu_vmx *vmx)
>> +{
>> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
>> +
>> +	BUG_ON(vmx->vcpu.vcpu_id > kvm_vmx->pid_last_index);
> This is pointless since pid_last_index is hardcoded.  E.g. if we drop
> pid_last_index then this becomes
>
> 	BUG_ON(vmx->vcpu.vcpu_id > KVM_MAX_VCPU_ID)
>
> which is silly.  And if pid_last_index is ever needed because the table grows
> dynamically, the BUG_ON would still be silly since pid_last_index would be
> derived directly from the max vcpu_id.
Then remove it. :-)
>> +	/* Bit 0 is the valid bit */
> Use a #define instead of a comment.
>
>> +	kvm_vmx->pid_table[vmx->vcpu.vcpu_id] = __pa(&vmx->pi_desc) | 1;
> This needs WRITE_ONCE to avoid theoretical badness if userspace creates a vCPU
> while others are running and a vCPU concurrently accesses the entry, e.g. CPU
> sees '1' but not the PA (which I think the compiler can technically do?).

IIUC, before this vCPU becomes online, other vCPU wouldn't make 
interrupts via access of its PI descriptor table entry.
I think it safe without WRITE_ONCE().

>
> And IIUC, IPIv works for both xAPIC and x2APIC.  With xAPIC, the guest can change
> its APIC ID at will, and all of this goes kaput.  I assume kvm_apic_set_xapic_id()
> and maybe even kvm_apic_set_x2apic_id() need to hook into IPIv.  E.g. see
>
> 	case APIC_ID:		/* Local APIC ID */
> 		if (!apic_x2apic_mode(apic))
> 			kvm_apic_set_xapic_id(apic, val >> 24);
> 		else
> 			ret = 1;
> 		break;
>
kvm set APIC id with vCPU id. And I don't find kernel will change it in 
runtime.  Probably we can ignore this api.

>> +	vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
>> +	vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
>> +}
>> +
>>   /*
>>    * Noting that the initialization of Guest-state Area of VMCS is in
>>    * vmx_vcpu_reset().
>> @@ -4367,6 +4397,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>   
>>   		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>>   		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>> +
>> +		if (enable_ipiv)
>> +			install_pid(vmx);
> I'd say avoid the whole "pid" naming mess and drop the helper.  Without context,
> intall_pid() absolutely looks like it's installing a process ID somewhere.  The
> line wraps can be largely avoided with a bit of value caching, e.g.
>
>   static void init_vmcs(struct vcpu_vmx *vmx)
>   {
> +       struct kvm_vcpu *vcpu = &vmx->vcpu;
> +       u64 *pid_table = to_kvm_vmx(vcpu->kvm)->pid_table;
> +
>          if (nested)
>                  nested_vmx_set_vmcs_shadowing_bitmap();
>
> @@ -4428,8 +4420,12 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>                  vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>                  vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>
> -               if (enable_ipiv)
> -                       install_pid(vmx);
> +               if (enable_ipiv) {
> +                       pid_table[vcpu->vcpu_id] = __pa(&vmx->pi_desc) |
> +                                                  PID_TABLE_ENTRY_VALID;
> +                       vmcs_write64(PID_POINTER_TABLE, __pa(pid_table));
> +                       vmcs_write16(LAST_PID_POINTER_INDEX, KVM_MAX_VCPU_ID);
> +               }
>          }

I'll adapt it as you suggested. Thanks.

>>   	}
>>   
>>   	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
>> @@ -6965,6 +6998,22 @@ static int vmx_vm_init(struct kvm *kvm)
>>   			break;
>>   		}
>>   	}
>> +
>> +	if (enable_ipiv) {
>> +		struct page *pages;
>> +
>> +		/* Allocate pages for PID table in order of PID_TABLE_ORDER
>> +		 * depending on KVM_MAX_VCPU_ID. Each PID entry is 8 bytes.
>> +		 */
> Not a helpful comment, it doesn't provide any info that isn't readily available
> in the code.

I will move comments to the definition of PID_TABLE_ORDER.

>> +		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, PID_TABLE_ORDER);
>> +
> Unnecessary space.
>
>> +		if (!pages)
>> +			return -ENOMEM;
>> +
>> +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
>> +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;
> I don't see the point of pid_last_index if we're hardcoding it to KVM_MAX_VCPU_ID.
> If I understand the ucode pseudocode, there's no performance hit in the happy
> case, i.e. it only guards against out-of-bounds accesses.
>
> And I wonder if we want to fail the build if this grows beyond an order-1
> allocation, e.g.
>
> 		BUILD_BUG_ON(PID_TABLE_ORDER > 1);
>
> Allocating two pages per VM isn't terrible, but 4+ starts to get painful when
> considering the fact that most VMs aren't going to need more than one page.  For
> now I agree the simplicity of not dynamically growing the table is worth burning
> a page.
>
>> +	}
>> +
>>   	return 0;
>>   }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-09-10 23:55     ` Sean Christopherson
@ 2021-09-17 16:10       ` Zeng Guang
  2021-10-14 18:56         ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On 9/11/2021 7:55 AM, Sean Christopherson wrote:
> On Fri, Sep 10, 2021, Sean Christopherson wrote:
>> On Mon, Aug 09, 2021, Zeng Guang wrote:
>>> +		if (!pages)
>>> +			return -ENOMEM;
>>> +
>>> +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
>>> +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;
>> I don't see the point of pid_last_index if we're hardcoding it to KVM_MAX_VCPU_ID.
>> If I understand the ucode pseudocode, there's no performance hit in the happy
>> case, i.e. it only guards against out-of-bounds accesses.
>>
>> And I wonder if we want to fail the build if this grows beyond an order-1
>> allocation, e.g.
>>
>> 		BUILD_BUG_ON(PID_TABLE_ORDER > 1);
>>
>> Allocating two pages per VM isn't terrible, but 4+ starts to get painful when
>> considering the fact that most VMs aren't going to need more than one page.  For
>> now I agree the simplicity of not dynamically growing the table is worth burning
>> a page.
> Ugh, Paolo has queued a series which bumps KVM_MAX_VCPU_ID to 4096[*].  That makes
> this an order-3 allocation, which is quite painful.  One thought would be to let
> userspace declare the max vCPU it wants to create, not sure if that would work for
> xAPIC though.
>
> [*] https://lkml.kernel.org/r/1111efc8-b32f-bd50-2c0f-4c6f506b544b@redhat.com
Thus we keep current design as no change.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control
  2021-09-10 21:25   ` Sean Christopherson
@ 2021-09-17 16:10     ` Zeng Guang
  0 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo

On 9/11/2021 5:25 AM, Sean Christopherson wrote:
> x86/cpu: is probaby more appropriate, this touches more than just feat_ctl.
>
> On Mon, Aug 09, 2021, Zeng Guang wrote:
>> From: Robert Hoo <robert.hu@linux.intel.com>
>>
>> New VMX capability MSR IA32_VMX_PROCBASED_CTLS3 conresponse to this new
>> VM-Execution control field. And it is 64bit allow-1 semantics, not like
>> previous capability MSRs 32bit allow-0 and 32bit allow-1. So with Tertiary
>> VM-Execution control field introduced, 2 vmx_feature leaves are introduced,
>> TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
>>
>> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
> Nits aside,
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
>
>> @@ -22,7 +24,7 @@ enum vmx_feature_leafs {
>>   
>>   static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>>   {
>> -	u32 supported, funcs, ept, vpid, ign;
>> +	u32 supported, funcs, ept, vpid, ign, low, high;
>>   
>>   	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
>>   
>> @@ -42,6 +44,13 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>>   	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
>>   	c->vmx_capability[SECONDARY_CTLS] = supported;
>>   
>> +	/*
>> +	 * For tertiary execution controls MSR, it's actually a 64bit allowed-1.
>> +	 */
> Maybe something like this to better fit on one line?
>
> 	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */
>
>> +	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
>> +	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
>> +	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
>> +
>>   	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
>>   	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
>>   
>> -- 
>> 2.25.1
Thanks for reviewed-by.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2021-09-10 21:28   ` Sean Christopherson
@ 2021-09-17 16:13     ` Zeng Guang
  0 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo

On 9/11/2021 5:28 AM, Sean Christopherson wrote:
> On Mon, Aug 09, 2021, Zeng Guang wrote:
>> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
>> +{									\
>> +	return vmx->loaded_vmcs->controls_shadow.lname;			\
>> +}									\
> This conflicts with commit 389ab25216c9 ("KVM: nVMX: Pull KVM L0's desired controls
> directly from vmcs01"), I believe the correct resolution is:
>
> ---
>   arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
>   1 file changed, 31 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 4858c5fd95f2..1ae43afe52a7 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -408,35 +408,38 @@ static inline u8 vmx_get_rvi(void)
>   	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
>   }
>
> -#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
> -static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
> -{									    \
> -	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
> -		vmcs_write32(uname, val);				    \
> -		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
> -	}								    \
> -}									    \
> -static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
> -{									    \
> -	return vmcs->controls_shadow.lname;				    \
> -}									    \
> -static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
> -{									    \
> -	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
> -}									    \
> -static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
> -{									    \
> -	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
> -}									    \
> -static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
> -{									    \
> -	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
> +#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
> +static inline								\
> +void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
> +		vmcs_write##bits(uname, val);				\
> +		vmx->loaded_vmcs->controls_shadow.lname = val;		\
> +	}								\
> +}									\
> +static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
> +{									\
> +	return vmcs->controls_shadow.lname;				\
> +}									\
> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
> +{									\
> +	return __##lname##_controls_get(vmx->loaded_vmcs);		\
> +}									\
> +static inline								\
> +void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
> +}									\
> +static inline								\
> +void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
>   }
> -BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
> -BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
> -BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
> +BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
>
>   static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
>   {
> --

I will rebase it. Thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2021-09-10 21:35   ` Sean Christopherson
@ 2021-09-17 16:15     ` Zeng Guang
  0 siblings, 0 replies; 21+ messages in thread
From: Zeng Guang @ 2021-09-17 16:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo

On 9/11/2021 5:35 AM, Sean Christopherson wrote:
> On Mon, Aug 09, 2021, Zeng Guang wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 927a552393b9..ee8c5664dc95 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2391,6 +2391,23 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>>   	return 0;
>>   }
>>   
>> +static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
>> +					 u32 msr, u64 *result)
>> +{
>> +	u64 vmx_msr;
>> +	u64 ctl = ctl_min | ctl_opt;
>> +
>> +	rdmsrl(msr, vmx_msr);
>> +	ctl &= vmx_msr; /* bit == 1 means it can be set */
>> +
>> +	/* Ensure minimum (required) set of control bits are supported. */
>> +	if (ctl_min & ~ctl)
>> +		return -EIO;
>> +
>> +	*result = ctl;
>> +	return 0;
>> +}
> More succinctly, since we don't need to force-set bits in the final value:
>
> 	u64 allowed1;
>
> 	rdmsrl(msr, allowed1);
>
> 	/* Ensure minimum (required) set of control bits are supported. */
> 	if (ctl_min & ~allowed1)
> 		return -EIO;
>
> 	*result = (ctl_min | ctl_opt) & allowed1;
> 	return 0;
Yes, it becomes more concise. I will change it . Thanks.
>>   static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>>   				    struct vmx_capability *vmx_cap)
>>   {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: enable IPI virtualization
  2021-09-17 16:10       ` Zeng Guang
@ 2021-10-14 18:56         ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-10-14 18:56 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On Sat, Sep 18, 2021, Zeng Guang wrote:
> On 9/11/2021 7:55 AM, Sean Christopherson wrote:
> > On Fri, Sep 10, 2021, Sean Christopherson wrote:
> > > On Mon, Aug 09, 2021, Zeng Guang wrote:
> > > > +		if (!pages)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
> > > > +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_ID;
> > > I don't see the point of pid_last_index if we're hardcoding it to KVM_MAX_VCPU_ID.
> > > If I understand the ucode pseudocode, there's no performance hit in the happy
> > > case, i.e. it only guards against out-of-bounds accesses.
> > > 
> > > And I wonder if we want to fail the build if this grows beyond an order-1
> > > allocation, e.g.
> > > 
> > > 		BUILD_BUG_ON(PID_TABLE_ORDER > 1);
> > > 
> > > Allocating two pages per VM isn't terrible, but 4+ starts to get painful when
> > > considering the fact that most VMs aren't going to need more than one page.  For
> > > now I agree the simplicity of not dynamically growing the table is worth burning
> > > a page.
> > Ugh, Paolo has queued a series which bumps KVM_MAX_VCPU_ID to 4096[*].  That makes
> > this an order-3 allocation, which is quite painful.  One thought would be to let
> > userspace declare the max vCPU it wants to create, not sure if that would work for
> > xAPIC though.
> > 
> > [*] https://lkml.kernel.org/r/1111efc8-b32f-bd50-2c0f-4c6f506b544b@redhat.com
> Thus we keep current design as no change.

Not necessarily.  I was pointing out that the current design is already problematic
from a memory allocation perspective.  Burning a few pages per vCPU isn't the end
of the world, but 32kb of _contiguous_ memory is rough, especially when 28kb is
unlikely to be used in many cases.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-10-14 18:56 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09  3:29 [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang
2021-08-09  3:29 ` [PATCH v4 1/6] x86/feat_ctl: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
2021-09-10 21:25   ` Sean Christopherson
2021-09-17 16:10     ` Zeng Guang
2021-08-09  3:29 ` [PATCH v4 2/6] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
2021-09-10 21:28   ` Sean Christopherson
2021-09-17 16:13     ` Zeng Guang
2021-08-09  3:29 ` [PATCH v4 3/6] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
2021-09-10 21:35   ` Sean Christopherson
2021-09-17 16:15     ` Zeng Guang
2021-08-09  3:29 ` [PATCH v4 4/6] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
2021-08-09  3:29 ` [PATCH v4 5/6] KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM exit Zeng Guang
2021-09-10 22:28   ` Sean Christopherson
2021-09-17 16:00     ` Zeng Guang
2021-08-09  3:29 ` [PATCH v4 6/6] KVM: VMX: enable IPI virtualization Zeng Guang
2021-09-10 23:43   ` Sean Christopherson
2021-09-10 23:55     ` Sean Christopherson
2021-09-17 16:10       ` Zeng Guang
2021-10-14 18:56         ` Sean Christopherson
2021-09-17 16:00     ` Zeng Guang
2021-08-20 13:19 ` [PATCH v4 0/6] IPI virtualization support for VM Zeng Guang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).