linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/8] IPI virtualization support for VM
@ 2022-03-04  8:07 Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 1/8] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Currently, issuing an IPI except self-ipi in guest on Intel CPU
always causes a VM-exit. It can lead to non-negligible overhead
to some workloads involving frequent IPIs when running in VMs.

IPI virtualization is a new VT-x feature, targeting to eliminate
VM-exits on source vCPUs when issuing unicast, physical-addressing
IPIs. Once it is enabled, the processor virtualizes following kinds
of operations that send IPIs without causing VM-exits:
- Memory-mapped ICR writes
- MSR-mapped ICR writes
- SENDUIPI execution

This patch series implements IPI virtualization support in KVM.

Patches 1-4 add tertiary processor-based VM-execution support
framework, which is used to enumerate IPI virtualization.

Patch 5 handles APIC-write VM exit due to writes to ICR MSR when
guest works in x2APIC mode. This is a new case introduced by
Intel VT-x.

Patch 6 disallow the APIC ID change unconditionally.

Patch 7 provide userspace capability to set maximum possible VCPU
ID for current VM. IPIv can refer to this value to allocate memory
for PID-pointer table. 

Patch 8 implements IPI virtualization related function including
feature enabling through tertiary processor-based VM-execution in
various scenarios of VMCS configuration, PID table setup in vCPU
creation and vCPU block consideration.

Document for IPI virtualization is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

We did experiment to measure average time sending IPI from source vCPU
to the target vCPU completing the IPI handling by kvm unittest w/ and
w/o IPI virtualization. When IPI virtualization enabled, it will reduce
22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
respectively.

--------------------------------------
KVM unittest:vmexit/ipi

2 vCPU, AP was modified to run in idle loop instead of halt to ensure
no VM exit impact on target vCPU.

                Cycles of IPI
                xAPIC mode              x2APIC mode
        test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
        1       6106      4816          4265      3768
        2       6244      4656          4404      3546
        3       6165      4658          4233      3474
        4       5992      4710          4363      3430
        5       6083      4741          4215      3551
        6       6238      4904          4304      3547
        7       6164      4617          4263      3709
        8       5984      4763          4518      3779
        9       5931      4712          4645      3667
        10      5955      4530          4332      3724
        11      5897      4673          4283      3569
        12      6140      4794          4178      3598
        13      6183      4728          4363      3628
        14      5991      4994          4509      3842
        15      5866      4665          4520      3739
        16      6032      4654          4229      3701
        17      6050      4653          4185      3726
        18      6004      4792          4319      3746
        19      5961      4626          4196      3392
        20      6194      4576          4433      3760

Average cycles  6059      4713.1        4337.85   3644.8
%Reduction                -22.21%                 -15.98%

--------------------------------------
IPI microbenchmark:
(https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)

2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode

Result with IPIv enabled:

Dry-run:                         0,             272798 ns
Self-IPI:                  5094123,           11114037 ns
Normal IPI:              131697087,          173321200 ns
Broadcast IPI:                   0,          155649075 ns
Broadcast lock:                  0,          161518031 ns

Result with IPIv disabled:

Dry-run:                         0,             272766 ns
Self-IPI:                  5091788,           11123699 ns
Normal IPI:              145215772,          174558920 ns
Broadcast IPI:                   0,          175785384 ns
Broadcast lock:                  0,          149076195 ns


As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
about 9.73% time saving on average out of 15 test runs when IPIv is
enabled.

Normal IPI statistics (unit:ns):
        test    w/o IPIv        w/ IPIv
        1       153346049       140907046
        2       147218648       141660618
        3       145215772       117890672
        4       146621682       136430470
        5       144821472       136199421
        6       144704378       131676928
        7       141403224       131697087
        8       144775766       125476250
        9       140658192       137263330
        10      144768626       138593127
        11      145166679       131946752
        12      145020451       116852889
        13      148161353       131406280
        14      148378655       130174353
        15      148903652       127969674

Average time    145944306.6     131742993.1 ns
%Reduction                      -9.73%

--------------------------------------
hackbench:

8 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 100000

                w/o IPIv        w/ IPIv
Time            91.887          74.605
%Reduction                      -18.808%

96 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 1000000

                w/o IPIv        w/ IPIv
Time            287.504         235.185
%Reduction                      -18.198%

--------------------------------------
v6->v7:
1. Revise kvm_apic_write_nodecode() on dealing with
   vICR busy bit in x2apic mode
2. Merge PID-table memory allocation with max_vcpu_id
   into IPIv enabling patch
3. Change to allocate PID-table, setup vCPU's PID-table
   entry and IPIv related VMCS fields once IPIv can
   be enabled, which support runtime enabling IPIv.

v5->v6:
1. Adapt kvm_apic_write_nodecode() implementation based
   on Sean's fix of x2apic's ICR register process.
2. Drop the patch handling IPIv table entry setting in
   case APIC ID changed, instead applying Levitsky's patch
   to disallow setting APIC ID in any case.
3. Drop the patch resizing the PID-pointer table on demand.
   Allow userspace to set maximum vcpu id at runtime that
   IPIv can refer to the practical value to allocate memory
   for PID-pointer table.

v4 -> v5:
1. Deal with enable_ipiv parameter following current
   vmcs configuration rule.
2. Allocate memory for PID-pointer table dynamically
3. Support guest runtime modify APIC ID in xAPIC mode
4. Helper to judge possibility to take PI block in IPIv case

v3 -> v4:
1. Refine code style of patch 2
2. Move tertiary control shadow build into patch 3
3. Make vmx_tertiary_exec_control to be static function

v2 -> v3:
1. Misc change on tertiary execution control
   definition and capability setup
2. Alternative to get tertiary execution
   control configuration

v1 -> v2:
1. Refine the IPIv enabling logic for VM.
   Remove ipiv_active definition per vCPU.

--------------------------------------

Chao Gao (1):
  KVM: VMX: enable IPI virtualization

Maxim Levitsky (1):
  KVM: x86: lapic: don't allow to change APIC ID unconditionally

Robert Hoo (4):
  x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
    variation
  KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well

Zeng Guang (2):
  KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  KVM: x86: Allow userspace set maximum VCPU id for VM

 arch/x86/include/asm/kvm_host.h    |   6 ++
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/include/asm/vmx.h         |  11 +++
 arch/x86/include/asm/vmxfeatures.h |   5 +-
 arch/x86/kernel/cpu/feat_ctl.c     |   9 +-
 arch/x86/kvm/lapic.c               |  47 ++++++++--
 arch/x86/kvm/vmx/capabilities.h    |  13 +++
 arch/x86/kvm/vmx/evmcs.c           |   2 +
 arch/x86/kvm/vmx/evmcs.h           |   1 +
 arch/x86/kvm/vmx/posted_intr.c     |  15 ++-
 arch/x86/kvm/vmx/vmcs.h            |   1 +
 arch/x86/kvm/vmx/vmx.c             | 141 +++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h             |  65 +++++++------
 arch/x86/kvm/x86.c                 |  11 +++
 14 files changed, 280 insertions(+), 48 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v7 1/8] x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.

Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.

Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.

[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

[2] SDM Vol3. Appendix A.3

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/msr-index.h   | 1 +
 arch/x86/include/asm/vmxfeatures.h | 3 ++-
 arch/x86/kernel/cpu/feat_ctl.c     | 9 ++++++++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3faf0f97edb1..1d180f883c32 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -938,6 +938,7 @@
 #define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
+#define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
 
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_VMCS_SIZE_SHIFT	32
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..ff20776dc83b 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -5,7 +5,7 @@
 /*
  * Defines VMX CPU feature bits
  */
-#define NVMXINTS			3 /* N 32-bit words worth of info */
+#define NVMXINTS			5 /* N 32-bit words worth of info */
 
 /*
  * Note: If the comment begins with a quoted string, that string is used
@@ -43,6 +43,7 @@
 #define VMX_FEATURE_RDTSC_EXITING	( 1*32+ 12) /* "" VM-Exit on RDTSC */
 #define VMX_FEATURE_CR3_LOAD_EXITING	( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
 #define VMX_FEATURE_CR3_STORE_EXITING	( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
+#define VMX_FEATURE_TERTIARY_CONTROLS	( 1*32+ 17) /* "" Enable Tertiary VM-Execution Controls */
 #define VMX_FEATURE_CR8_LOAD_EXITING	( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
 #define VMX_FEATURE_CR8_STORE_EXITING	( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
 #define VMX_FEATURE_VIRTUAL_TPR		( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index da696eb4821a..993697e71854 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -15,6 +15,8 @@ enum vmx_feature_leafs {
 	MISC_FEATURES = 0,
 	PRIMARY_CTLS,
 	SECONDARY_CTLS,
+	TERTIARY_CTLS_LOW,
+	TERTIARY_CTLS_HIGH,
 	NR_VMX_FEATURE_WORDS,
 };
 
@@ -22,7 +24,7 @@ enum vmx_feature_leafs {
 
 static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 {
-	u32 supported, funcs, ept, vpid, ign;
+	u32 supported, funcs, ept, vpid, ign, low, high;
 
 	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
 
@@ -42,6 +44,11 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
 	c->vmx_capability[SECONDARY_CTLS] = supported;
 
+	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */
+	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
+	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
+	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
+
 	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
 	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 1/8] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-31 22:27   ` Sean Christopherson
  2022-03-04  8:07 ` [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7f2c82e7f38f..e07c76974fb0 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -456,35 +456,38 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }
 
-#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
-{									    \
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
-		vmcs_write32(uname, val);				    \
-		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
-	}								    \
-}									    \
-static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
-{									    \
-	return vmcs->controls_shadow.lname;				    \
-}									    \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
-{									    \
-	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
-}									    \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
-}									    \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
+static inline								\
+void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
+		vmcs_write##bits(uname, val);				\
+		vmx->loaded_vmcs->controls_shadow.lname = val;		\
+	}								\
+}									\
+static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
+{									\
+	return vmcs->controls_shadow.lname;				\
+}									\
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
+{									\
+	return __##lname##_controls_get(vmx->loaded_vmcs);		\
+}									\
+static inline								\
+void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
+}									\
+static inline								\
+void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
 }
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 1/8] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-31 22:41   ` Sean Christopherson
  2022-03-04  8:07 ` [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.

EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h      |  3 +++
 arch/x86/kvm/vmx/capabilities.h |  7 ++++++
 arch/x86/kvm/vmx/evmcs.c        |  2 ++
 arch/x86/kvm/vmx/evmcs.h        |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 38 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h          |  1 +
 7 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..8c929596a299 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -31,6 +31,7 @@
 #define CPU_BASED_RDTSC_EXITING                 VMCS_CONTROL_BIT(RDTSC_EXITING)
 #define CPU_BASED_CR3_LOAD_EXITING		VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
 #define CPU_BASED_CR3_STORE_EXITING		VMCS_CONTROL_BIT(CR3_STORE_EXITING)
+#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS	VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
 #define CPU_BASED_CR8_LOAD_EXITING              VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
 #define CPU_BASED_CR8_STORE_EXITING             VMCS_CONTROL_BIT(CR8_STORE_EXITING)
 #define CPU_BASED_TPR_SHADOW                    VMCS_CONTROL_BIT(VIRTUAL_TPR)
@@ -221,6 +222,8 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
+	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..31f3d88b3e4d 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,6 +59,7 @@ struct vmcs_config {
 	u32 pin_based_exec_ctrl;
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
+	u64 cpu_based_3rd_exec_ctrl;
 	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
 	struct nested_vmx_msrs nested;
@@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 }
 
+static inline bool cpu_has_tertiary_exec_ctrls(void)
+{
+	return vmcs_config.cpu_based_exec_ctrl &
+		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
index 87e3dc10edf4..6a61b1ae7942 100644
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@@ -297,8 +297,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
 #if IS_ENABLED(CONFIG_HYPERV)
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
 {
+	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
 	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
 	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
 
 	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
 	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 8d70f9aea94b..f886a8ff0342 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
  */
 #define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
 				    PIN_BASED_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 #define EVMCS1_UNSUPPORTED_2NDEXEC					\
 	(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
 	 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a816..e18dc68eeeeb 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -50,6 +50,7 @@ struct vmcs_controls_shadow {
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
+	u64 tertiary_exec;
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c569dc2b9192..8a5713d49635 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2422,6 +2422,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
 	return 0;
 }
 
+static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
+					 u32 msr, u64 *result)
+{
+	u64 allowed1;
+
+	rdmsrl(msr, allowed1);
+
+	/* Ensure minimum (required) set of control bits are supported. */
+	if (ctl_min & ~allowed1)
+		return -EIO;
+
+	*result = (ctl_min | ctl_opt) & allowed1;
+	return 0;
+}
+
 static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				    struct vmx_capability *vmx_cap)
 {
@@ -2430,6 +2445,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
+	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 
@@ -2451,7 +2467,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
-	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
 				&_cpu_based_exec_control) < 0)
 		return -EIO;
@@ -2525,6 +2542,16 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				"1-setting enable VPID VM-execution control\n");
 	}
 
+	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+		u64 opt3 = 0;
+		u64 min3 = 0;
+
+		if (adjust_vmx_controls_64(min3, opt3,
+					   MSR_IA32_VMX_PROCBASED_CTLS3,
+					   &_cpu_based_3rd_exec_control))
+			return -EIO;
+	}
+
 	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -2611,6 +2638,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
 
@@ -4230,6 +4258,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl;
+}
+
 /*
  * Adjust a single secondary execution control bit to intercept/allow an
  * instruction in the guest.  This is usually done based on whether or not a
@@ -4395,6 +4428,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
 
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
+
 	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e07c76974fb0..d4a647d3ed4a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -488,6 +488,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (2 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-31 22:46   ` Sean Christopherson
  2022-03-04  8:07 ` [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Add tertiary_exec_control field report in dump_vmcs()

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8a5713d49635..7beba7a9f247 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5891,6 +5891,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
+	u64 tertiary_exec_control;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -5904,9 +5905,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
-	secondary_exec_control = 0;
+
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	else
+		secondary_exec_control = 0;
+
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
+	else
+		tertiary_exec_control = 0;
 
 	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
 	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
@@ -6006,9 +6014,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
 
 	pr_err("*** Control State ***\n");
-	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
-	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
-	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
+	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
+	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
+	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (3 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-31 23:07   ` Sean Christopherson
  2022-03-04  8:07 ` [PATCH v7 6/8] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
were previously impossible.

Note, x2APIC MSR writes are 64 bits wide.

Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 629c116b0d3e..22929b5b3f9b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
 #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
+static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
 
 static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
 {
@@ -2227,10 +2228,25 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 /* emulate APIC access in a trap manner */
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 {
-	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 val;
+
+	if (apic_x2apic_mode(apic)) {
+		/*
+		 * When guest APIC is in x2APIC mode and IPI virtualization
+		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
+		 * on Intel hardware. Other offsets are not possible.
+		 */
+		if (WARN_ON_ONCE(offset != APIC_ICR))
+			return;
 
-	/* TODO: optimize to just emulate side effect w/o one more write */
-	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
+		kvm_lapic_msr_read(apic, offset, &val);
+		kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
+	} else {
+		val = kvm_lapic_get_reg(apic, offset);
+		/* TODO: optimize to just emulate side effect w/o one more write */
+		kvm_lapic_reg_write(apic, offset, (u32)val);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 6/8] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (4 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-03-04  8:07 ` [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Maxim Levitsky

From: Maxim Levitsky <mlevitsk@redhat.com>

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

And Intel recent hardware just ignores writes to APIC_ID in
xAPIC mode. More background can be found at:
https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/

Looks there is no much value to support writable xAPIC ID in
guest except supporting some old and crazy use cases which
probably would fail on real hardware. So, make xAPIC ID
read-only for KVM guests.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 22929b5b3f9b..76b50c77527b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
-			kvm_apic_set_xapic_id(apic, val >> 24);
-		else
+		if (apic_x2apic_mode(apic)) {
 			ret = 1;
+			break;
+		}
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((val >> 24) != apic->vcpu->vcpu_id) {
+			kvm_vm_bugged(apic->vcpu->kvm);
+			break;
+		}
+
+		kvm_apic_set_xapic_id(apic, val >> 24);
 		break;
 
 	case APIC_TASKPRI:
@@ -2628,11 +2635,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		struct kvm_lapic_state *s, bool set)
 {
-	if (apic_x2apic_mode(vcpu->arch.apic)) {
-		u32 *id = (u32 *)(s->regs + APIC_ID);
-		u32 *ldr = (u32 *)(s->regs + APIC_LDR);
-		u64 icr;
+	u32 *id = (u32 *)(s->regs + APIC_ID);
+	u32 *ldr = (u32 *)(s->regs + APIC_LDR);
+	u64 icr;
 
+	if (!apic_x2apic_mode(vcpu->arch.apic)) {
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((*id >> 24) != vcpu->vcpu_id)
+			return -EINVAL;
+	} else {
 		if (vcpu->kvm->arch.x2apic_format) {
 			if (*id != vcpu->vcpu_id)
 				return -EINVAL;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (5 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 6/8] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-04-01  2:01   ` Sean Christopherson
  2022-03-04  8:07 ` [PATCH v7 8/8] KVM: VMX: enable IPI virtualization Zeng Guang
  2022-03-18  8:15 ` [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Introduce new max_vcpu_id in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_id set as KVM_MAX_VCPU_IDS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++++
 arch/x86/kvm/x86.c              | 11 +++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6dcccb304775..db16aebd946c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1233,6 +1233,12 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+	/*
+	 * VM-scope maximum vCPU ID. Used to determine the size of structures
+	 * that increase along with the maximum vCPU ID, in which case, using
+	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
+	 */
+	u32 max_vcpu_id;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f6fe9974cb5..ca17cc452bd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5994,6 +5994,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exit_on_emulation_error = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_MAX_VCPU_ID:
+		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
+			kvm->arch.max_vcpu_id = cap->args[0];
+			r = 0;
+		} else
+			r = -E2BIG;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -11067,6 +11074,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	struct page *page;
 	int r;
 
+	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_id)
+		return -E2BIG;
+
 	vcpu->arch.last_vmentry_cpu = -1;
 	vcpu->arch.regs_avail = ~0;
 	vcpu->arch.regs_dirty = ~0;
@@ -11589,6 +11599,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
+	kvm->arch.max_vcpu_id = KVM_MAX_VCPU_IDS;
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (6 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-03-04  8:07 ` Zeng Guang
  2022-04-01  2:37   ` Sean Christopherson
  2022-03-18  8:15 ` [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
  8 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-03-04  8:07 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

From: Chao Gao <chao.gao@intel.com>

With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h         |  8 +++
 arch/x86/include/asm/vmxfeatures.h |  2 +
 arch/x86/kvm/vmx/capabilities.h    |  6 ++
 arch/x86/kvm/vmx/posted_intr.c     | 15 ++++-
 arch/x86/kvm/vmx/vmx.c             | 90 ++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h             |  5 ++
 6 files changed, 121 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8c929596a299..b79b6438acaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -76,6 +76,11 @@
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
 
+/*
+ * Definitions of Tertiary Processor-Based VM-Execution Controls.
+ */
+#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
+
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
 #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
@@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID            = 0x00000000,
 	POSTED_INTR_NV                  = 0x00000002,
+	LAST_PID_POINTER_INDEX		= 0x00000008,
 	GUEST_ES_SELECTOR               = 0x00000800,
 	GUEST_CS_SELECTOR               = 0x00000802,
 	GUEST_SS_SELECTOR               = 0x00000804,
@@ -224,6 +230,8 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	PID_POINTER_TABLE		= 0x00002042,
+	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index ff20776dc83b..7ce616af2db2 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -86,4 +86,6 @@
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
 
+/* Tertiary Processor-Based VM-Execution Controls, word 3 */
+#define VMX_FEATURE_IPI_VIRT		(3*32 +  4) /* "" Enable IPI virtualization */
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 31f3d88b3e4d..5f656c9e33be 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
+extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
@@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
 		cpu_has_vmx_posted_intr();
 }
 
+static inline bool cpu_has_vmx_ipiv(void)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index aa1fe9085d77..0882115a9b7a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -177,11 +177,24 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 	local_irq_restore(flags);
 }
 
+static bool vmx_can_use_pi_wakeup(struct kvm *kvm)
+{
+	/*
+	 * If a blocked vCPU can be the target of posted interrupts,
+	 * switching notification vector is needed so that kernel can
+	 * be informed when an interrupt is posted and get the chance
+	 * to wake up the blocked vCPU. For now, using posted interrupt
+	 * for vCPU wakeup when IPI virtualization or VT-d PI can be
+	 * enabled.
+	 */
+	return vmx_can_use_ipiv(kvm) || vmx_can_use_vtd_pi(kvm);
+}
+
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!vmx_can_use_vtd_pi(vcpu->kvm))
+	if (!vmx_can_use_pi_wakeup(vcpu->kvm))
 		return;
 
 	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7beba7a9f247..121d4f0b35b9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
 
 module_param(enable_apicv, bool, S_IRUGO);
 
+bool __read_mostly enable_ipiv = true;
+module_param(enable_ipiv, bool, 0444);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -227,6 +230,8 @@ static const struct {
 };
 
 #define L1D_CACHE_ORDER 4
+#define PID_TABLE_ENTRY_VALID 1
+
 static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
@@ -2543,7 +2548,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	}
 
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
-		u64 opt3 = 0;
+		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
 		u64 min3 = 0;
 
 		if (adjust_vmx_controls_64(min3, opt3,
@@ -3898,6 +3903,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
 		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+		if (enable_ipiv)
+			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),MSR_TYPE_RW);
 	}
 }
 
@@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
 	if (cpu_has_secondary_exec_ctrls()) {
-		if (kvm_vcpu_apicv_active(vcpu))
+		if (kvm_vcpu_apicv_active(vcpu)) {
 			secondary_exec_controls_setbit(vmx,
 				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
 				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		else
+			if (enable_ipiv)
+				tertiary_exec_controls_setbit(vmx,
+						TERTIARY_EXEC_IPI_VIRT);
+		} else {
 			secondary_exec_controls_clearbit(vmx,
 					SECONDARY_EXEC_APIC_REGISTER_VIRT |
 					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+			if (enable_ipiv)
+				tertiary_exec_controls_clearbit(vmx,
+						TERTIARY_EXEC_IPI_VIRT);
+		}
 	}
 
 	vmx_update_msr_bitmap_x2apic(vcpu);
@@ -4260,7 +4274,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 
 static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
 {
-	return vmcs_config.cpu_based_3rd_exec_ctrl;
+	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
+
+	/*
+	 * IPI virtualization relies on APICv. Disable IPI
+	 * virtualization if APICv is inhibited.
+	 */
+	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
+		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
+
+	return exec_control;
 }
 
 /*
@@ -4408,6 +4431,29 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
+{
+	struct page *pages;
+
+	if(kvm_vmx->pid_table)
+		return 0;
+
+	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));
+
+	if (!pages)
+		return -ENOMEM;
+
+	kvm_vmx->pid_table = (void *)page_address(pages);
+	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
+	return 0;
+}
+
+bool vmx_can_use_ipiv(struct kvm *kvm)
+{
+	return irqchip_in_kernel(kvm) && enable_ipiv;
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 
 static void init_vmcs(struct vcpu_vmx *vmx)
@@ -4443,6 +4489,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
 	}
 
+	if (vmx_can_use_ipiv(vmx->vcpu.kvm)) {
+		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
+
+		vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+		vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+	}
+
 	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmx->ple_window = ple_window;
@@ -7123,6 +7176,22 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 			goto free_vmcs;
 	}
 
+	/*
+	 * Allocate PID-table and program this vCPU's PID-table
+	 * entry if IPI virtualization can be enabled.
+	 */
+	if (vmx_can_use_ipiv(vcpu->kvm)) {
+		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
+		mutex_lock(&vcpu->kvm->lock);
+		err = vmx_alloc_pid_table(kvm_vmx);
+		mutex_unlock(&vcpu->kvm->lock);
+		if (err)
+			goto free_vmcs;
+		WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
+			__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+	}
+
 	return 0;
 
 free_vmcs:
@@ -7756,6 +7825,15 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+	if (kvm_vmx->pid_table)
+		free_pages((unsigned long)kvm_vmx->pid_table,
+			get_order((kvm_vmx->pid_last_index + 1) * sizeof(u64)));
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -7768,6 +7846,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
+	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_create = vmx_create_vcpu,
 	.vcpu_free = vmx_free_vcpu,
@@ -8022,6 +8101,9 @@ static __init int hardware_setup(void)
 	if (!enable_apicv)
 		vmx_x86_ops.sync_pir_to_irr = NULL;
 
+	if (!enable_apicv || !cpu_has_vmx_ipiv())
+		enable_ipiv = false;
+
 	if (cpu_has_vmx_tsc_scaling()) {
 		kvm_has_tsc_control = true;
 		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d4a647d3ed4a..5b65930a750e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -365,6 +365,9 @@ struct kvm_vmx {
 	unsigned int tss_addr;
 	bool ept_identity_pagetable_done;
 	gpa_t ept_identity_map_addr;
+	/* PID table for IPI virtualization */
+	u64 *pid_table;
+	u16 pid_last_index;
 };
 
 bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
@@ -584,4 +587,6 @@ static inline int vmx_get_instr_info_reg2(u32 vmx_instr_info)
 	return (vmx_instr_info >> 28) & 0xf;
 }
 
+bool vmx_can_use_ipiv(struct kvm *kvm);
+
 #endif /* __KVM_X86_VMX_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 0/8] IPI virtualization support for VM
  2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
                   ` (7 preceding siblings ...)
  2022-03-04  8:07 ` [PATCH v7 8/8] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2022-03-18  8:15 ` Zeng Guang
  8 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-03-18  8:15 UTC (permalink / raw)
  To: Paolo Bonzini, Christopherson,,
	Sean, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm, Dave Hansen, Luck, Tony, Kan Liang, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Kim Phillips,
	Jarkko Sakkinen, Jethro Beekman, Huang, Kai
  Cc: x86, linux-kernel, Hu, Robert, Gao, Chao

On 3/4/2022 4:07 PM, Zeng, Guang wrote:
> Currently, issuing an IPI except self-ipi in guest on Intel CPU
> always causes a VM-exit. It can lead to non-negligible overhead
> to some workloads involving frequent IPIs when running in VMs.
>
> IPI virtualization is a new VT-x feature, targeting to eliminate
> VM-exits on source vCPUs when issuing unicast, physical-addressing
> IPIs. Once it is enabled, the processor virtualizes following kinds
> of operations that send IPIs without causing VM-exits:
> - Memory-mapped ICR writes
> - MSR-mapped ICR writes
> - SENDUIPI execution
>
> This patch series implements IPI virtualization support in KVM.
>
> Patches 1-4 add tertiary processor-based VM-execution support
> framework, which is used to enumerate IPI virtualization.
>
> Patch 5 handles APIC-write VM exit due to writes to ICR MSR when
> guest works in x2APIC mode. This is a new case introduced by
> Intel VT-x.
>
> Patch 6 disallow the APIC ID change unconditionally.
>
> Patch 7 provide userspace capability to set maximum possible VCPU
> ID for current VM. IPIv can refer to this value to allocate memory
> for PID-pointer table.
>
> Patch 8 implements IPI virtualization related function including
> feature enabling through tertiary processor-based VM-execution in
> various scenarios of VMCS configuration, PID table setup in vCPU
> creation and vCPU block consideration.
>
> Document for IPI virtualization is now available at the latest "Intel
> Architecture Instruction Set Extensions Programming Reference".
>
> Document Link:
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>
> We did experiment to measure average time sending IPI from source vCPU
> to the target vCPU completing the IPI handling by kvm unittest w/ and
> w/o IPI virtualization. When IPI virtualization enabled, it will reduce
> 22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
> respectively.
>
> --------------------------------------
> KVM unittest:vmexit/ipi
>
> 2 vCPU, AP was modified to run in idle loop instead of halt to ensure
> no VM exit impact on target vCPU.
>
>                  Cycles of IPI
>                  xAPIC mode              x2APIC mode
>          test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
>          1       6106      4816          4265      3768
>          2       6244      4656          4404      3546
>          3       6165      4658          4233      3474
>          4       5992      4710          4363      3430
>          5       6083      4741          4215      3551
>          6       6238      4904          4304      3547
>          7       6164      4617          4263      3709
>          8       5984      4763          4518      3779
>          9       5931      4712          4645      3667
>          10      5955      4530          4332      3724
>          11      5897      4673          4283      3569
>          12      6140      4794          4178      3598
>          13      6183      4728          4363      3628
>          14      5991      4994          4509      3842
>          15      5866      4665          4520      3739
>          16      6032      4654          4229      3701
>          17      6050      4653          4185      3726
>          18      6004      4792          4319      3746
>          19      5961      4626          4196      3392
>          20      6194      4576          4433      3760
>
> Average cycles  6059      4713.1        4337.85   3644.8
> %Reduction                -22.21%                 -15.98%
>
> --------------------------------------
> IPI microbenchmark:
> (https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)
>
> 2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode
>
> Result with IPIv enabled:
>
> Dry-run:                         0,             272798 ns
> Self-IPI:                  5094123,           11114037 ns
> Normal IPI:              131697087,          173321200 ns
> Broadcast IPI:                   0,          155649075 ns
> Broadcast lock:                  0,          161518031 ns
>
> Result with IPIv disabled:
>
> Dry-run:                         0,             272766 ns
> Self-IPI:                  5091788,           11123699 ns
> Normal IPI:              145215772,          174558920 ns
> Broadcast IPI:                   0,          175785384 ns
> Broadcast lock:                  0,          149076195 ns
>
>
> As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
> about 9.73% time saving on average out of 15 test runs when IPIv is
> enabled.
>
> Normal IPI statistics (unit:ns):
>          test    w/o IPIv        w/ IPIv
>          1       153346049       140907046
>          2       147218648       141660618
>          3       145215772       117890672
>          4       146621682       136430470
>          5       144821472       136199421
>          6       144704378       131676928
>          7       141403224       131697087
>          8       144775766       125476250
>          9       140658192       137263330
>          10      144768626       138593127
>          11      145166679       131946752
>          12      145020451       116852889
>          13      148161353       131406280
>          14      148378655       130174353
>          15      148903652       127969674
>
> Average time    145944306.6     131742993.1 ns
> %Reduction                      -9.73%
>
> --------------------------------------
> hackbench:
>
> 8 vCPUs, guest VM free run, x2APIC mode
> ./hackbench -p -l 100000
>
>                  w/o IPIv        w/ IPIv
> Time            91.887          74.605
> %Reduction                      -18.808%
>
> 96 vCPUs, guest VM free run, x2APIC mode
> ./hackbench -p -l 1000000
>
>                  w/o IPIv        w/ IPIv
> Time            287.504         235.185
> %Reduction                      -18.198%
>
> --------------------------------------
> v6->v7:
> 1. Revise kvm_apic_write_nodecode() on dealing with
>     vICR busy bit in x2apic mode
> 2. Merge PID-table memory allocation with max_vcpu_id
>     into IPIv enabling patch
> 3. Change to allocate PID-table, setup vCPU's PID-table
>     entry and IPIv related VMCS fields once IPIv can
>     be enabled, which support runtime enabling IPIv.
>
> v5->v6:
> 1. Adapt kvm_apic_write_nodecode() implementation based
>     on Sean's fix of x2apic's ICR register process.
> 2. Drop the patch handling IPIv table entry setting in
>     case APIC ID changed, instead applying Levitsky's patch
>     to disallow setting APIC ID in any case.
> 3. Drop the patch resizing the PID-pointer table on demand.
>     Allow userspace to set maximum vcpu id at runtime that
>     IPIv can refer to the practical value to allocate memory
>     for PID-pointer table.
>
> v4 -> v5:
> 1. Deal with enable_ipiv parameter following current
>     vmcs configuration rule.
> 2. Allocate memory for PID-pointer table dynamically
> 3. Support guest runtime modify APIC ID in xAPIC mode
> 4. Helper to judge possibility to take PI block in IPIv case
>
> v3 -> v4:
> 1. Refine code style of patch 2
> 2. Move tertiary control shadow build into patch 3
> 3. Make vmx_tertiary_exec_control to be static function
>
> v2 -> v3:
> 1. Misc change on tertiary execution control
>     definition and capability setup
> 2. Alternative to get tertiary execution
>     control configuration
>
> v1 -> v2:
> 1. Refine the IPIv enabling logic for VM.
>     Remove ipiv_active definition per vCPU.
>
> --------------------------------------
>
> Chao Gao (1):
>    KVM: VMX: enable IPI virtualization
>
> Maxim Levitsky (1):
>    KVM: x86: lapic: don't allow to change APIC ID unconditionally
>
> Robert Hoo (4):
>    x86/cpu: Add new VMX feature, Tertiary VM-Execution control
>    KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
>      variation
>    KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
>    KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
>
> Zeng Guang (2):
>    KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
>    KVM: x86: Allow userspace set maximum VCPU id for VM
>
>   arch/x86/include/asm/kvm_host.h    |   6 ++
>   arch/x86/include/asm/msr-index.h   |   1 +
>   arch/x86/include/asm/vmx.h         |  11 +++
>   arch/x86/include/asm/vmxfeatures.h |   5 +-
>   arch/x86/kernel/cpu/feat_ctl.c     |   9 +-
>   arch/x86/kvm/lapic.c               |  47 ++++++++--
>   arch/x86/kvm/vmx/capabilities.h    |  13 +++
>   arch/x86/kvm/vmx/evmcs.c           |   2 +
>   arch/x86/kvm/vmx/evmcs.h           |   1 +
>   arch/x86/kvm/vmx/posted_intr.c     |  15 ++-
>   arch/x86/kvm/vmx/vmcs.h            |   1 +
>   arch/x86/kvm/vmx/vmx.c             | 141 +++++++++++++++++++++++++++--
>   arch/x86/kvm/vmx/vmx.h             |  65 +++++++------
>   arch/x86/kvm/x86.c                 |  11 +++
>   14 files changed, 280 insertions(+), 48 deletions(-)
Kindly ping ! :)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-03-04  8:07 ` [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2022-03-31 22:27   ` Sean Christopherson
  2022-04-02 12:47     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-03-31 22:27 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, Mar 04, 2022, Zeng Guang wrote:
> +#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
> +static inline								\
> +void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
> +		vmcs_write##bits(uname, val);				\
> +		vmx->loaded_vmcs->controls_shadow.lname = val;		\
> +	}								\
> +}									\
> +static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
> +{									\
> +	return vmcs->controls_shadow.lname;				\
> +}									\
> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
> +{									\
> +	return __##lname##_controls_get(vmx->loaded_vmcs);		\
> +}									\
> +static inline								\

Drop the newline, there's no need to split this across two lines.  Aligning the
backslashes will mean they all poke past the 80 char soft limit, but that's totally
ok.  The whole point of the line limit is to improve readability, and a trivial
runover is much less painful than a split function declaration.  As a bonus, all
the backslashes are aligned, have leading whitespace, and still land on a tab stop :-)

#define BUILD_CONTROLS_SHADOW(lname, uname, bits)				\
static inline void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)	\
{										\
	if (vmx->loaded_vmcs->controls_shadow.lname != val) {			\
		vmcs_write##bits(uname, val);					\
		vmx->loaded_vmcs->controls_shadow.lname = val;			\
	}									\
}										\
static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)	\
{										\
	return vmcs->controls_shadow.lname;					\
}										\
static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)		\
{										\
	return __##lname##_controls_get(vmx->loaded_vmcs);			\
}										\
static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)	\
{										\
	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);		\
}										\
static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
{										\
	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);		\
}

With that fixed,

Reviewed-by: Sean Christopherson <seanjc@google.com>

> +void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
> +}									\
> +static inline								\
> +void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
>  }
> -BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
> -BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
> -BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
> +BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
>  
>  /*
>   * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-03-04  8:07 ` [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2022-03-31 22:41   ` Sean Christopherson
  2022-04-02 12:58     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-03-31 22:41 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, Mar 04, 2022, Zeng Guang wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c569dc2b9192..8a5713d49635 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2422,6 +2422,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>  	return 0;
>  }
>  
> +static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,

I slightly prefer controls64 over controls_64.  As usual, KVM is inconsistent as
a whole, but vmcs_read/write64 omit the underscore, so we can at least be somewhat
consistent within VMX.

> +					 u32 msr, u64 *result)
> +{
> +	u64 allowed1;
> +
> +	rdmsrl(msr, allowed1);
> +
> +	/* Ensure minimum (required) set of control bits are supported. */
> +	if (ctl_min & ~allowed1)

Eh, just drop @ctl_min.  Practically speaking, there is zero chance tertiary
controls or any other control of this nature will ever be mandatory.  Secondary
controls would fall into the same boat, but specifying min=0 allows it to share
helpers, so it's the lesser of evils.

With the error return gone, this can be

  static __init u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr)
  {
	u64 allowed;

	rdmsrl(msr, allowed);

	return ctl_opt & allowed;
  }

Alternatively, we could take the control-to-modify directly and have no return,
but I like having the "u64 opt = ..." in the caller.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2022-03-04  8:07 ` [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
@ 2022-03-31 22:46   ` Sean Christopherson
  2022-04-02 13:09     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-03-31 22:46 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Robert Hoo

Nit, shortlog is funky, it'd read better as

  KVM: VMX: Report tertiary_exec_control field in dump_vmcs()

On Fri, Mar 04, 2022, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> Add tertiary_exec_control field report in dump_vmcs()

Please call out the shuffling of PinBased and provide a sample dump.  It's not
mandatory to put that sort of info in the changelog, but it really does help
reviewers, e.g. I remember discussing the shuffling and seeing the sample output,
but other reviewers coming into this blind won't have that luxury.

> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8a5713d49635..7beba7a9f247 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5891,6 +5891,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 vmentry_ctl, vmexit_ctl;
>  	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> +	u64 tertiary_exec_control;
>  	unsigned long cr4;
>  	int efer_slot;
>  
> @@ -5904,9 +5905,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
>  	cr4 = vmcs_readl(GUEST_CR4);
> -	secondary_exec_control = 0;
> +
>  	if (cpu_has_secondary_exec_ctrls())
>  		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	else
> +		secondary_exec_control = 0;
> +
> +	if (cpu_has_tertiary_exec_ctrls())
> +		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
> +	else
> +		tertiary_exec_control = 0;
>  
>  	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
>  	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
> @@ -6006,9 +6014,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
>  
>  	pr_err("*** Control State ***\n");
> -	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
> -	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
> -	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
> +	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
> +	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
> +	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
> +	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
>  	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
>  	       vmcs_read32(EXCEPTION_BITMAP),
>  	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-03-04  8:07 ` [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
@ 2022-03-31 23:07   ` Sean Christopherson
  2022-04-02 13:33     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-03-31 23:07 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Fri, Mar 04, 2022, Zeng Guang wrote:
> Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
> i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
> the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
> were previously impossible.
> 
> Note, x2APIC MSR writes are 64 bits wide.
> 
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 629c116b0d3e..22929b5b3f9b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
>  #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
>  /* step-by-step approximation to mitigate fluctuation */
>  #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
>  
>  static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
>  {
> @@ -2227,10 +2228,25 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>  /* emulate APIC access in a trap manner */
>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>  {
> -	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 val;
> +
> +	if (apic_x2apic_mode(apic)) {
> +		/*
> +		 * When guest APIC is in x2APIC mode and IPI virtualization
> +		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
> +		 * on Intel hardware. Other offsets are not possible.
> +		 */
> +		if (WARN_ON_ONCE(offset != APIC_ICR))
> +			return;
>  
> -	/* TODO: optimize to just emulate side effect w/o one more write */
> -	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
> +		kvm_lapic_msr_read(apic, offset, &val);
> +		kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));

This needs to clear the APIC_ICR_BUSY bit.  It'd also be nice to trace this write.
The easiest thing is to use kvm_x2apic_icr_write().  Kinda silly as it'll generate
an extra write, but on the plus side the TODO comment doesn't have to move :-D

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c4c3155d98db..58bf296ee313 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2230,6 +2230,7 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
        struct kvm_lapic *apic = vcpu->arch.apic;
        u64 val;

+       /* TODO: optimize to just emulate side effect w/o one more write */
        if (apic_x2apic_mode(apic)) {
                /*
                 * When guest APIC is in x2APIC mode and IPI virtualization
@@ -2240,10 +2241,9 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
                        return;

                kvm_lapic_msr_read(apic, offset, &val);
-               kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
+               kvm_x2apic_icr_write(apic, val);
        } else {
                val = kvm_lapic_get_reg(apic, offset);
-               /* TODO: optimize to just emulate side effect w/o one more write */
                kvm_lapic_reg_write(apic, offset, (u32)val);
        }
 }


> +	} else {
> +		val = kvm_lapic_get_reg(apic, offset);
> +		/* TODO: optimize to just emulate side effect w/o one more write */
> +		kvm_lapic_reg_write(apic, offset, (u32)val);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
>  
> -- 
> 2.27.0
> 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-03-04  8:07 ` [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-04-01  2:01   ` Sean Christopherson
  2022-04-03 10:17     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-01  2:01 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Fri, Mar 04, 2022, Zeng Guang wrote:
> Introduce new max_vcpu_id in KVM for x86 architecture. Userspace
> can assign maximum possible vcpu id for current VM session using
> KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
> 
> This is done for x86 only because the sole use case is to guide
> memory allocation for PID-pointer table, a structure needed to
> enable VMX IPI.
> 
> By default, max_vcpu_id set as KVM_MAX_VCPU_IDS.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  6 ++++++
>  arch/x86/kvm/x86.c              | 11 +++++++++++

The new behavior needs to be documented in api.rst.

>  2 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6dcccb304775..db16aebd946c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1233,6 +1233,12 @@ struct kvm_arch {
>  	hpa_t	hv_root_tdp;
>  	spinlock_t hv_root_tdp_lock;
>  #endif
> +	/*
> +	 * VM-scope maximum vCPU ID. Used to determine the size of structures
> +	 * that increase along with the maximum vCPU ID, in which case, using
> +	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
> +	 */
> +	u32 max_vcpu_id;

This should be max_vcpu_ids.  I agree the it _should_ be max_vcpu_id, but KVM's API
for this is awful and we're stuck with the plural name.

>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f6fe9974cb5..ca17cc452bd3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5994,6 +5994,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		kvm->arch.exit_on_emulation_error = cap->args[0];
>  		r = 0;
>  		break;
> +	case KVM_CAP_MAX_VCPU_ID:

I think it makes sense to change kvm_vm_ioctl_check_extension() to return the
current max, it is a VM-scoped ioctl after all.

Amusingly, I think we also need a capability to enumerate that KVM_CAP_MAX_VCPU_ID
is writable.  

> +		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
> +			kvm->arch.max_vcpu_id = cap->args[0];

This needs to be rejected if kvm->created_vcpus > 0, and that check needs to be
done under kvm_lock, otherwise userspace can bump the max ID after KVM allocates
per-VM structures and trigger buffer overflow.

> +			r = 0;
> +		} else

If-elif-else statements need curly braces for all paths if any path needs braces.
Probably a moot point for this patch due to the above changes.

> +			r = -E2BIG;

This should be -EINVAL, not -E2BIG.

E.g.

	case KVM_CAP_MAX_VCPU_ID:
		r = -EINVAL;
		if (cap->args[0] > KVM_MAX_VCPU_IDS)
			break;

		mutex_lock(&kvm->lock);
		if (!kvm->created_vcpus) {
			kvm->arch.max_vcpu_id = cap->args[0];
			r = 0;
		}
		mutex_unlock(&kvm->lock);
		break;


> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> @@ -11067,6 +11074,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	struct page *page;
>  	int r;
>  
> +	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_id)
> +		return -E2BIG;

Same here, it should be -EINVAL.

> +
>  	vcpu->arch.last_vmentry_cpu = -1;
>  	vcpu->arch.regs_avail = ~0;
>  	vcpu->arch.regs_dirty = ~0;
> @@ -11589,6 +11599,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
>  	kvm->arch.hv_root_tdp = INVALID_PAGE;
>  #endif
> +	kvm->arch.max_vcpu_id = KVM_MAX_VCPU_IDS;
>  
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-03-04  8:07 ` [PATCH v7 8/8] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2022-04-01  2:37   ` Sean Christopherson
  2022-04-03 14:38     ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-01  2:37 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Fri, Mar 04, 2022, Zeng Guang wrote:
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index ff20776dc83b..7ce616af2db2 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -86,4 +86,6 @@
>  #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>  #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
>  
> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
> +#define VMX_FEATURE_IPI_VIRT		(3*32 +  4) /* "" Enable IPI virtualization */

Please follow the existing (weird) spacing and style.  And this should definitely
be enumerated to userspace, it's one of the more interesting VMX features, i.e. drop
the "".

#define VMX_FEATURE_IPI_VIRT		( 3*32+  4) /* Enable IPI virtualization */

>  #endif /* _ASM_X86_VMXFEATURES_H */
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 31f3d88b3e4d..5f656c9e33be 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
>  extern bool __read_mostly enable_unrestricted_guest;
>  extern bool __read_mostly enable_ept_ad_bits;
>  extern bool __read_mostly enable_pml;
> +extern bool __read_mostly enable_ipiv;
>  extern int __read_mostly pt_mode;
>  
>  #define PT_MODE_SYSTEM		0
> @@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
>  		cpu_has_vmx_posted_intr();
>  }
>  
> +static inline bool cpu_has_vmx_ipiv(void)
> +{
> +	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index aa1fe9085d77..0882115a9b7a 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -177,11 +177,24 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>  	local_irq_restore(flags);
>  }
>  
> +static bool vmx_can_use_pi_wakeup(struct kvm *kvm)
> +{
> +	/*
> +	 * If a blocked vCPU can be the target of posted interrupts,
> +	 * switching notification vector is needed so that kernel can
> +	 * be informed when an interrupt is posted and get the chance
> +	 * to wake up the blocked vCPU. For now, using posted interrupt
> +	 * for vCPU wakeup when IPI virtualization or VT-d PI can be
> +	 * enabled.
> +	 */
> +	return vmx_can_use_ipiv(kvm) || vmx_can_use_vtd_pi(kvm);
> +}
> +
>  void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
>  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> -	if (!vmx_can_use_vtd_pi(vcpu->kvm))
> +	if (!vmx_can_use_pi_wakeup(vcpu->kvm))
>  		return;
>  
>  	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7beba7a9f247..121d4f0b35b9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
>  
>  module_param(enable_apicv, bool, S_IRUGO);
>  
> +bool __read_mostly enable_ipiv = true;
> +module_param(enable_ipiv, bool, 0444);
> +
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -227,6 +230,8 @@ static const struct {
>  };
>  
>  #define L1D_CACHE_ORDER 4
> +#define PID_TABLE_ENTRY_VALID 1

Put this in posted_intr.h to give the "PID" part some context.

diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 9a45d5c9f116..26992076552e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -5,6 +5,8 @@
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1

+#define PID_TABLE_ENTRY_VALID 1
+
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
        u32 pir[8];     /* Posted interrupt requested */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bbdd77a0388f..6a757e31d1d1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -230,7 +230,6 @@ static const struct {
 };

 #define L1D_CACHE_ORDER 4
-#define PID_TABLE_ENTRY_VALID 1

 static void *vmx_l1d_flush_pages;

> +
>  static void *vmx_l1d_flush_pages;
>  
>  static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
> @@ -2543,7 +2548,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	}
>  
>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
> -		u64 opt3 = 0;
> +		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
>  		u64 min3 = 0;
>  
>  		if (adjust_vmx_controls_64(min3, opt3,
> @@ -3898,6 +3903,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
>  		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
> +		if (enable_ipiv)
> +			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),MSR_TYPE_RW);

Missing space after the last comma.

>  	}
>  }
>  
> @@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>  	if (cpu_has_secondary_exec_ctrls()) {
> -		if (kvm_vcpu_apicv_active(vcpu))
> +		if (kvm_vcpu_apicv_active(vcpu)) {
>  			secondary_exec_controls_setbit(vmx,
>  				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> -		else
> +			if (enable_ipiv)
> +				tertiary_exec_controls_setbit(vmx,
> +						TERTIARY_EXEC_IPI_VIRT);
> +		} else {
>  			secondary_exec_controls_clearbit(vmx,
>  					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> +			if (enable_ipiv)
> +				tertiary_exec_controls_clearbit(vmx,
> +						TERTIARY_EXEC_IPI_VIRT);

Oof.  The existing code is kludgy.  We should never reach this point without
enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
let alone seconary exec being support.

Unless I'm missing something, throw a prep patch earlier in the series to drop
the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.

> +		}
>  	}
>  
>  	vmx_update_msr_bitmap_x2apic(vcpu);
> @@ -4260,7 +4274,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>  
>  static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>  {
> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
> +
> +	/*
> +	 * IPI virtualization relies on APICv. Disable IPI
> +	 * virtualization if APICv is inhibited.

Wrap comments at 80 chars.

> +	 */
> +	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
> +
> +	return exec_control;
>  }
>  
>  /*
> @@ -4408,6 +4431,29 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	return exec_control;
>  }
>  
> +static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
> +{
> +	struct page *pages;
> +
> +	if(kvm_vmx->pid_table)

Needs a space after the "if".  Moot point though, this shouldn't exist.

> +		return 0;
> +
> +	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> +			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));

Instead of sizeof(u64), do sizeof(*kvm_vmx->pid_table), that way the code is more
self documenting and less fragile.  The PID table size obviously shouldn't change
since it architecturally, but it's a good habit/style.

> +
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	kvm_vmx->pid_table = (void *)page_address(pages);
> +	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;

No need to cache pid_last_index, it's only used in one place (initializing the
VMCS field).  The allocation/free paths can use max_vcpu_id directly.  Actually,
for the alloc/free, add a helper to provide the order, that'll clean up both
call sites and avoid duplicate math.  E.g.

int vmx_get_pid_table_order(struct kvm_vmx *kvm_vmx)
{
	return get_order(kvm_vmx->kvm.arch.max_vcpu_ids * sizeof(*kvm_vmx->pid_table));
}

> +	return 0;
> +}
> +
> +bool vmx_can_use_ipiv(struct kvm *kvm)
> +{
> +	return irqchip_in_kernel(kvm) && enable_ipiv;
> +}

Move this helper to posted_intr.h (or maybe vmx.h, though I think posted_intr.h
is a slightly better fit) and make it static inline.  This patch already exposes
enable_ipiv, and the usage in vmx_can_use_pi_wakeup() will be frequent enough
that making it inline is worthwhile.

> +
>  #define VMX_XSS_EXIT_BITMAP 0
>  
>  static void init_vmcs(struct vcpu_vmx *vmx)
> @@ -4443,6 +4489,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>  	}
>  
> +	if (vmx_can_use_ipiv(vmx->vcpu.kvm)) {
> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);

Hoist this to the top of the function, that way we don't end up with variable
shadowing and don't have to move it if future code also needs to access kvm_vmx.

> +
> +		vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
> +		vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
> +	}
> +
>  	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmx->ple_window = ple_window;
> @@ -7123,6 +7176,22 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>  			goto free_vmcs;
>  	}
>  
> +	/*
> +	 * Allocate PID-table and program this vCPU's PID-table
> +	 * entry if IPI virtualization can be enabled.

Please wrap comments at 80 chars.  But I'd just drop this one entirely, the code
is self-explanatory once the allocation and setting of the vCPU's entry are split.

> +	 */
> +	if (vmx_can_use_ipiv(vcpu->kvm)) {
> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
> +
> +		mutex_lock(&vcpu->kvm->lock);
> +		err = vmx_alloc_pid_table(kvm_vmx);
> +		mutex_unlock(&vcpu->kvm->lock);

This belongs in vmx_vm_init(), doing it in vCPU creation is a remnant of the
dynamic resize approach that's no longer needed.

 
> +		if (err)
> +			goto free_vmcs;
> +		WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
> +			__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);

This gets to stay though.  Please align the indentation, i.e.


	if (vmx_can_use_ipiv(vcpu->kvm))
		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);

> +	}
> +
>  	return 0;
>  
>  free_vmcs:
> @@ -7756,6 +7825,15 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>  	return supported & BIT(bit);
>  }
>  
> +static void vmx_vm_destroy(struct kvm *kvm)
> +{
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
> +
> +	if (kvm_vmx->pid_table)
> +		free_pages((unsigned long)kvm_vmx->pid_table,
> +			get_order((kvm_vmx->pid_last_index + 1) * sizeof(u64)));
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  	.name = "kvm_intel",
>  
> @@ -7768,6 +7846,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  
>  	.vm_size = sizeof(struct kvm_vmx),
>  	.vm_init = vmx_vm_init,
> +	.vm_destroy = vmx_vm_destroy,
>  
>  	.vcpu_create = vmx_create_vcpu,
>  	.vcpu_free = vmx_free_vcpu,
> @@ -8022,6 +8101,9 @@ static __init int hardware_setup(void)
>  	if (!enable_apicv)
>  		vmx_x86_ops.sync_pir_to_irr = NULL;
>  
> +	if (!enable_apicv || !cpu_has_vmx_ipiv())
> +		enable_ipiv = false;
> +
>  	if (cpu_has_vmx_tsc_scaling()) {
>  		kvm_has_tsc_control = true;
>  		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index d4a647d3ed4a..5b65930a750e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -365,6 +365,9 @@ struct kvm_vmx {
>  	unsigned int tss_addr;
>  	bool ept_identity_pagetable_done;
>  	gpa_t ept_identity_map_addr;
> +	/* PID table for IPI virtualization */

I like having a comment here since "pid_table" is ambiguous, but take the opportunity
to explain PID, i.e.

	/* Posted Interrupt Descriptor (PID) table for IPI virtualization. */

> +	u64 *pid_table;
> +	u16 pid_last_index;
>  };
>  
>  bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
> @@ -584,4 +587,6 @@ static inline int vmx_get_instr_info_reg2(u32 vmx_instr_info)
>  	return (vmx_instr_info >> 28) & 0xf;
>  }
>  
> +bool vmx_can_use_ipiv(struct kvm *kvm);
> +
>  #endif /* __KVM_X86_VMX_H */
> -- 
> 2.27.0
> 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-03-31 22:27   ` Sean Christopherson
@ 2022-04-02 12:47     ` Zeng Guang
  0 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-04-02 12:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo


On 4/1/2022 6:27 AM, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> +#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
>> +static inline								\
>> +void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
>> +{									\
>> +	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
>> +		vmcs_write##bits(uname, val);				\
>> +		vmx->loaded_vmcs->controls_shadow.lname = val;		\
>> +	}								\
>> +}									\
>> +static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
>> +{									\
>> +	return vmcs->controls_shadow.lname;				\
>> +}									\
>> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
>> +{									\
>> +	return __##lname##_controls_get(vmx->loaded_vmcs);		\
>> +}									\
>> +static inline								\
> Drop the newline, there's no need to split this across two lines.  Aligning the
> backslashes will mean they all poke past the 80 char soft limit, but that's totally
> ok.  The whole point of the line limit is to improve readability, and a trivial
> runover is much less painful than a split function declaration.  As a bonus, all
> the backslashes are aligned, have leading whitespace, and still land on a tab stop :-)
>
> #define BUILD_CONTROLS_SHADOW(lname, uname, bits)				\
> static inline void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)	\
> {										\
> 	if (vmx->loaded_vmcs->controls_shadow.lname != val) {			\
> 		vmcs_write##bits(uname, val);					\
> 		vmx->loaded_vmcs->controls_shadow.lname = val;			\
> 	}									\
> }										\
> static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)	\
> {										\
> 	return vmcs->controls_shadow.lname;					\
> }										\
> static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)		\
> {										\
> 	return __##lname##_controls_get(vmx->loaded_vmcs);			\
> }										\
> static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)	\
> {										\
> 	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);		\
> }										\
> static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
> {										\
> 	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);		\
> }
>
> With that fixed,
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>

OK. I'll revise it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-03-31 22:41   ` Sean Christopherson
@ 2022-04-02 12:58     ` Zeng Guang
  0 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-04-02 12:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo


On 4/1/2022 6:41 AM, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index c569dc2b9192..8a5713d49635 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2422,6 +2422,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>>   	return 0;
>>   }
>>   
>> +static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
> I slightly prefer controls64 over controls_64.  As usual, KVM is inconsistent as
> a whole, but vmcs_read/write64 omit the underscore, so we can at least be somewhat
> consistent within VMX.
>
>> +					 u32 msr, u64 *result)
>> +{
>> +	u64 allowed1;
>> +
>> +	rdmsrl(msr, allowed1);
>> +
>> +	/* Ensure minimum (required) set of control bits are supported. */
>> +	if (ctl_min & ~allowed1)
> Eh, just drop @ctl_min.  Practically speaking, there is zero chance tertiary
> controls or any other control of this nature will ever be mandatory.  Secondary
> controls would fall into the same boat, but specifying min=0 allows it to share
> helpers, so it's the lesser of evils.
>
> With the error return gone, this can be
>
>    static __init u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr)
>    {
> 	u64 allowed;
>
> 	rdmsrl(msr, allowed);
>
> 	return ctl_opt & allowed;
>    }

Make sense. I will change it.  Thanks.


> Alternatively, we could take the control-to-modify directly and have no return,
> but I like having the "u64 opt = ..." in the caller.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2022-03-31 22:46   ` Sean Christopherson
@ 2022-04-02 13:09     ` Zeng Guang
  0 siblings, 0 replies; 27+ messages in thread
From: Zeng Guang @ 2022-04-02 13:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao, Robert Hoo


On 4/1/2022 6:46 AM, Sean Christopherson wrote:
> Nit, shortlog is funky, it'd read better as
>
>    KVM: VMX: Report tertiary_exec_control field in dump_vmcs()
>
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> From: Robert Hoo <robert.hu@linux.intel.com>
>>
>> Add tertiary_exec_control field report in dump_vmcs()
> Please call out the shuffling of PinBased and provide a sample dump.  It's not
> mandatory to put that sort of info in the changelog, but it really does help
> reviewers, e.g. I remember discussing the shuffling and seeing the sample output,
> but other reviewers coming into this blind won't have that luxury.

OK. Will give more details.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-03-31 23:07   ` Sean Christopherson
@ 2022-04-02 13:33     ` Zeng Guang
  2022-04-04 15:29       ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-04-02 13:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao


On 4/1/2022 7:07 AM, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
>> i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
>> the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
>> were previously impossible.
>>
>> Note, x2APIC MSR writes are 64 bits wide.
>>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>   arch/x86/kvm/lapic.c | 22 +++++++++++++++++++---
>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 629c116b0d3e..22929b5b3f9b 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
>>   #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
>>   /* step-by-step approximation to mitigate fluctuation */
>>   #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
>> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
>>   
>>   static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
>>   {
>> @@ -2227,10 +2228,25 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>>   /* emulate APIC access in a trap manner */
>>   void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>>   {
>> -	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	u64 val;
>> +
>> +	if (apic_x2apic_mode(apic)) {
>> +		/*
>> +		 * When guest APIC is in x2APIC mode and IPI virtualization
>> +		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
>> +		 * on Intel hardware. Other offsets are not possible.
>> +		 */
>> +		if (WARN_ON_ONCE(offset != APIC_ICR))
>> +			return;
>>   
>> -	/* TODO: optimize to just emulate side effect w/o one more write */
>> -	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
>> +		kvm_lapic_msr_read(apic, offset, &val);
>> +		kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
> This needs to clear the APIC_ICR_BUSY bit.  It'd also be nice to trace this write.
> The easiest thing is to use kvm_x2apic_icr_write().  Kinda silly as it'll generate
> an extra write, but on the plus side the TODO comment doesn't have to move :-D
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c4c3155d98db..58bf296ee313 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2230,6 +2230,7 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>          struct kvm_lapic *apic = vcpu->arch.apic;
>          u64 val;
>
> +       /* TODO: optimize to just emulate side effect w/o one more write */
>          if (apic_x2apic_mode(apic)) {
>                  /*
>                   * When guest APIC is in x2APIC mode and IPI virtualization
> @@ -2240,10 +2241,9 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>                          return;
>
>                  kvm_lapic_msr_read(apic, offset, &val);
> -               kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
> +               kvm_x2apic_icr_write(apic, val);

As SDM section 10.12.9 "ICR Operation in X2APIC mode" says "Delivery status
bit is removed since it is not needed in x2APIC mode" , so that's not 
necessary
to clear the APIC_ICR_BUSY bit here. Alternatively we can add trace to 
this write
by hardware.


>          } else {
>                  val = kvm_lapic_get_reg(apic, offset);
> -               /* TODO: optimize to just emulate side effect w/o one more write */
>                  kvm_lapic_reg_write(apic, offset, (u32)val);
>          }
>   }
>
>
>> +	} else {
>> +		val = kvm_lapic_get_reg(apic, offset);
>> +		/* TODO: optimize to just emulate side effect w/o one more write */
>> +		kvm_lapic_reg_write(apic, offset, (u32)val);
>> +	}
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
>>   
>> -- 
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-04-01  2:01   ` Sean Christopherson
@ 2022-04-03 10:17     ` Zeng Guang
  2022-04-04 17:25       ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-04-03 10:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao


On 4/1/2022 10:01 AM, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> Introduce new max_vcpu_id in KVM for x86 architecture. Userspace
>> can assign maximum possible vcpu id for current VM session using
>> KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
>>
>> This is done for x86 only because the sole use case is to guide
>> memory allocation for PID-pointer table, a structure needed to
>> enable VMX IPI.
>>
>> By default, max_vcpu_id set as KVM_MAX_VCPU_IDS.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |  6 ++++++
>>   arch/x86/kvm/x86.c              | 11 +++++++++++
> The new behavior needs to be documented in api.rst.

OK. I will prepare document for it.


>>   2 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 6dcccb304775..db16aebd946c 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1233,6 +1233,12 @@ struct kvm_arch {
>>   	hpa_t	hv_root_tdp;
>>   	spinlock_t hv_root_tdp_lock;
>>   #endif
>> +	/*
>> +	 * VM-scope maximum vCPU ID. Used to determine the size of structures
>> +	 * that increase along with the maximum vCPU ID, in which case, using
>> +	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
>> +	 */
>> +	u32 max_vcpu_id;
> This should be max_vcpu_ids.  I agree the it _should_ be max_vcpu_id, but KVM's API
> for this is awful and we're stuck with the plural name.
>
>>   };
>>   
>>   struct kvm_vm_stat {
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 4f6fe9974cb5..ca17cc452bd3 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5994,6 +5994,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		kvm->arch.exit_on_emulation_error = cap->args[0];
>>   		r = 0;
>>   		break;
>> +	case KVM_CAP_MAX_VCPU_ID:
> I think it makes sense to change kvm_vm_ioctl_check_extension() to return the
> current max, it is a VM-scoped ioctl after all.

kvm_vm_ioctl_check_extension() can return kvm->arch.max_vcpu_ids
as it reflects runtime capability supported on current vm. I will
change it.

> Amusingly, I think we also need a capability to enumerate that KVM_CAP_MAX_VCPU_ID
> is writable.

IIUC, KVM_CAP_*  has intrinsic writable attribute. KVM will return invalid
If not implemented.

>> +		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
>> +			kvm->arch.max_vcpu_id = cap->args[0];
> This needs to be rejected if kvm->created_vcpus > 0, and that check needs to be
> done under kvm_lock, otherwise userspace can bump the max ID after KVM allocates
> per-VM structures and trigger buffer overflow.

Is it necessary to use kvm_lock ? Seems no use case to call it from multi-threads.

>> +			r = 0;
>> +		} else
> If-elif-else statements need curly braces for all paths if any path needs braces.
> Probably a moot point for this patch due to the above changes.
>
>> +			r = -E2BIG;
> This should be -EINVAL, not -E2BIG.
>
> E.g.
>
> 	case KVM_CAP_MAX_VCPU_ID:
> 		r = -EINVAL;
> 		if (cap->args[0] > KVM_MAX_VCPU_IDS)
> 			break;
>
> 		mutex_lock(&kvm->lock);
> 		if (!kvm->created_vcpus) {
> 			kvm->arch.max_vcpu_id = cap->args[0];
> 			r = 0;
> 		}
> 		mutex_unlock(&kvm->lock);
> 		break;
>
>
>> +		break;
>>   	default:
>>   		r = -EINVAL;
>>   		break;
>> @@ -11067,6 +11074,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>   	struct page *page;
>>   	int r;
>>   
>> +	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_id)
>> +		return -E2BIG;
> Same here, it should be -EINVAL.
>
>> +
>>   	vcpu->arch.last_vmentry_cpu = -1;
>>   	vcpu->arch.regs_avail = ~0;
>>   	vcpu->arch.regs_dirty = ~0;
>> @@ -11589,6 +11599,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>   	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
>>   	kvm->arch.hv_root_tdp = INVALID_PAGE;
>>   #endif
>> +	kvm->arch.max_vcpu_id = KVM_MAX_VCPU_IDS;
>>   
>>   	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>>   	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>> -- 
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-04-01  2:37   ` Sean Christopherson
@ 2022-04-03 14:38     ` Zeng Guang
  2022-04-04 17:57       ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-04-03 14:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao


On 4/1/2022 10:37 AM, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Zeng Guang wrote:
>> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
>> index ff20776dc83b..7ce616af2db2 100644
>> --- a/arch/x86/include/asm/vmxfeatures.h
>> +++ b/arch/x86/include/asm/vmxfeatures.h
>> @@ -86,4 +86,6 @@
>>   #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>>   #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
>>   
>> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
>> +#define VMX_FEATURE_IPI_VIRT		(3*32 +  4) /* "" Enable IPI virtualization */
> Please follow the existing (weird) spacing and style.  And this should definitely
> be enumerated to userspace, it's one of the more interesting VMX features, i.e. drop
> the "".
>
> #define VMX_FEATURE_IPI_VIRT		( 3*32+  4) /* Enable IPI virtualization */


Now understand the meaning of comment without string "". :)

>>   #endif /* _ASM_X86_VMXFEATURES_H */
>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>> index 31f3d88b3e4d..5f656c9e33be 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
>>   extern bool __read_mostly enable_unrestricted_guest;
>>   extern bool __read_mostly enable_ept_ad_bits;
>>   extern bool __read_mostly enable_pml;
>> +extern bool __read_mostly enable_ipiv;
>>   extern int __read_mostly pt_mode;
>>   
>>   #define PT_MODE_SYSTEM		0
>> @@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
>>   		cpu_has_vmx_posted_intr();
>>   }
>>   
>> +static inline bool cpu_has_vmx_ipiv(void)
>> +{
>> +	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
>> +}
>> +
>>   static inline bool cpu_has_vmx_flexpriority(void)
>>   {
>>   	return cpu_has_vmx_tpr_shadow() &&
>> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
>> index aa1fe9085d77..0882115a9b7a 100644
>> --- a/arch/x86/kvm/vmx/posted_intr.c
>> +++ b/arch/x86/kvm/vmx/posted_intr.c
>> @@ -177,11 +177,24 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>>   	local_irq_restore(flags);
>>   }
>>   
>> +static bool vmx_can_use_pi_wakeup(struct kvm *kvm)
>> +{
>> +	/*
>> +	 * If a blocked vCPU can be the target of posted interrupts,
>> +	 * switching notification vector is needed so that kernel can
>> +	 * be informed when an interrupt is posted and get the chance
>> +	 * to wake up the blocked vCPU. For now, using posted interrupt
>> +	 * for vCPU wakeup when IPI virtualization or VT-d PI can be
>> +	 * enabled.
>> +	 */
>> +	return vmx_can_use_ipiv(kvm) || vmx_can_use_vtd_pi(kvm);
>> +}
>> +
>>   void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>>   {
>>   	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>>   
>> -	if (!vmx_can_use_vtd_pi(vcpu->kvm))
>> +	if (!vmx_can_use_pi_wakeup(vcpu->kvm))
>>   		return;
>>   
>>   	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 7beba7a9f247..121d4f0b35b9 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
>>   
>>   module_param(enable_apicv, bool, S_IRUGO);
>>   
>> +bool __read_mostly enable_ipiv = true;
>> +module_param(enable_ipiv, bool, 0444);
>> +
>>   /*
>>    * If nested=1, nested virtualization is supported, i.e., guests may use
>>    * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -227,6 +230,8 @@ static const struct {
>>   };
>>   
>>   #define L1D_CACHE_ORDER 4
>> +#define PID_TABLE_ENTRY_VALID 1
> Put this in posted_intr.h to give the "PID" part some context.
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 9a45d5c9f116..26992076552e 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -5,6 +5,8 @@
>   #define POSTED_INTR_ON  0
>   #define POSTED_INTR_SN  1
>
> +#define PID_TABLE_ENTRY_VALID 1
> +
>   /* Posted-Interrupt Descriptor */
>   struct pi_desc {
>          u32 pir[8];     /* Posted interrupt requested */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index bbdd77a0388f..6a757e31d1d1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -230,7 +230,6 @@ static const struct {
>   };
>
>   #define L1D_CACHE_ORDER 4
> -#define PID_TABLE_ENTRY_VALID 1
>
>   static void *vmx_l1d_flush_pages;


OK.


>> +
>>   static void *vmx_l1d_flush_pages;
>>   
>>   static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>> @@ -2543,7 +2548,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>>   	}
>>   
>>   	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
>> -		u64 opt3 = 0;
>> +		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
>>   		u64 min3 = 0;
>>   
>>   		if (adjust_vmx_controls_64(min3, opt3,
>> @@ -3898,6 +3903,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
>>   		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>>   		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>>   		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
>> +		if (enable_ipiv)
>> +			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),MSR_TYPE_RW);
> Missing space after the last comma.
>
>>   	}
>>   }
>>   
>> @@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>>   
>>   	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>>   	if (cpu_has_secondary_exec_ctrls()) {
>> -		if (kvm_vcpu_apicv_active(vcpu))
>> +		if (kvm_vcpu_apicv_active(vcpu)) {
>>   			secondary_exec_controls_setbit(vmx,
>>   				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>   				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> -		else
>> +			if (enable_ipiv)
>> +				tertiary_exec_controls_setbit(vmx,
>> +						TERTIARY_EXEC_IPI_VIRT);
>> +		} else {
>>   			secondary_exec_controls_clearbit(vmx,
>>   					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>   					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> +			if (enable_ipiv)
>> +				tertiary_exec_controls_clearbit(vmx,
>> +						TERTIARY_EXEC_IPI_VIRT);
> Oof.  The existing code is kludgy.  We should never reach this point without
> enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
> let alone seconary exec being support.
>
> Unless I'm missing something, throw a prep patch earlier in the series to drop
> the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.

cpu_has_secondary_exec_ctrls() check can avoid wrong vmcs write in case mistaken
invocation.

>> +		}
>>   	}
>>   
>>   	vmx_update_msr_bitmap_x2apic(vcpu);
>> @@ -4260,7 +4274,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>>   
>>   static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>>   {
>> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
>> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
>> +
>> +	/*
>> +	 * IPI virtualization relies on APICv. Disable IPI
>> +	 * virtualization if APICv is inhibited.
> Wrap comments at 80 chars.
>
>> +	 */
>> +	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
>> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
>> +
>> +	return exec_control;
>>   }
>>   
>>   /*
>> @@ -4408,6 +4431,29 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>>   	return exec_control;
>>   }
>>   
>> +static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
>> +{
>> +	struct page *pages;
>> +
>> +	if(kvm_vmx->pid_table)
> Needs a space after the "if".  Moot point though, this shouldn't exist.
>
>> +		return 0;
>> +
>> +	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>> +			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));
> Instead of sizeof(u64), do sizeof(*kvm_vmx->pid_table), that way the code is more
> self documenting and less fragile.  The PID table size obviously shouldn't change
> since it architecturally, but it's a good habit/style.


Good point, I'll follow up. Thanks.


>> +
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	kvm_vmx->pid_table = (void *)page_address(pages);
>> +	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
> No need to cache pid_last_index, it's only used in one place (initializing the
> VMCS field).  The allocation/free paths can use max_vcpu_id directly.  Actually,

In previous design, we don't forbid to change max_vcpu_id after vCPU 
creation or for other
purpose in future. Thus it's safe to decouple them and make ipiv usage 
independent. If it can
be sure that max_vcpu_id won't be modified , we can totally remove 
pid_last_index and use
max_vcpu_id directly even for initializing the VMCD field.

> for the alloc/free, add a helper to provide the order, that'll clean up both
> call sites and avoid duplicate math.  E.g.
>
> int vmx_get_pid_table_order(struct kvm_vmx *kvm_vmx)
> {
> 	return get_order(kvm_vmx->kvm.arch.max_vcpu_ids * sizeof(*kvm_vmx->pid_table));
> }
>
>> +	return 0;
>> +}
>> +
>> +bool vmx_can_use_ipiv(struct kvm *kvm)
>> +{
>> +	return irqchip_in_kernel(kvm) && enable_ipiv;
>> +}
> Move this helper to posted_intr.h (or maybe vmx.h, though I think posted_intr.h
> is a slightly better fit) and make it static inline.  This patch already exposes
> enable_ipiv, and the usage in vmx_can_use_pi_wakeup() will be frequent enough
> that making it inline is worthwhile.


I'll adapt and put it in vmx.h.


>> +
>>   #define VMX_XSS_EXIT_BITMAP 0
>>   
>>   static void init_vmcs(struct vcpu_vmx *vmx)
>> @@ -4443,6 +4489,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>   		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>>   	}
>>   
>> +	if (vmx_can_use_ipiv(vmx->vcpu.kvm)) {
>> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
> Hoist this to the top of the function, that way we don't end up with variable
> shadowing and don't have to move it if future code also needs to access kvm_vmx.
OK.
>> +
>> +		vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
>> +		vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
>> +	}
>> +
>>   	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
>>   		vmcs_write32(PLE_GAP, ple_gap);
>>   		vmx->ple_window = ple_window;
>> @@ -7123,6 +7176,22 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>>   			goto free_vmcs;
>>   	}
>>   
>> +	/*
>> +	 * Allocate PID-table and program this vCPU's PID-table
>> +	 * entry if IPI virtualization can be enabled.
> Please wrap comments at 80 chars.  But I'd just drop this one entirely, the code
> is self-explanatory once the allocation and setting of the vCPU's entry are split.
>
>> +	 */
>> +	if (vmx_can_use_ipiv(vcpu->kvm)) {
>> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
>> +
>> +		mutex_lock(&vcpu->kvm->lock);
>> +		err = vmx_alloc_pid_table(kvm_vmx);
>> +		mutex_unlock(&vcpu->kvm->lock);
> This belongs in vmx_vm_init(), doing it in vCPU creation is a remnant of the
> dynamic resize approach that's no longer needed.
>
>   


We cannot allocate pid table in vmx_vm_init() as userspace has no chance 
to set
max_vcpu_ids at this stage. That's the reason we do it in vCPU creation 
instead.

>> +		if (err)
>> +			goto free_vmcs;
>> +		WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
>> +			__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
> This gets to stay though.  Please align the indentation, i.e.
>
>
> 	if (vmx_can_use_ipiv(vcpu->kvm))
> 		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
> 			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
OK.
>> +	}
>> +
>> / 	return 0;
>>   
>>   free_vmcs:
>> @@ -7756,6 +7825,15 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>>   	return supported & BIT(bit);
>>   }
>>   
>> +static void vmx_vm_destroy(struct kvm *kvm)
>> +{
>> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
>> +
>> +	if (kvm_vmx->pid_table)
>> +		free_pages((unsigned long)kvm_vmx->pid_table,
>> +			get_order((kvm_vmx->pid_last_index + 1) * sizeof(u64)));
>> +}
>> +
>>   static struct kvm_x86_ops vmx_x86_ops __initdata = {
>>   	.name = "kvm_intel",
>>   
>> @@ -7768,6 +7846,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>>   
>>   	.vm_size = sizeof(struct kvm_vmx),
>>   	.vm_init = vmx_vm_init,
>> +	.vm_destroy = vmx_vm_destroy,
>>   
>>   	.vcpu_create = vmx_create_vcpu,
>>   	.vcpu_free = vmx_free_vcpu,
>> @@ -8022,6 +8101,9 @@ static __init int hardware_setup(void)
>>   	if (!enable_apicv)
>>   		vmx_x86_ops.sync_pir_to_irr = NULL;
>>   
>> +	if (!enable_apicv || !cpu_has_vmx_ipiv())
>> +		enable_ipiv = false;
>> +
>>   	if (cpu_has_vmx_tsc_scaling()) {
>>   		kvm_has_tsc_control = true;
>>   		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index d4a647d3ed4a..5b65930a750e 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -365,6 +365,9 @@ struct kvm_vmx {
>>   	unsigned int tss_addr;
>>   	bool ept_identity_pagetable_done;
>>   	gpa_t ept_identity_map_addr;
>> +	/* PID table for IPI virtualization */
> I like having a comment here since "pid_table" is ambiguous, but take the opportunity
> to explain PID, i.e.
>
> 	/* Posted Interrupt Descriptor (PID) table for IPI virtualization. */
OK.  Thanks to point it out.
>> +	u64 *pid_table;
>> +	u16 pid_last_index;
>>   };
>>   
>>   bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
>> @@ -584,4 +587,6 @@ static inline int vmx_get_instr_info_reg2(u32 vmx_instr_info)
>>   	return (vmx_instr_info >> 28) & 0xf;
>>   }
>>   
>> +bool vmx_can_use_ipiv(struct kvm *kvm);
>> +
>>   #endif /* __KVM_X86_VMX_H */
>> -- 
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-04-02 13:33     ` Zeng Guang
@ 2022-04-04 15:29       ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-04 15:29 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On Sat, Apr 02, 2022, Zeng Guang wrote:
> 
> > > -	/* TODO: optimize to just emulate side effect w/o one more write */
> > > -	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
> > > +		kvm_lapic_msr_read(apic, offset, &val);
> > > +		kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
> > This needs to clear the APIC_ICR_BUSY bit.  It'd also be nice to trace this write.
> > The easiest thing is to use kvm_x2apic_icr_write().  Kinda silly as it'll generate
> > an extra write, but on the plus side the TODO comment doesn't have to move :-D
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index c4c3155d98db..58bf296ee313 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2230,6 +2230,7 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
> >          struct kvm_lapic *apic = vcpu->arch.apic;
> >          u64 val;
> > 
> > +       /* TODO: optimize to just emulate side effect w/o one more write */
> >          if (apic_x2apic_mode(apic)) {
> >                  /*
> >                   * When guest APIC is in x2APIC mode and IPI virtualization
> > @@ -2240,10 +2241,9 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
> >                          return;
> > 
> >                  kvm_lapic_msr_read(apic, offset, &val);
> > -               kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
> > +               kvm_x2apic_icr_write(apic, val);
> 
> As SDM section 10.12.9 "ICR Operation in X2APIC mode" says "Delivery status
> bit is removed since it is not needed in x2APIC mode" , so that's not
> necessary to clear the APIC_ICR_BUSY bit here. Alternatively we can add trace
> to this write by hardware.

That same section later says 

  With the removal of the Delivery Status bit, system software no longer has a
  reason to read the ICR. It remains readable only to aid in debugging; however,
  software should not assume the value returned by reading the ICR is the last
  written value.

which means that it's at least legal for a hypervisor to clear the busy bit.  That
might be useful for debugging IPI issues?  Probably a bit of a stretch, e.g. I doubt
any kernels set the busy bit.  But, I do think the tracing would be helpful, and at
that point, the extra code should be an AND+MOV.

I don't have a super strong opinion, and I'm being somewhat hypocritical (see commit
b51818afdc1d ("KVM: SVM: Don't rewrite guest ICR on AVIC IPI virtualization failure"),
though that has dedicated tracing), so either approach works for me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-04-03 10:17     ` Zeng Guang
@ 2022-04-04 17:25       ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-04 17:25 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On Sun, Apr 03, 2022, Zeng Guang wrote:
> 
> On 4/1/2022 10:01 AM, Sean Christopherson wrote:
> > Amusingly, I think we also need a capability to enumerate that KVM_CAP_MAX_VCPU_ID
> > is writable.
> 
> IIUC, KVM_CAP_*  has intrinsic writable attribute. KVM will return invalid
> If not implemented.

Yes, but forcing userspace to do a dummy write to detect support is rather ugly.
I'm not totally opposed to it.  Probably a Paolo question.

Paolo?

> > > +		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
> > > +			kvm->arch.max_vcpu_id = cap->args[0];
> > This needs to be rejected if kvm->created_vcpus > 0, and that check needs to be
> > done under kvm_lock, otherwise userspace can bump the max ID after KVM allocates
> > per-VM structures and trigger buffer overflow.
> 
> Is it necessary to use kvm_lock ? Seems no use case to call it from multi-threads.

There's no sane use case, but userspace is untrusted, i.e. KVM can't assume that
userspace will do the right/desired thing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-04-03 14:38     ` Zeng Guang
@ 2022-04-04 17:57       ` Sean Christopherson
  2022-04-08 16:41         ` Zeng Guang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-04 17:57 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On Sun, Apr 03, 2022, Zeng Guang wrote:
> 
> On 4/1/2022 10:37 AM, Sean Christopherson wrote:
> > > @@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> > >   	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
> > >   	if (cpu_has_secondary_exec_ctrls()) {
> > > -		if (kvm_vcpu_apicv_active(vcpu))
> > > +		if (kvm_vcpu_apicv_active(vcpu)) {
> > >   			secondary_exec_controls_setbit(vmx,
> > >   				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > >   				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> > > -		else
> > > +			if (enable_ipiv)
> > > +				tertiary_exec_controls_setbit(vmx,
> > > +						TERTIARY_EXEC_IPI_VIRT);
> > > +		} else {
> > >   			secondary_exec_controls_clearbit(vmx,
> > >   					SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > >   					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> > > +			if (enable_ipiv)
> > > +				tertiary_exec_controls_clearbit(vmx,
> > > +						TERTIARY_EXEC_IPI_VIRT);
> > Oof.  The existing code is kludgy.  We should never reach this point without
> > enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
> > let alone seconary exec being support.
> > 
> > Unless I'm missing something, throw a prep patch earlier in the series to drop
> > the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.
> 
> cpu_has_secondary_exec_ctrls() check can avoid wrong vmcs write in case mistaken
> invocation.

KVM has far bigger problems on buggy invocation, and in that case the resulting
printk + WARN from the failed VMWRITE is a good thing.

> > > +
> > > +	if (!pages)
> > > +		return -ENOMEM;
> > > +
> > > +	kvm_vmx->pid_table = (void *)page_address(pages);
> > > +	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
> > No need to cache pid_last_index, it's only used in one place (initializing the
> > VMCS field).  The allocation/free paths can use max_vcpu_id directly.  Actually,
> 
> In previous design, we don't forbid to change max_vcpu_id after vCPU creation
> or for other purpose in future. Thus it's safe to decouple them and make ipiv
> usage independent. If it can be sure that max_vcpu_id won't be modified , we
> can totally remove pid_last_index and use max_vcpu_id directly even for
> initializing the VMCD field.

max_vcpu_id asolutely needs to be constant after the first vCPU is created.

> > > @@ -7123,6 +7176,22 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> > >   			goto free_vmcs;
> > >   	}
> > > +	/*
> > > +	 * Allocate PID-table and program this vCPU's PID-table
> > > +	 * entry if IPI virtualization can be enabled.
> > Please wrap comments at 80 chars.  But I'd just drop this one entirely, the code
> > is self-explanatory once the allocation and setting of the vCPU's entry are split.
> > 
> > > +	 */
> > > +	if (vmx_can_use_ipiv(vcpu->kvm)) {
> > > +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
> > > +
> > > +		mutex_lock(&vcpu->kvm->lock);
> > > +		err = vmx_alloc_pid_table(kvm_vmx);
> > > +		mutex_unlock(&vcpu->kvm->lock);
> > This belongs in vmx_vm_init(), doing it in vCPU creation is a remnant of the
> > dynamic resize approach that's no longer needed.
> 
> We cannot allocate pid table in vmx_vm_init() as userspace has no chance to
> set max_vcpu_ids at this stage. That's the reason we do it in vCPU creation
> instead.

Ah, right.  Hrm.  And that's going to be a recurring problem if we try to use the
dynamic kvm->max_vcpu_ids to reduce other kernel allocations.

Argh, and even kvm_arch_vcpu_precreate() isn't protected by kvm->lock.

Taking kvm->lock isn't problematic per se, I just hate doing it so deep in a
per-vCPU flow like this.

A really gross hack/idea would be to make this 64-bit only and steal the upper
32 bits of @type in kvm_create_vm() for the max ID.

I think my first choice would be to move kvm_arch_vcpu_precreate() under kvm->lock.
None of the architectures that have a non-nop implemenation (s390, arm64 and x86)
do significant work, so holding kvm->lock shouldn't harm performance.  s390 has to
acquire kvm->lock in its implementation, so we could drop that.  And looking at
arm64, I believe its logic should also be done under kvm->lock.

It'll mean adding yet another kvm_x86_ops, but I like that more than burying the
code deep in vCPU creation.

Paolo, any thoughts on this?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-04-04 17:57       ` Sean Christopherson
@ 2022-04-08 16:41         ` Zeng Guang
  2022-04-15 14:35           ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Zeng Guang @ 2022-04-08 16:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao


On 4/5/2022 1:57 AM, Sean Christopherson wrote:
> On Sun, Apr 03, 2022, Zeng Guang wrote:
>> On 4/1/2022 10:37 AM, Sean Christopherson wrote:
>>>> @@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>>>>    	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>>>>    	if (cpu_has_secondary_exec_ctrls()) {
>>>> -		if (kvm_vcpu_apicv_active(vcpu))
>>>> +		if (kvm_vcpu_apicv_active(vcpu)) {
>>>>    			secondary_exec_controls_setbit(vmx,
>>>>    				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>>>    				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>>>> -		else
>>>> +			if (enable_ipiv)
>>>> +				tertiary_exec_controls_setbit(vmx,
>>>> +						TERTIARY_EXEC_IPI_VIRT);
>>>> +		} else {
>>>>    			secondary_exec_controls_clearbit(vmx,
>>>>    					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>>>    					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>>>> +			if (enable_ipiv)
>>>> +				tertiary_exec_controls_clearbit(vmx,
>>>> +						TERTIARY_EXEC_IPI_VIRT);
>>> Oof.  The existing code is kludgy.  We should never reach this point without
>>> enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
>>> let alone seconary exec being support.
>>>
>>> Unless I'm missing something, throw a prep patch earlier in the series to drop
>>> the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.
>> cpu_has_secondary_exec_ctrls() check can avoid wrong vmcs write in case mistaken
>> invocation.
> KVM has far bigger problems on buggy invocation, and in that case the resulting
> printk + WARN from the failed VMWRITE is a good thing.


SDM doesn't define VMWRITE failure for such case. But it says the logical
processor operates as if all the secondary processor-based VM-execution
controls were 0 if "activate secondary controls" primary processor-based
VM-execution control is 0. So we may add WARN() to detect this kind of
buggy invocation instead.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 61e075e16c19..6c370b507b45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4200,22 +4200,22 @@ static void vmx_refresh_apicv_exec_ctrl(struct 
kvm_vcpu *vcpu)
         struct vcpu_vmx *vmx = to_vmx(vcpu);

         pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
-       if (cpu_has_secondary_exec_ctrls()) {
-               if (kvm_vcpu_apicv_active(vcpu)) {
-                       secondary_exec_controls_setbit(vmx,
- SECONDARY_EXEC_APIC_REGISTER_VIRT |
- SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-                       if (enable_ipiv)
-                               tertiary_exec_controls_setbit(vmx,
- TERTIARY_EXEC_IPI_VIRT);
-               } else {
-                       secondary_exec_controls_clearbit(vmx,
- SECONDARY_EXEC_APIC_REGISTER_VIRT |
- SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-                       if (enable_ipiv)
- tertiary_exec_controls_clearbit(vmx,
- TERTIARY_EXEC_IPI_VIRT);
-               }
+
+       WARN(!cpu_has_secondary_exec_ctrls(),
+                    "VMX: unexpected vmwrite with inactive secondary 
exec controls");
+
+       if (kvm_vcpu_apicv_active(vcpu)) {
+               secondary_exec_controls_setbit(vmx,
+                             SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+               if (enable_ipiv)
+                       tertiary_exec_controls_setbit(vmx, 
TERTIARY_EXEC_IPI_VIRT);
+       } else {
+               secondary_exec_controls_clearbit(vmx,
+                             SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+               if (enable_ipiv)
+                       tertiary_exec_controls_clearbit(vmx, 
TERTIARY_EXEC_IPI_VIRT);
         }

         vmx_update_msr_bitmap_x2apic(vcpu);

>
>>>> +	 */
>>>> +	if (vmx_can_use_ipiv(vcpu->kvm)) {
>>>> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
>>>> +
>>>> +		mutex_lock(&vcpu->kvm->lock);
>>>> +		err = vmx_alloc_pid_table(kvm_vmx);
>>>> +		mutex_unlock(&vcpu->kvm->lock);
>>> This belongs in vmx_vm_init(), doing it in vCPU creation is a remnant of the
>>> dynamic resize approach that's no longer needed.
>> We cannot allocate pid table in vmx_vm_init() as userspace has no chance to
>> set max_vcpu_ids at this stage. That's the reason we do it in vCPU creation
>> instead.
> Ah, right.  Hrm.  And that's going to be a recurring problem if we try to use the
> dynamic kvm->max_vcpu_ids to reduce other kernel allocations.
>
> Argh, and even kvm_arch_vcpu_precreate() isn't protected by kvm->lock.
>
> Taking kvm->lock isn't problematic per se, I just hate doing it so deep in a
> per-vCPU flow like this.
>
> A really gross hack/idea would be to make this 64-bit only and steal the upper
> 32 bits of @type in kvm_create_vm() for the max ID.
>
> I think my first choice would be to move kvm_arch_vcpu_precreate() under kvm->lock.
> None of the architectures that have a non-nop implemenation (s390, arm64 and x86)
> do significant work, so holding kvm->lock shouldn't harm performance.  s390 has to
> acquire kvm->lock in its implementation, so we could drop that.  And looking at
> arm64, I believe its logic should also be done under kvm->lock.
>
> It'll mean adding yet another kvm_x86_ops, but I like that more than burying the
> code deep in vCPU creation.
>
> Paolo, any thoughts on this?

Sounds reasonable. I will prepare patch to refactor the 
kvm_arch_vcpu_precreate()
and make pid table allocation done there.

Thanks.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization
  2022-04-08 16:41         ` Zeng Guang
@ 2022-04-15 14:35           ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-15 14:35 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao

On Sat, Apr 09, 2022, Zeng Guang wrote:
> 
> On 4/5/2022 1:57 AM, Sean Christopherson wrote:
> > On Sun, Apr 03, 2022, Zeng Guang wrote:
> > > On 4/1/2022 10:37 AM, Sean Christopherson wrote:
> > > > > @@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> > > > >    	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
> > > > >    	if (cpu_has_secondary_exec_ctrls()) {
> > > > > -		if (kvm_vcpu_apicv_active(vcpu))
> > > > > +		if (kvm_vcpu_apicv_active(vcpu)) {
> > > > >    			secondary_exec_controls_setbit(vmx,
> > > > >    				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > > > >    				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> > > > > -		else
> > > > > +			if (enable_ipiv)
> > > > > +				tertiary_exec_controls_setbit(vmx,
> > > > > +						TERTIARY_EXEC_IPI_VIRT);
> > > > > +		} else {
> > > > >    			secondary_exec_controls_clearbit(vmx,
> > > > >    					SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > > > >    					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> > > > > +			if (enable_ipiv)
> > > > > +				tertiary_exec_controls_clearbit(vmx,
> > > > > +						TERTIARY_EXEC_IPI_VIRT);
> > > > Oof.  The existing code is kludgy.  We should never reach this point without
> > > > enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
> > > > let alone seconary exec being support.
> > > > 
> > > > Unless I'm missing something, throw a prep patch earlier in the series to drop
> > > > the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.
> > > cpu_has_secondary_exec_ctrls() check can avoid wrong vmcs write in case mistaken
> > > invocation.
> > KVM has far bigger problems on buggy invocation, and in that case the resulting
> > printk + WARN from the failed VMWRITE is a good thing.
> 
> SDM doesn't define VMWRITE failure for such case.

Yes it absolutely does.  cpu_has_secondary_exec_ctrls() checks if the VMCS field
_exists_, not if it's being used by KVM (though that's a moot point since KVM
always enables secondary controls when it's supported).  VMWRITE to non-existent
fields cause VM-Fail.

  ELSIF secondary source operand does not correspond to any VMCS field
  THEN VMfailValid(VMREAD/VMWRITE from/to unsupported VMCS component);

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-04-15 14:36 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04  8:07 [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang
2022-03-04  8:07 ` [PATCH v7 1/8] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
2022-03-04  8:07 ` [PATCH v7 2/8] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
2022-03-31 22:27   ` Sean Christopherson
2022-04-02 12:47     ` Zeng Guang
2022-03-04  8:07 ` [PATCH v7 3/8] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
2022-03-31 22:41   ` Sean Christopherson
2022-04-02 12:58     ` Zeng Guang
2022-03-04  8:07 ` [PATCH v7 4/8] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
2022-03-31 22:46   ` Sean Christopherson
2022-04-02 13:09     ` Zeng Guang
2022-03-04  8:07 ` [PATCH v7 5/8] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
2022-03-31 23:07   ` Sean Christopherson
2022-04-02 13:33     ` Zeng Guang
2022-04-04 15:29       ` Sean Christopherson
2022-03-04  8:07 ` [PATCH v7 6/8] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
2022-03-04  8:07 ` [PATCH v7 7/8] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
2022-04-01  2:01   ` Sean Christopherson
2022-04-03 10:17     ` Zeng Guang
2022-04-04 17:25       ` Sean Christopherson
2022-03-04  8:07 ` [PATCH v7 8/8] KVM: VMX: enable IPI virtualization Zeng Guang
2022-04-01  2:37   ` Sean Christopherson
2022-04-03 14:38     ` Zeng Guang
2022-04-04 17:57       ` Sean Christopherson
2022-04-08 16:41         ` Zeng Guang
2022-04-15 14:35           ` Sean Christopherson
2022-03-18  8:15 ` [PATCH v7 0/8] IPI virtualization support for VM Zeng Guang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).