All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/9] IPI virtualization support for VM
@ 2022-04-11  9:04 Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Currently, issuing an IPI except self-ipi in guest on Intel CPU
always causes a VM-exit. It can lead to non-negligible overhead
to some workloads involving frequent IPIs when running in VMs.

IPI virtualization is a new VT-x feature, targeting to eliminate
VM-exits on source vCPUs when issuing unicast, physical-addressing
IPIs. Once it is enabled, the processor virtualizes following kinds
of operations that send IPIs without causing VM-exits:
- Memory-mapped ICR writes
- MSR-mapped ICR writes
- SENDUIPI execution

This patch series implements IPI virtualization support in KVM.

Patches 1-4 add tertiary processor-based VM-execution support
framework, which is used to enumerate IPI virtualization.

Patch 5 handles APIC-write VM exit due to writes to ICR MSR when
guest works in x2APIC mode. This is a new case introduced by
Intel VT-x.

Patch 6 disallow the APIC ID change unconditionally.

Patch 7 move kvm_arch_vcpu_precreate() under kvm->lock protection.
This patch is prepared for IPIv PID-table allocation prior to
the creation of vCPUs.

Patch 8 provide userspace capability to set maximum possible VCPU
ID for current VM. IPIv can refer to this value to allocate memory
for PID-pointer table.

Patch 9 implements IPI virtualization related function including
feature enabling through tertiary processor-based VM-execution in
various scenarios of VMCS configuration, PID table setup in vCPU
creation and vCPU block consideration.

Document for IPI virtualization is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

We did experiment to measure average time sending IPI from source vCPU
to the target vCPU completing the IPI handling by kvm unittest w/ and
w/o IPI virtualization. When IPI virtualization enabled, it will reduce
22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
respectively.
--------------------------------------
KVM unittest:vmexit/ipi

2 vCPU, AP was modified to run in idle loop instead of halt to ensure
no VM exit impact on target vCPU.

                Cycles of IPI
                xAPIC mode              x2APIC mode
        test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
        1       6106      4816          4265      3768
        2       6244      4656          4404      3546
        3       6165      4658          4233      3474
        4       5992      4710          4363      3430
        5       6083      4741          4215      3551
        6       6238      4904          4304      3547
        7       6164      4617          4263      3709
        8       5984      4763          4518      3779
        9       5931      4712          4645      3667
        10      5955      4530          4332      3724
        11      5897      4673          4283      3569
        12      6140      4794          4178      3598
        13      6183      4728          4363      3628
        14      5991      4994          4509      3842
        15      5866      4665          4520      3739
        16      6032      4654          4229      3701
        17      6050      4653          4185      3726
        18      6004      4792          4319      3746
        19      5961      4626          4196      3392
        20      6194      4576          4433      3760

Average cycles  6059      4713.1        4337.85   3644.8
%Reduction                -22.21%                 -15.98%

--------------------------------------
IPI microbenchmark:
(https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)

2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode

Result with IPIv enabled:

Dry-run:                         0,             272798 ns
Self-IPI:                  5094123,           11114037 ns
Normal IPI:              131697087,          173321200 ns
Broadcast IPI:                   0,          155649075 ns
Broadcast lock:                  0,          161518031 ns

Result with IPIv disabled:

Dry-run:                         0,             272766 ns
Self-IPI:                  5091788,           11123699 ns
Normal IPI:              145215772,          174558920 ns
Broadcast IPI:                   0,          175785384 ns
Broadcast lock:                  0,          149076195 ns


As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
about 9.73% time saving on average out of 15 test runs when IPIv is
enabled.

Normal IPI statistics (unit:ns):
        test    w/o IPIv        w/ IPIv
        1       153346049       140907046
        2       147218648       141660618
        3       145215772       117890672
        4       146621682       136430470
        5       144821472       136199421
        6       144704378       131676928
        7       141403224       131697087
        8       144775766       125476250
        9       140658192       137263330
        10      144768626       138593127
        11      145166679       131946752
        12      145020451       116852889
        13      148161353       131406280
        14      148378655       130174353
        15      148903652       127969674

Average time    145944306.6     131742993.1 ns
%Reduction                      -9.73%

--------------------------------------
hackbench:

8 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 100000

                w/o IPIv        w/ IPIv
Time            91.887          74.605
%Reduction                      -18.808%

96 vCPUs, guest VM fre run, x2APIC mode
./hackbench -p -l 1000000

                w/o IPIv        w/ IPIv
Time            287.504         235.185
%Reduction                      -18.198%

--------------------------------------
v7->v8:
1. Add trace in kvm_apic_write_nodecode() to track
vICR Write in APIC Write VM-exit handling
2. Move IPIv PID table allocation done in the vCPU
pre-creation (kvm_arch_vcpu_precreate()) protected
by kvm->lock.
3. Misc code refine

v6->v7:
1. Revise kvm_apic_write_nodecode() on dealing with
   vICR busy bit in x2apic mode
2. Merge PID-table memory allocation with max_vcpu_id
   into IPIv enabling patch
3. Change to allocate PID-table, setup vCPU's PID-table
   entry and IPIv related VMCS fields once IPIv can
   be enabled, which support runtime enabling IPIv.

v5->v6:
1. Adapt kvm_apic_write_nodecode() implementation based
   on Sean's fix of x2apic's ICR register process.
2. Drop the patch handling IPIv table entry setting in
   case APIC ID changed, instead applying Levitsky's patch
   to disallow setting APIC ID in any case.
3. Drop the patch resizing the PID-pointer table on demand.
   Allow userspace to set maximum vcpu id at runtime that
   IPIv can refer to the practical value to allocate memory
   for PID-pointer table.

v4 -> v5:
1. Deal with enable_ipiv parameter following current
   vmcs configuration rule.
2. Allocate memory for PID-pointer table dynamically
3. Support guest runtime modify APIC ID in xAPIC mode
4. Helper to judge possibility to take PI block in IPIv case

v3 -> v4:
1. Refine code style of patch 2
2. Move tertiary control shadow build into patch 3
3. Make vmx_tertiary_exec_control to be static function

v2 -> v3:
1. Misc change on tertiary execution control
   definition and capability setup
2. Alternative to get tertiary execution
   control configuration

v1 -> v2:
1. Refine the IPIv enabling logic for VM.
   Remove ipiv_active definition per vCPU.

--------------------------------------

Chao Gao (1):
  KVM: VMX: enable IPI virtualization

Maxim Levitsky (1):
  KVM: x86: lapic: don't allow to change APIC ID unconditionally

Robert Hoo (4):
  x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
    variation
  KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  KVM: VMX: Report tertiary_exec_control field in dump_vmcs()

Zeng Guang (3):
  KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  KVM: Move kvm_arch_vcpu_precreate() under kvm->lock
  KVM: x86: Allow userspace set maximum VCPU id for VM

 Documentation/virt/kvm/api.rst     |  17 ++++
 arch/s390/kvm/kvm-s390.c           |   2 -
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   7 ++
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/include/asm/vmx.h         |  11 +++
 arch/x86/include/asm/vmxfeatures.h |   5 +-
 arch/x86/kernel/cpu/feat_ctl.c     |   9 +-
 arch/x86/kvm/lapic.c               |  49 ++++++++---
 arch/x86/kvm/vmx/capabilities.h    |  13 +++
 arch/x86/kvm/vmx/evmcs.c           |   2 +
 arch/x86/kvm/vmx/evmcs.h           |   1 +
 arch/x86/kvm/vmx/posted_intr.c     |  15 +++-
 arch/x86/kvm/vmx/posted_intr.h     |   2 +
 arch/x86/kvm/vmx/vmcs.h            |   1 +
 arch/x86/kvm/vmx/vmx.c             | 131 ++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h             |  64 ++++++++------
 arch/x86/kvm/x86.c                 |  24 +++++-
 virt/kvm/kvm_main.c                |   2 +-
 19 files changed, 297 insertions(+), 60 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v8 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.

Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.

Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.

[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

[2] SDM Vol3. Appendix A.3

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/msr-index.h   | 1 +
 arch/x86/include/asm/vmxfeatures.h | 3 ++-
 arch/x86/kernel/cpu/feat_ctl.c     | 9 ++++++++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0eb90d21049e..219a97098cf8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -961,6 +961,7 @@
 #define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
+#define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
 
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_VMCS_SIZE_SHIFT	32
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..ff20776dc83b 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -5,7 +5,7 @@
 /*
  * Defines VMX CPU feature bits
  */
-#define NVMXINTS			3 /* N 32-bit words worth of info */
+#define NVMXINTS			5 /* N 32-bit words worth of info */
 
 /*
  * Note: If the comment begins with a quoted string, that string is used
@@ -43,6 +43,7 @@
 #define VMX_FEATURE_RDTSC_EXITING	( 1*32+ 12) /* "" VM-Exit on RDTSC */
 #define VMX_FEATURE_CR3_LOAD_EXITING	( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
 #define VMX_FEATURE_CR3_STORE_EXITING	( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
+#define VMX_FEATURE_TERTIARY_CONTROLS	( 1*32+ 17) /* "" Enable Tertiary VM-Execution Controls */
 #define VMX_FEATURE_CR8_LOAD_EXITING	( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
 #define VMX_FEATURE_CR8_STORE_EXITING	( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
 #define VMX_FEATURE_VIRTUAL_TPR		( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index da696eb4821a..993697e71854 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -15,6 +15,8 @@ enum vmx_feature_leafs {
 	MISC_FEATURES = 0,
 	PRIMARY_CTLS,
 	SECONDARY_CTLS,
+	TERTIARY_CTLS_LOW,
+	TERTIARY_CTLS_HIGH,
 	NR_VMX_FEATURE_WORDS,
 };
 
@@ -22,7 +24,7 @@ enum vmx_feature_leafs {
 
 static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 {
-	u32 supported, funcs, ept, vpid, ign;
+	u32 supported, funcs, ept, vpid, ign, low, high;
 
 	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
 
@@ -42,6 +44,11 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
 	c->vmx_capability[SECONDARY_CTLS] = supported;
 
+	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */
+	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
+	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
+	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
+
 	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
 	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.h | 56 +++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9c6bfcd84008..122fdbf85a02 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -455,35 +455,35 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }
 
-#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
-{									    \
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
-		vmcs_write32(uname, val);				    \
-		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
-	}								    \
-}									    \
-static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
-{									    \
-	return vmcs->controls_shadow.lname;				    \
-}									    \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
-{									    \
-	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
-}									    \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
-}									    \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits)				\
+static inline void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)	\
+{										\
+	if (vmx->loaded_vmcs->controls_shadow.lname != val) {			\
+		vmcs_write##bits(uname, val);					\
+		vmx->loaded_vmcs->controls_shadow.lname = val;			\
+	}									\
+}										\
+static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)	\
+{										\
+	return vmcs->controls_shadow.lname;					\
+}										\
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)		\
+{										\
+	return __##lname##_controls_get(vmx->loaded_vmcs);			\
+}										\
+static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)	\
+{										\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);		\
+}										\
+static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+{										\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);		\
 }
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 4/9] KVM: VMX: Report tertiary_exec_control field in dump_vmcs() Zeng Guang
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.

EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h      |  3 +++
 arch/x86/kvm/vmx/capabilities.h |  7 +++++++
 arch/x86/kvm/vmx/evmcs.c        |  2 ++
 arch/x86/kvm/vmx/evmcs.h        |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 29 ++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h          |  1 +
 7 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..8c929596a299 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -31,6 +31,7 @@
 #define CPU_BASED_RDTSC_EXITING                 VMCS_CONTROL_BIT(RDTSC_EXITING)
 #define CPU_BASED_CR3_LOAD_EXITING		VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
 #define CPU_BASED_CR3_STORE_EXITING		VMCS_CONTROL_BIT(CR3_STORE_EXITING)
+#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS	VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
 #define CPU_BASED_CR8_LOAD_EXITING              VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
 #define CPU_BASED_CR8_STORE_EXITING             VMCS_CONTROL_BIT(CR8_STORE_EXITING)
 #define CPU_BASED_TPR_SHADOW                    VMCS_CONTROL_BIT(VIRTUAL_TPR)
@@ -221,6 +222,8 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
+	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..31f3d88b3e4d 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,6 +59,7 @@ struct vmcs_config {
 	u32 pin_based_exec_ctrl;
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
+	u64 cpu_based_3rd_exec_ctrl;
 	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
 	struct nested_vmx_msrs nested;
@@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 }
 
+static inline bool cpu_has_tertiary_exec_ctrls(void)
+{
+	return vmcs_config.cpu_based_exec_ctrl &
+		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
index 87e3dc10edf4..6a61b1ae7942 100644
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@@ -297,8 +297,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
 #if IS_ENABLED(CONFIG_HYPERV)
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
 {
+	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
 	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
 	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
 
 	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
 	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 8d70f9aea94b..f886a8ff0342 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
  */
 #define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
 				    PIN_BASED_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 #define EVMCS1_UNSUPPORTED_2NDEXEC					\
 	(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
 	 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a816..e18dc68eeeeb 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -50,6 +50,7 @@ struct vmcs_controls_shadow {
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
+	u64 tertiary_exec;
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04d170c4b61e..961e61044341 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2410,6 +2410,15 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
 	return 0;
 }
 
+static __init u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr)
+{
+	u64 allowed;
+
+	rdmsrl(msr, allowed);
+
+	return  ctl_opt & allowed;
+}
+
 static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				    struct vmx_capability *vmx_cap)
 {
@@ -2418,6 +2427,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
+	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 
@@ -2439,7 +2449,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
-	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
 				&_cpu_based_exec_control) < 0)
 		return -EIO;
@@ -2513,6 +2524,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				"1-setting enable VPID VM-execution control\n");
 	}
 
+	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+		u64 opt3 = 0;
+
+		_cpu_based_3rd_exec_control = adjust_vmx_controls64(opt3,
+					      MSR_IA32_VMX_PROCBASED_CTLS3);
+	}
+
 	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -2599,6 +2617,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
 
@@ -4215,6 +4234,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl;
+}
+
 /*
  * Adjust a single secondary execution control bit to intercept/allow an
  * instruction in the guest.  This is usually done based on whether or not a
@@ -4380,6 +4404,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
 
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
+
 	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 122fdbf85a02..85c067f2d7f2 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -484,6 +484,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 4/9] KVM: VMX: Report tertiary_exec_control field in dump_vmcs()
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (2 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Add tertiary_exec_control field report in dump_vmcs(). Meanwhile,
reorganize the dump output of VMCS category as follows.

Before change:
*** Control State ***
 PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
 EntryControls=0000d1ff ExitControls=002befff

After change:
*** Control State ***
 CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
 PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 961e61044341..f439abd52bad 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5867,6 +5867,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
+	u64 tertiary_exec_control;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -5880,9 +5881,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
-	secondary_exec_control = 0;
+
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	else
+		secondary_exec_control = 0;
+
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
+	else
+		tertiary_exec_control = 0;
 
 	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
 	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
@@ -5982,9 +5990,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
 
 	pr_err("*** Control State ***\n");
-	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
-	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
-	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
+	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
+	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
+	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (3 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 4/9] KVM: VMX: Report tertiary_exec_control field in dump_vmcs() Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
were previously impossible.

Note, x2APIC MSR writes are 64 bits wide.

Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 66b0eb0bda94..137c3a2f5180 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
 #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
+static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
 
 static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
 {
@@ -2230,10 +2231,27 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 /* emulate APIC access in a trap manner */
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 {
-	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 val;
+
+	if (apic_x2apic_mode(apic)) {
+		/*
+		 * When guest APIC is in x2APIC mode and IPI virtualization
+		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
+		 * on Intel hardware. Other offsets are not possible.
+		 */
+		if (WARN_ON_ONCE(offset != APIC_ICR))
+			return;
 
-	/* TODO: optimize to just emulate side effect w/o one more write */
-	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
+		kvm_lapic_msr_read(apic, offset, &val);
+		kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
+		trace_kvm_apic_write(APIC_ICR, val);
+	} else {
+		val = kvm_lapic_get_reg(apic, offset);
+
+		/* TODO: optimize to just emulate side effect w/o one more write */
+		kvm_lapic_reg_write(apic, offset, (u32)val);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (4 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-15 14:39   ` Sean Christopherson
  2022-04-11  9:04 ` [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock Zeng Guang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Maxim Levitsky

From: Maxim Levitsky <mlevitsk@redhat.com>

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

And Intel recent hardware just ignores writes to APIC_ID in
xAPIC mode. More background can be found at:
https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/

Looks there is no much value to support writable xAPIC ID in
guest except supporting some old and crazy use cases which
probably would fail on real hardware. So, make xAPIC ID
read-only for KVM guests.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 137c3a2f5180..62d5ce4dc0c5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2047,10 +2047,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
-			kvm_apic_set_xapic_id(apic, val >> 24);
-		else
+		if (apic_x2apic_mode(apic)) {
 			ret = 1;
+			break;
+		}
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((val >> 24) != apic->vcpu->vcpu_id) {
+			kvm_vm_bugged(apic->vcpu->kvm);
+			break;
+		}
+
+		kvm_apic_set_xapic_id(apic, val >> 24);
 		break;
 
 	case APIC_TASKPRI:
@@ -2635,11 +2642,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		struct kvm_lapic_state *s, bool set)
 {
-	if (apic_x2apic_mode(vcpu->arch.apic)) {
-		u32 *id = (u32 *)(s->regs + APIC_ID);
-		u32 *ldr = (u32 *)(s->regs + APIC_LDR);
-		u64 icr;
+	u32 *id = (u32 *)(s->regs + APIC_ID);
+	u32 *ldr = (u32 *)(s->regs + APIC_LDR);
+	u64 icr;
 
+	if (!apic_x2apic_mode(vcpu->arch.apic)) {
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((*id >> 24) != vcpu->vcpu_id)
+			return -EINVAL;
+	} else {
 		if (vcpu->kvm->arch.x2apic_format) {
 			if (*id != vcpu->vcpu_id)
 				return -EINVAL;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (5 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-15 15:00   ` Sean Christopherson
  2022-04-11  9:04 ` [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
  2022-04-11  9:04 ` [PATCH v8 9/9] KVM: VMX: enable IPI virtualization Zeng Guang
  8 siblings, 1 reply; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Arch specific KVM common data may require pre-allocation or other
preprocess ready before vCPU creation at runtime. It's safe to
invoke kvm_arch_vcpu_precreate() within the protection of kvm->lock
directly rather than take into account in the implementation for each
architecture.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/s390/kvm/kvm-s390.c | 2 --
 virt/kvm/kvm_main.c      | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 156d1c25a3c1..5c795bbcf1ea 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3042,9 +3042,7 @@ static int sca_can_add_vcpu(struct kvm *kvm, unsigned int id)
 	if (!sclp.has_esca || !sclp.has_64bscao)
 		return false;
 
-	mutex_lock(&kvm->lock);
 	rc = kvm->arch.use_esca ? 0 : sca_switch_to_extended(kvm);
-	mutex_unlock(&kvm->lock);
 
 	return rc == 0 && id < KVM_S390_ESCA_CPU_SLOTS;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..a452e678a015 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3732,9 +3732,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	}
 
 	kvm->created_vcpus++;
+	r = kvm_arch_vcpu_precreate(kvm, id);
 	mutex_unlock(&kvm->lock);
 
-	r = kvm_arch_vcpu_precreate(kvm, id);
 	if (r)
 		goto vcpu_decrement;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (6 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-15 15:01   ` Sean Christopherson
  2022-04-11  9:04 ` [PATCH v8 9/9] KVM: VMX: enable IPI virtualization Zeng Guang
  8 siblings, 1 reply; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 Documentation/virt/kvm/api.rst  | 17 +++++++++++++++++
 arch/x86/include/asm/kvm_host.h |  6 ++++++
 arch/x86/kvm/x86.c              | 18 +++++++++++++++++-
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index d13fa6600467..bb0b0f3edefe 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7136,6 +7136,23 @@ The valid bits in cap.args[0] are:
                                     IA32_MISC_ENABLE[bit 18] is cleared.
 =================================== ============================================
 
+7.32 KVM_CAP_MAX_VCPU_ID
+------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] - maximum APIC ID value set for current VM
+:Returns: 0 on success, -EINVAL if args[0] is beyond KVM_MAX_VCPU_IDS
+          supported in KVM or if vCPU has been created.
+
+Userspace is able to calculate the limit to APIC ID values from designated CPU
+topology. This capability allows userspace to specify maximum possible APIC ID
+assigned for current VM session prior to the creation of vCPUs. KVM can manage
+memory allocation of VM-scope structures which depends on the value of APIC ID.
+
+Calling KVM_CHECK_EXTENSION for this capability returns the value of maximum APIC
+ID that KVM supports at runtime. It sets as KVM_MAX_VCPU_IDS in VM initialization.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d23e80a56eb8..cdd14033988d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1238,6 +1238,12 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+	/*
+	 * VM-scope maximum vCPU ID. Used to determine the size of structures
+	 * that increase along with the maximum vCPU ID, in which case, using
+	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
+	 */
+	u32 max_vcpu_ids;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c0ca599a353..d1a39285deab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4320,7 +4320,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = KVM_MAX_VCPUS;
 		break;
 	case KVM_CAP_MAX_VCPU_ID:
-		r = KVM_MAX_VCPU_IDS;
+		r = kvm->arch.max_vcpu_ids;
 		break;
 	case KVM_CAP_PV_MMU:	/* obsolete */
 		r = 0;
@@ -6064,6 +6064,18 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_MAX_VCPU_ID:
+		r = -EINVAL;
+		if (cap->args[0] > KVM_MAX_VCPU_IDS)
+			break;
+
+		mutex_lock(&kvm->lock);
+		if (!kvm->created_vcpus) {
+			kvm->arch.max_vcpu_ids = cap->args[0];
+			r = 0;
+		}
+		mutex_unlock(&kvm->lock);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -11180,6 +11192,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	struct page *page;
 	int r;
 
+	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_ids)
+		return -EINVAL;
+
 	vcpu->arch.last_vmentry_cpu = -1;
 	vcpu->arch.regs_avail = ~0;
 	vcpu->arch.regs_dirty = ~0;
@@ -11704,6 +11719,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
+	kvm->arch.max_vcpu_ids = KVM_MAX_VCPU_IDS;
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
                   ` (7 preceding siblings ...)
  2022-04-11  9:04 ` [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-04-11  9:04 ` Zeng Guang
  2022-04-15 15:25   ` Sean Christopherson
  2022-04-15 15:45   ` Sean Christopherson
  8 siblings, 2 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-11  9:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

From: Chao Gao <chao.gao@intel.com>

With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/include/asm/vmx.h         |  8 +++
 arch/x86/include/asm/vmxfeatures.h |  2 +
 arch/x86/kvm/vmx/capabilities.h    |  6 ++
 arch/x86/kvm/vmx/posted_intr.c     | 15 ++++-
 arch/x86/kvm/vmx/posted_intr.h     |  2 +
 arch/x86/kvm/vmx/vmx.c             | 89 ++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h             |  7 +++
 arch/x86/kvm/x86.c                 |  6 +-
 10 files changed, 124 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 3c368b639c04..357757e7227f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -126,6 +126,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
 KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP_OPTIONAL(alloc_ipiv_pid_table)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdd14033988d..6c5ec99b22d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1505,6 +1505,7 @@ struct kvm_x86_ops {
 	int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+	int (*alloc_ipiv_pid_table)(struct kvm *kvm);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8c929596a299..b79b6438acaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -76,6 +76,11 @@
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
 
+/*
+ * Definitions of Tertiary Processor-Based VM-Execution Controls.
+ */
+#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
+
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
 #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
@@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID            = 0x00000000,
 	POSTED_INTR_NV                  = 0x00000002,
+	LAST_PID_POINTER_INDEX		= 0x00000008,
 	GUEST_ES_SELECTOR               = 0x00000800,
 	GUEST_CS_SELECTOR               = 0x00000802,
 	GUEST_SS_SELECTOR               = 0x00000804,
@@ -224,6 +230,8 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	PID_POINTER_TABLE		= 0x00002042,
+	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index ff20776dc83b..589608c157bf 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -86,4 +86,6 @@
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
 
+/* Tertiary Processor-Based VM-Execution Controls, word 3 */
+#define VMX_FEATURE_IPI_VIRT		( 3*32+  4) /* Enable IPI virtualization */
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 31f3d88b3e4d..5f656c9e33be 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
+extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
@@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
 		cpu_has_vmx_posted_intr();
 }
 
+static inline bool cpu_has_vmx_ipiv(void)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 3834bb30ce54..1b12f9cfa280 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -177,11 +177,24 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 	local_irq_restore(flags);
 }
 
+static bool vmx_can_use_pi_wakeup(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * If a blocked vCPU can be the target of posted interrupts,
+	 * switching notification vector is needed so that kernel can
+	 * be informed when an interrupt is posted and get the chance
+	 * to wake up the blocked vCPU. For now, using posted interrupt
+	 * for vCPU wakeup when IPI virtualization or VT-d PI can be
+	 * enabled.
+	 */
+	return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
+}
+
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!vmx_can_use_vtd_pi(vcpu->kvm))
+	if (!vmx_can_use_pi_wakeup(vcpu))
 		return;
 
 	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 9a45d5c9f116..26992076552e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -5,6 +5,8 @@
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
 
+#define PID_TABLE_ENTRY_VALID 1
+
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
 	u32 pir[8];     /* Posted interrupt requested */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f439abd52bad..a5ef39e53d51 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
 
 module_param(enable_apicv, bool, S_IRUGO);
 
+bool __read_mostly enable_ipiv = true;
+module_param(enable_ipiv, bool, 0444);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -2525,7 +2528,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	}
 
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
-		u64 opt3 = 0;
+		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
 
 		_cpu_based_3rd_exec_control = adjust_vmx_controls64(opt3,
 					      MSR_IA32_VMX_PROCBASED_CTLS3);
@@ -3872,6 +3875,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
 		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+		if (enable_ipiv)
+			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR), MSR_TYPE_RW);
 	}
 }
 
@@ -4194,15 +4199,19 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
-	if (cpu_has_secondary_exec_ctrls()) {
-		if (kvm_vcpu_apicv_active(vcpu))
-			secondary_exec_controls_setbit(vmx,
-				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
-				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		else
-			secondary_exec_controls_clearbit(vmx,
-					SECONDARY_EXEC_APIC_REGISTER_VIRT |
-					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+
+	if (kvm_vcpu_apicv_active(vcpu)) {
+		secondary_exec_controls_setbit(vmx,
+			      SECONDARY_EXEC_APIC_REGISTER_VIRT |
+			      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+		if (enable_ipiv)
+			tertiary_exec_controls_setbit(vmx, TERTIARY_EXEC_IPI_VIRT);
+	} else {
+		secondary_exec_controls_clearbit(vmx,
+			      SECONDARY_EXEC_APIC_REGISTER_VIRT |
+			      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+		if (enable_ipiv)
+			tertiary_exec_controls_clearbit(vmx, TERTIARY_EXEC_IPI_VIRT);
 	}
 
 	vmx_update_msr_bitmap_x2apic(vcpu);
@@ -4236,7 +4245,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 
 static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
 {
-	return vmcs_config.cpu_based_3rd_exec_ctrl;
+	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
+
+	/*
+	 * IPI virtualization relies on APICv. Disable IPI virtualization if
+	 * APICv is inhibited.
+	 */
+	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
+		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
+
+	return exec_control;
 }
 
 /*
@@ -4384,10 +4402,37 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+int vmx_get_pid_table_order(struct kvm_vmx *kvm_vmx)
+{
+	return get_order(kvm_vmx->kvm.arch.max_vcpu_ids * sizeof(*kvm_vmx->pid_table));
+}
+
+static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
+{
+	struct page *pages;
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+	if (!irqchip_in_kernel(kvm) || !enable_ipiv)
+		return 0;
+	if (kvm_vmx->pid_table)
+		return 0;
+
+	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+			    vmx_get_pid_table_order(kvm_vmx));
+
+	if (!pages)
+		return -ENOMEM;
+
+	kvm_vmx->pid_table = (void *)page_address(pages);
+	return 0;
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 
 static void init_vmcs(struct vcpu_vmx *vmx)
 {
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);
+
 	if (nested)
 		nested_vmx_set_vmcs_shadowing_bitmap();
 
@@ -4419,6 +4464,11 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
 	}
 
+	if (vmx_can_use_ipiv(&vmx->vcpu)) {
+		vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+		vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->kvm.arch.max_vcpu_ids - 1);
+	}
+
 	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmx->ple_window = ple_window;
@@ -7112,6 +7162,10 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 			goto free_vmcs;
 	}
 
+	if (vmx_can_use_ipiv(vcpu))
+		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
+			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+
 	return 0;
 
 free_vmcs:
@@ -7746,6 +7800,14 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 	return supported & BIT(reason);
 }
 
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+	if (kvm_vmx->pid_table)
+		free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm_vmx));
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -7757,6 +7819,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
+	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_create = vmx_vcpu_create,
 	.vcpu_free = vmx_vcpu_free,
@@ -7880,6 +7943,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+	.alloc_ipiv_pid_table = vmx_alloc_ipiv_pid_table,
 };
 
 static unsigned int vmx_handle_intel_pt_intr(void)
@@ -8011,6 +8075,9 @@ static __init int hardware_setup(void)
 	if (!enable_apicv)
 		vmx_x86_ops.sync_pir_to_irr = NULL;
 
+	if (!enable_apicv || !cpu_has_vmx_ipiv())
+		enable_ipiv = false;
+
 	if (cpu_has_vmx_tsc_scaling())
 		kvm_has_tsc_control = true;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 85c067f2d7f2..4ab66b683624 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -365,6 +365,8 @@ struct kvm_vmx {
 	unsigned int tss_addr;
 	bool ept_identity_pagetable_done;
 	gpa_t ept_identity_map_addr;
+	/* Posted Interrupt Descriptor (PID) table for IPI virtualization */
+	u64 *pid_table;
 };
 
 bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
@@ -580,4 +582,9 @@ static inline int vmx_get_instr_info_reg2(u32 vmx_instr_info)
 	return (vmx_instr_info >> 28) & 0xf;
 }
 
+static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu)
+{
+	return  lapic_in_kernel(vcpu) && enable_ipiv;
+}
+
 #endif /* __KVM_X86_VMX_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d1a39285deab..23fbf52f7bea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
 
 int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
 {
+	int ret = 0;
+
 	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
 		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
 			     "guest TSC will not be reliable\n");
 
-	return 0;
+	if (kvm_x86_ops.alloc_ipiv_pid_table)
+		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);
+	return ret;
 }
 
 int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-04-11  9:04 ` [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
@ 2022-04-15 14:39   ` Sean Christopherson
  2022-04-19 14:07     ` Maxim Levitsky
  0 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 14:39 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Maxim Levitsky

On Mon, Apr 11, 2022, Zeng Guang wrote:
> From: Maxim Levitsky <mlevitsk@redhat.com>
> 
> No normal guest has any reason to change physical APIC IDs, and
> allowing this introduces bugs into APIC acceleration code.
> 
> And Intel recent hardware just ignores writes to APIC_ID in
> xAPIC mode. More background can be found at:
> https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> 
> Looks there is no much value to support writable xAPIC ID in
> guest except supporting some old and crazy use cases which
> probably would fail on real hardware. So, make xAPIC ID
> read-only for KVM guests.

AFAIK, the plan is to add a capability to let userspace opt-in to a fully read-only
APIC ID[*], but I haven't seen patches...

Maxim?

[*] https://lore.kernel.org/all/c903e82ed2a1e98f66910c35b5aabdcf56e08e72.camel@redhat.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock
  2022-04-11  9:04 ` [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock Zeng Guang
@ 2022-04-15 15:00   ` Sean Christopherson
  2022-04-15 15:11     ` Sean Christopherson
  0 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 15:00 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Marc Zyngier, James Morse,
	Alexandru Elisei, Suzuki K Poulose, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, David Hildenbrand

Heh, lot's of people cc'd, but none of the people who's code this affects.

+s390 and arm folks

On Mon, Apr 11, 2022, Zeng Guang wrote:
> Arch specific KVM common data may require pre-allocation or other
> preprocess ready before vCPU creation at runtime.

Please provide the specific motivation for the move, i.e. explain the desire to
do per-VM allocations based on max_vcpu_ids at the first vCPU creation.

> It's safe to invoke kvm_arch_vcpu_precreate() within the protection of
> kvm->lock directly rather than take into account in the implementation for
> each architecture.

This absolutely needs to explain _why_ it's safe, e.g. only arm64, x86, and s390
have non-nop implementations and they're all simple and short with no tendrils
into other code that might take kvm->lock.

And as before, I suspect arm64 needs this protection, the vgic_initialized()
check looks racy.  Though it's hard to tell if doing the check under kvm->lock
actually fixes anything.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/s390/kvm/kvm-s390.c | 2 --
>  virt/kvm/kvm_main.c      | 2 +-

I think it's also worth changing x86's implementation to check created_vcpus
instead of online_vcpus.  That'll fix a race where userspace could never see the
pr_warn() (which is arguably useless, but whatever), e.g. if it creates a VM with
2 vCPUs and both simultaneously go through kvm_arch_vcpu_precreate().

>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 156d1c25a3c1..5c795bbcf1ea 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -3042,9 +3042,7 @@ static int sca_can_add_vcpu(struct kvm *kvm, unsigned int id)
>  	if (!sclp.has_esca || !sclp.has_64bscao)
>  		return false;
>  
> -	mutex_lock(&kvm->lock);
>  	rc = kvm->arch.use_esca ? 0 : sca_switch_to_extended(kvm);
> -	mutex_unlock(&kvm->lock);
>  
>  	return rc == 0 && id < KVM_S390_ESCA_CPU_SLOTS;
>  }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70e05af5ebea..a452e678a015 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3732,9 +3732,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  	}
>  
>  	kvm->created_vcpus++;
> +	r = kvm_arch_vcpu_precreate(kvm, id);

Hmm, so I think I'd prefer this to be invoked before bumping created_vcpus.  The
existing implementation don't reference created_vcpus, so there's no change needed
to existing code.  Logically, a pre-create helper should not see a non-zero count
as the "pre" part strongly implies it's being called _before_ creating the first vCPU.

Then switching from online_vcpus to created_vcpus in the x86 implementation doesn't
need to have a wierd change from "> 0" => "> 1".

Ah, and then it also wouldn't have goofy behavior where it drops and reacquires
kvm->lock on failure just to decrement created_vcpus.

>  	mutex_unlock(&kvm->lock);
>  
> -	r = kvm_arch_vcpu_precreate(kvm, id);
>  	if (r)
>  		goto vcpu_decrement;
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-04-11  9:04 ` [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-04-15 15:01   ` Sean Christopherson
  0 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 15:01 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Mon, Apr 11, 2022, Zeng Guang wrote:
> @@ -11180,6 +11192,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	struct page *page;
>  	int r;
>  
> +	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_ids)

This belongs in pre-create.

> +		return -EINVAL;
> +
>  	vcpu->arch.last_vmentry_cpu = -1;
>  	vcpu->arch.regs_avail = ~0;
>  	vcpu->arch.regs_dirty = ~0;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock
  2022-04-15 15:00   ` Sean Christopherson
@ 2022-04-15 15:11     ` Sean Christopherson
  0 siblings, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 15:11 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Marc Zyngier, James Morse,
	Alexandru Elisei, Suzuki K Poulose, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, David Hildenbrand

On Fri, Apr 15, 2022, Sean Christopherson wrote:
> > It's safe to invoke kvm_arch_vcpu_precreate() within the protection of
> > kvm->lock directly rather than take into account in the implementation for
> > each architecture.
> 
> This absolutely needs to explain _why_ it's safe, e.g. only arm64, x86, and s390
> have non-nop implementations and they're all simple and short with no tendrils
> into other code that might take kvm->lock.
> 
> And as before, I suspect arm64 needs this protection, the vgic_initialized()
> check looks racy.  Though it's hard to tell if doing the check under kvm->lock
> actually fixes anything.

Ah, I bet this code in vgic_init() provides the necessary protection.

	/* Are we also in the middle of creating a VCPU? */
	if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus))
		return -EBUSY;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-11  9:04 ` [PATCH v8 9/9] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2022-04-15 15:25   ` Sean Christopherson
  2022-04-18  9:25     ` Chao Gao
  2022-04-18 12:49     ` Zeng Guang
  2022-04-15 15:45   ` Sean Christopherson
  1 sibling, 2 replies; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 15:25 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Mon, Apr 11, 2022, Zeng Guang wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d1a39285deab..23fbf52f7bea 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
>  
>  int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
>  {
> +	int ret = 0;
> +
>  	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
>  		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
>  			     "guest TSC will not be reliable\n");
>  
> -	return 0;
> +	if (kvm_x86_ops.alloc_ipiv_pid_table)
> +		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);

Add a generic kvm_x86_ops.vcpu_precreate, no reason to make this so specific.
And use KVM_X86_OP_RET0 instead of KVM_X86_OP_OPTIONAL, then this can simply be

	return static_call(kvm_x86_vcpu_precreate);

That said, there's a flaw in my genius plan.

  1. KVM_CREATE_VM
  2. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=1
  3. KVM_CREATE_VCPU, create IPIv table but ultimately fails
  4. KVM decrements created_vcpus back to '0'
  5. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=4096
  6. KVM_CREATE_VCPU w/ ID out of range

In other words, malicious userspace could trigger buffer overflow.

That could be solved by adding an arch hook to undo precreate, but that's gross
and a good indication that we're trying to solve this the wrong way.

I think it's high time we add KVM_FINALIZE_VM, though that's probably a bad name
since e.g. TDX wants to use that name for VM really, really, being finalized[*],
i.e. after all vCPUs have been created.

KVM_POST_CREATE_VM?  That's not very good either.

Paolo or anyone else, thoughts?

[*] https://lore.kernel.org/all/83768bf0f786d24f49d9b698a45ba65441ef5ef0.1646422845.git.isaku.yamahata@intel.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-11  9:04 ` [PATCH v8 9/9] KVM: VMX: enable IPI virtualization Zeng Guang
  2022-04-15 15:25   ` Sean Christopherson
@ 2022-04-15 15:45   ` Sean Christopherson
  1 sibling, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2022-04-15 15:45 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Mon, Apr 11, 2022, Zeng Guang wrote:
> @@ -4194,15 +4199,19 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
> -	if (cpu_has_secondary_exec_ctrls()) {
> -		if (kvm_vcpu_apicv_active(vcpu))
> -			secondary_exec_controls_setbit(vmx,
> -				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
> -				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> -		else
> -			secondary_exec_controls_clearbit(vmx,
> -					SECONDARY_EXEC_APIC_REGISTER_VIRT |
> -					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> +
> +	if (kvm_vcpu_apicv_active(vcpu)) {
> +		secondary_exec_controls_setbit(vmx,
> +			      SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> +		if (enable_ipiv)
> +			tertiary_exec_controls_setbit(vmx, TERTIARY_EXEC_IPI_VIRT);
> +	} else {
> +		secondary_exec_controls_clearbit(vmx,
> +			      SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);

Thanks for doing this, but can you move it to a separate patch?  Just in case
we're missing something and this somehow explodes.

> +		if (enable_ipiv)
> +			tertiary_exec_controls_clearbit(vmx, TERTIARY_EXEC_IPI_VIRT);
>  	}
>  
>  	vmx_update_msr_bitmap_x2apic(vcpu);
> @@ -4236,7 +4245,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>  
>  static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>  {
> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
> +
> +	/*
> +	 * IPI virtualization relies on APICv. Disable IPI virtualization if
> +	 * APICv is inhibited.
> +	 */
> +	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
> +
> +	return exec_control;
>  }
>  
>  /*
> @@ -4384,10 +4402,37 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	return exec_control;
>  }
>  
> +int vmx_get_pid_table_order(struct kvm_vmx *kvm_vmx)
> +{
> +	return get_order(kvm_vmx->kvm.arch.max_vcpu_ids * sizeof(*kvm_vmx->pid_table));

I think it's slightly less gross to take @kvm and then:

	return get_order(kvm->arch.max_vcpu_ids * sizeof(*to_kvm_vmx(kvm)->pid_table));

> +}
> +
> +static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
> +{
> +	struct page *pages;
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
> +
> +	if (!irqchip_in_kernel(kvm) || !enable_ipiv)
> +		return 0;

Newline here please.

> +	if (kvm_vmx->pid_table)

Note, this check goes away if this ends up being called from a dedicated ioctl.

> +		return 0;
> +
> +	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> +			    vmx_get_pid_table_order(kvm_vmx));
> +

But no newline here please :-)

> +	if (!pages)
> +		return -ENOMEM;
> +
> +	kvm_vmx->pid_table = (void *)page_address(pages);
> +	return 0;
> +}
> +
>  #define VMX_XSS_EXIT_BITMAP 0
>  
>  static void init_vmcs(struct vcpu_vmx *vmx)
>  {
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vmx->vcpu.kvm);

Might be worth doing:

	struct kvm *kvm = vmx->vcpu.kvm;
	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);

The kvm_vmx->kvm.arch below is kinda funky.

Ah yeah, do that, then e.g. the kvm_pause_in_guest() call doesn't need to get
'kvm' itself.

> +
>  	if (nested)
>  		nested_vmx_set_vmcs_shadowing_bitmap();
>  
> @@ -4419,6 +4464,11 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>  	}
>  
> +	if (vmx_can_use_ipiv(&vmx->vcpu)) {
> +		vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
> +		vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->kvm.arch.max_vcpu_ids - 1);
> +	}
> +
>  	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmx->ple_window = ple_window;
> @@ -7112,6 +7162,10 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
>  			goto free_vmcs;
>  	}
>  
> +	if (vmx_can_use_ipiv(vcpu))
> +		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
> +			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
> +
>  	return 0;
>  
>  free_vmcs:
> @@ -7746,6 +7800,14 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
>  	return supported & BIT(reason);
>  }
>  
> +static void vmx_vm_destroy(struct kvm *kvm)
> +{
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
> +
> +	if (kvm_vmx->pid_table)
> +		free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm_vmx));

free_pages() does the != 0 check, no need to handle that here.  I agree it feels
wierd, but it's well established behavior.

> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  	.name = "kvm_intel",
>  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-15 15:25   ` Sean Christopherson
@ 2022-04-18  9:25     ` Chao Gao
  2022-04-18 15:14       ` Sean Christopherson
  2022-04-18 12:49     ` Zeng Guang
  1 sibling, 1 reply; 23+ messages in thread
From: Chao Gao @ 2022-04-18  9:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu

On Fri, Apr 15, 2022 at 03:25:06PM +0000, Sean Christopherson wrote:
>On Mon, Apr 11, 2022, Zeng Guang wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index d1a39285deab..23fbf52f7bea 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
>>  
>>  int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
>>  {
>> +	int ret = 0;
>> +
>>  	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
>>  		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
>>  			     "guest TSC will not be reliable\n");
>>  
>> -	return 0;
>> +	if (kvm_x86_ops.alloc_ipiv_pid_table)
>> +		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);
>
>Add a generic kvm_x86_ops.vcpu_precreate, no reason to make this so specific.
>And use KVM_X86_OP_RET0 instead of KVM_X86_OP_OPTIONAL, then this can simply be
>
>	return static_call(kvm_x86_vcpu_precreate);
>
>That said, there's a flaw in my genius plan.
>
>  1. KVM_CREATE_VM
>  2. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=1
>  3. KVM_CREATE_VCPU, create IPIv table but ultimately fails
>  4. KVM decrements created_vcpus back to '0'
>  5. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=4096
>  6. KVM_CREATE_VCPU w/ ID out of range
>
>In other words, malicious userspace could trigger buffer overflow.

can we simply return an error (e.g., -EEXIST) on step 5 (i.e.,
max_vcpu_ids cannot be changed after being set once)?

or

can we detect the change of max_vcpu_ids in step 6 and re-allocate PID
table?

>
>That could be solved by adding an arch hook to undo precreate, but that's gross
>and a good indication that we're trying to solve this the wrong way.
>
>I think it's high time we add KVM_FINALIZE_VM, though that's probably a bad name
>since e.g. TDX wants to use that name for VM really, really, being finalized[*],
>i.e. after all vCPUs have been created.
>
>KVM_POST_CREATE_VM?  That's not very good either.
>
>Paolo or anyone else, thoughts?
>
>[*] https://lore.kernel.org/all/83768bf0f786d24f49d9b698a45ba65441ef5ef0.1646422845.git.isaku.yamahata@intel.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-15 15:25   ` Sean Christopherson
  2022-04-18  9:25     ` Chao Gao
@ 2022-04-18 12:49     ` Zeng Guang
  1 sibling, 0 replies; 23+ messages in thread
From: Zeng Guang @ 2022-04-18 12:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Luck, Tony, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Huang, Kai, x86,
	linux-kernel, Hu, Robert, Gao, Chao


On 4/15/2022 11:25 PM, Sean Christopherson wrote:
> On Mon, Apr 11, 2022, Zeng Guang wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index d1a39285deab..23fbf52f7bea 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
>>   
>>   int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
>>   {
>> +	int ret = 0;
>> +
>>   	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
>>   		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
>>   			     "guest TSC will not be reliable\n");
>>   
>> -	return 0;
>> +	if (kvm_x86_ops.alloc_ipiv_pid_table)
>> +		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);
> Add a generic kvm_x86_ops.vcpu_precreate, no reason to make this so specific.
> And use KVM_X86_OP_RET0 instead of KVM_X86_OP_OPTIONAL, then this can simply be
>
> 	return static_call(kvm_x86_vcpu_precreate);
>
> That said, there's a flaw in my genius plan.
>
>    1. KVM_CREATE_VM
>    2. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=1
>    3. KVM_CREATE_VCPU, create IPIv table but ultimately fails
>    4. KVM decrements created_vcpus back to '0'
>    5. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=4096
>    6. KVM_CREATE_VCPU w/ ID out of range
>
> In other words, malicious userspace could trigger buffer overflow.


This is the tricky exploit that make max_vcpu_ids update more times. I 
think we
can avoid this issue by checking pid table availability during the 
pre-creation
of the first vCPU. If it's already allocated, free it and re-allocate to 
accommodate
table size to new max_vcpu_ids if updated.

> That could be solved by adding an arch hook to undo precreate, but that's gross
> and a good indication that we're trying to solve this the wrong way.
>
> I think it's high time we add KVM_FINALIZE_VM, though that's probably a bad name
> since e.g. TDX wants to use that name for VM really, really, being finalized[*],
> i.e. after all vCPUs have been created.
>
> KVM_POST_CREATE_VM?  That's not very good either.
>
> Paolo or anyone else, thoughts?
>
> [*] https://lore.kernel.org/all/83768bf0f786d24f49d9b698a45ba65441ef5ef0.1646422845.git.isaku.yamahata@intel.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-18  9:25     ` Chao Gao
@ 2022-04-18 15:14       ` Sean Christopherson
  2022-04-19  0:00         ` Chao Gao
  0 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2022-04-18 15:14 UTC (permalink / raw)
  To: Chao Gao
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu

On Mon, Apr 18, 2022, Chao Gao wrote:
> On Fri, Apr 15, 2022 at 03:25:06PM +0000, Sean Christopherson wrote:
> >On Mon, Apr 11, 2022, Zeng Guang wrote:
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index d1a39285deab..23fbf52f7bea 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
> >>  
> >>  int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
> >>  {
> >> +	int ret = 0;
> >> +
> >>  	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
> >>  		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
> >>  			     "guest TSC will not be reliable\n");
> >>  
> >> -	return 0;
> >> +	if (kvm_x86_ops.alloc_ipiv_pid_table)
> >> +		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);
> >
> >Add a generic kvm_x86_ops.vcpu_precreate, no reason to make this so specific.
> >And use KVM_X86_OP_RET0 instead of KVM_X86_OP_OPTIONAL, then this can simply be
> >
> >	return static_call(kvm_x86_vcpu_precreate);
> >
> >That said, there's a flaw in my genius plan.
> >
> >  1. KVM_CREATE_VM
> >  2. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=1
> >  3. KVM_CREATE_VCPU, create IPIv table but ultimately fails
> >  4. KVM decrements created_vcpus back to '0'
> >  5. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=4096
> >  6. KVM_CREATE_VCPU w/ ID out of range
> >
> >In other words, malicious userspace could trigger buffer overflow.
> 
> can we simply return an error (e.g., -EEXIST) on step 5 (i.e.,
> max_vcpu_ids cannot be changed after being set once)?
> 
> or
> 
> can we detect the change of max_vcpu_ids in step 6 and re-allocate PID
> table?

Returning an error is viable, but would be a rather odd ABI.  Re-allocating isn't
a good option because the PID table could be in active use by other vCPUs, e.g.
KVM would need to send a request and kick all vCPUs to have all vCPUs update their
VMCS.

And with both of those alternatives, I still don't like that every feature that
acts on max_vcpu_ids would need to handle this same edge case.

An alternative to another new ioctl() would be to to make KVM_CAP_MAX_VCPU_ID
write-once, i.e. reject attempts to change the max once set (though we could allow
re-writing the same value).  I think I like that idea better than adding an ioctl().

It can even be done without an extra flag by zero-initializing the field and instead
waiting until vCPU pre-create to lock in the value.  That would also help detect
bad usage of max_vcpu_ids, especially if we added a wrapper to get the value, e.g.
the wrapper could WARN_ON(!kvm->arch.max_vcpu_ids).

E.g.

int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
{
	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
			     "guest TSC will not be reliable\n");

	if (!kvm->arch.max_vcpu_ids)
		kvm->arch.max_vcpu_ids = KVM_MAX_VCPU_IDS;

	return 0;
}


	case KVM_CAP_MAX_VCPU_ID:
		r = -EINVAL;
		if (cap->args[0] > KVM_MAX_VCPU_IDS)
			break;

		mutex_lock(&kvm->lock);
                if (kvm->arch.max_vcpu_ids == cap->args[0]) {
                        r = 0;
                } else if (!kvm->arch.max_vcpu_ids) {
			kvm->arch.max_vcpu_ids = cap->args[0];
			r = 0;
		}
		mutex_unlock(&kvm->lock);
		break;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 9/9] KVM: VMX: enable IPI virtualization
  2022-04-18 15:14       ` Sean Christopherson
@ 2022-04-19  0:00         ` Chao Gao
  0 siblings, 0 replies; 23+ messages in thread
From: Chao Gao @ 2022-04-19  0:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu

On Mon, Apr 18, 2022 at 03:14:51PM +0000, Sean Christopherson wrote:
>On Mon, Apr 18, 2022, Chao Gao wrote:
>> On Fri, Apr 15, 2022 at 03:25:06PM +0000, Sean Christopherson wrote:
>> >On Mon, Apr 11, 2022, Zeng Guang wrote:
>> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> >> index d1a39285deab..23fbf52f7bea 100644
>> >> --- a/arch/x86/kvm/x86.c
>> >> +++ b/arch/x86/kvm/x86.c
>> >> @@ -11180,11 +11180,15 @@ static int sync_regs(struct kvm_vcpu *vcpu)
>> >>  
>> >>  int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
>> >>  {
>> >> +	int ret = 0;
>> >> +
>> >>  	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
>> >>  		pr_warn_once("kvm: SMP vm created on host with unstable TSC; "
>> >>  			     "guest TSC will not be reliable\n");
>> >>  
>> >> -	return 0;
>> >> +	if (kvm_x86_ops.alloc_ipiv_pid_table)
>> >> +		ret = static_call(kvm_x86_alloc_ipiv_pid_table)(kvm);
>> >
>> >Add a generic kvm_x86_ops.vcpu_precreate, no reason to make this so specific.
>> >And use KVM_X86_OP_RET0 instead of KVM_X86_OP_OPTIONAL, then this can simply be
>> >
>> >	return static_call(kvm_x86_vcpu_precreate);
>> >
>> >That said, there's a flaw in my genius plan.
>> >
>> >  1. KVM_CREATE_VM
>> >  2. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=1
>> >  3. KVM_CREATE_VCPU, create IPIv table but ultimately fails
>> >  4. KVM decrements created_vcpus back to '0'
>> >  5. KVM_CAP_MAX_VCPU_ID, set max_vcpu_ids=4096
>> >  6. KVM_CREATE_VCPU w/ ID out of range
>> >
>> >In other words, malicious userspace could trigger buffer overflow.
>> 
>> can we simply return an error (e.g., -EEXIST) on step 5 (i.e.,
>> max_vcpu_ids cannot be changed after being set once)?
>> 
>> or
>> 
>> can we detect the change of max_vcpu_ids in step 6 and re-allocate PID
>> table?
>
>Returning an error is viable, but would be a rather odd ABI.  Re-allocating isn't
>a good option because the PID table could be in active use by other vCPUs, e.g.
>KVM would need to send a request and kick all vCPUs to have all vCPUs update their
>VMCS.
>
>And with both of those alternatives, I still don't like that every feature that
>acts on max_vcpu_ids would need to handle this same edge case.
>
>An alternative to another new ioctl() would be to to make KVM_CAP_MAX_VCPU_ID
>write-once, i.e. reject attempts to change the max once set (though we could allow
>re-writing the same value).  I think I like that idea better than adding an ioctl().
>
>It can even be done without an extra flag by zero-initializing the field and instead
>waiting until vCPU pre-create to lock in the value.  That would also help detect
>bad usage of max_vcpu_ids, especially if we added a wrapper to get the value, e.g.
>the wrapper could WARN_ON(!kvm->arch.max_vcpu_ids).

Yes, it looks simpler than adding an ioctl(). We will use this approach.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-04-15 14:39   ` Sean Christopherson
@ 2022-04-19 14:07     ` Maxim Levitsky
  2022-04-26  8:14       ` Maxim Levitsky
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Levitsky @ 2022-04-19 14:07 UTC (permalink / raw)
  To: Sean Christopherson, Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-04-15 at 14:39 +0000, Sean Christopherson wrote:
> On Mon, Apr 11, 2022, Zeng Guang wrote:
> > From: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > No normal guest has any reason to change physical APIC IDs, and
> > allowing this introduces bugs into APIC acceleration code.
> > 
> > And Intel recent hardware just ignores writes to APIC_ID in
> > xAPIC mode. More background can be found at:
> > https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> > 
> > Looks there is no much value to support writable xAPIC ID in
> > guest except supporting some old and crazy use cases which
> > probably would fail on real hardware. So, make xAPIC ID
> > read-only for KVM guests.
> 
> AFAIK, the plan is to add a capability to let userspace opt-in to a fully read-only
> APIC ID[*], but I haven't seen patches...
> 
> Maxim?

Yep, I will start working on this pretty much today.

I was busy last 3 weeks stablilizing nested AVIC
(I am getting ~600,000 IPIs/s instead of ~40,000 in L2 VM with nested AVIC!),
with 700,000-900,000 IPIs native with AVIC, 
almost bare metal IPI performance in L2!

(the result is from test which I will soon publish makes all 
vCPUs send IPIs in round robin fashion, and a vCPU sends IPI only after 
it received it from previous vCPU - the number is total
number of IPIs send on 8 vCPUs).


The fact that the dreadful AVIC errata dominates my testing again,
supports my feeling that I mostly fixed nested AVIC bugs.'
Tomorrow I'll send RFC v2 of the patches.


About read-only apic ID cap, 
I have few questions before I start implementing it:

Paolo gave me recently an idea to make the APIC ID always read-only for 
guest writes, and only allow host APIC ID changes (unless the cap is set).

I am kind of torn about it - assuming that no guest writes APIC ID this will work just fine 
in empty logical sense, but if a guest actually writes an apic id, 
while it will migrate fine to a newer KVM, but then after a reboot 
it won't be able to set its APIC ID again.

On the other hand, versus fully unconditional read-only apic id, 
that will support that very unlikely case if the hypervisor
itself is actually the one that for some reason changes the apic id,
from the initial value it gave.


In terms of what I need:

- In nested AVIC I strongly prefer read-only apic ids, and I can
  make nested AVIC be conditional on the new cap.
  IPI virtualization also can be made conditional on the new cap.


- I also would love to remove broken code from *non nested* AVIC, 
  which tries to support APIC ID change. 

  I can make non nested AVIC *also* depend on the new cap, 
  but that will technically be a regression, since this way users of 
  older qemu and new kernel will suddenly have their AVIC inhibited. 

  I don't know if that is such a big issue though because AVIC is
  (sadly) by default disabled anyway.

  If that is not possible the other way to solve this is to inhibit AVIC
  as soon as the guest tries to change APIC ID.

- I would also want to remove the ability to relocate apic base,
  likely depending on the new cap as well, but if there are objections
  I can drop this. I don't need this, but it is just nice to do while we
  are at it.


Paolo, Sean, and everyone else: What do you think?

Also:
Suggestions for the name for the new cap? Is this all right if
the same cap would make both apic id read only and apic base
(this is also something Paolo suggested to me)

Best regards,
	Maxim Levitsky


> 
> [*] https://lore.kernel.org/all/c903e82ed2a1e98f66910c35b5aabdcf56e08e72.camel@redhat.com
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-04-19 14:07     ` Maxim Levitsky
@ 2022-04-26  8:14       ` Maxim Levitsky
  2022-04-26 14:00         ` Chao Gao
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Levitsky @ 2022-04-26  8:14 UTC (permalink / raw)
  To: Sean Christopherson, Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao

On Tue, 2022-04-19 at 17:07 +0300, Maxim Levitsky wrote:
> On Fri, 2022-04-15 at 14:39 +0000, Sean Christopherson wrote:
> > On Mon, Apr 11, 2022, Zeng Guang wrote:
> > > From: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > No normal guest has any reason to change physical APIC IDs, and
> > > allowing this introduces bugs into APIC acceleration code.
> > > 
> > > And Intel recent hardware just ignores writes to APIC_ID in
> > > xAPIC mode. More background can be found at:
> > > https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> > > 
> > > Looks there is no much value to support writable xAPIC ID in
> > > guest except supporting some old and crazy use cases which
> > > probably would fail on real hardware. So, make xAPIC ID
> > > read-only for KVM guests.
> > 
> > AFAIK, the plan is to add a capability to let userspace opt-in to a fully read-only
> > APIC ID[*], but I haven't seen patches...
> > 
> > Maxim?
> 
> Yep, I will start working on this pretty much today.
> 
> I was busy last 3 weeks stablilizing nested AVIC
> (I am getting ~600,000 IPIs/s instead of ~40,000 in L2 VM with nested AVIC!),
> with 700,000-900,000 IPIs native with AVIC, 
> almost bare metal IPI performance in L2!
> 
> (the result is from test which I will soon publish makes all 
> vCPUs send IPIs in round robin fashion, and a vCPU sends IPI only after 
> it received it from previous vCPU - the number is total
> number of IPIs send on 8 vCPUs).
> 
> 
> The fact that the dreadful AVIC errata dominates my testing again,
> supports my feeling that I mostly fixed nested AVIC bugs.'
> Tomorrow I'll send RFC v2 of the patches.
> 
> 
> About read-only apic ID cap, 
> I have few questions before I start implementing it:
> 
> Paolo gave me recently an idea to make the APIC ID always read-only for 
> guest writes, and only allow host APIC ID changes (unless the cap is set).
> 
> I am kind of torn about it - assuming that no guest writes APIC ID this will work just fine 
> in empty logical sense, but if a guest actually writes an apic id, 
> while it will migrate fine to a newer KVM, but then after a reboot 
> it won't be able to set its APIC ID again.
> 
> On the other hand, versus fully unconditional read-only apic id, 
> that will support that very unlikely case if the hypervisor
> itself is actually the one that for some reason changes the apic id,
> from the initial value it gave.
> 
> 
> In terms of what I need:
> 
> - In nested AVIC I strongly prefer read-only apic ids, and I can
>   make nested AVIC be conditional on the new cap.
>   IPI virtualization also can be made conditional on the new cap.
> 
> 
> - I also would love to remove broken code from *non nested* AVIC, 
>   which tries to support APIC ID change. 
> 
>   I can make non nested AVIC *also* depend on the new cap, 
>   but that will technically be a regression, since this way users of 
>   older qemu and new kernel will suddenly have their AVIC inhibited. 
> 
>   I don't know if that is such a big issue though because AVIC is
>   (sadly) by default disabled anyway.
> 
>   If that is not possible the other way to solve this is to inhibit AVIC
>   as soon as the guest tries to change APIC ID.
> 
> - I would also want to remove the ability to relocate apic base,
>   likely depending on the new cap as well, but if there are objections
>   I can drop this. I don't need this, but it is just nice to do while we
>   are at it.
> 
> 
> Paolo, Sean, and everyone else: What do you think?

Palo, Sean, Any update?

After thinking more about this, I actualy think I will do something
different, something that actually was proposed here, and I was against it:


1. I will add new inhibit APICV_INHIBIT_REASON_RO_SETTINGS, which will be set
first time any vCPU touches apic id and/or apic base because why not...

That will take care of non nested case cleanly, and will take care of IPIv
for now (as long as it does't support nesting).

2. For my nested AVIC, I will do 2 things:

   a. My code never reads L1 apic ids, and always uses vcpu_id, thus
      in theory, if I just ignore the problem, and the guest changes apic ids,
      the nested AVIC will just keep on using initial apic ids, thus there is  no danger
      of CVE like issue if the guest tries to change theses ids in the 'right' time.

   b. on each nested vm entry I'll just check that apic id is not changed from the default,
      if AVIC is enabled for the nested guest.

      if so the nested entry will fail (best with kvm_vm_bugged) to get attention of
      the user, but I can just fail it with standard vm exit reason of 0xFFFFFFFF.

      Chances that there is a modern VM, which runs nested guests, on AMD, and changes APIC IDs,
      IMHO are too low to bother, plus one can always disable nested AVIC for this case by tweaking
      the nested CPUID.


This way there will no need to patch qemu, propogate the new cap to the machines, etc, etc.

What do you think?

Best regards,
	Maxim Levitsky



> 
> Also:
> Suggestions for the name for the new cap? Is this all right if
> the same cap would make both apic id read only and apic base
> (this is also something Paolo suggested to me)
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> > [*] https://lore.kernel.org/all/c903e82ed2a1e98f66910c35b5aabdcf56e08e72.camel@redhat.com
> > 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-04-26  8:14       ` Maxim Levitsky
@ 2022-04-26 14:00         ` Chao Gao
  0 siblings, 0 replies; 23+ messages in thread
From: Chao Gao @ 2022-04-26 14:00 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

>Palo, Sean, Any update?
>
>After thinking more about this, I actualy think I will do something
>different, something that actually was proposed here, and I was against it:
>
>
>1. I will add new inhibit APICV_INHIBIT_REASON_RO_SETTINGS, which will be set
>first time any vCPU touches apic id and/or apic base because why not...
>
>That will take care of non nested case cleanly, and will take care of IPIv
>for now (as long as it does't support nesting).

Yes. This works well with intel IPIv.

>
>2. For my nested AVIC, I will do 2 things:
>
>   a. My code never reads L1 apic ids, and always uses vcpu_id, thus
>      in theory, if I just ignore the problem, and the guest changes apic ids,
>      the nested AVIC will just keep on using initial apic ids, thus there is  no danger
>      of CVE like issue if the guest tries to change theses ids in the 'right' time.
>
>   b. on each nested vm entry I'll just check that apic id is not changed from the default,
>      if AVIC is enabled for the nested guest.
>
>      if so the nested entry will fail (best with kvm_vm_bugged) to get attention of
>      the user, but I can just fail it with standard vm exit reason of 0xFFFFFFFF.

For sake of simplicity, I prefer to make APIC ID read-only for VMs that
supports (nested) AVIC or IPIv (KVM can check guest CPUID/MSR to know
this). When guest attempts to change read-only APIC ID, KVM can raise an
internal error, saying KVM cannot emulate this action. To get rid of such
an error, users should launch the VM with nested AVIC/IPIv disabled or
upgrade VM kernel to not change APIC ID.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-04-26 14:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-11  9:04 [PATCH v8 0/9] IPI virtualization support for VM Zeng Guang
2022-04-11  9:04 ` [PATCH v8 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
2022-04-11  9:04 ` [PATCH v8 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
2022-04-11  9:04 ` [PATCH v8 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
2022-04-11  9:04 ` [PATCH v8 4/9] KVM: VMX: Report tertiary_exec_control field in dump_vmcs() Zeng Guang
2022-04-11  9:04 ` [PATCH v8 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
2022-04-11  9:04 ` [PATCH v8 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
2022-04-15 14:39   ` Sean Christopherson
2022-04-19 14:07     ` Maxim Levitsky
2022-04-26  8:14       ` Maxim Levitsky
2022-04-26 14:00         ` Chao Gao
2022-04-11  9:04 ` [PATCH v8 7/9] KVM: Move kvm_arch_vcpu_precreate() under kvm->lock Zeng Guang
2022-04-15 15:00   ` Sean Christopherson
2022-04-15 15:11     ` Sean Christopherson
2022-04-11  9:04 ` [PATCH v8 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
2022-04-15 15:01   ` Sean Christopherson
2022-04-11  9:04 ` [PATCH v8 9/9] KVM: VMX: enable IPI virtualization Zeng Guang
2022-04-15 15:25   ` Sean Christopherson
2022-04-18  9:25     ` Chao Gao
2022-04-18 15:14       ` Sean Christopherson
2022-04-19  0:00         ` Chao Gao
2022-04-18 12:49     ` Zeng Guang
2022-04-15 15:45   ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.