linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/9] IPI virtualization support for VM
@ 2022-02-25  8:22 Zeng Guang
  2022-02-25  8:22 ` [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
                   ` (8 more replies)
  0 siblings, 9 replies; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Currently, issuing an IPI except self-ipi in guest on Intel CPU
always causes a VM-exit. It can lead to non-negligible overhead
to some workloads involving frequent IPIs when running in VMs.

IPI virtualization is a new VT-x feature, targeting to eliminate
VM-exits on source vCPUs when issuing unicast, physical-addressing
IPIs. Once it is enabled, the processor virtualizes following kinds
of operations that send IPIs without causing VM-exits:
- Memory-mapped ICR writes
- MSR-mapped ICR writes
- SENDUIPI execution

This patch series implements IPI virtualization support in KVM.

Patches 1-4 add tertiary processor-based VM-execution support
framework, which is used to enumerate IPI virtualization.

Patch 5 handles APIC-write VM exit due to writes to ICR MSR when
guest works in x2APIC mode. This is a new case introduced by
Intel VT-x.

Patch 6 disable the APIC ID change in any case.

Patch 7 implements IPI virtualization related function including
feature enabling through tertiary processor-based VM-execution in
various scenarios of VMCS configuration, PID table setup in vCPU
creation and vCPU block consideration.

Patch 8-9 provide userspace capability to set maximum possible VCPU
ID for current VM. IPIv can refer to this value to allocate essential
memory for PID-pointer table instead of using KVM_MAX_VCPU_IDS. It 
targets to reduce overall memory footprint.

Document for IPI virtualization is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

We did experiment to measure average time sending IPI from source vCPU
to the target vCPU completing the IPI handling by kvm unittest w/ and
w/o IPI virtualization. When IPI virtualization enabled, it will reduce
22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
respectively.

--------------------------------------
KVM unittest:vmexit/ipi

2 vCPU, AP was modified to run in idle loop instead of halt to ensure
no VM exit impact on target vCPU.

                Cycles of IPI
                xAPIC mode              x2APIC mode
        test    w/o IPIv  w/ IPIv       w/o IPIv  w/ IPIv
        1       6106      4816          4265      3768
        2       6244      4656          4404      3546
        3       6165      4658          4233      3474
        4       5992      4710          4363      3430
        5       6083      4741          4215      3551
        6       6238      4904          4304      3547
        7       6164      4617          4263      3709
        8       5984      4763          4518      3779
        9       5931      4712          4645      3667
        10      5955      4530          4332      3724
        11      5897      4673          4283      3569
        12      6140      4794          4178      3598
        13      6183      4728          4363      3628
        14      5991      4994          4509      3842
        15      5866      4665          4520      3739
        16      6032      4654          4229      3701
        17      6050      4653          4185      3726
        18      6004      4792          4319      3746
        19      5961      4626          4196      3392
        20      6194      4576          4433      3760

Average cycles  6059      4713.1        4337.85   3644.8
%Reduction                -22.21%                 -15.98%

--------------------------------------
IPI microbenchmark:
(https://lore.kernel.org/kvm/20171219085010.4081-1-ynorov@caviumnetworks.com)

2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode

Result with IPIv enabled:

Dry-run:                         0,             272798 ns
Self-IPI:                  5094123,           11114037 ns
Normal IPI:              131697087,          173321200 ns
Broadcast IPI:                   0,          155649075 ns
Broadcast lock:                  0,          161518031 ns

Result with IPIv disabled:

Dry-run:                         0,             272766 ns
Self-IPI:                  5091788,           11123699 ns
Normal IPI:              145215772,          174558920 ns
Broadcast IPI:                   0,          175785384 ns
Broadcast lock:                  0,          149076195 ns


As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
about 9.73% time saving on average out of 15 test runs when IPIv is
enabled.

Normal IPI statistics (unit:ns):
        test    w/o IPIv        w/ IPIv
        1       153346049       140907046
        2       147218648       141660618
        3       145215772       117890672
        4       146621682       136430470
        5       144821472       136199421
        6       144704378       131676928
        7       141403224       131697087
        8       144775766       125476250
        9       140658192       137263330
        10      144768626       138593127
        11      145166679       131946752
        12      145020451       116852889
        13      148161353       131406280
        14      148378655       130174353
        15      148903652       127969674

Average time    145944306.6     131742993.1 ns
%Reduction                      -9.73%

--------------------------------------
hackbench:

8 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 100000

                w/o IPIv        w/ IPIv
Time            91.887          74.605
%Reduction                      -18.808%

96 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 1000000

                w/o IPIv        w/ IPIv
Time            287.504         235.185
%Reduction                      -18.198%

--------------------------------------
v5->v6:
1. Adapt kvm_apic_write_nodecode() implementation based
   on Sean's fix of x2apic's ICR register process.
2. Drop the patch handling IPIv table entry setting in 
   case APIC ID changed, instead applying Levitsky's patch
   to disallow setting APIC ID in any case.
3. Drop the patch resizing the PID-pointer table on demand.
   Allow userspace to set maximum vcpu id at runtime that
   IPIv can refer to the practical value to allocate necessary
   memory for PID-pointer table. 

v4 -> v5:
1. Deal with enable_ipiv parameter following current
   vmcs configuration rule.
2. Allocate memory for PID-pointer table dynamically
3. Support guest runtime modify APIC ID in xAPIC mode
4. Helper to judge possibility to take PI block in IPIv case

v3 -> v4:
1. Refine code style of patch 2
2. Move tertiary control shadow build into patch 3
3. Make vmx_tertiary_exec_control to be static function

v2 -> v3:
1. Misc change on tertiary execution control
   definition and capability setup
2. Alternative to get tertiary execution
   control configuration

v1 -> v2:
1. Refine the IPIv enabling logic for VM.
   Remove ipiv_active definition per vCPU.

--------------------------------------

Gao Chao (1):
  KVM: VMX: enable IPI virtualization

Maxim Levitsky (1):
  KVM: x86: lapic: don't allow to change APIC ID unconditionally

Robert Hoo (4):
  x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
    variation
  KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well

Zeng Guang (3):
  KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  KVM: x86: Allow userspace set maximum VCPU id for VM
  KVM: VMX: Optimize memory allocation for PID-pointer table

 arch/x86/include/asm/kvm_host.h    |   6 ++
 arch/x86/include/asm/msr-index.h   |   1 +
 arch/x86/include/asm/vmx.h         |  11 +++
 arch/x86/include/asm/vmxfeatures.h |   5 +-
 arch/x86/kernel/cpu/feat_ctl.c     |   9 +-
 arch/x86/kvm/lapic.c               |  50 ++++++++---
 arch/x86/kvm/vmx/capabilities.h    |  13 +++
 arch/x86/kvm/vmx/evmcs.c           |   2 +
 arch/x86/kvm/vmx/evmcs.h           |   1 +
 arch/x86/kvm/vmx/posted_intr.c     |  12 ++-
 arch/x86/kvm/vmx/vmcs.h            |   1 +
 arch/x86/kvm/vmx/vmx.c             | 140 ++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h             |  63 +++++++------
 arch/x86/kvm/x86.c                 |  11 +++
 14 files changed, 274 insertions(+), 51 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:09   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.

Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.

Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.

[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

[2] SDM Vol3. Appendix A.3

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/msr-index.h   | 1 +
 arch/x86/include/asm/vmxfeatures.h | 3 ++-
 arch/x86/kernel/cpu/feat_ctl.c     | 9 ++++++++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3faf0f97edb1..1d180f883c32 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -938,6 +938,7 @@
 #define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
+#define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
 
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_VMCS_SIZE_SHIFT	32
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..ff20776dc83b 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -5,7 +5,7 @@
 /*
  * Defines VMX CPU feature bits
  */
-#define NVMXINTS			3 /* N 32-bit words worth of info */
+#define NVMXINTS			5 /* N 32-bit words worth of info */
 
 /*
  * Note: If the comment begins with a quoted string, that string is used
@@ -43,6 +43,7 @@
 #define VMX_FEATURE_RDTSC_EXITING	( 1*32+ 12) /* "" VM-Exit on RDTSC */
 #define VMX_FEATURE_CR3_LOAD_EXITING	( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
 #define VMX_FEATURE_CR3_STORE_EXITING	( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
+#define VMX_FEATURE_TERTIARY_CONTROLS	( 1*32+ 17) /* "" Enable Tertiary VM-Execution Controls */
 #define VMX_FEATURE_CR8_LOAD_EXITING	( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
 #define VMX_FEATURE_CR8_STORE_EXITING	( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
 #define VMX_FEATURE_VIRTUAL_TPR		( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index da696eb4821a..993697e71854 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -15,6 +15,8 @@ enum vmx_feature_leafs {
 	MISC_FEATURES = 0,
 	PRIMARY_CTLS,
 	SECONDARY_CTLS,
+	TERTIARY_CTLS_LOW,
+	TERTIARY_CTLS_HIGH,
 	NR_VMX_FEATURE_WORDS,
 };
 
@@ -22,7 +24,7 @@ enum vmx_feature_leafs {
 
 static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 {
-	u32 supported, funcs, ept, vpid, ign;
+	u32 supported, funcs, ept, vpid, ign, low, high;
 
 	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
 
@@ -42,6 +44,11 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
 	c->vmx_capability[SECONDARY_CTLS] = supported;
 
+	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */
+	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
+	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
+	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
+
 	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
 	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
  2022-02-25  8:22 ` [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:24   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7f2c82e7f38f..e07c76974fb0 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -456,35 +456,38 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }
 
-#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
-{									    \
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
-		vmcs_write32(uname, val);				    \
-		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
-	}								    \
-}									    \
-static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
-{									    \
-	return vmcs->controls_shadow.lname;				    \
-}									    \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
-{									    \
-	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
-}									    \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
-}									    \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{									    \
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
+static inline								\
+void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
+		vmcs_write##bits(uname, val);				\
+		vmx->loaded_vmcs->controls_shadow.lname = val;		\
+	}								\
+}									\
+static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
+{									\
+	return vmcs->controls_shadow.lname;				\
+}									\
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
+{									\
+	return __##lname##_controls_get(vmx->loaded_vmcs);		\
+}									\
+static inline								\
+void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
+}									\
+static inline								\
+void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+{									\
+	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
 }
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
  2022-02-25  8:22 ` [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
  2022-02-25  8:22 ` [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:30   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.

EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h      |  3 +++
 arch/x86/kvm/vmx/capabilities.h |  7 ++++++
 arch/x86/kvm/vmx/evmcs.c        |  2 ++
 arch/x86/kvm/vmx/evmcs.h        |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 38 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h          |  1 +
 7 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..8c929596a299 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -31,6 +31,7 @@
 #define CPU_BASED_RDTSC_EXITING                 VMCS_CONTROL_BIT(RDTSC_EXITING)
 #define CPU_BASED_CR3_LOAD_EXITING		VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
 #define CPU_BASED_CR3_STORE_EXITING		VMCS_CONTROL_BIT(CR3_STORE_EXITING)
+#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS	VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
 #define CPU_BASED_CR8_LOAD_EXITING              VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
 #define CPU_BASED_CR8_STORE_EXITING             VMCS_CONTROL_BIT(CR8_STORE_EXITING)
 #define CPU_BASED_TPR_SHADOW                    VMCS_CONTROL_BIT(VIRTUAL_TPR)
@@ -221,6 +222,8 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
+	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..31f3d88b3e4d 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,6 +59,7 @@ struct vmcs_config {
 	u32 pin_based_exec_ctrl;
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
+	u64 cpu_based_3rd_exec_ctrl;
 	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
 	struct nested_vmx_msrs nested;
@@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 }
 
+static inline bool cpu_has_tertiary_exec_ctrls(void)
+{
+	return vmcs_config.cpu_based_exec_ctrl &
+		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
index 87e3dc10edf4..6a61b1ae7942 100644
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@@ -297,8 +297,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
 #if IS_ENABLED(CONFIG_HYPERV)
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
 {
+	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
 	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
 	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
 
 	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
 	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 8d70f9aea94b..f886a8ff0342 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
  */
 #define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
 				    PIN_BASED_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 #define EVMCS1_UNSUPPORTED_2NDEXEC					\
 	(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
 	 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a816..e18dc68eeeeb 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -50,6 +50,7 @@ struct vmcs_controls_shadow {
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
+	u64 tertiary_exec;
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c569dc2b9192..8a5713d49635 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2422,6 +2422,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
 	return 0;
 }
 
+static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
+					 u32 msr, u64 *result)
+{
+	u64 allowed1;
+
+	rdmsrl(msr, allowed1);
+
+	/* Ensure minimum (required) set of control bits are supported. */
+	if (ctl_min & ~allowed1)
+		return -EIO;
+
+	*result = (ctl_min | ctl_opt) & allowed1;
+	return 0;
+}
+
 static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				    struct vmx_capability *vmx_cap)
 {
@@ -2430,6 +2445,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
+	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 
@@ -2451,7 +2467,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
-	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
 				&_cpu_based_exec_control) < 0)
 		return -EIO;
@@ -2525,6 +2542,16 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				"1-setting enable VPID VM-execution control\n");
 	}
 
+	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+		u64 opt3 = 0;
+		u64 min3 = 0;
+
+		if (adjust_vmx_controls_64(min3, opt3,
+					   MSR_IA32_VMX_PROCBASED_CTLS3,
+					   &_cpu_based_3rd_exec_control))
+			return -EIO;
+	}
+
 	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -2611,6 +2638,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
+	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
 
@@ -4230,6 +4258,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl;
+}
+
 /*
  * Adjust a single secondary execution control bit to intercept/allow an
  * instruction in the guest.  This is usually done based on whether or not a
@@ -4395,6 +4428,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
 
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
+
 	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e07c76974fb0..d4a647d3ed4a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -488,6 +488,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
 
 /*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (2 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:31   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Robert Hoo

From: Robert Hoo <robert.hu@linux.intel.com>

Add tertiary_exec_control field report in dump_vmcs()

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8a5713d49635..7beba7a9f247 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5891,6 +5891,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
+	u64 tertiary_exec_control;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -5904,9 +5905,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
-	secondary_exec_control = 0;
+
 	if (cpu_has_secondary_exec_ctrls())
 		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	else
+		secondary_exec_control = 0;
+
+	if (cpu_has_tertiary_exec_ctrls())
+		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
+	else
+		tertiary_exec_control = 0;
 
 	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
 	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
@@ -6006,9 +6014,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
 
 	pr_err("*** Control State ***\n");
-	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
-	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
-	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
+	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
+	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
+	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (3 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:44   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
were previously impossible.

Note, x2APIC MSR writes are 64 bits wide.

Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 629c116b0d3e..e4bcdab1fac0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
 #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
+static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
 
 static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
 {
@@ -2227,10 +2228,28 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 /* emulate APIC access in a trap manner */
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 {
-	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 val;
+
+	if (apic_x2apic_mode(apic)) {
+		/*
+		 * When guest APIC is in x2APIC mode and IPI virtualization
+		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
+		 * on Intel hardware. Other offsets are not possible.
+		 */
+		if (WARN_ON_ONCE(offset != APIC_ICR))
+			return;
 
-	/* TODO: optimize to just emulate side effect w/o one more write */
-	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
+		kvm_lapic_msr_read(apic, offset, &val);
+		if (val & APIC_ICR_BUSY)
+			kvm_x2apic_icr_write(apic, val);
+		else
+			kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
+	} else {
+		val = kvm_lapic_get_reg(apic, offset);
+		/* TODO: optimize to just emulate side effect w/o one more write */
+		kvm_lapic_reg_write(apic, offset, (u32)val);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (4 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 14:46   ` Maxim Levitsky
  2022-03-08 23:04   ` Sean Christopherson
  2022-02-25  8:22 ` [PATCH v6 7/9] KVM: VMX: enable IPI virtualization Zeng Guang
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang, Maxim Levitsky

From: Maxim Levitsky <mlevitsk@redhat.com>

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

And Intel recent hardware just ignores writes to APIC_ID in
xAPIC mode. More background can be found at:
https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/

Looks there is no much value to support writable xAPIC ID in
guest except supporting some old and crazy use cases which
probably would fail on real hardware. So, make xAPIC ID
read-only for KVM guests.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e4bcdab1fac0..b38288c8a94f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
-			kvm_apic_set_xapic_id(apic, val >> 24);
-		else
+		if (apic_x2apic_mode(apic)) {
 			ret = 1;
+			break;
+		}
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((val >> 24) != apic->vcpu->vcpu_id) {
+			kvm_vm_bugged(apic->vcpu->kvm);
+			break;
+		}
+
+		kvm_apic_set_xapic_id(apic, val >> 24);
 		break;
 
 	case APIC_TASKPRI:
@@ -2631,11 +2638,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		struct kvm_lapic_state *s, bool set)
 {
-	if (apic_x2apic_mode(vcpu->arch.apic)) {
-		u32 *id = (u32 *)(s->regs + APIC_ID);
-		u32 *ldr = (u32 *)(s->regs + APIC_LDR);
-		u64 icr;
+	u32 *id = (u32 *)(s->regs + APIC_ID);
+	u32 *ldr = (u32 *)(s->regs + APIC_LDR);
+	u64 icr;
 
+	if (!apic_x2apic_mode(vcpu->arch.apic)) {
+		/* Don't allow changing APIC ID to avoid unexpected issues */
+		if ((*id >> 24) != vcpu->vcpu_id)
+			return -EINVAL;
+	} else {
 		if (vcpu->kvm->arch.x2apic_format) {
 			if (*id != vcpu->vcpu_id)
 				return -EINVAL;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 7/9] KVM: VMX: enable IPI virtualization
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (5 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 17:19   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
  2022-02-25  8:22 ` [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table Zeng Guang
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

From: Gao Chao <chao.gao@intel.com>

With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID.

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.

Signed-off-by: Gao Chao <chao.gao@intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/include/asm/vmx.h         |  8 ++++
 arch/x86/include/asm/vmxfeatures.h |  2 +
 arch/x86/kvm/vmx/capabilities.h    |  6 +++
 arch/x86/kvm/vmx/posted_intr.c     | 12 ++++-
 arch/x86/kvm/vmx/vmx.c             | 74 +++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h             |  3 ++
 6 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8c929596a299..b79b6438acaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -76,6 +76,11 @@
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
 
+/*
+ * Definitions of Tertiary Processor-Based VM-Execution Controls.
+ */
+#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
+
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
 #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
@@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID            = 0x00000000,
 	POSTED_INTR_NV                  = 0x00000002,
+	LAST_PID_POINTER_INDEX		= 0x00000008,
 	GUEST_ES_SELECTOR               = 0x00000800,
 	GUEST_CS_SELECTOR               = 0x00000802,
 	GUEST_SS_SELECTOR               = 0x00000804,
@@ -224,6 +230,8 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	PID_POINTER_TABLE		= 0x00002042,
+	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index ff20776dc83b..7ce616af2db2 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -86,4 +86,6 @@
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
 
+/* Tertiary Processor-Based VM-Execution Controls, word 3 */
+#define VMX_FEATURE_IPI_VIRT		(3*32 +  4) /* "" Enable IPI virtualization */
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 31f3d88b3e4d..5f656c9e33be 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
+extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
@@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
 		cpu_has_vmx_posted_intr();
 }
 
+static inline bool cpu_has_vmx_ipiv(void)
+{
+	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index aa1fe9085d77..90124a30c074 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -177,11 +177,21 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 	local_irq_restore(flags);
 }
 
+static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
+{
+	return irqchip_in_kernel(kvm) && enable_ipiv;
+}
+
+static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
+{
+	return vmx_can_use_ipiv_pi(kvm) || vmx_can_use_vtd_pi(kvm);
+}
+
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!vmx_can_use_vtd_pi(vcpu->kvm))
+	if (!vmx_can_use_posted_interrupts(vcpu->kvm))
 		return;
 
 	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7beba7a9f247..0cb141c277ef 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
 
 module_param(enable_apicv, bool, S_IRUGO);
 
+bool __read_mostly enable_ipiv = true;
+module_param(enable_ipiv, bool, 0444);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -227,6 +230,11 @@ static const struct {
 };
 
 #define L1D_CACHE_ORDER 4
+
+/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
+#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
+#define PID_TABLE_ENTRY_VALID 1
+
 static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
@@ -2543,7 +2551,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	}
 
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
-		u64 opt3 = 0;
+		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
 		u64 min3 = 0;
 
 		if (adjust_vmx_controls_64(min3, opt3,
@@ -3898,6 +3906,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
 		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
 		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+		if (enable_ipiv)
+			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),MSR_TYPE_RW);
 	}
 }
 
@@ -4219,14 +4229,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
 	if (cpu_has_secondary_exec_ctrls()) {
-		if (kvm_vcpu_apicv_active(vcpu))
+		if (kvm_vcpu_apicv_active(vcpu)) {
 			secondary_exec_controls_setbit(vmx,
 				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
 				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		else
+			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
+				tertiary_exec_controls_setbit(vmx,
+						TERTIARY_EXEC_IPI_VIRT);
+		} else {
 			secondary_exec_controls_clearbit(vmx,
 					SECONDARY_EXEC_APIC_REGISTER_VIRT |
 					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+			if (cpu_has_tertiary_exec_ctrls())
+				tertiary_exec_controls_clearbit(vmx,
+						TERTIARY_EXEC_IPI_VIRT);
+		}
 	}
 
 	vmx_update_msr_bitmap_x2apic(vcpu);
@@ -4260,7 +4277,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 
 static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
 {
-	return vmcs_config.cpu_based_3rd_exec_ctrl;
+	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
+
+	/*
+	 * IPI virtualization relies on APICv. Disable IPI
+	 * virtualization if APICv is inhibited.
+	 */
+	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
+		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
+
+	return exec_control;
 }
 
 /*
@@ -4412,6 +4438,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 
 static void init_vmcs(struct vcpu_vmx *vmx)
 {
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
 	if (nested)
 		nested_vmx_set_vmcs_shadowing_bitmap();
 
@@ -4431,7 +4460,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	if (cpu_has_tertiary_exec_ctrls())
 		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
 
-	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
+	if (kvm_vcpu_apicv_active(vcpu)) {
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
 		vmcs_write64(EOI_EXIT_BITMAP2, 0);
@@ -4441,6 +4470,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+		if (enable_ipiv) {
+			WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
+				__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+			vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+			vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+		}
 	}
 
 	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
@@ -4492,7 +4528,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 	}
 
-	vmx_write_encls_bitmap(&vmx->vcpu, NULL);
+	vmx_write_encls_bitmap(vcpu, NULL);
 
 	if (vmx_pt_mode_is_host_guest()) {
 		memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
@@ -4508,7 +4544,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 	if (cpu_has_vmx_tpr_shadow()) {
 		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
-		if (cpu_need_tpr_shadow(&vmx->vcpu))
+		if (cpu_need_tpr_shadow(vcpu))
 			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
 				     __pa(vmx->vcpu.arch.apic->regs));
 		vmcs_write32(TPR_THRESHOLD, 0);
@@ -7165,6 +7201,18 @@ static int vmx_vm_init(struct kvm *kvm)
 			break;
 		}
 	}
+
+	if (enable_ipiv) {
+		struct page *pages;
+
+		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
+		if (!pages)
+			return -ENOMEM;
+
+		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
+		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
+	}
+
 	return 0;
 }
 
@@ -7756,6 +7804,14 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+	if (kvm_vmx->pid_table)
+		free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -7768,6 +7824,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
+	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_create = vmx_create_vcpu,
 	.vcpu_free = vmx_free_vcpu,
@@ -8022,6 +8079,9 @@ static __init int hardware_setup(void)
 	if (!enable_apicv)
 		vmx_x86_ops.sync_pir_to_irr = NULL;
 
+	if (!enable_apicv || !cpu_has_vmx_ipiv())
+		enable_ipiv = false;
+
 	if (cpu_has_vmx_tsc_scaling()) {
 		kvm_has_tsc_control = true;
 		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d4a647d3ed4a..e7b0c00c9d43 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -365,6 +365,9 @@ struct kvm_vmx {
 	unsigned int tss_addr;
 	bool ept_identity_pagetable_done;
 	gpa_t ept_identity_map_addr;
+	/* PID table for IPI virtualization */
+	u64 *pid_table;
+	u16 pid_last_index;
 };
 
 bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (6 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 7/9] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 17:22   ` Maxim Levitsky
  2022-02-25  8:22 ` [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table Zeng Guang
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Introduce new max_vcpu_id in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_id set as KVM_MAX_VCPU_IDS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
No new KVM capability is added to advertise the support of
configurable maximum vCPU ID to user space because max_vcpu_id is
just a hint/commitment to allow KVM to reduce the size of PID-pointer
table. But I am not 100% sure if it is proper to do so.

 arch/x86/include/asm/kvm_host.h |  6 ++++++
 arch/x86/kvm/x86.c              | 11 +++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6dcccb304775..db16aebd946c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1233,6 +1233,12 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+	/*
+	 * VM-scope maximum vCPU ID. Used to determine the size of structures
+	 * that increase along with the maximum vCPU ID, in which case, using
+	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
+	 */
+	u32 max_vcpu_id;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f6fe9974cb5..ca17cc452bd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5994,6 +5994,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exit_on_emulation_error = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_MAX_VCPU_ID:
+		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
+			kvm->arch.max_vcpu_id = cap->args[0];
+			r = 0;
+		} else
+			r = -E2BIG;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -11067,6 +11074,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	struct page *page;
 	int r;
 
+	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_id)
+		return -E2BIG;
+
 	vcpu->arch.last_vmentry_cpu = -1;
 	vcpu->arch.regs_avail = ~0;
 	vcpu->arch.regs_dirty = ~0;
@@ -11589,6 +11599,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
+	kvm->arch.max_vcpu_id = KVM_MAX_VCPU_IDS;
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table
  2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
                   ` (7 preceding siblings ...)
  2022-02-25  8:22 ` [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-02-25  8:22 ` Zeng Guang
  2022-02-25 17:29   ` Maxim Levitsky
  8 siblings, 1 reply; 41+ messages in thread
From: Zeng Guang @ 2022-02-25  8:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Zeng Guang

Current kvm allocates 8 pages in advance for Posted Interrupt Descriptor
pointer (PID-pointer) table to accommodate vCPUs with APIC ID up to
KVM_MAX_VCPU_IDS - 1. This policy wastes some memory because most of
VMs have less than 512 vCPUs and then just need one page.

If user hypervisor specify max practical vcpu id prior to vCPU creation,
IPIv can allocate only essential memory for PID-pointer table and reduce
the memory footprint of VMs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 45 ++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0cb141c277ef..22bfb4953289 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -230,9 +230,6 @@ static const struct {
 };
 
 #define L1D_CACHE_ORDER 4
-
-/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
-#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
 #define PID_TABLE_ENTRY_VALID 1
 
 static void *vmx_l1d_flush_pages;
@@ -4434,6 +4431,24 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	return exec_control;
 }
 
+static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
+{
+	struct page *pages;
+
+	if(kvm_vmx->pid_table)
+		return 0;
+
+	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));
+
+	if (!pages)
+		return -ENOMEM;
+
+	kvm_vmx->pid_table = (void *)page_address(pages);
+	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
+	return 0;
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 
 static void init_vmcs(struct vcpu_vmx *vmx)
@@ -7159,6 +7174,16 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 			goto free_vmcs;
 	}
 
+	if (enable_ipiv && kvm_vcpu_apicv_active(vcpu)) {
+		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
+		mutex_lock(&vcpu->kvm->lock);
+		err = vmx_alloc_pid_table(kvm_vmx);
+		mutex_unlock(&vcpu->kvm->lock);
+		if (err)
+			goto free_vmcs;
+	}
+
 	return 0;
 
 free_vmcs:
@@ -7202,17 +7227,6 @@ static int vmx_vm_init(struct kvm *kvm)
 		}
 	}
 
-	if (enable_ipiv) {
-		struct page *pages;
-
-		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
-		if (!pages)
-			return -ENOMEM;
-
-		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
-		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
-	}
-
 	return 0;
 }
 
@@ -7809,7 +7823,8 @@ static void vmx_vm_destroy(struct kvm *kvm)
 	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
 
 	if (kvm_vmx->pid_table)
-		free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
+		free_pages((unsigned long)kvm_vmx->pid_table,
+			get_order((kvm_vmx->pid_last_index + 1) * sizeof(u64)));
 }
 
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control
  2022-02-25  8:22 ` [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
@ 2022-02-25 14:09   ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:09 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> A new 64-bit control field "tertiary processor-based VM-execution
> controls", is defined [1]. It's controlled by bit 17 of the primary
> processor-based VM-execution controls.
> 
> Different from its brother VM-execution fields, this tertiary VM-
> execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
> TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
> 
> Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
> (0x492), is also semantically different from its brothers, whose 64 bits
> consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
> Therefore, its init_vmx_capabilities() is a little different from others.
> 
> [1] ISE 6.2 "VMCS Changes"
> https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> 
> [2] SDM Vol3. Appendix A.3
> 
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/msr-index.h   | 1 +
>  arch/x86/include/asm/vmxfeatures.h | 3 ++-
>  arch/x86/kernel/cpu/feat_ctl.c     | 9 ++++++++-
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3faf0f97edb1..1d180f883c32 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -938,6 +938,7 @@
>  #define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
>  #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
>  #define MSR_IA32_VMX_VMFUNC             0x00000491
> +#define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
>  
>  /* VMX_BASIC bits and bitmasks */
>  #define VMX_BASIC_VMCS_SIZE_SHIFT	32
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index d9a74681a77d..ff20776dc83b 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -5,7 +5,7 @@
>  /*
>   * Defines VMX CPU feature bits
>   */
> -#define NVMXINTS			3 /* N 32-bit words worth of info */
> +#define NVMXINTS			5 /* N 32-bit words worth of info */
>  
>  /*
>   * Note: If the comment begins with a quoted string, that string is used
> @@ -43,6 +43,7 @@
>  #define VMX_FEATURE_RDTSC_EXITING	( 1*32+ 12) /* "" VM-Exit on RDTSC */
>  #define VMX_FEATURE_CR3_LOAD_EXITING	( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
>  #define VMX_FEATURE_CR3_STORE_EXITING	( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
> +#define VMX_FEATURE_TERTIARY_CONTROLS	( 1*32+ 17) /* "" Enable Tertiary VM-Execution Controls */
>  #define VMX_FEATURE_CR8_LOAD_EXITING	( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
>  #define VMX_FEATURE_CR8_STORE_EXITING	( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
>  #define VMX_FEATURE_VIRTUAL_TPR		( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
> diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
> index da696eb4821a..993697e71854 100644
> --- a/arch/x86/kernel/cpu/feat_ctl.c
> +++ b/arch/x86/kernel/cpu/feat_ctl.c
> @@ -15,6 +15,8 @@ enum vmx_feature_leafs {
>  	MISC_FEATURES = 0,
>  	PRIMARY_CTLS,
>  	SECONDARY_CTLS,
> +	TERTIARY_CTLS_LOW,
> +	TERTIARY_CTLS_HIGH,
>  	NR_VMX_FEATURE_WORDS,
>  };
>  
> @@ -22,7 +24,7 @@ enum vmx_feature_leafs {
>  
>  static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>  {
> -	u32 supported, funcs, ept, vpid, ign;
> +	u32 supported, funcs, ept, vpid, ign, low, high;
>  
>  	BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
>  
> @@ -42,6 +44,11 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>  	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
>  	c->vmx_capability[SECONDARY_CTLS] = supported;
>  
> +	/* All 64 bits of tertiary controls MSR are allowed-1 settings. */
> +	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
> +	c->vmx_capability[TERTIARY_CTLS_LOW] = low;
> +	c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
> +
>  	rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
>  	rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
>  

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
  2022-02-25  8:22 ` [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
@ 2022-02-25 14:24   ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:24 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> The Tertiary VM-Exec Control, different from previous control fields, is 64
> bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
> support both 32 bit and 64 bit fields' auxiliary functions building.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
>  1 file changed, 31 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 7f2c82e7f38f..e07c76974fb0 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -456,35 +456,38 @@ static inline u8 vmx_get_rvi(void)
>  	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
>  }
>  
> -#define BUILD_CONTROLS_SHADOW(lname, uname)				    \
> -static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val)	    \
> -{									    \
> -	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		    \
> -		vmcs_write32(uname, val);				    \
> -		vmx->loaded_vmcs->controls_shadow.lname = val;		    \
> -	}								    \
> -}									    \
> -static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs)	    \
> -{									    \
> -	return vmcs->controls_shadow.lname;				    \
> -}									    \
> -static inline u32 lname##_controls_get(struct vcpu_vmx *vmx)		    \
> -{									    \
> -	return __##lname##_controls_get(vmx->loaded_vmcs);		    \
> -}									    \
> -static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val)   \
> -{									    \
> -	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	    \
> -}									    \
> -static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
> -{									    \
> -	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	    \
> +#define BUILD_CONTROLS_SHADOW(lname, uname, bits)			\
> +static inline								\
> +void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	if (vmx->loaded_vmcs->controls_shadow.lname != val) {		\
> +		vmcs_write##bits(uname, val);				\
> +		vmx->loaded_vmcs->controls_shadow.lname = val;		\
> +	}								\
> +}									\
> +static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
> +{									\
> +	return vmcs->controls_shadow.lname;				\
> +}									\
> +static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)	\
> +{									\
> +	return __##lname##_controls_get(vmx->loaded_vmcs);		\
> +}									\
> +static inline								\
> +void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);	\
> +}									\
> +static inline								\
> +void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
> +{									\
> +	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);	\
>  }
> -BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
> -BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
> -BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
> -BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
> +BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
> +BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
>  
>  /*
>   * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the

I must admit that this will make it a bit harder to find references in the code.
I personally would just use pair of 32 bit capabilities, but I don't have strong opinion
on this.

Thus:

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
  2022-02-25  8:22 ` [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
@ 2022-02-25 14:30   ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:30 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> Check VMX features on tertiary execution control in VMCS config setup.
> Sub-features in tertiary execution control to be enabled are adjusted
> according to hardware capabilities although no sub-feature is enabled
> in this patch.
> 
> EVMCSv1 doesn't support tertiary VM-execution control, so disable it
> when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
> control field here, using the new BUILD_CONTROLS_SHADOW().
> 
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/include/asm/vmx.h      |  3 +++
>  arch/x86/kvm/vmx/capabilities.h |  7 ++++++
>  arch/x86/kvm/vmx/evmcs.c        |  2 ++
>  arch/x86/kvm/vmx/evmcs.h        |  1 +
>  arch/x86/kvm/vmx/vmcs.h         |  1 +
>  arch/x86/kvm/vmx/vmx.c          | 38 ++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.h          |  1 +
>  7 files changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0ffaa3156a4e..8c929596a299 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -31,6 +31,7 @@
>  #define CPU_BASED_RDTSC_EXITING                 VMCS_CONTROL_BIT(RDTSC_EXITING)
>  #define CPU_BASED_CR3_LOAD_EXITING		VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
>  #define CPU_BASED_CR3_STORE_EXITING		VMCS_CONTROL_BIT(CR3_STORE_EXITING)
> +#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS	VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
>  #define CPU_BASED_CR8_LOAD_EXITING              VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
>  #define CPU_BASED_CR8_STORE_EXITING             VMCS_CONTROL_BIT(CR8_STORE_EXITING)
>  #define CPU_BASED_TPR_SHADOW                    VMCS_CONTROL_BIT(VIRTUAL_TPR)
> @@ -221,6 +222,8 @@ enum vmcs_field {
>  	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
>  	TSC_MULTIPLIER                  = 0x00002032,
>  	TSC_MULTIPLIER_HIGH             = 0x00002033,
> +	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
> +	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 3f430e218375..31f3d88b3e4d 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -59,6 +59,7 @@ struct vmcs_config {
>  	u32 pin_based_exec_ctrl;
>  	u32 cpu_based_exec_ctrl;
>  	u32 cpu_based_2nd_exec_ctrl;
> +	u64 cpu_based_3rd_exec_ctrl;
>  	u32 vmexit_ctrl;
>  	u32 vmentry_ctrl;
>  	struct nested_vmx_msrs nested;
> @@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
>  		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
>  }
>  
> +static inline bool cpu_has_tertiary_exec_ctrls(void)
> +{
> +	return vmcs_config.cpu_based_exec_ctrl &
> +		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
> +}
> +
>  static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
>  {
>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
> diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
> index 87e3dc10edf4..6a61b1ae7942 100644
> --- a/arch/x86/kvm/vmx/evmcs.c
> +++ b/arch/x86/kvm/vmx/evmcs.c
> @@ -297,8 +297,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
>  #if IS_ENABLED(CONFIG_HYPERV)
>  __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
>  {
> +	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
>  	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
>  	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
> +	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
>  
>  	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
>  	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
> diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
> index 8d70f9aea94b..f886a8ff0342 100644
> --- a/arch/x86/kvm/vmx/evmcs.h
> +++ b/arch/x86/kvm/vmx/evmcs.h
> @@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
>   */
>  #define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
>  				    PIN_BASED_VMX_PREEMPTION_TIMER)
> +#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
>  #define EVMCS1_UNSUPPORTED_2NDEXEC					\
>  	(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
>  	 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index e325c290a816..e18dc68eeeeb 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -50,6 +50,7 @@ struct vmcs_controls_shadow {
>  	u32 pin;
>  	u32 exec;
>  	u32 secondary_exec;
> +	u64 tertiary_exec;
>  };
>  
>  /*
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c569dc2b9192..8a5713d49635 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2422,6 +2422,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>  	return 0;
>  }
>  
> +static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
> +					 u32 msr, u64 *result)
> +{
> +	u64 allowed1;
> +
> +	rdmsrl(msr, allowed1);
> +
> +	/* Ensure minimum (required) set of control bits are supported. */
> +	if (ctl_min & ~allowed1)
> +		return -EIO;
> +
> +	*result = (ctl_min | ctl_opt) & allowed1;
> +	return 0;
> +}
> +
>  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  				    struct vmx_capability *vmx_cap)
>  {
> @@ -2430,6 +2445,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	u32 _pin_based_exec_control = 0;
>  	u32 _cpu_based_exec_control = 0;
>  	u32 _cpu_based_2nd_exec_control = 0;
> +	u64 _cpu_based_3rd_exec_control = 0;
>  	u32 _vmexit_control = 0;
>  	u32 _vmentry_control = 0;
>  
> @@ -2451,7 +2467,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  
>  	opt = CPU_BASED_TPR_SHADOW |
>  	      CPU_BASED_USE_MSR_BITMAPS |
> -	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> +	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
> +	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
>  				&_cpu_based_exec_control) < 0)
>  		return -EIO;
> @@ -2525,6 +2542,16 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  				"1-setting enable VPID VM-execution control\n");
>  	}
>  
> +	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
> +		u64 opt3 = 0;
> +		u64 min3 = 0;
> +
> +		if (adjust_vmx_controls_64(min3, opt3,
> +					   MSR_IA32_VMX_PROCBASED_CTLS3,
> +					   &_cpu_based_3rd_exec_control))
> +			return -EIO;
> +	}
> +
>  	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
>  #ifdef CONFIG_X86_64
>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
> @@ -2611,6 +2638,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
>  	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
>  	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
> +	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
>  	vmcs_conf->vmexit_ctrl         = _vmexit_control;
>  	vmcs_conf->vmentry_ctrl        = _vmentry_control;
>  
> @@ -4230,6 +4258,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>  	return exec_control;
>  }
>  
> +static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
> +{
> +	return vmcs_config.cpu_based_3rd_exec_ctrl;
> +}
> +
>  /*
>   * Adjust a single secondary execution control bit to intercept/allow an
>   * instruction in the guest.  This is usually done based on whether or not a
> @@ -4395,6 +4428,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	if (cpu_has_secondary_exec_ctrls())
>  		secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
>  
> +	if (cpu_has_tertiary_exec_ctrls())
> +		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
> +
>  	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
>  		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>  		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index e07c76974fb0..d4a647d3ed4a 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -488,6 +488,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
>  BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
>  BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
>  BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
> +BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
>  
>  /*
>   * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
  2022-02-25  8:22 ` [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
@ 2022-02-25 14:31   ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:31 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao, Robert Hoo

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> Add tertiary_exec_control field report in dump_vmcs()
> 
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8a5713d49635..7beba7a9f247 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5891,6 +5891,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 vmentry_ctl, vmexit_ctl;
>  	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> +	u64 tertiary_exec_control;
>  	unsigned long cr4;
>  	int efer_slot;
>  
> @@ -5904,9 +5905,16 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
>  	cr4 = vmcs_readl(GUEST_CR4);
> -	secondary_exec_control = 0;
> +
>  	if (cpu_has_secondary_exec_ctrls())
>  		secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	else
> +		secondary_exec_control = 0;
> +
> +	if (cpu_has_tertiary_exec_ctrls())
> +		tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
> +	else
> +		tertiary_exec_control = 0;
>  
>  	pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
>  	       vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
> @@ -6006,9 +6014,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
>  
>  	pr_err("*** Control State ***\n");
> -	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
> -	       pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
> -	pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
> +	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
> +	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
> +	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
> +	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
>  	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
>  	       vmcs_read32(EXCEPTION_BITMAP),
>  	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-02-25  8:22 ` [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
@ 2022-02-25 14:44   ` Maxim Levitsky
  2022-02-25 15:29     ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:44 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
> i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
> the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
> were previously impossible.
> 
> Note, x2APIC MSR writes are 64 bits wide.
> 
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 629c116b0d3e..e4bcdab1fac0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
>  #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
>  /* step-by-step approximation to mitigate fluctuation */
>  #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
>  
>  static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
>  {
> @@ -2227,10 +2228,28 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>  /* emulate APIC access in a trap manner */
>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>  {
> -	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 val;
> +
> +	if (apic_x2apic_mode(apic)) {
> +		/*
> +		 * When guest APIC is in x2APIC mode and IPI virtualization
> +		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
> +		 * on Intel hardware. Other offsets are not possible.
> +		 */
> +		if (WARN_ON_ONCE(offset != APIC_ICR))
> +			return;
>  
> -	/* TODO: optimize to just emulate side effect w/o one more write */
> -	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
> +		kvm_lapic_msr_read(apic, offset, &val);
> +		if (val & APIC_ICR_BUSY)
> +			kvm_x2apic_icr_write(apic, val);
> +		else
> +			kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
I don't fully understand the above code.

First of where kvm_x2apic_icr_write is defined?

Second, I thought that busy bit is not used in x2apic mode?
At least in intel's SDM, section 10.12.9 'ICR Operation in x2APIC Mode'
this bit is not defined.


Best regards,
	Maxim Levitsky


> +	} else {
> +		val = kvm_lapic_get_reg(apic, offset);
> +		/* TODO: optimize to just emulate side effect w/o one more write */
> +		kvm_lapic_reg_write(apic, offset, (u32)val);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
>  



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25  8:22 ` [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
@ 2022-02-25 14:46   ` Maxim Levitsky
  2022-02-25 14:56     ` David Woodhouse
  2022-03-01  8:03     ` Chao Gao
  2022-03-08 23:04   ` Sean Christopherson
  1 sibling, 2 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 14:46 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Maxim Levitsky <mlevitsk@redhat.com>
> 
> No normal guest has any reason to change physical APIC IDs, and
> allowing this introduces bugs into APIC acceleration code.
> 
> And Intel recent hardware just ignores writes to APIC_ID in
> xAPIC mode. More background can be found at:
> https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> 
> Looks there is no much value to support writable xAPIC ID in
> guest except supporting some old and crazy use cases which
> probably would fail on real hardware. So, make xAPIC ID
> read-only for KVM guests.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>

Assuming that this is approved and accepted upstream,
that is even better that my proposal of doing this
when APICv is enabled.

Since now apic id is always read only, now we should not 
forget to clean up some parts of kvm like kvm_recalculate_apic_map,
which are not needed anymore.

Best regards,
	Maxim Levitsky

> ---
>  arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e4bcdab1fac0..b38288c8a94f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  
>  	switch (reg) {
>  	case APIC_ID:		/* Local APIC ID */
> -		if (!apic_x2apic_mode(apic))
> -			kvm_apic_set_xapic_id(apic, val >> 24);
> -		else
> +		if (apic_x2apic_mode(apic)) {
>  			ret = 1;
> +			break;
> +		}
> +		/* Don't allow changing APIC ID to avoid unexpected issues */
> +		if ((val >> 24) != apic->vcpu->vcpu_id) {
> +			kvm_vm_bugged(apic->vcpu->kvm);
> +			break;
> +		}
> +
> +		kvm_apic_set_xapic_id(apic, val >> 24);
>  		break;
>  
>  	case APIC_TASKPRI:
> @@ -2631,11 +2638,15 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>  static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>  		struct kvm_lapic_state *s, bool set)
>  {
> -	if (apic_x2apic_mode(vcpu->arch.apic)) {
> -		u32 *id = (u32 *)(s->regs + APIC_ID);
> -		u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> -		u64 icr;
> +	u32 *id = (u32 *)(s->regs + APIC_ID);
> +	u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> +	u64 icr;
>  
> +	if (!apic_x2apic_mode(vcpu->arch.apic)) {
> +		/* Don't allow changing APIC ID to avoid unexpected issues */
> +		if ((*id >> 24) != vcpu->vcpu_id)
> +			return -EINVAL;
> +	} else {
>  		if (vcpu->kvm->arch.x2apic_format) {
>  			if (*id != vcpu->vcpu_id)
>  				return -EINVAL;



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25 14:46   ` Maxim Levitsky
@ 2022-02-25 14:56     ` David Woodhouse
  2022-02-25 15:11       ` Maxim Levitsky
  2022-03-01  8:03     ` Chao Gao
  1 sibling, 1 reply; 41+ messages in thread
From: David Woodhouse @ 2022-02-25 14:56 UTC (permalink / raw)
  To: Maxim Levitsky, Zeng Guang, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	Dave Hansen, Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

[-- Attachment #1: Type: text/plain, Size: 1327 bytes --]

On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> > From: Maxim Levitsky <
> > mlevitsk@redhat.com
> > >
> > 
> > No normal guest has any reason to change physical APIC IDs, and
> > allowing this introduces bugs into APIC acceleration code.
> > 
> > And Intel recent hardware just ignores writes to APIC_ID in
> > xAPIC mode. More background can be found at:
> > https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> > 
> > 
> > Looks there is no much value to support writable xAPIC ID in
> > guest except supporting some old and crazy use cases which
> > probably would fail on real hardware. So, make xAPIC ID
> > read-only for KVM guests.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> 
> Assuming that this is approved and accepted upstream,
> that is even better that my proposal of doing this
> when APICv is enabled.
> 
> Since now apic id is always read only, now we should not 
> forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> which are not needed anymore

Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
a linear search over all the vCPUs when there isn't a 1:1
correspondence with the vCPU index?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25 14:56     ` David Woodhouse
@ 2022-02-25 15:11       ` Maxim Levitsky
  2022-02-25 15:42         ` David Woodhouse
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 15:11 UTC (permalink / raw)
  To: David Woodhouse, Zeng Guang, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	Dave Hansen, Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> > > From: Maxim Levitsky <
> > > mlevitsk@redhat.com
> > > 
> > > No normal guest has any reason to change physical APIC IDs, and
> > > allowing this introduces bugs into APIC acceleration code.
> > > 
> > > And Intel recent hardware just ignores writes to APIC_ID in
> > > xAPIC mode. More background can be found at:
> > > https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
> > > 
> > > 
> > > Looks there is no much value to support writable xAPIC ID in
> > > guest except supporting some old and crazy use cases which
> > > probably would fail on real hardware. So, make xAPIC ID
> > > read-only for KVM guests.
> > > 
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> > 
> > Assuming that this is approved and accepted upstream,
> > that is even better that my proposal of doing this
> > when APICv is enabled.
> > 
> > Since now apic id is always read only, now we should not 
> > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > which are not needed anymore
> 
> Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> a linear search over all the vCPUs when there isn't a 1:1
> correspondence with the vCPU index?

I don't think so since vcpu id can still be set by userspace to anything,
and this is even used to encode topology in it.


However a hash table can still be used there to speed it up regardless of
read-only apic id IMHO.

Or, even better than a hash table, I see that KVM already 
limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
bits of topology are used:

"In the worst case, we'll need less than one extra bit for the
 * Core ID, and less than one extra bit for the Package (Die) ID,
 * so ratio of 4 should be enough"

Thus, we could in theory standardize location of those bits in apic_id 
(even with a new KVM extension and do linear search for legacy userspace), 
and then just mask/shift the topology bits.

The kvm extension would be defining how many low (or high?) bits of vcpu_id
are topology bits.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
  2022-02-25 14:44   ` Maxim Levitsky
@ 2022-02-25 15:29     ` Chao Gao
  0 siblings, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-02-25 15:29 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

On Fri, Feb 25, 2022 at 04:44:05PM +0200, Maxim Levitsky wrote:
>On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
>> Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
>> i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
>> the WRMSR.  Add support for handling "nodecode" x2APIC writes, which
>> were previously impossible.
>> 
>> Note, x2APIC MSR writes are 64 bits wide.
>> 
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>  arch/x86/kvm/lapic.c | 25 ++++++++++++++++++++++---
>>  1 file changed, 22 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 629c116b0d3e..e4bcdab1fac0 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -67,6 +67,7 @@ static bool lapic_timer_advance_dynamic __read_mostly;
>>  #define LAPIC_TIMER_ADVANCE_NS_MAX     5000
>>  /* step-by-step approximation to mitigate fluctuation */
>>  #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
>> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
>>  
>>  static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
>>  {
>> @@ -2227,10 +2228,28 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>>  /* emulate APIC access in a trap manner */
>>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>>  {
>> -	u32 val = kvm_lapic_get_reg(vcpu->arch.apic, offset);
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	u64 val;
>> +
>> +	if (apic_x2apic_mode(apic)) {
>> +		/*
>> +		 * When guest APIC is in x2APIC mode and IPI virtualization
>> +		 * is enabled, accessing APIC_ICR may cause trap-like VM-exit
>> +		 * on Intel hardware. Other offsets are not possible.
>> +		 */
>> +		if (WARN_ON_ONCE(offset != APIC_ICR))
>> +			return;
>>  
>> -	/* TODO: optimize to just emulate side effect w/o one more write */
>> -	kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
>> +		kvm_lapic_msr_read(apic, offset, &val);
>> +		if (val & APIC_ICR_BUSY)
>> +			kvm_x2apic_icr_write(apic, val);
>> +		else
>> +			kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32));
>I don't fully understand the above code.
>
>First of where kvm_x2apic_icr_write is defined?

Sean introduces it in his "prep work for VMX IPI virtualization" series, which
is merged into kvm/queue branch.

https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?h=queue&id=7a641ca0c219e4bbe102f2634dbc7e06072fcd3c

>
>Second, I thought that busy bit is not used in x2apic mode?
>At least in intel's SDM, section 10.12.9 'ICR Operation in x2APIC Mode'
>this bit is not defined.

You are right. We will remove the pointless check against APIC_ICR_BUSY and
just invoke kvm_apic_send_ipi().

In that section, SDM also says:
With the removal of the Delivery Status bit, system software no longer has a
reason to read the ICR. It remains readable only to aid in debugging; however,
***software should not assume the value returned by reading the ICR is the last
written value***.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25 15:11       ` Maxim Levitsky
@ 2022-02-25 15:42         ` David Woodhouse
  2022-02-25 16:12           ` Maxim Levitsky
  0 siblings, 1 reply; 41+ messages in thread
From: David Woodhouse @ 2022-02-25 15:42 UTC (permalink / raw)
  To: Maxim Levitsky, Zeng Guang, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	Dave Hansen, Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On Fri, 2022-02-25 at 17:11 +0200, Maxim Levitsky wrote:
> On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> > On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > > Assuming that this is approved and accepted upstream,
> > > that is even better that my proposal of doing this
> > > when APICv is enabled.
> > > 
> > > Since now apic id is always read only, now we should not 
> > > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > > which are not needed anymore
> > 
> > Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> > a linear search over all the vCPUs when there isn't a 1:1
> > correspondence with the vCPU index?
> 
> I don't think so since vcpu id can still be set by userspace to anything,
> and this is even used to encode topology in it.

Yes, but it can only be set at vCPU creation time and it has to be
unique.

> However a hash table can still be used there to speed it up regardless of
> read-only apic id IMHO.
> 
> Or, even better than a hash table, I see that KVM already 
> limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
> bits of topology are used:

We already have the kvm_apic_map which provides a fast lookup. The key
point here is that the APIC ID can't *change* from vcpu->vcpu_id any
more, so we can actually use the kvm_apic_map for kvm_get_vcpu_by_id()
now, can't we?


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25 15:42         ` David Woodhouse
@ 2022-02-25 16:12           ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 16:12 UTC (permalink / raw)
  To: David Woodhouse, Zeng Guang, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	Dave Hansen, Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 15:42 +0000, David Woodhouse wrote:
> On Fri, 2022-02-25 at 17:11 +0200, Maxim Levitsky wrote:
> > On Fri, 2022-02-25 at 14:56 +0000, David Woodhouse wrote:
> > > On Fri, 2022-02-25 at 16:46 +0200, Maxim Levitsky wrote:
> > > > Assuming that this is approved and accepted upstream,
> > > > that is even better that my proposal of doing this
> > > > when APICv is enabled.
> > > > 
> > > > Since now apic id is always read only, now we should not 
> > > > forget to clean up some parts of kvm like kvm_recalculate_apic_map,
> > > > which are not needed anymore
> > > 
> > > Can we also now optimise kvm_get_vcpu_by_id() so it doesn't have to do
> > > a linear search over all the vCPUs when there isn't a 1:1
> > > correspondence with the vCPU index?
> > 
> > I don't think so since vcpu id can still be set by userspace to anything,
> > and this is even used to encode topology in it.
> 
> Yes, but it can only be set at vCPU creation time and it has to be
> unique.
> 
> > However a hash table can still be used there to speed it up regardless of
> > read-only apic id IMHO.
> > 
> > Or, even better than a hash table, I see that KVM already 
> > limits vcpu_id to KVM_MAX_VCPUS * 4 with a comment that only two extra
> > bits of topology are used:
> 
> We already have the kvm_apic_map which provides a fast lookup. The key
> point here is that the APIC ID can't *change* from vcpu->vcpu_id any
> more, so we can actually use the kvm_apic_map for kvm_get_vcpu_by_id()
> now, can't we?
> 
Right! I wrote my response partially when I still assumed that vcpu_id
can be any 32 bit number (thus hash table), 
and later checked that it is capped by KVM_MAX_VCPUS * 4 which isn't a big number, 
plus as I now see in the kvm_recalculate_apic_map
the map is dynamically allocated up to the max apic id.

(technically speaking an array is a hash table).

Now the map would only be needed to be rebuit few times when new vCPUs are added,
and can be used to locate vcpu by its apic id.

I so hope that this patch is accepted so all of this could be done.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 7/9] KVM: VMX: enable IPI virtualization
  2022-02-25  8:22 ` [PATCH v6 7/9] KVM: VMX: enable IPI virtualization Zeng Guang
@ 2022-02-25 17:19   ` Maxim Levitsky
  2022-03-01  9:21     ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 17:19 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> From: Gao Chao <chao.gao@intel.com>
> 
> With IPI virtualization enabled, the processor emulates writes to
> APIC registers that would send IPIs. The processor sets the bit
> corresponding to the vector in target vCPU's PIR and may send a
> notification (IPI) specified by NDST and NV fields in target vCPU's
> Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
> engine does when dealing with posted interrupt from devices.
> 
> A PID-pointer table is used by the processor to locate the PID of a
> vCPU with the vCPU's APIC ID.
> 
> Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
> notification vector to wakeup vector. This can ensure that when an IPI
> for blocked vCPUs arrives, VMM can get control and wake up blocked
> vCPUs. And if a VCPU is preempted, its posted interrupt notification
> is suppressed.
> 
> Note that IPI virtualization can only virualize physical-addressing,
> flat mode, unicast IPIs. Sending other IPIs would still cause a
> trap-like APIC-write VM-exit and need to be handled by VMM.
> 
> Signed-off-by: Gao Chao <chao.gao@intel.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/include/asm/vmx.h         |  8 ++++
>  arch/x86/include/asm/vmxfeatures.h |  2 +
>  arch/x86/kvm/vmx/capabilities.h    |  6 +++
>  arch/x86/kvm/vmx/posted_intr.c     | 12 ++++-
>  arch/x86/kvm/vmx/vmx.c             | 74 +++++++++++++++++++++++++++---
>  arch/x86/kvm/vmx/vmx.h             |  3 ++
>  6 files changed, 97 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 8c929596a299..b79b6438acaa 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -76,6 +76,11 @@
>  #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
>  #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
>  
> +/*
> + * Definitions of Tertiary Processor-Based VM-Execution Controls.
> + */
> +#define TERTIARY_EXEC_IPI_VIRT			VMCS_CONTROL_BIT(IPI_VIRT)
> +
>  #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
>  #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
>  #define PIN_BASED_VIRTUAL_NMIS                  VMCS_CONTROL_BIT(VIRTUAL_NMIS)
> @@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
>  enum vmcs_field {
>  	VIRTUAL_PROCESSOR_ID            = 0x00000000,
>  	POSTED_INTR_NV                  = 0x00000002,
> +	LAST_PID_POINTER_INDEX		= 0x00000008,
>  	GUEST_ES_SELECTOR               = 0x00000800,
>  	GUEST_CS_SELECTOR               = 0x00000802,
>  	GUEST_SS_SELECTOR               = 0x00000804,
> @@ -224,6 +230,8 @@ enum vmcs_field {
>  	TSC_MULTIPLIER_HIGH             = 0x00002033,
>  	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
>  	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
> +	PID_POINTER_TABLE		= 0x00002042,
> +	PID_POINTER_TABLE_HIGH		= 0x00002043,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index ff20776dc83b..7ce616af2db2 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -86,4 +86,6 @@
>  #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>  #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
>  
> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
> +#define VMX_FEATURE_IPI_VIRT		(3*32 +  4) /* "" Enable IPI virtualization */
>  #endif /* _ASM_X86_VMXFEATURES_H */
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 31f3d88b3e4d..5f656c9e33be 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -13,6 +13,7 @@ extern bool __read_mostly enable_ept;
>  extern bool __read_mostly enable_unrestricted_guest;
>  extern bool __read_mostly enable_ept_ad_bits;
>  extern bool __read_mostly enable_pml;
> +extern bool __read_mostly enable_ipiv;
>  extern int __read_mostly pt_mode;
>  
>  #define PT_MODE_SYSTEM		0
> @@ -283,6 +284,11 @@ static inline bool cpu_has_vmx_apicv(void)
>  		cpu_has_vmx_posted_intr();
>  }
>  
> +static inline bool cpu_has_vmx_ipiv(void)
> +{
> +	return vmcs_config.cpu_based_3rd_exec_ctrl & TERTIARY_EXEC_IPI_VIRT;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index aa1fe9085d77..90124a30c074 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -177,11 +177,21 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>  	local_irq_restore(flags);
>  }
>  
> +static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
> +{
> +	return irqchip_in_kernel(kvm) && enable_ipiv;
> +}
> +
> +static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
> +{
> +	return vmx_can_use_ipiv_pi(kvm) || vmx_can_use_vtd_pi(kvm);

It took me a while to figure that out.
 
vmx_can_use_vtd_pi returns true when the VM can be targeted by posted
interrupts from the IOMMU, which leads to
 
1. update of the NV vector and SN bit on vcpu_load/vcpu_put to let
IOMMU knows where the vCPU really runs.
 
2. in vmx_pi_update_irte to configure the posted interrupts.
 
 
Now IPIv will also use the same NV vector and SN bit for IPI virtualization,
thus they have to be kept up to date on vcpu load/put.
 
I would appreciate a comment about this in vmx_can_use_posted_interrupts
because posted interrupts can mean too many things, like a host->guest
posted interrupt which is sent by just interrupt.
 
Maybe also rename the function to something like
 
vmx_need_up_to_date_nv_sn(). Sounds silly to me so
maybe something else.


> +}
> +
>  void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
>  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> -	if (!vmx_can_use_vtd_pi(vcpu->kvm))
> +	if (!vmx_can_use_posted_interrupts(vcpu->kvm))
I see here it is used.
>  		return;
>  
>  	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7beba7a9f247..0cb141c277ef 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -105,6 +105,9 @@ module_param(fasteoi, bool, S_IRUGO);
>  
>  module_param(enable_apicv, bool, S_IRUGO);
>  
> +bool __read_mostly enable_ipiv = true;
> +module_param(enable_ipiv, bool, 0444);
> +
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -227,6 +230,11 @@ static const struct {
>  };
>  
>  #define L1D_CACHE_ORDER 4
> +
> +/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
> +#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
> +#define PID_TABLE_ENTRY_VALID 1
> +
>  static void *vmx_l1d_flush_pages;
>  
>  static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
> @@ -2543,7 +2551,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	}
>  
>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
> -		u64 opt3 = 0;
> +		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
>  		u64 min3 = 0;
>  
>  		if (adjust_vmx_controls_64(min3, opt3,
> @@ -3898,6 +3906,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
>  		vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>  		vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
> +		if (enable_ipiv)
> +			vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),MSR_TYPE_RW);
>  	}
>  }
>  
> @@ -4219,14 +4229,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>  	if (cpu_has_secondary_exec_ctrls()) {
> -		if (kvm_vcpu_apicv_active(vcpu))
> +		if (kvm_vcpu_apicv_active(vcpu)) {
>  			secondary_exec_controls_setbit(vmx,
>  				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> -		else
> +			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
> +				tertiary_exec_controls_setbit(vmx,
> +						TERTIARY_EXEC_IPI_VIRT);
> +		} else {
>  			secondary_exec_controls_clearbit(vmx,
>  					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>  					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> +			if (cpu_has_tertiary_exec_ctrls())
> +				tertiary_exec_controls_clearbit(vmx,
> +						TERTIARY_EXEC_IPI_VIRT);
> +		}

Why check for cpu_has_tertiary_exec_ctrls()? wouldn't it be always true
(enable_ipiv has to be turned to false if CPU doesn't support IPIv,
and if it does it will support tertiary exec controls).

I don't mind this as a precaution + consistency with other code.


>  	}
>  
>  	vmx_update_msr_bitmap_x2apic(vcpu);
> @@ -4260,7 +4277,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>  
>  static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>  {
> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
> +
> +	/*
> +	 * IPI virtualization relies on APICv. Disable IPI
> +	 * virtualization if APICv is inhibited.
> +	 */
> +	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;

I am not 100% sure, but kvm_vcpu_apicv_active might not be the
best thing to check here, as it reflects per-cpu dynamic APICv inhibit.

It probably works, but it might be better to use enable_apicv
here and rely on normal APICv inhibit, and there inibit IPIv  as well
as you do in vmx_refresh_apicv_exec_ctrl/



> +
> +	return exec_control;
>  }
>  
>  /*
> @@ -4412,6 +4438,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  
>  static void init_vmcs(struct vcpu_vmx *vmx)
>  {
> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
> +
>  	if (nested)
>  		nested_vmx_set_vmcs_shadowing_bitmap();
>  
> @@ -4431,7 +4460,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	if (cpu_has_tertiary_exec_ctrls())
>  		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
>  
> -	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
> +	if (kvm_vcpu_apicv_active(vcpu)) {

here too (pre-existing), I also not 100% sure that kvm_vcpu_apicv_active
should be used. I haven't studied APICv code that much to be 100% sure.


>  		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>  		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>  		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> @@ -4441,6 +4470,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  
>  		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (enable_ipiv) {
> +			WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
> +				__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
> +			vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
> +			vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
> +		}
>  	}
>  
>  	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
> @@ -4492,7 +4528,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>  	}
>  
> -	vmx_write_encls_bitmap(&vmx->vcpu, NULL);
> +	vmx_write_encls_bitmap(vcpu, NULL);

I might have separated the refactoring of using vcpu instead of &vmx->vcpu
in a separate patch, but I don't mind that that much.

>  
>  	if (vmx_pt_mode_is_host_guest()) {
>  		memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
> @@ -4508,7 +4544,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  
>  	if (cpu_has_vmx_tpr_shadow()) {
>  		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
> -		if (cpu_need_tpr_shadow(&vmx->vcpu))
> +		if (cpu_need_tpr_shadow(vcpu))
>  			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
>  				     __pa(vmx->vcpu.arch.apic->regs));
>  		vmcs_write32(TPR_THRESHOLD, 0);
> @@ -7165,6 +7201,18 @@ static int vmx_vm_init(struct kvm *kvm)
>  			break;
>  		}
>  	}
> +
> +	if (enable_ipiv) {
> +		struct page *pages;
> +
> +		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
> +		if (!pages)
> +			return -ENOMEM;
> +
> +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
> +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -7756,6 +7804,14 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>  	return supported & BIT(bit);
>  }
>  
> +static void vmx_vm_destroy(struct kvm *kvm)
> +{
> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
> +
> +	if (kvm_vmx->pid_table)
> +		free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);

Maybe add a warning checking that ipiv was actually enabled.
Maybe this is overkill.


> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  	.name = "kvm_intel",
>  
> @@ -7768,6 +7824,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  
>  	.vm_size = sizeof(struct kvm_vmx),
>  	.vm_init = vmx_vm_init,
> +	.vm_destroy = vmx_vm_destroy,
>  
>  	.vcpu_create = vmx_create_vcpu,
>  	.vcpu_free = vmx_free_vcpu,
> @@ -8022,6 +8079,9 @@ static __init int hardware_setup(void)
>  	if (!enable_apicv)
>  		vmx_x86_ops.sync_pir_to_irr = NULL;
>  
> +	if (!enable_apicv || !cpu_has_vmx_ipiv())
> +		enable_ipiv = false;
> +
>  	if (cpu_has_vmx_tsc_scaling()) {
>  		kvm_has_tsc_control = true;
>  		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index d4a647d3ed4a..e7b0c00c9d43 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -365,6 +365,9 @@ struct kvm_vmx {
>  	unsigned int tss_addr;
>  	bool ept_identity_pagetable_done;
>  	gpa_t ept_identity_map_addr;
> +	/* PID table for IPI virtualization */
> +	u64 *pid_table;
> +	u16 pid_last_index;
>  };
>  
>  bool nested_vmx_allowed(struct kvm_vcpu *vcpu);


I might have missed something, but overall looks good.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM
  2022-02-25  8:22 ` [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
@ 2022-02-25 17:22   ` Maxim Levitsky
  0 siblings, 0 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 17:22 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> Introduce new max_vcpu_id in KVM for x86 architecture. Userspace
> can assign maximum possible vcpu id for current VM session using
> KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
> 
> This is done for x86 only because the sole use case is to guide
> memory allocation for PID-pointer table, a structure needed to
> enable VMX IPI.
> 
> By default, max_vcpu_id set as KVM_MAX_VCPU_IDS.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
> No new KVM capability is added to advertise the support of
> configurable maximum vCPU ID to user space because max_vcpu_id is
> just a hint/commitment to allow KVM to reduce the size of PID-pointer
> table. But I am not 100% sure if it is proper to do so.
> 
>  arch/x86/include/asm/kvm_host.h |  6 ++++++
>  arch/x86/kvm/x86.c              | 11 +++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6dcccb304775..db16aebd946c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1233,6 +1233,12 @@ struct kvm_arch {
>  	hpa_t	hv_root_tdp;
>  	spinlock_t hv_root_tdp_lock;
>  #endif
> +	/*
> +	 * VM-scope maximum vCPU ID. Used to determine the size of structures
> +	 * that increase along with the maximum vCPU ID, in which case, using
> +	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
> +	 */
> +	u32 max_vcpu_id;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f6fe9974cb5..ca17cc452bd3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5994,6 +5994,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		kvm->arch.exit_on_emulation_error = cap->args[0];
>  		r = 0;
>  		break;
> +	case KVM_CAP_MAX_VCPU_ID:
> +		if (cap->args[0] <= KVM_MAX_VCPU_IDS) {
> +			kvm->arch.max_vcpu_id = cap->args[0];
> +			r = 0;
> +		} else
> +			r = -E2BIG;
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> @@ -11067,6 +11074,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	struct page *page;
>  	int r;
>  
> +	if (vcpu->vcpu_id >= vcpu->kvm->arch.max_vcpu_id)
> +		return -E2BIG;
> +
>  	vcpu->arch.last_vmentry_cpu = -1;
>  	vcpu->arch.regs_avail = ~0;
>  	vcpu->arch.regs_dirty = ~0;
> @@ -11589,6 +11599,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
>  	kvm->arch.hv_root_tdp = INVALID_PAGE;
>  #endif
> +	kvm->arch.max_vcpu_id = KVM_MAX_VCPU_IDS;
>  
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

+ Reminder for myself to use this in AVIC code to as well to limit the size of the physid table.
(The max size of the table (for now...) is 1 page (or 1/2 of page for non x2avic), 
but still no doubt it takes some effort  for microcode to scan all the unused entries in it for nothing
(and soon my nested avic code will have to scan it as well).

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table
  2022-02-25  8:22 ` [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table Zeng Guang
@ 2022-02-25 17:29   ` Maxim Levitsky
  2022-03-01  9:23     ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-02-25 17:29 UTC (permalink / raw)
  To: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang
  Cc: x86, linux-kernel, Robert Hu, Gao Chao

On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
> Current kvm allocates 8 pages in advance for Posted Interrupt Descriptor
> pointer (PID-pointer) table to accommodate vCPUs with APIC ID up to
> KVM_MAX_VCPU_IDS - 1. This policy wastes some memory because most of
> VMs have less than 512 vCPUs and then just need one page.
> 
> If user hypervisor specify max practical vcpu id prior to vCPU creation,
> IPIv can allocate only essential memory for PID-pointer table and reduce
> the memory footprint of VMs.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 45 ++++++++++++++++++++++++++++--------------
>  1 file changed, 30 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0cb141c277ef..22bfb4953289 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -230,9 +230,6 @@ static const struct {
>  };
>  
>  #define L1D_CACHE_ORDER 4
> -
> -/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
> -#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
>  #define PID_TABLE_ENTRY_VALID 1
>  
>  static void *vmx_l1d_flush_pages;
> @@ -4434,6 +4431,24 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	return exec_control;
>  }
>  
> +static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
> +{
> +	struct page *pages;
> +
> +	if(kvm_vmx->pid_table)
> +		return 0;
> +
> +	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> +			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));
> +
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	kvm_vmx->pid_table = (void *)page_address(pages);
> +	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
> +	return 0;
> +}
> +
>  #define VMX_XSS_EXIT_BITMAP 0
>  
>  static void init_vmcs(struct vcpu_vmx *vmx)
> @@ -7159,6 +7174,16 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>  			goto free_vmcs;
>  	}
>  
> +	if (enable_ipiv && kvm_vcpu_apicv_active(vcpu)) {
> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
> +
> +		mutex_lock(&vcpu->kvm->lock);
> +		err = vmx_alloc_pid_table(kvm_vmx);
> +		mutex_unlock(&vcpu->kvm->lock);
> +		if (err)
> +			goto free_vmcs;
> +	}

This could be dangerous. If APICv is temporary inhibited,
this code won't run and we will end up without PID table.

I think that kvm_vcpu_apicv_active should be just dropped
from this condition.

Best regards,
	Maxim Levitsky


> +
>  	return 0;
>  
>  free_vmcs:
> @@ -7202,17 +7227,6 @@ static int vmx_vm_init(struct kvm *kvm)
>  		}
>  	}
>  
> -	if (enable_ipiv) {
> -		struct page *pages;
> -
> -		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
> -		if (!pages)
> -			return -ENOMEM;
> -
> -		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
> -		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
> -	}
> -
>  	return 0;
>  }
>  
> @@ -7809,7 +7823,8 @@ static void vmx_vm_destroy(struct kvm *kvm)
>  	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
>  
>  	if (kvm_vmx->pid_table)
> -		free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
> +		free_pages((unsigned long)kvm_vmx->pid_table,
> +			get_order((kvm_vmx->pid_last_index + 1) * sizeof(u64)));
>  }
>  
>  static struct kvm_x86_ops vmx_x86_ops __initdata = {





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25 14:46   ` Maxim Levitsky
  2022-02-25 14:56     ` David Woodhouse
@ 2022-03-01  8:03     ` Chao Gao
  1 sibling, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-03-01  8:03 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

On Fri, Feb 25, 2022 at 04:46:31PM +0200, Maxim Levitsky wrote:
>On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
>> From: Maxim Levitsky <mlevitsk@redhat.com>
>> 
>> No normal guest has any reason to change physical APIC IDs, and
>> allowing this introduces bugs into APIC acceleration code.
>> 
>> And Intel recent hardware just ignores writes to APIC_ID in
>> xAPIC mode. More background can be found at:
>> https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/
>> 
>> Looks there is no much value to support writable xAPIC ID in
>> guest except supporting some old and crazy use cases which
>> probably would fail on real hardware. So, make xAPIC ID
>> read-only for KVM guests.
>> 
>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>
>Assuming that this is approved and accepted upstream,
>that is even better that my proposal of doing this
>when APICv is enabled.

Sean & Paolo

what's your opinion? Shall we make xAPIC ID read-only unconditionally
or just when enable_apicv is enabled or use a parameter to control
it as Sean suggested at

https://lore.kernel.org/lkml/Yfw5ddGNOnDqxMLs@google.com/

Intel SDM Vol3 10.4.6 Local APID ID says:

	In MP systems, the local APIC ID is also used as a processor ID by the
	BIOS and the operating system. Some processors permit software to
	modify the APIC ID. However, the ability of software to modify the APIC
	ID is processor model specific. Because of this, operating system
	software should avoid writing to the local APIC ID register.

So, we think it is fine to make xAPIC ID always read-only. Instead of
supporting writable xAPIC ID in KVM guest, it may be better to fix software
that modify local APIC ID.

Intel IPI virtualization and AVIC are two cases that requires special
handling if xAPIC ID is writable. But it doesn't worth the effort and
is error-prone (e.g., AVIC is broken if guest changed its APIC ID).

>
>Since now apic id is always read only, now we should not 
>forget to clean up some parts of kvm like kvm_recalculate_apic_map,
>which are not needed anymore.

Do you mean marking apic_map as DIRTY isn't needed in some cases?
Or some cleanup should be done inside kvm_recalculate_apic_map()?

For the former, I think we can ...

>
>Best regards,
>	Maxim Levitsky
>
>> ---
>>  arch/x86/kvm/lapic.c | 25 ++++++++++++++++++-------
>>  1 file changed, 18 insertions(+), 7 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index e4bcdab1fac0..b38288c8a94f 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2044,10 +2044,17 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>>  
>>  	switch (reg) {
>>  	case APIC_ID:		/* Local APIC ID */
>> -		if (!apic_x2apic_mode(apic))
>> -			kvm_apic_set_xapic_id(apic, val >> 24);
>> -		else
>> +		if (apic_x2apic_mode(apic)) {
>>  			ret = 1;
>> +			break;
>> +		}
>> +		/* Don't allow changing APIC ID to avoid unexpected issues */
>> +		if ((val >> 24) != apic->vcpu->vcpu_id) {
>> +			kvm_vm_bugged(apic->vcpu->kvm);
>> +			break;
>> +		}
>> +
>> +		kvm_apic_set_xapic_id(apic, val >> 24);

... drop the kvm_apic_set_xapic_id().

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 7/9] KVM: VMX: enable IPI virtualization
  2022-02-25 17:19   ` Maxim Levitsky
@ 2022-03-01  9:21     ` Chao Gao
  2022-03-02  6:45       ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Chao Gao @ 2022-03-01  9:21 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

>> +static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
>> +{
>> +	return irqchip_in_kernel(kvm) && enable_ipiv;
>> +}
>> +
>> +static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
>> +{
>> +	return vmx_can_use_ipiv_pi(kvm) || vmx_can_use_vtd_pi(kvm);
>
>It took me a while to figure that out.
> 
>vmx_can_use_vtd_pi returns true when the VM can be targeted by posted
>interrupts from the IOMMU, which leads to
> 
>1. update of the NV vector and SN bit on vcpu_load/vcpu_put to let
>IOMMU knows where the vCPU really runs.
> 
>2. in vmx_pi_update_irte to configure the posted interrupts.
> 
> 
>Now IPIv will also use the same NV vector and SN bit for IPI virtualization,
>thus they have to be kept up to date on vcpu load/put.
> 
>I would appreciate a comment about this in vmx_can_use_posted_interrupts
>because posted interrupts can mean too many things, like a host->guest
>posted interrupt which is sent by just interrupt.
> 
>Maybe also rename the function to something like
> 
>vmx_need_up_to_date_nv_sn(). Sounds silly to me so
>maybe something else.

It makes sense.

Will add a comment and rename the function.

>
>> @@ -4219,14 +4229,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>>  
>>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>>  	if (cpu_has_secondary_exec_ctrls()) {
>> -		if (kvm_vcpu_apicv_active(vcpu))
>> +		if (kvm_vcpu_apicv_active(vcpu)) {
>>  			secondary_exec_controls_setbit(vmx,
>>  				      SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>  				      SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> -		else
>> +			if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
>> +				tertiary_exec_controls_setbit(vmx,
>> +						TERTIARY_EXEC_IPI_VIRT);
>> +		} else {
>>  			secondary_exec_controls_clearbit(vmx,
>>  					SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>  					SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> +			if (cpu_has_tertiary_exec_ctrls())
>> +				tertiary_exec_controls_clearbit(vmx,
>> +						TERTIARY_EXEC_IPI_VIRT);
>> +		}
>
>Why check for cpu_has_tertiary_exec_ctrls()? wouldn't it be always true
>(enable_ipiv has to be turned to false if CPU doesn't support IPIv,
>and if it does it will support tertiary exec controls).

yes. Checking enable_ipiv in two if()s above is enough.

>
>I don't mind this as a precaution + consistency with other code.
>
>
>>  	}
>>  
>>  	vmx_update_msr_bitmap_x2apic(vcpu);
>> @@ -4260,7 +4277,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
>>  
>>  static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
>>  {
>> -	return vmcs_config.cpu_based_3rd_exec_ctrl;
>> +	u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
>> +
>> +	/*
>> +	 * IPI virtualization relies on APICv. Disable IPI
>> +	 * virtualization if APICv is inhibited.
>> +	 */
>> +	if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
>> +		exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
>
>I am not 100% sure, but kvm_vcpu_apicv_active might not be the
>best thing to check here, as it reflects per-cpu dynamic APICv inhibit.
>
>It probably works, but it might be better to use enable_apicv
>here and rely on normal APICv inhibit, and there inibit IPIv  as well
>as you do in vmx_refresh_apicv_exec_ctrl/
>

What's the difference between normal and per-cpu dynamic APICv inhibit?

>
>
>> +
>> +	return exec_control;
>>  }
>>  
>>  /*
>> @@ -4412,6 +4438,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>>  
>>  static void init_vmcs(struct vcpu_vmx *vmx)
>>  {
>> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
>> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
>> +
>>  	if (nested)
>>  		nested_vmx_set_vmcs_shadowing_bitmap();
>>  
>> @@ -4431,7 +4460,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>  	if (cpu_has_tertiary_exec_ctrls())
>>  		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
>>  
>> -	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
>> +	if (kvm_vcpu_apicv_active(vcpu)) {
>
>here too (pre-existing), I also not 100% sure that kvm_vcpu_apicv_active
>should be used. I haven't studied APICv code that much to be 100% sure.

I think kvm_vcpu_apicv_active is better.

The question is: If CPU supports a VMX feature (APICv), but it isn't enabled
now, is it allowed to configure VMCS fields defined by the feature?  Would CPU
ignore the values written to the fields or retain them after enabling the
feature later?

Personally, KVM shouldn't rely on CPU's behavior in this case. So, It is better
for KVM to write below VMCS fields only if APICv is enabled.

>
>
>>  		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>>  		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>>  		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>> @@ -4441,6 +4470,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>  
>>  		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>> +
>> +		if (enable_ipiv) {
>> +			WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
>> +				__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
>> +			vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
>> +			vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
>> +		}
>>  	}
>>  
>>  	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
>> @@ -4492,7 +4528,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>  		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>>  	}
>>  
>> -	vmx_write_encls_bitmap(&vmx->vcpu, NULL);
>> +	vmx_write_encls_bitmap(vcpu, NULL);
>
>I might have separated the refactoring of using vcpu instead of &vmx->vcpu
>in a separate patch, but I don't mind that that much.
>
>>  
>>  	if (vmx_pt_mode_is_host_guest()) {
>>  		memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
>> @@ -4508,7 +4544,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>  
>>  	if (cpu_has_vmx_tpr_shadow()) {
>>  		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
>> -		if (cpu_need_tpr_shadow(&vmx->vcpu))
>> +		if (cpu_need_tpr_shadow(vcpu))
>>  			vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
>>  				     __pa(vmx->vcpu.arch.apic->regs));
>>  		vmcs_write32(TPR_THRESHOLD, 0);
>> @@ -7165,6 +7201,18 @@ static int vmx_vm_init(struct kvm *kvm)
>>  			break;
>>  		}
>>  	}
>> +
>> +	if (enable_ipiv) {
>> +		struct page *pages;
>> +
>> +		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
>> +		if (!pages)
>> +			return -ENOMEM;
>> +
>> +		to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
>> +		to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
>> +	}
>> +
>>  	return 0;
>>  }
>>  
>> @@ -7756,6 +7804,14 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>>  	return supported & BIT(bit);
>>  }
>>  
>> +static void vmx_vm_destroy(struct kvm *kvm)
>> +{
>> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
>> +
>> +	if (kvm_vmx->pid_table)
>> +		free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
>
>Maybe add a warning checking that ipiv was actually enabled.

Do you mean ipiv was enabled on one of vCPU or just enable_ipiv is true?

The former will lead to false positives if qemu creates a VM and destroys the
vm without creating any vCPUs.

>Maybe this is overkill.
>
>
>> +}
>> +
>>  static struct kvm_x86_ops vmx_x86_ops __initdata = {
>>  	.name = "kvm_intel",
>>  
>> @@ -7768,6 +7824,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>>  
>>  	.vm_size = sizeof(struct kvm_vmx),
>>  	.vm_init = vmx_vm_init,
>> +	.vm_destroy = vmx_vm_destroy,
>>  
>>  	.vcpu_create = vmx_create_vcpu,
>>  	.vcpu_free = vmx_free_vcpu,
>> @@ -8022,6 +8079,9 @@ static __init int hardware_setup(void)
>>  	if (!enable_apicv)
>>  		vmx_x86_ops.sync_pir_to_irr = NULL;
>>  
>> +	if (!enable_apicv || !cpu_has_vmx_ipiv())
>> +		enable_ipiv = false;
>> +
>>  	if (cpu_has_vmx_tsc_scaling()) {
>>  		kvm_has_tsc_control = true;
>>  		kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index d4a647d3ed4a..e7b0c00c9d43 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -365,6 +365,9 @@ struct kvm_vmx {
>>  	unsigned int tss_addr;
>>  	bool ept_identity_pagetable_done;
>>  	gpa_t ept_identity_map_addr;
>> +	/* PID table for IPI virtualization */
>> +	u64 *pid_table;
>> +	u16 pid_last_index;
>>  };
>>  
>>  bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
>
>
>I might have missed something, but overall looks good.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table
  2022-02-25 17:29   ` Maxim Levitsky
@ 2022-03-01  9:23     ` Chao Gao
  0 siblings, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-03-01  9:23 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

On Fri, Feb 25, 2022 at 07:29:39PM +0200, Maxim Levitsky wrote:
>On Fri, 2022-02-25 at 16:22 +0800, Zeng Guang wrote:
>> Current kvm allocates 8 pages in advance for Posted Interrupt Descriptor
>> pointer (PID-pointer) table to accommodate vCPUs with APIC ID up to
>> KVM_MAX_VCPU_IDS - 1. This policy wastes some memory because most of
>> VMs have less than 512 vCPUs and then just need one page.
>> 
>> If user hypervisor specify max practical vcpu id prior to vCPU creation,
>> IPIv can allocate only essential memory for PID-pointer table and reduce
>> the memory footprint of VMs.
>> 
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
>> ---
>>  arch/x86/kvm/vmx/vmx.c | 45 ++++++++++++++++++++++++++++--------------
>>  1 file changed, 30 insertions(+), 15 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 0cb141c277ef..22bfb4953289 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -230,9 +230,6 @@ static const struct {
>>  };
>>  
>>  #define L1D_CACHE_ORDER 4
>> -
>> -/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
>> -#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
>>  #define PID_TABLE_ENTRY_VALID 1
>>  
>>  static void *vmx_l1d_flush_pages;
>> @@ -4434,6 +4431,24 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>>  	return exec_control;
>>  }
>>  
>> +static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx)
>> +{
>> +	struct page *pages;
>> +
>> +	if(kvm_vmx->pid_table)
>> +		return 0;
>> +
>> +	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>> +			get_order(kvm_vmx->kvm.arch.max_vcpu_id * sizeof(u64)));
>> +
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	kvm_vmx->pid_table = (void *)page_address(pages);
>> +	kvm_vmx->pid_last_index = kvm_vmx->kvm.arch.max_vcpu_id - 1;
>> +	return 0;
>> +}
>> +
>>  #define VMX_XSS_EXIT_BITMAP 0
>>  
>>  static void init_vmcs(struct vcpu_vmx *vmx)
>> @@ -7159,6 +7174,16 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>>  			goto free_vmcs;
>>  	}
>>  
>> +	if (enable_ipiv && kvm_vcpu_apicv_active(vcpu)) {
>> +		struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
>> +
>> +		mutex_lock(&vcpu->kvm->lock);
>> +		err = vmx_alloc_pid_table(kvm_vmx);
>> +		mutex_unlock(&vcpu->kvm->lock);
>> +		if (err)
>> +			goto free_vmcs;
>> +	}
>
>This could be dangerous. If APICv is temporary inhibited,
>this code won't run and we will end up without PID table.
>
>I think that kvm_vcpu_apicv_active should be just dropped
>from this condition.

Agreed. Will fix it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 7/9] KVM: VMX: enable IPI virtualization
  2022-03-01  9:21     ` Chao Gao
@ 2022-03-02  6:45       ` Chao Gao
  0 siblings, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-03-02  6:45 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

>>>  static void init_vmcs(struct vcpu_vmx *vmx)
>>>  {
>>> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
>>> +	struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
>>> +
>>>  	if (nested)
>>>  		nested_vmx_set_vmcs_shadowing_bitmap();
>>>  
>>> @@ -4431,7 +4460,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>>  	if (cpu_has_tertiary_exec_ctrls())
>>>  		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
>>>  
>>> -	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
>>> +	if (kvm_vcpu_apicv_active(vcpu)) {
>>
>>here too (pre-existing), I also not 100% sure that kvm_vcpu_apicv_active
>>should be used. I haven't studied APICv code that much to be 100% sure.
>

On second thoughts, I think you are correct. Below VMCS fields 
(i.e, EIO_EXIT_BITMAP0/1/2, POSTED_INTR_NV/DESC_ADDR) should be configured as
long as the VM can enable APICv, particularly considering
vmx_refresh_apicv_exec_ctrl() doesn't configure these VMCS fields when APICv
gets activated.

This is a latent bug in KVM. We will fix it with a separate patch.

>I think kvm_vcpu_apicv_active is better.
>
>The question is: If CPU supports a VMX feature (APICv), but it isn't enabled
>now, is it allowed to configure VMCS fields defined by the feature?  Would CPU
>ignore the values written to the fields or retain them after enabling the
>feature later?

This concern is invalid. SDM doesn't mention any ordering requirement about
configuring a feature's vm-execution bit and other VMCS fields introduced for
the feature. Please disregard my original remark.

>
>Personally, KVM shouldn't rely on CPU's behavior in this case. So, It is better
>for KVM to write below VMCS fields only if APICv is enabled.
>
>>
>>
>>>  		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>>>  		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>>>  		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>>> @@ -4441,6 +4470,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>>  
>>>  		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>>>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>>> +
>>> +		if (enable_ipiv) {
>>> +			WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
>>> +				__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
>>> +			vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
>>> +			vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
>>> +		}
>>>  	}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-02-25  8:22 ` [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
  2022-02-25 14:46   ` Maxim Levitsky
@ 2022-03-08 23:04   ` Sean Christopherson
  2022-03-09  5:21     ` Chao Gao
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2022-03-08 23:04 UTC (permalink / raw)
  To: Zeng Guang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, Dave Hansen, Tony Luck, Kan Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Kim Phillips, Jarkko Sakkinen, Jethro Beekman, Kai Huang, x86,
	linux-kernel, Robert Hu, Gao Chao, Maxim Levitsky

On Fri, Feb 25, 2022, Zeng Guang wrote:
> From: Maxim Levitsky <mlevitsk@redhat.com>
> 
> No normal guest has any reason to change physical APIC IDs,

I don't think we can reasonably assume this, my analysis in the link (that I just
realized I deleted from context here) shows it's at least plausible that an existing
guest could rely on the APIC ID being writable.  And that's just one kernel, who
know what else is out there, especially given that people use KVM to emulate really
old stuff, often on really old hardware.

Practically speaking, anyone that wants to deploy IPIv is going to have to make
the switch at some point, but that doesn't help people running legacy crud that
don't care about IPIv.

I was thinking a module param would be trivial, and it is (see below) if the
param is off by default.  A module param will also provide a convenient opportunity
to resolve the loophole reported by Maxim[1][2], though it's a bit funky.

Anyways, with an off-by-default module param, we can just do:

	if (!enable_apicv || !cpu_has_vmx_ipiv() || !xapic_id_readonly)
		enable_ipiv = false;

Forcing userspace to take advantage of IPIv is rather annoying, but it's not the
end of world.

Having the param on by default is a mess.  Either we break userspace (above), or
we only kinda break userspace by having it on iff IPIv is on, but then we end up
with cyclical dependency hell.  E.g. userspace makes xAPIC ID writable and forces
on IPIv, which one "wins"?  And if it's on by default, we can't fix the loophole
in KVM_SET_LAPIC.

If we really wanted to have it on by default, we could have a Kconfig and make
_that_ off by default, e.g.

  static bool __read_mostly xapic_id_readonly = IS_ENABLED(CONFING_KVM_XAPIC_ID_RO);

but that seems like overkill.  If a kernel owner knows they want the param on,
it should be easy enough to force it without a Kconfig.

So I think my vote would be for something like this?  Compile tested only...

---
 arch/x86/kvm/lapic.c    | 14 +++++++++-----
 arch/x86/kvm/svm/avic.c |  5 +++++
 arch/x86/kvm/x86.c      |  4 ++++
 arch/x86/kvm/x86.h      |  1 +
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c4c3155d98db..2c01cd45fb18 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2043,7 +2043,7 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)

 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
+		if (!apic_x2apic_mode(apic) && !xapic_id_readonly)
 			kvm_apic_set_xapic_id(apic, val >> 24);
 		else
 			ret = 1;
@@ -2634,10 +2634,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		u32 *ldr = (u32 *)(s->regs + APIC_LDR);
 		u64 icr;

-		if (vcpu->kvm->arch.x2apic_format) {
-			if (*id != vcpu->vcpu_id)
-				return -EINVAL;
-		} else {
+		if (!vcpu->kvm->arch.x2apic_format) {
 			if (set)
 				*id >>= 24;
 			else
@@ -2650,6 +2647,10 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		 * split to ICR+ICR2 in userspace for backwards compatibility.
 		 */
 		if (set) {
+			if ((vcpu->kvm->arch.x2apic_format || xapic_id_readonly) &&
+			    (*id != vcpu->vcpu_id))
+				return -EINVAL;
+
 			*ldr = kvm_apic_calc_x2apic_ldr(*id);

 			icr = __kvm_lapic_get_reg(s->regs, APIC_ICR) |
@@ -2659,6 +2660,9 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 			icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR);
 			__kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32);
 		}
+	} else if (set && xapic_id_readonly &&
+		   (__kvm_lapic_get_reg(s->regs, APIC_ID) >> 24) != vcpu->vcpu_id) {
+		return -EINVAL;
 	}

 	return 0;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index b37b353ec086..4a031d9686c2 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -442,6 +442,11 @@ static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 id = kvm_xapic_id(vcpu->arch.apic);

+	if (xapic_id_readonly && id != vcpu->vcpu_id) {
+		kvm_prepare_emulation_failure_exit(vcpu);
+		return 0;
+	}
+
 	if (vcpu->vcpu_id == id)
 		return 0;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4fa4d8269e5b..67706d468ed3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -177,6 +177,10 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 static int __read_mostly lapic_timer_advance_ns = -1;
 module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);

+bool __read_mostly xapic_id_readonly;
+module_param(xapic_id_readonly, bool, 0444);
+EXPORT_SYMBOL_GPL(xapic_id_readonly);
+
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index aa86abad914d..89f40c921c08 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -302,6 +302,7 @@ static inline bool kvm_mpx_supported(void)
 extern unsigned int min_timer_period_us;

 extern bool enable_vmware_backdoor;
+extern bool xapic_id_readonly;

 extern int pi_inject_timer;


base-commit: 1e147f6f90668f2c2b57406d451f0cfcd2ba19d0
--


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-08 23:04   ` Sean Christopherson
@ 2022-03-09  5:21     ` Chao Gao
  2022-03-09  6:01       ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Chao Gao @ 2022-03-09  5:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu, Maxim Levitsky

On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
>On Fri, Feb 25, 2022, Zeng Guang wrote:
>> From: Maxim Levitsky <mlevitsk@redhat.com>
>> 
>> No normal guest has any reason to change physical APIC IDs,
>
>I don't think we can reasonably assume this, my analysis in the link (that I just
>realized I deleted from context here) shows it's at least plausible that an existing
>guest could rely on the APIC ID being writable.  And that's just one kernel, who
>know what else is out there, especially given that people use KVM to emulate really
>old stuff, often on really old hardware.

Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
clearly saying writable xAPIC ID is processor model specific and ***software should
avoid writing to xAPIC ID***.

If writable xAPIC ID support should be retained and is tied to a module param,
live migration would depend on KVM's module params: e.g., migrate a VM with
modified xAPIC ID (apic_id_readonly off on this system) to one with
xapic_id_readonly on would fail, right? Is this failure desired? if not, we need to
have a VM-scope control. e.g., add an inhibitor of APICv (XAPIC_ID_MODIFIED) and
disable APICv forever for this VM if its vCPUs or QEMU modifies xAPIC ID.

>
>Practically speaking, anyone that wants to deploy IPIv is going to have to make
>the switch at some point, but that doesn't help people running legacy crud that
>don't care about IPIv.
>
>I was thinking a module param would be trivial, and it is (see below) if the
>param is off by default.  A module param will also provide a convenient opportunity
>to resolve the loophole reported by Maxim[1][2], though it's a bit funky.

Could you share the links?

>
>Anyways, with an off-by-default module param, we can just do:
>
>	if (!enable_apicv || !cpu_has_vmx_ipiv() || !xapic_id_readonly)
>		enable_ipiv = false;
>
>Forcing userspace to take advantage of IPIv is rather annoying, but it's not the
>end of world.
>
>Having the param on by default is a mess.  Either we break userspace (above), or
>we only kinda break userspace by having it on iff IPIv is on, but then we end up
>with cyclical dependency hell.  E.g. userspace makes xAPIC ID writable and forces
>on IPIv, which one "wins"? And if it's on by default, we can't fix the loophole
>in KVM_SET_LAPIC.

We are fine with having this param off by default.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-09  5:21     ` Chao Gao
@ 2022-03-09  6:01       ` Sean Christopherson
  2022-03-09 12:59         ` Maxim Levitsky
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2022-03-09  6:01 UTC (permalink / raw)
  To: Chao Gao
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu, Maxim Levitsky

TL;DR: Maxim, any objection to yet another inhibit?  Any potential issues you can think of?

On Wed, Mar 09, 2022, Chao Gao wrote:
> On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
> >On Fri, Feb 25, 2022, Zeng Guang wrote:
> >> From: Maxim Levitsky <mlevitsk@redhat.com>
> >> 
> >> No normal guest has any reason to change physical APIC IDs,
> >
> >I don't think we can reasonably assume this, my analysis in the link (that I just
> >realized I deleted from context here) shows it's at least plausible that an existing
> >guest could rely on the APIC ID being writable.  And that's just one kernel, who
> >know what else is out there, especially given that people use KVM to emulate really
> >old stuff, often on really old hardware.
> 
> Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
> clearly saying writable xAPIC ID is processor model specific and ***software should
> avoid writing to xAPIC ID***.

Intel isn't the only vendor KVM supports, and xAPIC ID is fully writable according
to AMD's docs and AMD's hardware.  x2APIC is even (indirectly) writable, but luckily
KVM has never modeled that...

Don't get me wrong, I would love to make xAPIC ID read-only, and I fully agree
that the probability of breaking someone's setup is very low, I just don't think
the benefits of forcing it are worth the risk of breaking userspace.

> If writable xAPIC ID support should be retained and is tied to a module param,
> live migration would depend on KVM's module params: e.g., migrate a VM with
> modified xAPIC ID (apic_id_readonly off on this system) to one with
> xapic_id_readonly on would fail, right? Is this failure desired?

Hrm, I was originally thinking it's not a terrible outcome, but I was assuming
that userspace would gracefully handle migration failure.  That's a bad assumption.

> if not, we need to have a VM-scope control. e.g., add an inhibitor of APICv
> (XAPIC_ID_MODIFIED) and disable APICv forever for this VM if its vCPUs or
> QEMU modifies xAPIC ID.

Inhibiting APICv if IPIv is enabled (implied for AMD's AVIC) is probably a better
option than a module param.  I was worried about ending up with silently degraded
VM performance, but that's easily solved by adding a stat to track APICv inhibitions,
which would be useful for other cases too (getting AMD's AVIC enabled is comically
difficult).

That would also let us drop the code buggy avic_handle_apic_id_update().

And it wouldn't necessarily have to be forever, though I agree that's a perfectly
fine approach until we have data that shows anything fancier is necessary.

> >Practically speaking, anyone that wants to deploy IPIv is going to have to make
> >the switch at some point, but that doesn't help people running legacy crud that
> >don't care about IPIv.
> >
> >I was thinking a module param would be trivial, and it is (see below) if the
> >param is off by default.  A module param will also provide a convenient opportunity
> >to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
> 
> Could you share the links?

Doh, sorry (they're both in this one).

https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-09  6:01       ` Sean Christopherson
@ 2022-03-09 12:59         ` Maxim Levitsky
  2022-03-11  4:26           ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-03-09 12:59 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Tony Luck,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Kai Huang, x86, linux-kernel, Robert Hu

On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> TL;DR: Maxim, any objection to yet another inhibit?  Any potential issues you can think of?
> 
> On Wed, Mar 09, 2022, Chao Gao wrote:
> > On Tue, Mar 08, 2022 at 11:04:01PM +0000, Sean Christopherson wrote:
> > > On Fri, Feb 25, 2022, Zeng Guang wrote:
> > > > From: Maxim Levitsky <mlevitsk@redhat.com>
> > > > 
> > > > No normal guest has any reason to change physical APIC IDs,
> > > 
> > > I don't think we can reasonably assume this, my analysis in the link (that I just
> > > realized I deleted from context here) shows it's at least plausible that an existing
> > > guest could rely on the APIC ID being writable.  And that's just one kernel, who
> > > know what else is out there, especially given that people use KVM to emulate really
> > > old stuff, often on really old hardware.
> > 
> > Making xAPIC ID readonly is not only based on your analysis, but also Intel SDM
> > clearly saying writable xAPIC ID is processor model specific and ***software should
> > avoid writing to xAPIC ID***.
> 
> Intel isn't the only vendor KVM supports, and xAPIC ID is fully writable according
> to AMD's docs and AMD's hardware.  x2APIC is even (indirectly) writable, but luckily
> KVM has never modeled that...
> 
> Don't get me wrong, I would love to make xAPIC ID read-only, and I fully agree
> that the probability of breaking someone's setup is very low, I just don't think
> the benefits of forcing it are worth the risk of breaking userspace.
> 
> > If writable xAPIC ID support should be retained and is tied to a module param,
> > live migration would depend on KVM's module params: e.g., migrate a VM with
> > modified xAPIC ID (apic_id_readonly off on this system) to one with
> > xapic_id_readonly on would fail, right? Is this failure desired?
> 
> Hrm, I was originally thinking it's not a terrible outcome, but I was assuming
> that userspace would gracefully handle migration failure.  That's a bad assumption.
> 
> > if not, we need to have a VM-scope control. e.g., add an inhibitor of APICv
> > (XAPIC_ID_MODIFIED) and disable APICv forever for this VM if its vCPUs or
> > QEMU modifies xAPIC ID.
> 
> Inhibiting APICv if IPIv is enabled (implied for AMD's AVIC) is probably a better
> option than a module param.  I was worried about ending up with silently degraded
> VM performance, but that's easily solved by adding a stat to track APICv inhibitions,
> which would be useful for other cases too (getting AMD's AVIC enabled is comically
> difficult).
> 
> That would also let us drop the code buggy avic_handle_apic_id_update().
> 
> And it wouldn't necessarily have to be forever, though I agree that's a perfectly
> fine approach until we have data that shows anything fancier is necessary.
> 
> > > Practically speaking, anyone that wants to deploy IPIv is going to have to make
> > > the switch at some point, but that doesn't help people running legacy crud that
> > > don't care about IPIv.
> > > 
> > > I was thinking a module param would be trivial, and it is (see below) if the
> > > param is off by default.  A module param will also provide a convenient opportunity
> > > to resolve the loophole reported by Maxim[1][2], though it's a bit funky.
> > 
> > Could you share the links?
> 
> Doh, sorry (they're both in this one).
> 
> https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
> 

My opinion on this subject is very simple: we need to draw the line somewhere.
 
There is balance between supporting (poorly) unused hardware features and
not supporting them at all.
 
Writable APIC id is not just some legacy feature like task switch but a 
feature that is frowned upon in both Intel and AMD manual:
 
Yes, look at AMD's SDM at:
 
"16.3.3 Local APIC ID
 
Unique local APIC IDs are assigned to each CPU core in the system. The value is determined by
hardware, based on the number of CPU cores on the processor and the node ID of the processor.
 
The APIC ID is located in the APIC ID register at APIC offset 20h. See Figure 16-3. It is model
dependent, whether software can modify the APIC ID Register. The initial value of the APIC ID (after
a reset) is the value returned in CPUID function 0000_0001h_EBX[31:24]."
 
 
Also in section '16.12 x2APIC_ID' of SDM it is mentioned:
 
 
RDMSR. An RDMSR of MSR 0802h returns the x2APIC_ID in EAX[31:0]. The x2APIC_ID is a
read-only register. 
 
Attempting to write MSR 802h or attempting to read this MSR when not in x2APIC
mode causes a #GP(0) exception. See 16.11 “Accessing x2APIC Registers”.
 
 
CPUID. The x2APIC ID is reported by the following CPUID functions Fn0000_000B (Extended
Topology Enumeration) and CPUID Fn8000_001E (Extended APIC ID) as follows:
 
 
From this you can also infer that x2apic id is assigned once on boot, and same value is
reported in CPUID and in MSR 0x0802.
 
 
The fact that one can outsmart the microcode, change apic id and then switch to
x2apic mode, is more a like a microcode bug that a feature IMHO, that nobody
bothered to fix since nobody uses it.
 
 
Sean, please don't get me wrong, I used to think differently on this - 
I implemented PDPTRS migration in kvm, although in retrospect I probably shouldn't.

I also understand your concerns - and I am not going to fight over this, a module
param for read only apic id, will work for me.
 
All I wanted to do is to make KVM better by simplifying it - KVM is already as complex
as it can get, anything to make it simpler is welcome IMHO.
 
 
Best regards,
	Maxim Levitsky
 
 
 
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-09 12:59         ` Maxim Levitsky
@ 2022-03-11  4:26           ` Sean Christopherson
       [not found]             ` <29c76393-4884-94a8-f224-08d313b73f71@intel.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2022-03-11  4:26 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Chao Gao, Zeng Guang, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen,
	Tony Luck, Kan Liang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Kim Phillips, Jarkko Sakkinen,
	Jethro Beekman, Kai Huang, x86, linux-kernel, Robert Hu

On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > Could you share the links?
> > 
> > Doh, sorry (they're both in this one).
> > 
> > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
> > 
> 
> My opinion on this subject is very simple: we need to draw the line somewhere.

...

> I also understand your concerns - and I am not going to fight over this, a module
> param for read only apic id, will work for me.

Sadly, I don't think a module param would actually help.  I was thinking it would
avoid breakage by allowing for graceful fallback on migration failure, but that
was wishful thinking.  An inhibit seems like the least awful idea if we don't end
up making it unconditionally readonly.

> All I wanted to do is to make KVM better by simplifying it - KVM is already
> as complex as it can get, anything to make it simpler is welcome IMHO.

I agree that simplifying KVM is a goal, and that we need to decide when enough is
enough.  But we also can't break userspace or existing deployments, that's a very
clearly drawn line in Linux.

My biggest worry is that, unlike the KVM_SET_CPUID2 breakage, which was obvious
and came relatively quick, this could cause breakage at the worst possible time
(migration) months or years down the road.

Since the goal is to simplify KVM, can we try the inhibit route and see what the
code looks like before making a decision?  I think it might actually yield a less
awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
we don't try to remove the inhibit on subsequent changes.

Killing the VM, as proposed, is very user unfriendly as the user will have no idea
why the VM was killed.  WARN is out of the question because this is user triggerable.
Returning an emulation error would be ideal, but getting that result up through
apic_mmio_write() could be annoying and end up being more complex.

The touchpoints will all be the same, unless I'm missing something the difference
should only be a call to set an inhibit instead killing the VM.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
       [not found]             ` <29c76393-4884-94a8-f224-08d313b73f71@intel.com>
@ 2022-03-13  9:19               ` Maxim Levitsky
  2022-03-13 10:59                 ` Maxim Levitsky
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-03-13  9:19 UTC (permalink / raw)
  To: Zeng Guang, Sean Christopherson
  Cc: Gao, Chao, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck, Tony,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
> 
> On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > Could you share the links?
> > > > 
> > > > Doh, sorry (they're both in this one).
> > > > 
> > > > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
> > > > 
> > > > 
> > > 
> > > My opinion on this subject is very simple: we need to draw the line somewhere.
> > 
> > ...
> > 
> > 
> > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > code looks like before making a decision?  I think it might actually yield a less
> > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > we don't try to remove the inhibit on subsequent changes.
> > 
> > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > why the VM was killed.  WARN is out of the question because this is user triggerable.
> > Returning an emulation error would be ideal, but getting that result up through
> > apic_mmio_write() could be annoying and end up being more complex.
> > 
> > The touchpoints will all be the same, unless I'm missing something the difference
> > should only be a call to set an inhibit instead killing the VM.
> 
> Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> APICv once KVM guest would try to change APIC ID in xapic mode, and same
> sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> alive but obviously lose benefit from hardware acceleration in this way.
> 
> So how do you think the proposal like this ?
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6dcccb304775..30d825c069be 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
>  #define APICV_INHIBIT_REASON_X2APIC    5
>  #define APICV_INHIBIT_REASON_BLOCKIRQ  6
>  #define APICV_INHIBIT_REASON_ABSENT    7
> +#define APICV_INHIBIT_REASON_APICID_CHG 8
> 
>  struct kvm_arch {
>         unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 22929b5b3f9b..66cd54fa4515 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> 
>         switch (reg) {
>         case APIC_ID:           /* Local APIC ID */
> -               if (!apic_x2apic_mode(apic))
> -                       kvm_apic_set_xapic_id(apic, val >> 24);
> -               else
> +               if (apic_x2apic_mode(apic)) {
>                         ret = 1;
> +                       break;
> +               }
> +               /*
> +                * If changing APIC ID with any APIC acceleration enabled,
> +                * deactivate APICv to avoid unexpected issues.
> +                */
> +               if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> +                       kvm_request_apicv_update(apic->vcpu->kvm,
> +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> +
> +               kvm_apic_set_xapic_id(apic, val >> 24);
>                 break;
> 
>         case APIC_TASKPRI:
> @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>  static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>                 struct kvm_lapic_state *s, bool set)
>  {
> -       if (apic_x2apic_mode(vcpu->arch.apic)) {
> -               u32 *id = (u32 *)(s->regs + APIC_ID);
> -               u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> -               u64 icr;
> +       u32 *id = (u32 *)(s->regs + APIC_ID);
> +       u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> +       u64 icr;
> +       if (!apic_x2apic_mode(vcpu->arch.apic)) {
> +               /*
> +                * If APIC ID changed with any APIC acceleration enabled,
> +                * deactivate APICv to avoid unexpected issues.
> +                */
> +               if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> +                       kvm_request_apicv_update(vcpu->kvm,
> +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> +       } else {
>                 if (vcpu->kvm->arch.x2apic_format) {
>                         if (*id != vcpu->vcpu_id)
>                                 return -EINVAL;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 82d56f8055de..f78754bdc1d0 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
>                           BIT(APICV_INHIBIT_REASON_IRQWIN) |
>                           BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
>                           BIT(APICV_INHIBIT_REASON_X2APIC) |
> -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> 
>         return supported & BIT(bit);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7beba7a9f247..91265f0784bd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>         ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>                           BIT(APICV_INHIBIT_REASON_ABSENT) |
>                           BIT(APICV_INHIBIT_REASON_HYPERV) |
> -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> 
>         return supported & BIT(bit);
>  }
> 
> 
> 

This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
because migration happens.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-13  9:19               ` Maxim Levitsky
@ 2022-03-13 10:59                 ` Maxim Levitsky
  2022-03-13 13:53                   ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-03-13 10:59 UTC (permalink / raw)
  To: Zeng Guang, Sean Christopherson
  Cc: Gao, Chao, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck, Tony,
	Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
> On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
> > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > > Could you share the links?
> > > > > 
> > > > > Doh, sorry (they're both in this one).
> > > > > 
> > > > > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
> > > > > 
> > > > > 
> > > > 
> > > > My opinion on this subject is very simple: we need to draw the line somewhere.
> > > 
> > > ...
> > > 
> > > 
> > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > > code looks like before making a decision?  I think it might actually yield a less
> > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > > we don't try to remove the inhibit on subsequent changes.
> > > 
> > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > > why the VM was killed.  WARN is out of the question because this is user triggerable.
> > > Returning an emulation error would be ideal, but getting that result up through
> > > apic_mmio_write() could be annoying and end up being more complex.
> > > 
> > > The touchpoints will all be the same, unless I'm missing something the difference
> > > should only be a call to set an inhibit instead killing the VM.
> > 
> > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> > APICv once KVM guest would try to change APIC ID in xapic mode, and same
> > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> > alive but obviously lose benefit from hardware acceleration in this way.
> > 
> > So how do you think the proposal like this ?
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6dcccb304775..30d825c069be 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
> >  #define APICV_INHIBIT_REASON_X2APIC    5
> >  #define APICV_INHIBIT_REASON_BLOCKIRQ  6
> >  #define APICV_INHIBIT_REASON_ABSENT    7
> > +#define APICV_INHIBIT_REASON_APICID_CHG 8
> > 
> >  struct kvm_arch {
> >         unsigned long n_used_mmu_pages;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 22929b5b3f9b..66cd54fa4515 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > 
> >         switch (reg) {
> >         case APIC_ID:           /* Local APIC ID */
> > -               if (!apic_x2apic_mode(apic))
> > -                       kvm_apic_set_xapic_id(apic, val >> 24);
> > -               else
> > +               if (apic_x2apic_mode(apic)) {
> >                         ret = 1;
> > +                       break;
> > +               }
> > +               /*
> > +                * If changing APIC ID with any APIC acceleration enabled,
> > +                * deactivate APICv to avoid unexpected issues.
> > +                */
> > +               if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> > +                       kvm_request_apicv_update(apic->vcpu->kvm,
> > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> > +
> > +               kvm_apic_set_xapic_id(apic, val >> 24);
> >                 break;
> > 
> >         case APIC_TASKPRI:
> > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> >  static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> >                 struct kvm_lapic_state *s, bool set)
> >  {
> > -       if (apic_x2apic_mode(vcpu->arch.apic)) {
> > -               u32 *id = (u32 *)(s->regs + APIC_ID);
> > -               u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > -               u64 icr;
> > +       u32 *id = (u32 *)(s->regs + APIC_ID);
> > +       u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > +       u64 icr;
> > +       if (!apic_x2apic_mode(vcpu->arch.apic)) {
> > +               /*
> > +                * If APIC ID changed with any APIC acceleration enabled,
> > +                * deactivate APICv to avoid unexpected issues.
> > +                */
> > +               if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> > +                       kvm_request_apicv_update(vcpu->kvm,
> > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> > +       } else {
> >                 if (vcpu->kvm->arch.x2apic_format) {
> >                         if (*id != vcpu->vcpu_id)
> >                                 return -EINVAL;
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 82d56f8055de..f78754bdc1d0 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
> >                           BIT(APICV_INHIBIT_REASON_IRQWIN) |
> >                           BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
> >                           BIT(APICV_INHIBIT_REASON_X2APIC) |
> > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > 
> >         return supported & BIT(bit);
> >  }
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7beba7a9f247..91265f0784bd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> >         ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> >                           BIT(APICV_INHIBIT_REASON_ABSENT) |
> >                           BIT(APICV_INHIBIT_REASON_HYPERV) |
> > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > 
> >         return supported & BIT(bit);
> >  }
> > 
> > 
> > 
> 
> This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> because migration happens.

I mean because host decided to change its apic id, which it can in theory do any time,
even after the nested guest has started. Seriously, the only reason guest has to change apic id,
is to try to exploit some security hole.
 
Best regards,
	Maxim Levitsky

> 
> Best regards,
> 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-13 10:59                 ` Maxim Levitsky
@ 2022-03-13 13:53                   ` Chao Gao
  2022-03-13 15:09                     ` Maxim Levitsky
  0 siblings, 1 reply; 41+ messages in thread
From: Chao Gao @ 2022-03-13 13:53 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Sun, Mar 13, 2022 at 12:59:36PM +0200, Maxim Levitsky wrote:
>On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
>> On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
>> > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
>> > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
>> > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
>> > > > > > Could you share the links?
>> > > > > 
>> > > > > Doh, sorry (they're both in this one).
>> > > > > 
>> > > > > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
>> > > > > 
>> > > > > 
>> > > > 
>> > > > My opinion on this subject is very simple: we need to draw the line somewhere.
>> > > 
>> > > ...
>> > > 
>> > > 
>> > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
>> > > code looks like before making a decision?  I think it might actually yield a less
>> > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
>> > > we don't try to remove the inhibit on subsequent changes.
>> > > 
>> > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
>> > > why the VM was killed.  WARN is out of the question because this is user triggerable.
>> > > Returning an emulation error would be ideal, but getting that result up through
>> > > apic_mmio_write() could be annoying and end up being more complex.
>> > > 
>> > > The touchpoints will all be the same, unless I'm missing something the difference
>> > > should only be a call to set an inhibit instead killing the VM.
>> > 
>> > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
>> > APICv once KVM guest would try to change APIC ID in xapic mode, and same
>> > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
>> > alive but obviously lose benefit from hardware acceleration in this way.
>> > 
>> > So how do you think the proposal like this ?
>> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> > index 6dcccb304775..30d825c069be 100644
>> > --- a/arch/x86/include/asm/kvm_host.h
>> > +++ b/arch/x86/include/asm/kvm_host.h
>> > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
>> >  #define APICV_INHIBIT_REASON_X2APIC    5
>> >  #define APICV_INHIBIT_REASON_BLOCKIRQ  6
>> >  #define APICV_INHIBIT_REASON_ABSENT    7
>> > +#define APICV_INHIBIT_REASON_APICID_CHG 8
>> > 
>> >  struct kvm_arch {
>> >         unsigned long n_used_mmu_pages;
>> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> > index 22929b5b3f9b..66cd54fa4515 100644
>> > --- a/arch/x86/kvm/lapic.c
>> > +++ b/arch/x86/kvm/lapic.c
>> > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>> > 
>> >         switch (reg) {
>> >         case APIC_ID:           /* Local APIC ID */
>> > -               if (!apic_x2apic_mode(apic))
>> > -                       kvm_apic_set_xapic_id(apic, val >> 24);
>> > -               else
>> > +               if (apic_x2apic_mode(apic)) {
>> >                         ret = 1;
>> > +                       break;
>> > +               }
>> > +               /*
>> > +                * If changing APIC ID with any APIC acceleration enabled,
>> > +                * deactivate APICv to avoid unexpected issues.
>> > +                */
>> > +               if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
>> > +                       kvm_request_apicv_update(apic->vcpu->kvm,
>> > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
>> > +
>> > +               kvm_apic_set_xapic_id(apic, val >> 24);
>> >                 break;
>> > 
>> >         case APIC_TASKPRI:
>> > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>> >  static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>> >                 struct kvm_lapic_state *s, bool set)
>> >  {
>> > -       if (apic_x2apic_mode(vcpu->arch.apic)) {
>> > -               u32 *id = (u32 *)(s->regs + APIC_ID);
>> > -               u32 *ldr = (u32 *)(s->regs + APIC_LDR);
>> > -               u64 icr;
>> > +       u32 *id = (u32 *)(s->regs + APIC_ID);
>> > +       u32 *ldr = (u32 *)(s->regs + APIC_LDR);
>> > +       u64 icr;
>> > +       if (!apic_x2apic_mode(vcpu->arch.apic)) {
>> > +               /*
>> > +                * If APIC ID changed with any APIC acceleration enabled,
>> > +                * deactivate APICv to avoid unexpected issues.
>> > +                */
>> > +               if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
>> > +                       kvm_request_apicv_update(vcpu->kvm,
>> > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
>> > +       } else {
>> >                 if (vcpu->kvm->arch.x2apic_format) {
>> >                         if (*id != vcpu->vcpu_id)
>> >                                 return -EINVAL;
>> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
>> > index 82d56f8055de..f78754bdc1d0 100644
>> > --- a/arch/x86/kvm/svm/avic.c
>> > +++ b/arch/x86/kvm/svm/avic.c
>> > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
>> >                           BIT(APICV_INHIBIT_REASON_IRQWIN) |
>> >                           BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
>> >                           BIT(APICV_INHIBIT_REASON_X2APIC) |
>> > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
>> > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
>> > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
>> > 
>> >         return supported & BIT(bit);
>> >  }
>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index 7beba7a9f247..91265f0784bd 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>> >         ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>> >                           BIT(APICV_INHIBIT_REASON_ABSENT) |
>> >                           BIT(APICV_INHIBIT_REASON_HYPERV) |
>> > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
>> > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
>> > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
>> > 
>> >         return supported & BIT(bit);
>> >  }
>> > 
>> > 
>> > 
>> 
>> This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> because migration happens.
>
>I mean because host decided to change its apic id, which it can in theory do any time,
>even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>is to try to exploit some security hole.

Hi

Thanks for the information.  

IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
It looks like a generic issue in dynamically toggling APICv scheme,
e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-13 13:53                   ` Chao Gao
@ 2022-03-13 15:09                     ` Maxim Levitsky
  2022-03-14  4:09                       ` Chao Gao
  2022-03-15 15:10                       ` Chao Gao
  0 siblings, 2 replies; 41+ messages in thread
From: Maxim Levitsky @ 2022-03-13 15:09 UTC (permalink / raw)
  To: Chao Gao
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Sun, 2022-03-13 at 21:53 +0800, Chao Gao wrote:
> On Sun, Mar 13, 2022 at 12:59:36PM +0200, Maxim Levitsky wrote:
> > On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote:
> > > On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote:
> > > > On 3/11/2022 12:26 PM, Sean Christopherson wrote:
> > > > > On Wed, Mar 09, 2022, Maxim Levitsky wrote:
> > > > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote:
> > > > > > > > Could you share the links?
> > > > > > > 
> > > > > > > Doh, sorry (they're both in this one).
> > > > > > > 
> > > > > > > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@redhat.com
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > My opinion on this subject is very simple: we need to draw the line somewhere.
> > > > > 
> > > > > ...
> > > > > 
> > > > > 
> > > > > Since the goal is to simplify KVM, can we try the inhibit route and see what the
> > > > > code looks like before making a decision?  I think it might actually yield a less
> > > > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e.
> > > > > we don't try to remove the inhibit on subsequent changes.
> > > > > 
> > > > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea
> > > > > why the VM was killed.  WARN is out of the question because this is user triggerable.
> > > > > Returning an emulation error would be ideal, but getting that result up through
> > > > > apic_mmio_write() could be annoying and end up being more complex.
> > > > > 
> > > > > The touchpoints will all be the same, unless I'm missing something the difference
> > > > > should only be a call to set an inhibit instead killing the VM.
> > > > 
> > > > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate
> > > > APICv once KVM guest would try to change APIC ID in xapic mode, and same
> > > > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep
> > > > alive but obviously lose benefit from hardware acceleration in this way.
> > > > 
> > > > So how do you think the proposal like this ?
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6dcccb304775..30d825c069be 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter {
> > > >  #define APICV_INHIBIT_REASON_X2APIC    5
> > > >  #define APICV_INHIBIT_REASON_BLOCKIRQ  6
> > > >  #define APICV_INHIBIT_REASON_ABSENT    7
> > > > +#define APICV_INHIBIT_REASON_APICID_CHG 8
> > > > 
> > > >  struct kvm_arch {
> > > >         unsigned long n_used_mmu_pages;
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 22929b5b3f9b..66cd54fa4515 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > > > 
> > > >         switch (reg) {
> > > >         case APIC_ID:           /* Local APIC ID */
> > > > -               if (!apic_x2apic_mode(apic))
> > > > -                       kvm_apic_set_xapic_id(apic, val >> 24);
> > > > -               else
> > > > +               if (apic_x2apic_mode(apic)) {
> > > >                         ret = 1;
> > > > +                       break;
> > > > +               }
> > > > +               /*
> > > > +                * If changing APIC ID with any APIC acceleration enabled,
> > > > +                * deactivate APICv to avoid unexpected issues.
> > > > +                */
> > > > +               if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id)
> > > > +                       kvm_request_apicv_update(apic->vcpu->kvm,
> > > > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> > > > +
> > > > +               kvm_apic_set_xapic_id(apic, val >> 24);
> > > >                 break;
> > > > 
> > > >         case APIC_TASKPRI:
> > > > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
> > > >  static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> > > >                 struct kvm_lapic_state *s, bool set)
> > > >  {
> > > > -       if (apic_x2apic_mode(vcpu->arch.apic)) {
> > > > -               u32 *id = (u32 *)(s->regs + APIC_ID);
> > > > -               u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > > > -               u64 icr;
> > > > +       u32 *id = (u32 *)(s->regs + APIC_ID);
> > > > +       u32 *ldr = (u32 *)(s->regs + APIC_LDR);
> > > > +       u64 icr;
> > > > +       if (!apic_x2apic_mode(vcpu->arch.apic)) {
> > > > +               /*
> > > > +                * If APIC ID changed with any APIC acceleration enabled,
> > > > +                * deactivate APICv to avoid unexpected issues.
> > > > +                */
> > > > +               if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
> > > > +                       kvm_request_apicv_update(vcpu->kvm,
> > > > +                               false, APICV_INHIBIT_REASON_APICID_CHG);
> > > > +       } else {
> > > >                 if (vcpu->kvm->arch.x2apic_format) {
> > > >                         if (*id != vcpu->vcpu_id)
> > > >                                 return -EINVAL;
> > > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > > index 82d56f8055de..f78754bdc1d0 100644
> > > > --- a/arch/x86/kvm/svm/avic.c
> > > > +++ b/arch/x86/kvm/svm/avic.c
> > > > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
> > > >                           BIT(APICV_INHIBIT_REASON_IRQWIN) |
> > > >                           BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
> > > >                           BIT(APICV_INHIBIT_REASON_X2APIC) |
> > > > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > > > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > > > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > > > 
> > > >         return supported & BIT(bit);
> > > >  }
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 7beba7a9f247..91265f0784bd 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> > > >         ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> > > >                           BIT(APICV_INHIBIT_REASON_ABSENT) |
> > > >                           BIT(APICV_INHIBIT_REASON_HYPERV) |
> > > > -                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> > > > +                         BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> > > > +                         BIT(APICV_INHIBIT_REASON_APICID_CHG);
> > > > 
> > > >         return supported & BIT(bit);
> > > >  }
> > > > 
> > > > 
> > > > 
> > > 
> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> > > because migration happens.
> > 
> > I mean because host decided to change its apic id, which it can in theory do any time,
> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
> > is to try to exploit some security hole.
> 
> Hi
> 
> Thanks for the information.  
> 
> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
> It looks like a generic issue in dynamically toggling APICv scheme,
> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
> 

That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
sorry buddy - hardware feature disappeared.

It is *currently* not a problem for APICv because it doesn't do IPI virtualization,
and even with these patches, it doesn't do this for nesting.
It does become when you allow nested guest to use this which I did in the nested AVIC code.


and writable apic ids do pose a large problem, since nested AVIC, will target L1's apic ids,
and when they can change under you without any notice, and even worse be duplicate,
it is just nightmare.

About KVM_GUESTDBG_BLOCKIRQ - yes, but that is a best effort hack anyway, 
which not supposed to 100% work - running gdbstub with nested is broken in many ways anyway.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-13 15:09                     ` Maxim Levitsky
@ 2022-03-14  4:09                       ` Chao Gao
  2022-03-15 15:10                       ` Chao Gao
  1 sibling, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-03-14  4:09 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

>> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> > > because migration happens.
>> > 
>> > I mean because host decided to change its apic id, which it can in theory do any time,
>> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>> > is to try to exploit some security hole.
>> 
>> Hi
>> 
>> Thanks for the information.  
>> 
>> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
>> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
>> It looks like a generic issue in dynamically toggling APICv scheme,
>> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
>> 
>
>That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
>Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
>sorry buddy - hardware feature disappeared.

Agreed. I missed this.

>
>It is *currently* not a problem for APICv because it doesn't do IPI virtualization,
>and even with these patches, it doesn't do this for nesting.
>It does become when you allow nested guest to use this which I did in the nested AVIC code.
>
>
>and writable apic ids do pose a large problem, since nested AVIC, will target L1's apic ids,
>and when they can change under you without any notice, and even worse be duplicate,
>it is just nightmare.

OK. So the problem of disabling APICv is if we choose to disable APICv instead
of making APIC ID read-only, although it can work perfectly for VMX IPIv, it
effectively makes future cleanup to AVIC difficult/impossible because nested
AVIC is practically to implement without assuming APIC IDs of L1 is immutable.

Sean & Maxim

How about go back to use a module parameter to opt in to read-only APIC ID.
Although migration in some cases may fail but it shouldn't be a big issue as
migration VMs from a KVM with nested=on to a KVM with nested=off may also fail.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-13 15:09                     ` Maxim Levitsky
  2022-03-14  4:09                       ` Chao Gao
@ 2022-03-15 15:10                       ` Chao Gao
  2022-03-15 15:30                         ` Maxim Levitsky
  1 sibling, 1 reply; 41+ messages in thread
From: Chao Gao @ 2022-03-15 15:10 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Sun, Mar 13, 2022 at 05:09:08PM +0200, Maxim Levitsky wrote:
>> > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
>> > > because migration happens.
>> > 
>> > I mean because host decided to change its apic id, which it can in theory do any time,
>> > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
>> > is to try to exploit some security hole.
>> 
>> Hi
>> 
>> Thanks for the information.  
>> 
>> IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
>> enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
>> It looks like a generic issue in dynamically toggling APICv scheme,
>> e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
>> 
>
>That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
>Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
>sorry buddy - hardware feature disappeared.

Hi Maxim,

I may miss something. When reading Sean's APICv inhibition cleanups, I
find AVIC is disabled for L1 when nested is enabled (SVM is advertised
to L1). Then, I think the new inhibition introduced for changed xAPIC ID
shouldn't be a problem for L2 VM. Or, you plan to remove
APICV_INHIBIT_REASON_NESTED and expose AVIC to L1?

svm_vcpu_after_set_cpuid:
                /*
                 * Currently, AVIC does not work with nested virtualization.
                 * So, we disable AVIC when cpuid for SVM is set in the L1 guest.
                 */
                if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM))
                        kvm_request_apicv_update(vcpu->kvm, false,
                                                 APICV_INHIBIT_REASON_NESTED);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-15 15:10                       ` Chao Gao
@ 2022-03-15 15:30                         ` Maxim Levitsky
  2022-03-16 11:50                           ` Chao Gao
  0 siblings, 1 reply; 41+ messages in thread
From: Maxim Levitsky @ 2022-03-15 15:30 UTC (permalink / raw)
  To: Chao Gao
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Tue, 2022-03-15 at 23:10 +0800, Chao Gao wrote:
> On Sun, Mar 13, 2022 at 05:09:08PM +0200, Maxim Levitsky wrote:
> > > > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC,
> > > > > because migration happens.
> > > > 
> > > > I mean because host decided to change its apic id, which it can in theory do any time,
> > > > even after the nested guest has started. Seriously, the only reason guest has to change apic id,
> > > > is to try to exploit some security hole.
> > > 
> > > Hi
> > > 
> > > Thanks for the information.  
> > > 
> > > IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv
> > > enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case?
> > > It looks like a generic issue in dynamically toggling APICv scheme,
> > > e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started.
> > > 
> > 
> > That is the problem - you can't disable it for L2, unless you are willing to emulate it in software.
> > Or in other words, when nested guest uses a hardware feature, you can't at some point say to it:
> > sorry buddy - hardware feature disappeared.
> 
> Hi Maxim,
> 
> I may miss something. When reading Sean's APICv inhibition cleanups, I
> find AVIC is disabled for L1 when nested is enabled (SVM is advertised
> to L1). Then, I think the new inhibition introduced for changed xAPIC ID
> shouldn't be a problem for L2 VM. Or, you plan to remove
> APICV_INHIBIT_REASON_NESTED and expose AVIC to L1?

Yep, I  have a patch for this ( which I hope to be accepted really soon
(KVM: x86: SVM: allow AVIC to co-exist with a nested guest running)
 
I also implemented working support for nested AVIC, which includes support for IPI without vm exits
between L2's vCPUs. I had sent an RFC for that.
 
With all patches applied both L1 and L2 switch hands on AVIC, L1's avic is inhibited
(only locally) on the vCPU which runs nested, and while it runs nested, L2 uses AVIC
to target other vCPUs which also run nested.
 
I and Paolo talked about this, and we reached a very promising conclusion.

I will add new KVM cap, say KVM_CAP_READ_ONLY_APIC, which userspace will set
prior to creating a vCPU, and which will make APIC ID fully readonly when set.
 
As a bonus, if you don't object, I will also make this cap, make APIC base read-only,
since this feature is also broken in kvm, optional in x86 spec, and not really
used by guests just like writable apic id.

I hope to have patches in day or two for this.
 
When this cap is not set, it is fair to disable both IPIv, my nested AVIC,
or even better inhibit AVIC completely, including any nested support.
 
Best regards,
	Maxim Levitsky

> 
> svm_vcpu_after_set_cpuid:
>                 /*
>                  * Currently, AVIC does not work with nested virtualization.
>                  * So, we disable AVIC when cpuid for SVM is set in the L1 guest.
>                  */
>                 if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM))
>                         kvm_request_apicv_update(vcpu->kvm, false,
>                                                  APICV_INHIBIT_REASON_NESTED);
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally
  2022-03-15 15:30                         ` Maxim Levitsky
@ 2022-03-16 11:50                           ` Chao Gao
  0 siblings, 0 replies; 41+ messages in thread
From: Chao Gao @ 2022-03-16 11:50 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Zeng Guang, Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, Dave Hansen, Luck,
	Tony, Kan Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Kim Phillips, Jarkko Sakkinen, Jethro Beekman,
	Huang, Kai, x86, linux-kernel, Hu, Robert

On Tue, Mar 15, 2022 at 05:30:32PM +0200, Maxim Levitsky wrote:
>Yep, I  have a patch for this ( which I hope to be accepted really soon
>(KVM: x86: SVM: allow AVIC to co-exist with a nested guest running)
> 
>I also implemented working support for nested AVIC, which includes support for IPI without vm exits
>between L2's vCPUs. I had sent an RFC for that.
> 
>With all patches applied both L1 and L2 switch hands on AVIC, L1's avic is inhibited
>(only locally) on the vCPU which runs nested, and while it runs nested, L2 uses AVIC
>to target other vCPUs which also run nested.
> 
>I and Paolo talked about this, and we reached a very promising conclusion.
>
>I will add new KVM cap, say KVM_CAP_READ_ONLY_APIC, which userspace will set
>prior to creating a vCPU, and which will make APIC ID fully readonly when set.

Will KVM report violations to QEMU? then, QEMU can know the VM attempted
to change APIC ID and report an error to admin. Then admin can relaunch
the VM without setting this new cap.  

> 
>As a bonus, if you don't object, I will also make this cap, make APIC base read-only,
>since this feature is also broken in kvm, optional in x86 spec, and not really
>used by guests just like writable apic id.
>
>I hope to have patches in day or two for this.

Great. And no objection to making APIC base read-only.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-03-16 11:51 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25  8:22 [PATCH v6 0/9] IPI virtualization support for VM Zeng Guang
2022-02-25  8:22 ` [PATCH v6 1/9] x86/cpu: Add new VMX feature, Tertiary VM-Execution control Zeng Guang
2022-02-25 14:09   ` Maxim Levitsky
2022-02-25  8:22 ` [PATCH v6 2/9] KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation Zeng Guang
2022-02-25 14:24   ` Maxim Levitsky
2022-02-25  8:22 ` [PATCH v6 3/9] KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config Zeng Guang
2022-02-25 14:30   ` Maxim Levitsky
2022-02-25  8:22 ` [PATCH v6 4/9] KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well Zeng Guang
2022-02-25 14:31   ` Maxim Levitsky
2022-02-25  8:22 ` [PATCH v6 5/9] KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode Zeng Guang
2022-02-25 14:44   ` Maxim Levitsky
2022-02-25 15:29     ` Chao Gao
2022-02-25  8:22 ` [PATCH v6 6/9] KVM: x86: lapic: don't allow to change APIC ID unconditionally Zeng Guang
2022-02-25 14:46   ` Maxim Levitsky
2022-02-25 14:56     ` David Woodhouse
2022-02-25 15:11       ` Maxim Levitsky
2022-02-25 15:42         ` David Woodhouse
2022-02-25 16:12           ` Maxim Levitsky
2022-03-01  8:03     ` Chao Gao
2022-03-08 23:04   ` Sean Christopherson
2022-03-09  5:21     ` Chao Gao
2022-03-09  6:01       ` Sean Christopherson
2022-03-09 12:59         ` Maxim Levitsky
2022-03-11  4:26           ` Sean Christopherson
     [not found]             ` <29c76393-4884-94a8-f224-08d313b73f71@intel.com>
2022-03-13  9:19               ` Maxim Levitsky
2022-03-13 10:59                 ` Maxim Levitsky
2022-03-13 13:53                   ` Chao Gao
2022-03-13 15:09                     ` Maxim Levitsky
2022-03-14  4:09                       ` Chao Gao
2022-03-15 15:10                       ` Chao Gao
2022-03-15 15:30                         ` Maxim Levitsky
2022-03-16 11:50                           ` Chao Gao
2022-02-25  8:22 ` [PATCH v6 7/9] KVM: VMX: enable IPI virtualization Zeng Guang
2022-02-25 17:19   ` Maxim Levitsky
2022-03-01  9:21     ` Chao Gao
2022-03-02  6:45       ` Chao Gao
2022-02-25  8:22 ` [PATCH v6 8/9] KVM: x86: Allow userspace set maximum VCPU id for VM Zeng Guang
2022-02-25 17:22   ` Maxim Levitsky
2022-02-25  8:22 ` [PATCH v6 9/9] KVM: VMX: Optimize memory allocation for PID-pointer table Zeng Guang
2022-02-25 17:29   ` Maxim Levitsky
2022-03-01  9:23     ` Chao Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).