stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries
@ 2022-06-14  2:11 Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 2/6] KVM: x86: do not set st->preempted when going back to user space Sasha Levin
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Paolo Bonzini, Jann Horn, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, kvm

From: Paolo Bonzini <pbonzini@redhat.com>

[ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]

If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access.  A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.

To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary.  A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.

It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed.  However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses.  Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/svm/svm.c          |  2 ++
 arch/x86/kvm/vmx/vmx.c          |  1 +
 arch/x86/kvm/x86.c              | 22 ++++++++++++++++++++++
 4 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4ff36610af6a..9fdaa847d4b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -651,6 +651,7 @@ struct kvm_vcpu_arch {
 	u64 ia32_misc_enable_msr;
 	u64 smbase;
 	u64 smi_count;
+	bool at_instruction_boundary;
 	bool tpr_access_reporting;
 	bool xsaves_enabled;
 	bool xfd_no_write_intercept;
@@ -1289,6 +1290,8 @@ struct kvm_vcpu_stat {
 	u64 nested_run;
 	u64 directed_yield_attempted;
 	u64 directed_yield_successful;
+	u64 preemption_reported;
+	u64 preemption_other;
 	u64 guest_mode;
 };
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7e45d03cd018..5842abf1eac4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4165,6 +4165,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
 
 static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
+	if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR)
+		vcpu->arch.at_instruction_boundary = true;
 }
 
 static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 982df9c000d3..c44f8e1d30c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6549,6 +6549,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+	vcpu->arch.at_instruction_boundary = true;
 }
 
 static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c571224ac2..36453517e847 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -291,6 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
 	STATS_DESC_COUNTER(VCPU, nested_run),
 	STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
 	STATS_DESC_COUNTER(VCPU, directed_yield_successful),
+	STATS_DESC_COUNTER(VCPU, preemption_reported),
+	STATS_DESC_COUNTER(VCPU, preemption_other),
 	STATS_DESC_ICOUNTER(VCPU, guest_mode)
 };
 
@@ -4604,6 +4606,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	struct kvm_memslots *slots;
 	static const u8 preempted = KVM_VCPU_PREEMPTED;
 
+	/*
+	 * The vCPU can be marked preempted if and only if the VM-Exit was on
+	 * an instruction boundary and will not trigger guest emulation of any
+	 * kind (see vcpu_run).  Vendor specific code controls (conservatively)
+	 * when this is true, for example allowing the vCPU to be marked
+	 * preempted if and only if the VM-Exit was due to a host interrupt.
+	 */
+	if (!vcpu->arch.at_instruction_boundary) {
+		vcpu->stat.preemption_other++;
+		return;
+	}
+
+	vcpu->stat.preemption_reported++;
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
@@ -10358,6 +10373,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
+		/*
+		 * If another guest vCPU requests a PV TLB flush in the middle
+		 * of instruction emulation, the rest of the emulation could
+		 * use a stale page translation. Assume that any code after
+		 * this point can start executing an instruction.
+		 */
+		vcpu->arch.at_instruction_boundary = false;
 		if (kvm_vcpu_running(vcpu)) {
 			r = vcpu_enter_guest(vcpu);
 		} else {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH MANUALSEL 5.18 2/6] KVM: x86: do not set st->preempted when going back to user space
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
@ 2022-06-14  2:11 ` Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 3/6] KVM: selftests: Make hyperv_clock selftest more stable Sasha Levin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Paolo Bonzini, Sean Christopherson, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, kvm

From: Paolo Bonzini <pbonzini@redhat.com>

[ Upstream commit 54aa83c90198e68eee8b0850c749bc70efb548da ]

Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted.  The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kvm/x86.c | 26 ++++++++++++++------------
 arch/x86/kvm/xen.h |  6 ++++--
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36453517e847..165e2912995f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4648,19 +4648,21 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
-	if (vcpu->preempted && !vcpu->arch.guest_state_protected)
-		vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
+	if (vcpu->preempted) {
+		if (!vcpu->arch.guest_state_protected)
+			vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
 
-	/*
-	 * Take the srcu lock as memslots will be accessed to check the gfn
-	 * cache generation against the memslots generation.
-	 */
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
-	if (kvm_xen_msr_enabled(vcpu->kvm))
-		kvm_xen_runstate_set_preempted(vcpu);
-	else
-		kvm_steal_time_set_preempted(vcpu);
-	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+		/*
+		 * Take the srcu lock as memslots will be accessed to check the gfn
+		 * cache generation against the memslots generation.
+		 */
+		idx = srcu_read_lock(&vcpu->kvm->srcu);
+		if (kvm_xen_msr_enabled(vcpu->kvm))
+			kvm_xen_runstate_set_preempted(vcpu);
+		else
+			kvm_steal_time_set_preempted(vcpu);
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	}
 
 	static_call(kvm_x86_vcpu_put)(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index adbcc9ed59db..fda1413f8af9 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -103,8 +103,10 @@ static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
 	 * behalf of the vCPU. Only if the VMM does actually block
 	 * does it need to enter RUNSTATE_blocked.
 	 */
-	if (vcpu->preempted)
-		kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
+	if (WARN_ON_ONCE(!vcpu->preempted))
+		return;
+
+	kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
 }
 
 /* 32-bit compatibility definitions, also used natively in 32-bit build */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH MANUALSEL 5.18 3/6] KVM: selftests: Make hyperv_clock selftest more stable
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 2/6] KVM: x86: do not set st->preempted when going back to user space Sasha Levin
@ 2022-06-14  2:11 ` Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 4/6] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging Sasha Levin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Vitaly Kuznetsov, Maxim Levitsky, Paolo Bonzini, Sasha Levin,
	shuah, seanjc, kvm, linux-kselftest

From: Vitaly Kuznetsov <vkuznets@redhat.com>

[ Upstream commit eae260be3a0111a28fe95923e117a55dddec0384 ]

hyperv_clock doesn't always give a stable test result, especially with
AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either
with rdmsr() from within the guest or KVM_GET_MSRS from the host)
against rdtsc(). To increase the accuracy, increase the measured delay
(done with nop loop) by two orders of magnitude and take the mean rdtsc()
value before and after rdmsr()/KVM_GET_MSRS.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220601144322.1968742-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 tools/testing/selftests/kvm/x86_64/hyperv_clock.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
index e0b2bb1339b1..3330fb183c68 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
@@ -44,7 +44,7 @@ static inline void nop_loop(void)
 {
 	int i;
 
-	for (i = 0; i < 1000000; i++)
+	for (i = 0; i < 100000000; i++)
 		asm volatile("nop");
 }
 
@@ -56,12 +56,14 @@ static inline void check_tsc_msr_rdtsc(void)
 	tsc_freq = rdmsr(HV_X64_MSR_TSC_FREQUENCY);
 	GUEST_ASSERT(tsc_freq > 0);
 
-	/* First, check MSR-based clocksource */
+	/* For increased accuracy, take mean rdtsc() before and afrer rdmsr() */
 	r1 = rdtsc();
 	t1 = rdmsr(HV_X64_MSR_TIME_REF_COUNT);
+	r1 = (r1 + rdtsc()) / 2;
 	nop_loop();
 	r2 = rdtsc();
 	t2 = rdmsr(HV_X64_MSR_TIME_REF_COUNT);
+	r2 = (r2 + rdtsc()) / 2;
 
 	GUEST_ASSERT(r2 > r1 && t2 > t1);
 
@@ -181,12 +183,14 @@ static void host_check_tsc_msr_rdtsc(struct kvm_vm *vm)
 	tsc_freq = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TSC_FREQUENCY);
 	TEST_ASSERT(tsc_freq > 0, "TSC frequency must be nonzero");
 
-	/* First, check MSR-based clocksource */
+	/* For increased accuracy, take mean rdtsc() before and afrer ioctl */
 	r1 = rdtsc();
 	t1 = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TIME_REF_COUNT);
+	r1 = (r1 + rdtsc()) / 2;
 	nop_loop();
 	r2 = rdtsc();
 	t2 = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TIME_REF_COUNT);
+	r2 = (r2 + rdtsc()) / 2;
 
 	TEST_ASSERT(t2 > t1, "Time reference MSR is not monotonic (%ld <= %ld)", t1, t2);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH MANUALSEL 5.18 4/6] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 2/6] KVM: x86: do not set st->preempted when going back to user space Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 3/6] KVM: selftests: Make hyperv_clock selftest more stable Sasha Levin
@ 2022-06-14  2:11 ` Sasha Levin
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set Sasha Levin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Ben Gardon, Paolo Bonzini, Sasha Levin, tglx, mingo, bp,
	dave.hansen, x86, kvm

From: Ben Gardon <bgardon@google.com>

[ Upstream commit 5ba7c4c6d1c7af47a916f728bb5940669684a087 ]

Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
the shadow MMU.

When disabling dirty logging, zap non-leaf parent entries to allow
replacement with huge pages instead of recursing and zapping all of the
child, leaf entries. This reduces the number of TLB flushes required.
and reduces the disable dirty log time with the TDP MMU to ~3 seconds.

Opportunistically add a WARN() to catch GFNs that are mapped at a
higher level than their max level.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220525230904.1584480-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kvm/mmu/tdp_iter.c |  9 +++++++++
 arch/x86/kvm/mmu/tdp_iter.h |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c  | 38 +++++++++++++++++++++++++++++++------
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 6d3b3e5a5533..ee4802d7b36c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -145,6 +145,15 @@ static bool try_step_up(struct tdp_iter *iter)
 	return true;
 }
 
+/*
+ * Step the iterator back up a level in the paging structure. Should only be
+ * used when the iterator is below the root level.
+ */
+void tdp_iter_step_up(struct tdp_iter *iter)
+{
+	WARN_ON(!try_step_up(iter));
+}
+
 /*
  * Step to the next SPTE in a pre-order traversal of the paging structure.
  * To get to the next SPTE, the iterator either steps down towards the goal
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..adfca0cf94d3 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -114,5 +114,6 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
+void tdp_iter_step_up(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 922b06bf4b94..b61a11d462cc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1748,12 +1748,12 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	gfn_t start = slot->base_gfn;
 	gfn_t end = start + slot->npages;
 	struct tdp_iter iter;
+	int max_mapping_level;
 	kvm_pfn_t pfn;
 
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
@@ -1761,15 +1761,41 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		/*
+		 * This is a leaf SPTE. Check if the PFN it maps can
+		 * be mapped at a higher level.
+		 */
 		pfn = spte_to_pfn(iter.old_spte);
-		if (kvm_is_reserved_pfn(pfn) ||
-		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
-							    pfn, PG_LEVEL_NUM))
+
+		if (kvm_is_reserved_pfn(pfn))
 			continue;
 
+		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
+				iter.gfn, pfn, PG_LEVEL_NUM);
+
+		WARN_ON(max_mapping_level < iter.level);
+
+		/*
+		 * If this page is already mapped at the highest
+		 * viable level, there's nothing more to do.
+		 */
+		if (max_mapping_level == iter.level)
+			continue;
+
+		/*
+		 * The page can be remapped at a higher level, so step
+		 * up to zap the parent SPTE.
+		 */
+		while (max_mapping_level > iter.level)
+			tdp_iter_step_up(&iter);
+
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
-			goto retry;
+		tdp_mmu_zap_spte_atomic(kvm, &iter);
+
+		/*
+		 * If the atomic zap fails, the iter will recurse back into
+		 * the same subtree to retry.
+		 */
 	}
 
 	rcu_read_unlock();
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
                   ` (2 preceding siblings ...)
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 4/6] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging Sasha Levin
@ 2022-06-14  2:11 ` Sasha Levin
  2022-06-14  4:38   ` Eric W. Biederman
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 6/6] KVM: Don't null dereference ops->destroy Sasha Levin
  2022-06-21 21:23 ` [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
  5 siblings, 1 reply; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Seth Forshee, Eric W. Biederman, Petr Mladek, Paolo Bonzini,
	Sasha Levin, tglx, peterz, luto

From: Seth Forshee <sforshee@digitalocean.com>

[ Upstream commit 3e684903a8574ffc9475fdf13c4780a7adb506ad ]

A livepatch transition may stall indefinitely when a kvm vCPU is heavily
loaded. To the host, the vCPU task is a user thread which is spending a
very long time in the ioctl(KVM_RUN) syscall. During livepatch
transition, set_notify_signal() will be called on such tasks to
interrupt the syscall so that the task can be transitioned. This
interrupts guest execution, but when xfer_to_guest_mode_work() sees that
TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
exit to user mode is unnecessary, and guest execution is resumed without
transitioning the task for the livepatch.

This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
is expected to break tasks out of interruptible kernel loops and cause
them to return to userspace. Change xfer_to_guest_mode_work() to handle
TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
loop that an exit to userpsace is needed. Any pending task_work will be
run when get_signal() is called from exit_to_user_mode_loop(), so there
is no longer any need to run task work from xfer_to_guest_mode_work().

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/entry/kvm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 9d09f489b60e..2e0f75bcb7fd 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -9,12 +9,6 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 		int ret;
 
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
-			clear_notify_signal();
-			if (task_work_pending(current))
-				task_work_run();
-		}
-
-		if (ti_work & _TIF_SIGPENDING) {
 			kvm_handle_signal_exit(vcpu);
 			return -EINTR;
 		}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH MANUALSEL 5.18 6/6] KVM: Don't null dereference ops->destroy
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
                   ` (3 preceding siblings ...)
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set Sasha Levin
@ 2022-06-14  2:11 ` Sasha Levin
  2022-06-21 21:23 ` [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
  5 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-14  2:11 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Alexey Kardashevskiy, Paolo Bonzini, Sasha Levin, kvm

From: Alexey Kardashevskiy <aik@ozlabs.ru>

[ Upstream commit e8bc2427018826e02add7b0ed0fc625a60390ae5 ]

A KVM device cleanup happens in either of two callbacks:
1) destroy() which is called when the VM is being destroyed;
2) release() which is called when a device fd is closed.

Most KVM devices use 1) but Book3s's interrupt controller KVM devices
(XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
the machine execution. The error handling in kvm_ioctl_create_device()
assumes destroy() is always defined which leads to NULL dereference as
discovered by Syzkaller.

This adds a checks for destroy!=NULL and adds a missing release().

This is not changing kvm_destroy_devices() as devices with defined
release() should have been removed from the KVM devices list by then.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 virt/kvm/kvm_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5ab12214e18d..24cb37d19c63 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4299,8 +4299,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 		kvm_put_kvm_no_destroy(kvm);
 		mutex_lock(&kvm->lock);
 		list_del(&dev->vm_node);
+		if (ops->release)
+			ops->release(dev);
 		mutex_unlock(&kvm->lock);
-		ops->destroy(dev);
+		if (ops->destroy)
+			ops->destroy(dev);
 		return ret;
 	}
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set Sasha Levin
@ 2022-06-14  4:38   ` Eric W. Biederman
  0 siblings, 0 replies; 11+ messages in thread
From: Eric W. Biederman @ 2022-06-14  4:38 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Seth Forshee, Petr Mladek, Paolo Bonzini,
	tglx, peterz, luto


Starting with v5.18 TIF_NOTIFY_SIGNAL and task_work_run are separated,
so backporting this to v5.18 looks safe.

For anyone considering backporting this any farther please be aware
that TIF_NOTIFY_SIGNAL and task_work were all mixed together and
to backport this also likely requires backporting my cleanups that
separated them.

Eric


Sasha Levin <sashal@kernel.org> writes:

> From: Seth Forshee <sforshee@digitalocean.com>
>
> [ Upstream commit 3e684903a8574ffc9475fdf13c4780a7adb506ad ]
>
> A livepatch transition may stall indefinitely when a kvm vCPU is heavily
> loaded. To the host, the vCPU task is a user thread which is spending a
> very long time in the ioctl(KVM_RUN) syscall. During livepatch
> transition, set_notify_signal() will be called on such tasks to
> interrupt the syscall so that the task can be transitioned. This
> interrupts guest execution, but when xfer_to_guest_mode_work() sees that
> TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
> exit to user mode is unnecessary, and guest execution is resumed without
> transitioning the task for the livepatch.
>
> This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
> is expected to break tasks out of interruptible kernel loops and cause
> them to return to userspace. Change xfer_to_guest_mode_work() to handle
> TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
> loop that an exit to userpsace is needed. Any pending task_work will be
> run when get_signal() is called from exit_to_user_mode_loop(), so there
> is no longer any need to run task work from xfer_to_guest_mode_work().
>
> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
> Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  kernel/entry/kvm.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 9d09f489b60e..2e0f75bcb7fd 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -9,12 +9,6 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
>  		int ret;
>  
>  		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
> -			clear_notify_signal();
> -			if (task_work_pending(current))
> -				task_work_run();
> -		}
> -
> -		if (ti_work & _TIF_SIGPENDING) {
>  			kvm_handle_signal_exit(vcpu);
>  			return -EINTR;
>  		}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
                   ` (4 preceding siblings ...)
  2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 6/6] KVM: Don't null dereference ops->destroy Sasha Levin
@ 2022-06-21 21:23 ` Sasha Levin
  2022-08-03 16:02   ` Jann Horn
  2022-08-03 16:50   ` Paolo Bonzini
  5 siblings, 2 replies; 11+ messages in thread
From: Sasha Levin @ 2022-06-21 21:23 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Paolo Bonzini, Jann Horn, tglx, mingo, bp, dave.hansen, x86, kvm

Paolo, ping?

On Mon, Jun 13, 2022 at 10:11:10PM -0400, Sasha Levin wrote:
>From: Paolo Bonzini <pbonzini@redhat.com>
>
>[ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]
>
>If a vCPU is outside guest mode and is scheduled out, it might be in the
>process of making a memory access.  A problem occurs if another vCPU uses
>the PV TLB flush feature during the period when the vCPU is scheduled
>out, and a virtual address has already been translated but has not yet
>been accessed, because this is equivalent to using a stale TLB entry.
>
>To avoid this, only report a vCPU as preempted if sure that the guest
>is at an instruction boundary.  A rescheduling request will be delivered
>to the host physical CPU as an external interrupt, so for simplicity
>consider any vmexit *not* instruction boundary except for external
>interrupts.
>
>It would in principle be okay to report the vCPU as preempted also
>if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
>vmentry/vmexit overhead unnecessarily, and optimistic spinning is
>also unlikely to succeed.  However, leave it for later because right
>now kvm_vcpu_check_block() is doing memory accesses.  Even
>though the TLB flush issue only applies to virtual memory address,
>it's very much preferrable to be conservative.
>
>Reported-by: Jann Horn <jannh@google.com>
>Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>Signed-off-by: Sasha Levin <sashal@kernel.org>
>---
> arch/x86/include/asm/kvm_host.h |  3 +++
> arch/x86/kvm/svm/svm.c          |  2 ++
> arch/x86/kvm/vmx/vmx.c          |  1 +
> arch/x86/kvm/x86.c              | 22 ++++++++++++++++++++++
> 4 files changed, 28 insertions(+)
>
>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 4ff36610af6a..9fdaa847d4b6 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -651,6 +651,7 @@ struct kvm_vcpu_arch {
> 	u64 ia32_misc_enable_msr;
> 	u64 smbase;
> 	u64 smi_count;
>+	bool at_instruction_boundary;
> 	bool tpr_access_reporting;
> 	bool xsaves_enabled;
> 	bool xfd_no_write_intercept;
>@@ -1289,6 +1290,8 @@ struct kvm_vcpu_stat {
> 	u64 nested_run;
> 	u64 directed_yield_attempted;
> 	u64 directed_yield_successful;
>+	u64 preemption_reported;
>+	u64 preemption_other;
> 	u64 guest_mode;
> };
>
>diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>index 7e45d03cd018..5842abf1eac4 100644
>--- a/arch/x86/kvm/svm/svm.c
>+++ b/arch/x86/kvm/svm/svm.c
>@@ -4165,6 +4165,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
>
> static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> {
>+	if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR)
>+		vcpu->arch.at_instruction_boundary = true;
> }
>
> static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 982df9c000d3..c44f8e1d30c8 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -6549,6 +6549,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> 		return;
>
> 	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
>+	vcpu->arch.at_instruction_boundary = true;
> }
>
> static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 39c571224ac2..36453517e847 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -291,6 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
> 	STATS_DESC_COUNTER(VCPU, nested_run),
> 	STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
> 	STATS_DESC_COUNTER(VCPU, directed_yield_successful),
>+	STATS_DESC_COUNTER(VCPU, preemption_reported),
>+	STATS_DESC_COUNTER(VCPU, preemption_other),
> 	STATS_DESC_ICOUNTER(VCPU, guest_mode)
> };
>
>@@ -4604,6 +4606,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> 	struct kvm_memslots *slots;
> 	static const u8 preempted = KVM_VCPU_PREEMPTED;
>
>+	/*
>+	 * The vCPU can be marked preempted if and only if the VM-Exit was on
>+	 * an instruction boundary and will not trigger guest emulation of any
>+	 * kind (see vcpu_run).  Vendor specific code controls (conservatively)
>+	 * when this is true, for example allowing the vCPU to be marked
>+	 * preempted if and only if the VM-Exit was due to a host interrupt.
>+	 */
>+	if (!vcpu->arch.at_instruction_boundary) {
>+		vcpu->stat.preemption_other++;
>+		return;
>+	}
>+
>+	vcpu->stat.preemption_reported++;
> 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> 		return;
>
>@@ -10358,6 +10373,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> 	vcpu->arch.l1tf_flush_l1d = true;
>
> 	for (;;) {
>+		/*
>+		 * If another guest vCPU requests a PV TLB flush in the middle
>+		 * of instruction emulation, the rest of the emulation could
>+		 * use a stale page translation. Assume that any code after
>+		 * this point can start executing an instruction.
>+		 */
>+		vcpu->arch.at_instruction_boundary = false;
> 		if (kvm_vcpu_running(vcpu)) {
> 			r = vcpu_enter_guest(vcpu);
> 		} else {
>-- 
>2.35.1
>

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  2022-06-21 21:23 ` [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
@ 2022-08-03 16:02   ` Jann Horn
  2022-08-03 16:50   ` Paolo Bonzini
  1 sibling, 0 replies; 11+ messages in thread
From: Jann Horn @ 2022-08-03 16:02 UTC (permalink / raw)
  To: Sasha Levin, Paolo Bonzini
  Cc: linux-kernel, stable, tglx, mingo, bp, dave.hansen, x86, kvm

On Tue, Jun 21, 2022 at 11:23 PM Sasha Levin <sashal@kernel.org> wrote:
>
> Paolo, ping?

What happened here? From what I can tell, even the backports that
Sasha already wrote didn't get applied?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  2022-06-21 21:23 ` [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
  2022-08-03 16:02   ` Jann Horn
@ 2022-08-03 16:50   ` Paolo Bonzini
  2022-08-06 14:24     ` Sasha Levin
  1 sibling, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2022-08-03 16:50 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable
  Cc: Jann Horn, tglx, mingo, bp, dave.hansen, x86, kvm

On 6/21/22 23:23, Sasha Levin wrote:
> Paolo, ping?

Sorry I missed the whole bunch of MANUALSEL patches.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

for all six.

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  2022-08-03 16:50   ` Paolo Bonzini
@ 2022-08-06 14:24     ` Sasha Levin
  0 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2022-08-06 14:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, stable, Jann Horn, tglx, mingo, bp, dave.hansen, x86, kvm

On Wed, Aug 03, 2022 at 06:50:25PM +0200, Paolo Bonzini wrote:
>On 6/21/22 23:23, Sasha Levin wrote:
>>Paolo, ping?
>
>Sorry I missed the whole bunch of MANUALSEL patches.
>
>Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>
>for all six.

Thanks Paolo! I'll queue up these and the missing ones as pointed out by
Jann.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-06 14:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14  2:11 [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 2/6] KVM: x86: do not set st->preempted when going back to user space Sasha Levin
2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 3/6] KVM: selftests: Make hyperv_clock selftest more stable Sasha Levin
2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 4/6] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging Sasha Levin
2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 5/6] entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set Sasha Levin
2022-06-14  4:38   ` Eric W. Biederman
2022-06-14  2:11 ` [PATCH MANUALSEL 5.18 6/6] KVM: Don't null dereference ops->destroy Sasha Levin
2022-06-21 21:23 ` [PATCH MANUALSEL 5.18 1/6] KVM: x86: do not report a vCPU as preempted outside instruction boundaries Sasha Levin
2022-08-03 16:02   ` Jann Horn
2022-08-03 16:50   ` Paolo Bonzini
2022-08-06 14:24     ` Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).