linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug
@ 2021-08-18  0:12 Sean Christopherson
  2021-08-18  0:12 ` [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME,
e.g. for task migration, clears the flag without informing rseq and leads
to stale data in userspace's rseq struct.

Patch 2 is a cleanup to try and make future bugs less likely.  It's also
a baby step towards moving and renaming tracehook_notify_resume() since
it has nothing to do with tracing.  It kills me to not do the move/rename
as part of this series, but having a dedicated series/discussion seems
more appropriate given the sheer number of architectures that call
tracehook_notify_resume() and the lack of an obvious home for the code.

Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when
the include path (intentionally) omits tools' uapi headers.  KVM's
selftests do exactly that so that they can pick up the uapi headers from
the installed kernel headers, and still use various tools/ headers that
mirror kernel code, e.g. linux/types.h.  This allows the new test in
patch 4 to reference __NR_rseq without having to manually define it.

Patch 4 is a regression test for the KVM+rseq bug.

Patch 5 is a cleanup made possible by patch 3.


Sean Christopherson (5):
  KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM
    guest
  entry: rseq: Call rseq_handle_notify_resume() in
    tracehook_notify_resume()
  tools: Move x86 syscall number fallbacks to .../uapi/
  KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration
    bugs
  KVM: selftests: Remove __NR_userfaultfd syscall fallback

 arch/arm/kernel/signal.c                      |   1 -
 arch/arm64/kernel/signal.c                    |   1 -
 arch/csky/kernel/signal.c                     |   4 +-
 arch/mips/kernel/signal.c                     |   4 +-
 arch/powerpc/kernel/signal.c                  |   4 +-
 arch/s390/kernel/signal.c                     |   1 -
 include/linux/tracehook.h                     |   2 +
 kernel/entry/common.c                         |   4 +-
 kernel/rseq.c                                 |   4 +-
 .../x86/include/{ => uapi}/asm/unistd_32.h    |   0
 .../x86/include/{ => uapi}/asm/unistd_64.h    |   3 -
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 tools/testing/selftests/kvm/rseq_test.c       | 131 ++++++++++++++++++
 14 files changed, 143 insertions(+), 20 deletions(-)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%)
 create mode 100644 tools/testing/selftests/kvm/rseq_test.c

-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
@ 2021-08-18  0:12 ` Sean Christopherson
  2021-08-19 21:39   ` Mathieu Desnoyers
  2021-08-18  0:12 ` [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() Sean Christopherson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
transferring to a KVM guest, which is roughly equivalent to an exit to
userspace and processes many of the same pending actions.  While the task
cannot be in an rseq critical section as the KVM path is reachable only
via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
critical section still apply, e.g. the CPU ID needs to be updated if the
task is migrated.

Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
and other badness in userspace VMMs that use rseq in combination with KVM,
e.g. due to the CPU ID being stale after task migration.

Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
Reported-by: Peter Foley <pefoley@google.com>
Bisected-by: Doug Evans <dje@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 kernel/entry/kvm.c | 4 +++-
 kernel/rseq.c      | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 49972ee99aff..049fd06b4c3d 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 		if (ti_work & _TIF_NEED_RESCHED)
 			schedule();
 
-		if (ti_work & _TIF_NOTIFY_RESUME)
+		if (ti_work & _TIF_NOTIFY_RESUME) {
 			tracehook_notify_resume(NULL);
+			rseq_handle_notify_resume(NULL, NULL);
+		}
 
 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
 		if (ret)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 35f7bd0fced0..58c79a7918cd 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
 
 static int rseq_ip_fixup(struct pt_regs *regs)
 {
-	unsigned long ip = instruction_pointer(regs);
+	unsigned long ip = regs ? instruction_pointer(regs) : 0;
 	struct task_struct *t = current;
 	struct rseq_cs rseq_cs;
 	int ret;
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
 	 * If not nested over a rseq critical section, restart is useless.
 	 * Clear the rseq_cs pointer and return.
 	 */
-	if (!in_rseq_cs(ip, &rseq_cs))
+	if (!regs || !in_rseq_cs(ip, &rseq_cs))
 		return clear_rseq_cs(t);
 	ret = rseq_need_restart(t, rseq_cs.flags);
 	if (ret <= 0)
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
  2021-08-18  0:12 ` [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
@ 2021-08-18  0:12 ` Sean Christopherson
  2021-08-19 21:41   ` Mathieu Desnoyers
  2021-08-18  0:12 ` [PATCH 3/5] tools: Move x86 syscall number fallbacks to .../uapi/ Sean Christopherson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
that the two function are always called back-to-back by architectures
that have rseq.  The rseq helper is stubbed out for architectures that
don't support rseq, i.e. this is a nop across the board.

Note, tracehook_notify_resume() is horribly named and arguably does not
belong in tracehook.h as literally every line of code in it has nothing
to do with tracing.  But, that's been true since commit a42c6ded827d
("move key_repace_session_keyring() into tracehook_notify_resume()")
first usurped tracehook_notify_resume() back in 2012.  Punt cleaning that
mess up to future patches.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm/kernel/signal.c     | 1 -
 arch/arm64/kernel/signal.c   | 1 -
 arch/csky/kernel/signal.c    | 4 +---
 arch/mips/kernel/signal.c    | 4 +---
 arch/powerpc/kernel/signal.c | 4 +---
 arch/s390/kernel/signal.c    | 1 -
 include/linux/tracehook.h    | 2 ++
 kernel/entry/common.c        | 4 +---
 kernel/entry/kvm.c           | 4 +---
 9 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index a3a38d0a4c85..9df68d139965 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 				uprobe_notify_resume(regs);
 			} else {
 				tracehook_notify_resume(regs);
-				rseq_handle_notify_resume(NULL, regs);
 			}
 		}
 		local_irq_disable();
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 23036334f4dc..22b55db13da6 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_NOTIFY_RESUME) {
 				tracehook_notify_resume(regs);
-				rseq_handle_notify_resume(NULL, regs);
 
 				/*
 				 * If we reschedule after checking the affinity
diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
index 312f046d452d..bc4238b9f709 100644
--- a/arch/csky/kernel/signal.c
+++ b/arch/csky/kernel/signal.c
@@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 		do_signal(regs);
 
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+	if (thread_info_flags & _TIF_NOTIFY_RESUME)
 		tracehook_notify_resume(regs);
-		rseq_handle_notify_resume(NULL, regs);
-	}
 }
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index f1e985109da0..c9b2a75563e1 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void *unused,
 	if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 		do_signal(regs);
 
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+	if (thread_info_flags & _TIF_NOTIFY_RESUME)
 		tracehook_notify_resume(regs);
-		rseq_handle_notify_resume(NULL, regs);
-	}
 
 	user_enter();
 }
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index e600764a926c..b93b87df499d 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 		do_signal(current);
 	}
 
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+	if (thread_info_flags & _TIF_NOTIFY_RESUME)
 		tracehook_notify_resume(regs);
-		rseq_handle_notify_resume(NULL, regs);
-	}
 }
 
 static unsigned long get_tm_stackpointer(struct task_struct *tsk)
diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
index 78ef53b29958..b307db26bf2d 100644
--- a/arch/s390/kernel/signal.c
+++ b/arch/s390/kernel/signal.c
@@ -537,5 +537,4 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
 void do_notify_resume(struct pt_regs *regs)
 {
 	tracehook_notify_resume(regs);
-	rseq_handle_notify_resume(NULL, regs);
 }
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 3e80c4bc66f7..2564b7434b4d 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -197,6 +197,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
 
 	mem_cgroup_handle_over_high();
 	blkcg_maybe_throttle_current();
+
+	rseq_handle_notify_resume(NULL, regs);
 }
 
 /*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bf16395b9e13..d5a61d565ad5 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -171,10 +171,8 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
-		if (ti_work & _TIF_NOTIFY_RESUME) {
+		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(regs);
-			rseq_handle_notify_resume(NULL, regs);
-		}
 
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 049fd06b4c3d..49972ee99aff 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -19,10 +19,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 		if (ti_work & _TIF_NEED_RESCHED)
 			schedule();
 
-		if (ti_work & _TIF_NOTIFY_RESUME) {
+		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(NULL);
-			rseq_handle_notify_resume(NULL, NULL);
-		}
 
 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
 		if (ret)
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] tools: Move x86 syscall number fallbacks to .../uapi/
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
  2021-08-18  0:12 ` [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
  2021-08-18  0:12 ` [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() Sean Christopherson
@ 2021-08-18  0:12 ` Sean Christopherson
  2021-08-18  0:12 ` [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Sean Christopherson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Move unistd_{32,64}.h from x86/include/asm to x86/include/uapi/asm so
that tools/selftests that install kernel headers, e.g. KVM selftests, can
include non-uapi tools headers, e.g. to get 'struct list_head', without
effectively overriding the installed non-tool uapi headers.

Swapping KVM's search order, e.g. to search the kernel headers before
tool headers, is not a viable option as doing results in linux/type.h and
other core headers getting pulled from the kernel headers, which do not
have the kernel-internal typedefs that are used through tools, including
many files outside of selftests/kvm's control.

Prior to commit cec07f53c398 ("perf tools: Move syscall number fallbacks
from perf-sys.h to tools/arch/x86/include/asm/"), the handcoded numbers
were actual fallbacks, i.e. overriding unistd_{32,64}.h from the kernel
headers was unintentional.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/arch/x86/include/{ => uapi}/asm/unistd_32.h | 0
 tools/arch/x86/include/{ => uapi}/asm/unistd_64.h | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (100%)

diff --git a/tools/arch/x86/include/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h
similarity index 100%
rename from tools/arch/x86/include/asm/unistd_32.h
rename to tools/arch/x86/include/uapi/asm/unistd_32.h
diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
similarity index 100%
rename from tools/arch/x86/include/asm/unistd_64.h
rename to tools/arch/x86/include/uapi/asm/unistd_64.h
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
                   ` (2 preceding siblings ...)
  2021-08-18  0:12 ` [PATCH 3/5] tools: Move x86 syscall number fallbacks to .../uapi/ Sean Christopherson
@ 2021-08-18  0:12 ` Sean Christopherson
  2021-08-19 21:52   ` Mathieu Desnoyers
  2021-08-18  0:12 ` [PATCH 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback Sean Christopherson
  2021-09-22 14:12 ` [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Paolo Bonzini
  5 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Add a test to verify an rseq's CPU ID is updated correctly if the task is
migrated while the kernel is handling KVM_RUN.  This is a regression test
for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
without updating rseq, leading to a stale CPU ID and other badness.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/.gitignore  |   1 +
 tools/testing/selftests/kvm/Makefile    |   3 +
 tools/testing/selftests/kvm/rseq_test.c | 131 ++++++++++++++++++++++++
 3 files changed, 135 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/rseq_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 0709af0144c8..6d031ff6b68e 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -47,6 +47,7 @@
 /kvm_page_table_test
 /memslot_modification_stress_test
 /memslot_perf_test
+/rseq_test
 /set_memory_region_test
 /steal_time
 /kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 5832f510a16c..0756e79cb513 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += kvm_page_table_test
 TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += memslot_perf_test
+TEST_GEN_PROGS_x86_64 += rseq_test
 TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
 TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
@@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_aarch64 += kvm_page_table_test
+TEST_GEN_PROGS_aarch64 += rseq_test
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
 TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
@@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
 TEST_GEN_PROGS_s390x += dirty_log_test
 TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
 TEST_GEN_PROGS_s390x += kvm_page_table_test
+TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
 
diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c
new file mode 100644
index 000000000000..90ed535eded7
--- /dev/null
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <linux/rseq.h>
+#include <linux/unistd.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+#define VCPU_ID 0
+
+static __thread volatile struct rseq __rseq = {
+	.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+#define RSEQ_SIG 0xdeadbeef
+
+static pthread_t migration_thread;
+static cpu_set_t possible_mask;
+static bool done;
+
+static void guest_code(void)
+{
+	for (;;)
+		GUEST_SYNC(0);
+}
+
+static void sys_rseq(int flags)
+{
+	int r;
+
+	r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG);
+	TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno));
+}
+
+static void *migration_worker(void *ign)
+{
+	cpu_set_t allowed_mask;
+	int r, i, nr_cpus, cpu;
+
+	CPU_ZERO(&allowed_mask);
+
+	nr_cpus = CPU_COUNT(&possible_mask);
+
+	for (i = 0; i < 20000; i++) {
+		cpu = i % nr_cpus;
+		if (!CPU_ISSET(cpu, &possible_mask))
+			continue;
+
+		CPU_SET(cpu, &allowed_mask);
+
+		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
+		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno,
+			    strerror(errno));
+
+		CPU_CLR(cpu, &allowed_mask);
+
+		usleep(10);
+	}
+	done = true;
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vm *vm;
+	u32 cpu, rseq_cpu;
+	int r;
+
+	/* Tell stdout not to buffer its content */
+	setbuf(stdout, NULL);
+
+	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
+	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
+		    strerror(errno));
+
+	if (CPU_COUNT(&possible_mask) < 2) {
+		print_skip("Only one CPU, task migration not possible\n");
+		exit(KSFT_SKIP);
+	}
+
+	sys_rseq(0);
+
+	/*
+	 * Create and run a dummy VM that immediately exits to userspace via
+	 * GUEST_SYNC, while concurrently migrating the process by setting its
+	 * CPU affinity.
+	 */
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+
+	pthread_create(&migration_thread, NULL, migration_worker, 0);
+
+	while (!done) {
+		vcpu_run(vm, VCPU_ID);
+		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+			    "Guest failed?");
+
+		cpu = sched_getcpu();
+		rseq_cpu = READ_ONCE(__rseq.cpu_id);
+
+		/*
+		 * Verify rseq's CPU matches sched's CPU, and that sched's CPU
+		 * is stable.  This doesn't handle the case where the task is
+		 * migrated between sched_getcpu() and reading rseq, and again
+		 * between reading rseq and sched_getcpu(), but in practice no
+		 * false positives have been observed, while on the other hand
+		 * blocking migration while this thread reads CPUs messes with
+		 * the timing and prevents hitting failures on a buggy kernel.
+		 */
+		TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(),
+			    "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
+	}
+
+	pthread_join(migration_thread, NULL);
+
+	kvm_vm_free(vm);
+
+	sys_rseq(RSEQ_FLAG_UNREGISTER);
+
+	return 0;
+}
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
                   ` (3 preceding siblings ...)
  2021-08-18  0:12 ` [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Sean Christopherson
@ 2021-08-18  0:12 ` Sean Christopherson
  2021-09-22 14:12 ` [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Paolo Bonzini
  5 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-08-18  0:12 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt,
	Sean Christopherson, Ben Gardon

Revert the __NR_userfaultfd syscall fallback added for KVM selftests now
that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in
KVM sefltests' search path, i.e. now that KVM gets x86 syscall numbers
from the installed kernel headers.

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/arch/x86/include/uapi/asm/unistd_64.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
index 4205ed4158bf..cb52a3a8b8fc 100644
--- a/tools/arch/x86/include/uapi/asm/unistd_64.h
+++ b/tools/arch/x86/include/uapi/asm/unistd_64.h
@@ -1,7 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __NR_userfaultfd
-#define __NR_userfaultfd 282
-#endif
 #ifndef __NR_perf_event_open
 # define __NR_perf_event_open 298
 #endif
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-18  0:12 ` [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
@ 2021-08-19 21:39   ` Mathieu Desnoyers
  2021-08-19 23:48     ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2021-08-19 21:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the CPU ID needs to be updated if the
> task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

I agree with the problem assessment, but I would recommend a small change
to this fix.

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley <pefoley@google.com>
> Bisected-by: Doug Evans <dje@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> kernel/entry/kvm.c | 4 +++-
> kernel/rseq.c      | 4 ++--
> 2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
> 		if (ti_work & _TIF_NEED_RESCHED)
> 			schedule();
> 
> -		if (ti_work & _TIF_NOTIFY_RESUME)
> +		if (ti_work & _TIF_NOTIFY_RESUME) {
> 			tracehook_notify_resume(NULL);
> +			rseq_handle_notify_resume(NULL, NULL);
> +		}
> 
> 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
> 		if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..58c79a7918cd 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs
> *rseq_cs)
> 
> static int rseq_ip_fixup(struct pt_regs *regs)
> {
> -	unsigned long ip = instruction_pointer(regs);
> +	unsigned long ip = regs ? instruction_pointer(regs) : 0;
> 	struct task_struct *t = current;
> 	struct rseq_cs rseq_cs;
> 	int ret;
> @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> 	 * If not nested over a rseq critical section, restart is useless.
> 	 * Clear the rseq_cs pointer and return.
> 	 */
> -	if (!in_rseq_cs(ip, &rseq_cs))
> +	if (!regs || !in_rseq_cs(ip, &rseq_cs))

I think clearing the thread's rseq_cs unconditionally here when regs is NULL
is not the behavior we want when this is called from xfer_to_guest_mode_work.

If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
kill this application in the rseq_syscall handler when exiting back to usermode
when the ioctl eventually returns.

However, clearing the thread's rseq_cs will prevent this from happening.

So I would favor an approach where we simply do:

if (!regs)
     return 0;

Immediately at the beginning of rseq_ip_fixup, before getting the instruction
pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
is not relevant to do any fixup here, because it is nested in a ioctl system
call.

Effectively, this would preserve the SIGSEGV behavior when this ioctl is
erroneously called by user-space from a rseq critical section.

Thanks for looking into this !

Mathieu

> 		return clear_rseq_cs(t);
> 	ret = rseq_need_restart(t, rseq_cs.flags);
> 	if (ret <= 0)
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()
  2021-08-18  0:12 ` [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() Sean Christopherson
@ 2021-08-19 21:41   ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2021-08-19 21:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

> Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
> that the two function are always called back-to-back by architectures
> that have rseq.  The rseq helper is stubbed out for architectures that
> don't support rseq, i.e. this is a nop across the board.
> 
> Note, tracehook_notify_resume() is horribly named and arguably does not
> belong in tracehook.h as literally every line of code in it has nothing
> to do with tracing.  But, that's been true since commit a42c6ded827d
> ("move key_repace_session_keyring() into tracehook_notify_resume()")
> first usurped tracehook_notify_resume() back in 2012.  Punt cleaning that
> mess up to future patches.
> 
> No functional change intended.

This will make it harder to introduce new code paths which consume the
NOTIFY_RESUME without calling the rseq callback, which introduces issues.
Agreed.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/arm/kernel/signal.c     | 1 -
> arch/arm64/kernel/signal.c   | 1 -
> arch/csky/kernel/signal.c    | 4 +---
> arch/mips/kernel/signal.c    | 4 +---
> arch/powerpc/kernel/signal.c | 4 +---
> arch/s390/kernel/signal.c    | 1 -
> include/linux/tracehook.h    | 2 ++
> kernel/entry/common.c        | 4 +---
> kernel/entry/kvm.c           | 4 +---
> 9 files changed, 7 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
> index a3a38d0a4c85..9df68d139965 100644
> --- a/arch/arm/kernel/signal.c
> +++ b/arch/arm/kernel/signal.c
> @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int
> thread_flags, int syscall)
> 				uprobe_notify_resume(regs);
> 			} else {
> 				tracehook_notify_resume(regs);
> -				rseq_handle_notify_resume(NULL, regs);
> 			}
> 		}
> 		local_irq_disable();
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 23036334f4dc..22b55db13da6 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
> 
> 			if (thread_flags & _TIF_NOTIFY_RESUME) {
> 				tracehook_notify_resume(regs);
> -				rseq_handle_notify_resume(NULL, regs);
> 
> 				/*
> 				 * If we reschedule after checking the affinity
> diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
> index 312f046d452d..bc4238b9f709 100644
> --- a/arch/csky/kernel/signal.c
> +++ b/arch/csky/kernel/signal.c
> @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
> 	if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> 		do_signal(regs);
> 
> -	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> +	if (thread_info_flags & _TIF_NOTIFY_RESUME)
> 		tracehook_notify_resume(regs);
> -		rseq_handle_notify_resume(NULL, regs);
> -	}
> }
> diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
> index f1e985109da0..c9b2a75563e1 100644
> --- a/arch/mips/kernel/signal.c
> +++ b/arch/mips/kernel/signal.c
> @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void
> *unused,
> 	if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> 		do_signal(regs);
> 
> -	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> +	if (thread_info_flags & _TIF_NOTIFY_RESUME)
> 		tracehook_notify_resume(regs);
> -		rseq_handle_notify_resume(NULL, regs);
> -	}
> 
> 	user_enter();
> }
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index e600764a926c..b93b87df499d 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
> thread_info_flags)
> 		do_signal(current);
> 	}
> 
> -	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> +	if (thread_info_flags & _TIF_NOTIFY_RESUME)
> 		tracehook_notify_resume(regs);
> -		rseq_handle_notify_resume(NULL, regs);
> -	}
> }
> 
> static unsigned long get_tm_stackpointer(struct task_struct *tsk)
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 78ef53b29958..b307db26bf2d 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/signal.c
> @@ -537,5 +537,4 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool
> has_signal)
> void do_notify_resume(struct pt_regs *regs)
> {
> 	tracehook_notify_resume(regs);
> -	rseq_handle_notify_resume(NULL, regs);
> }
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 3e80c4bc66f7..2564b7434b4d 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -197,6 +197,8 @@ static inline void tracehook_notify_resume(struct pt_regs
> *regs)
> 
> 	mem_cgroup_handle_over_high();
> 	blkcg_maybe_throttle_current();
> +
> +	rseq_handle_notify_resume(NULL, regs);
> }
> 
> /*
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..d5a61d565ad5 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -171,10 +171,8 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs
> *regs,
> 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> 			handle_signal_work(regs, ti_work);
> 
> -		if (ti_work & _TIF_NOTIFY_RESUME) {
> +		if (ti_work & _TIF_NOTIFY_RESUME)
> 			tracehook_notify_resume(regs);
> -			rseq_handle_notify_resume(NULL, regs);
> -		}
> 
> 		/* Architecture specific TIF work */
> 		arch_exit_to_user_mode_work(regs, ti_work);
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 049fd06b4c3d..49972ee99aff 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,10 +19,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
> 		if (ti_work & _TIF_NEED_RESCHED)
> 			schedule();
> 
> -		if (ti_work & _TIF_NOTIFY_RESUME) {
> +		if (ti_work & _TIF_NOTIFY_RESUME)
> 			tracehook_notify_resume(NULL);
> -			rseq_handle_notify_resume(NULL, NULL);
> -		}
> 
> 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
> 		if (ret)
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
  2021-08-18  0:12 ` [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Sean Christopherson
@ 2021-08-19 21:52   ` Mathieu Desnoyers
  2021-08-19 23:33     ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2021-08-19 21:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

[...]

> +
> +static void *migration_worker(void *ign)
> +{
> +	cpu_set_t allowed_mask;
> +	int r, i, nr_cpus, cpu;
> +
> +	CPU_ZERO(&allowed_mask);
> +
> +	nr_cpus = CPU_COUNT(&possible_mask);
> +
> +	for (i = 0; i < 20000; i++) {
> +		cpu = i % nr_cpus;
> +		if (!CPU_ISSET(cpu, &possible_mask))
> +			continue;
> +
> +		CPU_SET(cpu, &allowed_mask);
> +
> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno,
> +			    strerror(errno));
> +
> +		CPU_CLR(cpu, &allowed_mask);
> +
> +		usleep(10);
> +	}
> +	done = true;
> +	return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct kvm_vm *vm;
> +	u32 cpu, rseq_cpu;
> +	int r;
> +
> +	/* Tell stdout not to buffer its content */
> +	setbuf(stdout, NULL);
> +
> +	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
> +	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
> +		    strerror(errno));
> +
> +	if (CPU_COUNT(&possible_mask) < 2) {
> +		print_skip("Only one CPU, task migration not possible\n");
> +		exit(KSFT_SKIP);
> +	}
> +
> +	sys_rseq(0);
> +
> +	/*
> +	 * Create and run a dummy VM that immediately exits to userspace via
> +	 * GUEST_SYNC, while concurrently migrating the process by setting its
> +	 * CPU affinity.
> +	 */
> +	vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> +	pthread_create(&migration_thread, NULL, migration_worker, 0);
> +
> +	while (!done) {
> +		vcpu_run(vm, VCPU_ID);
> +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> +			    "Guest failed?");
> +
> +		cpu = sched_getcpu();
> +		rseq_cpu = READ_ONCE(__rseq.cpu_id);
> +
> +		/*
> +		 * Verify rseq's CPU matches sched's CPU, and that sched's CPU
> +		 * is stable.  This doesn't handle the case where the task is
> +		 * migrated between sched_getcpu() and reading rseq, and again
> +		 * between reading rseq and sched_getcpu(), but in practice no
> +		 * false positives have been observed, while on the other hand
> +		 * blocking migration while this thread reads CPUs messes with
> +		 * the timing and prevents hitting failures on a buggy kernel.
> +		 */

I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id
if you add a pthread mutex to protect:

sched_getcpu and __rseq_abi.cpu_id  reads

vs

sched_setaffinity calls within the migration thread.

Thoughts ?

Thanks,

Mathieu

> +		TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(),
> +			    "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> +	}
> +
> +	pthread_join(migration_thread, NULL);
> +
> +	kvm_vm_free(vm);
> +
> +	sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> +	return 0;
> +}
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
  2021-08-19 21:52   ` Mathieu Desnoyers
@ 2021-08-19 23:33     ` Sean Christopherson
  2021-08-20 18:31       ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-08-19 23:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
> 
> > Add a test to verify an rseq's CPU ID is updated correctly if the task is
> > migrated while the kernel is handling KVM_RUN.  This is a regression test
> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> > without updating rseq, leading to a stale CPU ID and other badness.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> [...]
> 
> > +	while (!done) {
> > +		vcpu_run(vm, VCPU_ID);
> > +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> > +			    "Guest failed?");
> > +
> > +		cpu = sched_getcpu();
> > +		rseq_cpu = READ_ONCE(__rseq.cpu_id);
> > +
> > +		/*
> > +		 * Verify rseq's CPU matches sched's CPU, and that sched's CPU
> > +		 * is stable.  This doesn't handle the case where the task is
> > +		 * migrated between sched_getcpu() and reading rseq, and again
> > +		 * between reading rseq and sched_getcpu(), but in practice no
> > +		 * false positives have been observed, while on the other hand
> > +		 * blocking migration while this thread reads CPUs messes with
> > +		 * the timing and prevents hitting failures on a buggy kernel.
> > +		 */
> 
> I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id
> if you add a pthread mutex to protect:
> 
> sched_getcpu and __rseq_abi.cpu_id  reads
> 
> vs
> 
> sched_setaffinity calls within the migration thread.
> 
> Thoughts ?

I tried that and couldn't reproduce the bug.  That's what I attempted to call out
in the blurb "blocking migration while this thread reads CPUs ... prevents hitting
failures on a buggy kernel".

I considered adding arbitrary delays around the mutex to try and hit the bug, but
I was worried that even if I got it "working" for this bug, the test would be too
tailored to this bug and potentially miss future regression.  Letting the two
threads run wild seemed like it would provide the best coverage, at the cost of
potentially causing to false failures.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-19 21:39   ` Mathieu Desnoyers
@ 2021-08-19 23:48     ` Sean Christopherson
  2021-08-20 18:51       ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-08-19 23:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> > 	 * If not nested over a rseq critical section, restart is useless.
> > 	 * Clear the rseq_cs pointer and return.
> > 	 */
> > -	if (!in_rseq_cs(ip, &rseq_cs))
> > +	if (!regs || !in_rseq_cs(ip, &rseq_cs))
> 
> I think clearing the thread's rseq_cs unconditionally here when regs is NULL
> is not the behavior we want when this is called from xfer_to_guest_mode_work.
> 
> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
> kill this application in the rseq_syscall handler when exiting back to usermode
> when the ioctl eventually returns.
> 
> However, clearing the thread's rseq_cs will prevent this from happening.
> 
> So I would favor an approach where we simply do:
> 
> if (!regs)
>      return 0;
> 
> Immediately at the beginning of rseq_ip_fixup, before getting the instruction
> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
> is not relevant to do any fixup here, because it is nested in a ioctl system
> call.
> 
> Effectively, this would preserve the SIGSEGV behavior when this ioctl is
> erroneously called by user-space from a rseq critical section.

Ha, that's effectively what I implemented first, but I changed it because of the
comment in clear_rseq_cs() that says:

  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
  as well as on top of code outside of the rseq assembly block.

Which makes it sound like something might rely on clearing rseq_cs?

Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
rseq critical section, and because syscalls in critical sections are illegal, by
definition clearing rseq_cs is a nop unless userspace is misbehaving.

If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
not worth the extra code to detect an error that will likely be caught anyways?

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 35f7bd0fced0..28b8342290b0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)

        if (unlikely(t->flags & PF_EXITING))
                return;
+       if (!regs) {
+#ifdef CONFIG_DEBUG_RSEQ
+               if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
+                       goto error;
+#endif
+               return;
+       }
        ret = rseq_ip_fixup(regs);
        if (unlikely(ret < 0))
                goto error;

> Thanks for looking into this !
> 
> Mathieu
> 
> > 		return clear_rseq_cs(t);
> > 	ret = rseq_need_restart(t, rseq_cs.flags);
> > 	if (ret <= 0)
> > --
> > 2.33.0.rc1.237.g0d66db33f3-goog
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
  2021-08-19 23:33     ` Sean Christopherson
@ 2021-08-20 18:31       ` Mathieu Desnoyers
  2021-08-20 22:25         ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2021-08-20 18:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

----- On Aug 19, 2021, at 7:33 PM, Sean Christopherson seanjc@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
>> 
>> > Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> > migrated while the kernel is handling KVM_RUN.  This is a regression test
>> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> > without updating rseq, leading to a stale CPU ID and other badness.
>> > 
>> > Signed-off-by: Sean Christopherson <seanjc@google.com>
>> > ---
>> 
>> [...]
>> 
>> > +	while (!done) {
>> > +		vcpu_run(vm, VCPU_ID);
>> > +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> > +			    "Guest failed?");
>> > +
>> > +		cpu = sched_getcpu();
>> > +		rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> > +
>> > +		/*
>> > +		 * Verify rseq's CPU matches sched's CPU, and that sched's CPU
>> > +		 * is stable.  This doesn't handle the case where the task is
>> > +		 * migrated between sched_getcpu() and reading rseq, and again
>> > +		 * between reading rseq and sched_getcpu(), but in practice no
>> > +		 * false positives have been observed, while on the other hand
>> > +		 * blocking migration while this thread reads CPUs messes with
>> > +		 * the timing and prevents hitting failures on a buggy kernel.
>> > +		 */
>> 
>> I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id
>> if you add a pthread mutex to protect:
>> 
>> sched_getcpu and __rseq_abi.cpu_id  reads
>> 
>> vs
>> 
>> sched_setaffinity calls within the migration thread.
>> 
>> Thoughts ?
> 
> I tried that and couldn't reproduce the bug.  That's what I attempted to call
> out
> in the blurb "blocking migration while this thread reads CPUs ... prevents
> hitting
> failures on a buggy kernel".
> 
> I considered adding arbitrary delays around the mutex to try and hit the bug,
> but
> I was worried that even if I got it "working" for this bug, the test would be
> too
> tailored to this bug and potentially miss future regression.  Letting the two
> threads run wild seemed like it would provide the best coverage, at the cost of
> potentially causing to false failures.

OK, so your point is that using mutual exclusion to ensure stability of the cpu id
changes the timings too much, to a point where the issues don't reproduce. I understand
that this mutex ties the migration thread timings to the vcpu thread's use of the mutex,
which will reduce timings randomness, which is unwanted here.

I still really hate flakiness in tests, because then people stop caring when they
fail once in a while. And with the nature of rseq, a once-in-a-while failure is a
big deal. Let's see if we can use other tricks to ensure stability of the cpu id
without changing timings too much.

One idea would be to use a seqcount lock. But even if we use that, I'm concerned that
the very long writer critical section calling sched_setaffinity would need to be
alternated with a sleep to ensure the read-side progresses. The sleep delay could be
relatively small compared to the duration of the sched_setaffinity call, e.g. ratio
1:10.

static volatile uint64_t seqcnt;

The thread responsible for setting the affinity would do something like:

for (;;) {
  atomic_inc_seq_cst(&seqcnt);
  sched_setaffinity(..., n++ % nr_cpus);
  atomic_inc_seq_cst(&seqcnt);
  usleep(1);  /* this is where read-side is allowed to progress. */
}

And the thread reading the rseq cpu id and calling sched_getcpu():

uint64_t snapshot;

do {
  snapshot = atomic_load(&seqcnt) & ~1; /* force retry if odd */
  smp_rmb();
  cpu = sched_getcpu();
  rseq_cpu = READ_ONCE(__rseq.cpu_id);
  smp_rmb();
} while (snapshot != atomic_load(&seqcnt));

So the reader retry the cpu id reads whenever sched_setaffinity is being
called by the migration thread, and whenever it is preempted for more
than one migration thread loop.

That should achieve our goal of providing cpu id stability without significantly
changing the timings of the migration thread, given that it never blocks waiting
for the reader.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-19 23:48     ` Sean Christopherson
@ 2021-08-20 18:51       ` Mathieu Desnoyers
  2021-08-20 22:26         ` Sean Christopherson
  2021-09-06 10:28         ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2021-08-20 18:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

----- On Aug 19, 2021, at 7:48 PM, Sean Christopherson seanjc@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
>> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>> > 	 * If not nested over a rseq critical section, restart is useless.
>> > 	 * Clear the rseq_cs pointer and return.
>> > 	 */
>> > -	if (!in_rseq_cs(ip, &rseq_cs))
>> > +	if (!regs || !in_rseq_cs(ip, &rseq_cs))
>> 
>> I think clearing the thread's rseq_cs unconditionally here when regs is NULL
>> is not the behavior we want when this is called from xfer_to_guest_mode_work.
>> 
>> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
>> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
>> kill this application in the rseq_syscall handler when exiting back to usermode
>> when the ioctl eventually returns.
>> 
>> However, clearing the thread's rseq_cs will prevent this from happening.
>> 
>> So I would favor an approach where we simply do:
>> 
>> if (!regs)
>>      return 0;
>> 
>> Immediately at the beginning of rseq_ip_fixup, before getting the instruction
>> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
>> is not relevant to do any fixup here, because it is nested in a ioctl system
>> call.
>> 
>> Effectively, this would preserve the SIGSEGV behavior when this ioctl is
>> erroneously called by user-space from a rseq critical section.
> 
> Ha, that's effectively what I implemented first, but I changed it because of the
> comment in clear_rseq_cs() that says:
> 
>  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
>  as well as on top of code outside of the rseq assembly block.
> 
> Which makes it sound like something might rely on clearing rseq_cs?

This comment is describing succinctly the lazy clear scheme for rseq_cs.

Without the lazy clear scheme, a rseq c.s. would look like:

 *                     init(rseq_cs)
 *                     cpu = TLS->rseq::cpu_id_start
 *   [1]               TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]        ----------------------------
 *   [2]               if (cpu != TLS->rseq::cpu_id)
 *                             goto abort_ip;
 *   [3]               <last_instruction_in_cs>
 *   [post_commit_ip]  ----------------------------
 *   [4]               TLS->rseq::rseq_cs = NULL

But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
descriptor contains information about the instruction pointer range of the critical
section. Therefore, userspace can omit [4], but if the kernel never clears it, it
means that it will have to re-read the rseq_cs descriptor's content each time it
needs to check it to confirm that it is not nested over a rseq c.s..

So making the kernel lazily clear the rseq_cs pointer is just an optimization which
ensures that the kernel won't do useless work the next time it needs to check
rseq_cs, given that it has already validated that the userspace code is currently
not within the rseq c.s. currently advertised by the rseq_cs field.

> 
> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
> rseq critical section, and because syscalls in critical sections are illegal, by
> definition clearing rseq_cs is a nop unless userspace is misbehaving.

Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ
code executed when returning from ioctl to userspace will be able to validate that
it is not nested within a rseq critical section.

> 
> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
> not worth the extra code to detect an error that will likely be caught anyways?

The error will indeed already be caught on return from ioctl to userspace, so I
don't see any added value in duplicating this check.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..28b8342290b0 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>        if (unlikely(t->flags & PF_EXITING))
>                return;
> +       if (!regs) {
> +#ifdef CONFIG_DEBUG_RSEQ
> +               if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
> +                       goto error;
> +#endif
> +               return;
> +       }
>        ret = rseq_ip_fixup(regs);
>        if (unlikely(ret < 0))
>                goto error;
> 
>> Thanks for looking into this !
>> 
>> Mathieu
>> 
>> > 		return clear_rseq_cs(t);
>> > 	ret = rseq_need_restart(t, rseq_cs.flags);
>> > 	if (ret <= 0)
>> > --
>> > 2.33.0.rc1.237.g0d66db33f3-goog
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
  2021-08-20 18:31       ` Mathieu Desnoyers
@ 2021-08-20 22:25         ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-08-20 22:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

On Fri, Aug 20, 2021, Mathieu Desnoyers wrote:
> I still really hate flakiness in tests, because then people stop caring when they
> fail once in a while. And with the nature of rseq, a once-in-a-while failure is a
> big deal. Let's see if we can use other tricks to ensure stability of the cpu id
> without changing timings too much.

Yeah, zero agrument regarding flaky tests.

> One idea would be to use a seqcount lock.

A sequence counter did the trick!  Thanks much!

> But even if we use that, I'm concerned that the very long writer critical
> section calling sched_setaffinity would need to be alternated with a sleep to
> ensure the read-side progresses. The sleep delay could be relatively small
> compared to the duration of the sched_setaffinity call, e.g. ratio 1:10.

I already had an arbitrary usleep(10) to let the reader make progress between
sched_setaffinity() calls.  Dropping it down to 1us didn't affect reproducibility,
so I went with that to shave those precious cycles :-)  Eliminating the delay
entirely did result in no repro, which was a nice confirmation that it's needed
to let the reader get back into KVM_RUN.

Thanks again!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-20 18:51       ` Mathieu Desnoyers
@ 2021-08-20 22:26         ` Sean Christopherson
  2021-09-06 10:28         ` Paolo Bonzini
  1 sibling, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-08-20 22:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, Paolo Bonzini, shuah, Benjamin Herrenschmidt,
	Paul Mackerras, linux-arm-kernel, linux-kernel, linux-csky,
	linux-mips, linuxppc-dev, linux-s390, KVM list, linux-kselftest,
	Peter Foley, Shakeel Butt, Ben Gardon

On Fri, Aug 20, 2021, Mathieu Desnoyers wrote:
> Without the lazy clear scheme, a rseq c.s. would look like:
> 
>  *                     init(rseq_cs)
>  *                     cpu = TLS->rseq::cpu_id_start
>  *   [1]               TLS->rseq::rseq_cs = rseq_cs
>  *   [start_ip]        ----------------------------
>  *   [2]               if (cpu != TLS->rseq::cpu_id)
>  *                             goto abort_ip;
>  *   [3]               <last_instruction_in_cs>
>  *   [post_commit_ip]  ----------------------------
>  *   [4]               TLS->rseq::rseq_cs = NULL
> 
> But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
> descriptor contains information about the instruction pointer range of the critical
> section. Therefore, userspace can omit [4], but if the kernel never clears it, it
> means that it will have to re-read the rseq_cs descriptor's content each time it
> needs to check it to confirm that it is not nested over a rseq c.s..
> 
> So making the kernel lazily clear the rseq_cs pointer is just an optimization which
> ensures that the kernel won't do useless work the next time it needs to check
> rseq_cs, given that it has already validated that the userspace code is currently
> not within the rseq c.s. currently advertised by the rseq_cs field.

Thanks for the explanation, much appreciated!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-08-20 18:51       ` Mathieu Desnoyers
  2021-08-20 22:26         ` Sean Christopherson
@ 2021-09-06 10:28         ` Paolo Bonzini
  2021-09-07 14:38           ` Sean Christopherson
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2021-09-06 10:28 UTC (permalink / raw)
  To: Mathieu Desnoyers, Sean Christopherson
  Cc: Russell King, ARM Linux, Catalin Marinas, Will Deacon, Guo Ren,
	Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens, gor,
	Christian Borntraeger, Oleg Nesterov, rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, paulmck,
	Boqun Feng, shuah, Benjamin Herrenschmidt, Paul Mackerras,
	linux-arm-kernel, linux-kernel, linux-csky, linux-mips,
	linuxppc-dev, linux-s390, KVM list, linux-kselftest, Peter Foley,
	Shakeel Butt, Ben Gardon

On 20/08/21 20:51, Mathieu Desnoyers wrote:
>> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
>> rseq critical section, and because syscalls in critical sections are illegal, by
>> definition clearing rseq_cs is a nop unless userspace is misbehaving.
> Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ
> code executed when returning from ioctl to userspace will be able to validate that
> it is not nested within a rseq critical section.
> 
>> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
>> not worth the extra code to detect an error that will likely be caught anyways?
> The error will indeed already be caught on return from ioctl to userspace, so I
> don't see any added value in duplicating this check.

Sean, can you send a v2 (even for this patch only would be okay)?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
  2021-09-06 10:28         ` Paolo Bonzini
@ 2021-09-07 14:38           ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-09-07 14:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mathieu Desnoyers, Russell King, ARM Linux, Catalin Marinas,
	Will Deacon, Guo Ren, Thomas Bogendoerfer, Michael Ellerman,
	Heiko Carstens, gor, Christian Borntraeger, Oleg Nesterov,
	rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, paulmck, Boqun Feng, shuah,
	Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	KVM list, linux-kselftest, Peter Foley, Shakeel Butt, Ben Gardon

On Mon, Sep 06, 2021, Paolo Bonzini wrote:
> On 20/08/21 20:51, Mathieu Desnoyers wrote:
> > > Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
> > > rseq critical section, and because syscalls in critical sections are illegal, by
> > > definition clearing rseq_cs is a nop unless userspace is misbehaving.
> > Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ
> > code executed when returning from ioctl to userspace will be able to validate that
> > it is not nested within a rseq critical section.
> > 
> > > If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
> > > not worth the extra code to detect an error that will likely be caught anyways?
> > The error will indeed already be caught on return from ioctl to userspace, so I
> > don't see any added value in duplicating this check.
> 
> Sean, can you send a v2 (even for this patch only would be okay)?

Made it all the way to v3 while you were out :-)

https://lkml.kernel.org/r/20210901203030.1292304-1-seanjc@google.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug
  2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
                   ` (4 preceding siblings ...)
  2021-08-18  0:12 ` [PATCH 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback Sean Christopherson
@ 2021-09-22 14:12 ` Paolo Bonzini
  5 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2021-09-22 14:12 UTC (permalink / raw)
  To: Sean Christopherson, Russell King, Catalin Marinas, Will Deacon,
	Guo Ren, Thomas Bogendoerfer, Michael Ellerman, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Oleg Nesterov,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andy Lutomirski, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Shuah Khan
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linux-arm-kernel,
	linux-kernel, linux-csky, linux-mips, linuxppc-dev, linux-s390,
	kvm, linux-kselftest, Peter Foley, Shakeel Butt, Ben Gardon

On 18/08/21 02:12, Sean Christopherson wrote:
> Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME,
> e.g. for task migration, clears the flag without informing rseq and leads
> to stale data in userspace's rseq struct.
> 
> Patch 2 is a cleanup to try and make future bugs less likely.  It's also
> a baby step towards moving and renaming tracehook_notify_resume() since
> it has nothing to do with tracing.  It kills me to not do the move/rename
> as part of this series, but having a dedicated series/discussion seems
> more appropriate given the sheer number of architectures that call
> tracehook_notify_resume() and the lack of an obvious home for the code.
> 
> Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when
> the include path (intentionally) omits tools' uapi headers.  KVM's
> selftests do exactly that so that they can pick up the uapi headers from
> the installed kernel headers, and still use various tools/ headers that
> mirror kernel code, e.g. linux/types.h.  This allows the new test in
> patch 4 to reference __NR_rseq without having to manually define it.
> 
> Patch 4 is a regression test for the KVM+rseq bug.
> 
> Patch 5 is a cleanup made possible by patch 3.
> 
> 
> Sean Christopherson (5):
>    KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM
>      guest
>    entry: rseq: Call rseq_handle_notify_resume() in
>      tracehook_notify_resume()
>    tools: Move x86 syscall number fallbacks to .../uapi/
>    KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration
>      bugs
>    KVM: selftests: Remove __NR_userfaultfd syscall fallback
> 
>   arch/arm/kernel/signal.c                      |   1 -
>   arch/arm64/kernel/signal.c                    |   1 -
>   arch/csky/kernel/signal.c                     |   4 +-
>   arch/mips/kernel/signal.c                     |   4 +-
>   arch/powerpc/kernel/signal.c                  |   4 +-
>   arch/s390/kernel/signal.c                     |   1 -
>   include/linux/tracehook.h                     |   2 +
>   kernel/entry/common.c                         |   4 +-
>   kernel/rseq.c                                 |   4 +-
>   .../x86/include/{ => uapi}/asm/unistd_32.h    |   0
>   .../x86/include/{ => uapi}/asm/unistd_64.h    |   3 -
>   tools/testing/selftests/kvm/.gitignore        |   1 +
>   tools/testing/selftests/kvm/Makefile          |   3 +
>   tools/testing/selftests/kvm/rseq_test.c       | 131 ++++++++++++++++++
>   14 files changed, 143 insertions(+), 20 deletions(-)
>   rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%)
>   rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%)
>   create mode 100644 tools/testing/selftests/kvm/rseq_test.c
> 

Queued v3, thanks.  I'll send it in a separate pull request to Linus 
since it touches stuff outside my usual turf.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-09-22 14:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  0:12 [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
2021-08-18  0:12 ` [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
2021-08-19 21:39   ` Mathieu Desnoyers
2021-08-19 23:48     ` Sean Christopherson
2021-08-20 18:51       ` Mathieu Desnoyers
2021-08-20 22:26         ` Sean Christopherson
2021-09-06 10:28         ` Paolo Bonzini
2021-09-07 14:38           ` Sean Christopherson
2021-08-18  0:12 ` [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() Sean Christopherson
2021-08-19 21:41   ` Mathieu Desnoyers
2021-08-18  0:12 ` [PATCH 3/5] tools: Move x86 syscall number fallbacks to .../uapi/ Sean Christopherson
2021-08-18  0:12 ` [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Sean Christopherson
2021-08-19 21:52   ` Mathieu Desnoyers
2021-08-19 23:33     ` Sean Christopherson
2021-08-20 18:31       ` Mathieu Desnoyers
2021-08-20 22:25         ` Sean Christopherson
2021-08-18  0:12 ` [PATCH 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback Sean Christopherson
2021-09-22 14:12 ` [PATCH 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).