All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
@ 2022-07-22 23:02 Andrei Vagin
  2022-07-22 23:02 ` [PATCH 1/5] kernel: add a new helper to execute system calls from kernel code Andrei Vagin
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

There is a class of applications that use KVM to manage multiple address
spaces rather than use it as an isolation boundary. In all other terms,
they are normal processes that execute system calls, handle signals,
etc. Currently, each time when such a process needs to interact with the
operation system, it has to switch to host and back to guest. Such
entire switches are expensive and significantly increase the overhead of
system calls. The new hypercall reduces this overhead by more than two
times.

The new hypercall runs system calls on the host.  As for native system
calls, seccomp filters are executed before system calls. It takes one
argument that is a pointer to a pt_regs structure in the host address
space. It provides registers to execute a system call according to the
calling convention. Arguments are passed in %rdi, %rsi, %rdx, %r10, %r8
and %r9 and a return code is stored in %rax. 

The hypercall returns 0 if a system call has been executed. Otherwise,
it returns an error code.

This series introduces a new capability that has to be set to enable the
hypercall. The new hypercall is a backdoor for regular virtual machines,
so it is disabled by default. There is another standard way to allow
hypercalls via cpuid. It has not been used because one of the common
ways to manage them is to request all available features and let them
all together. In this case, it is a hard requirement that the new
hypercall can be enabled only intentionally.

= Background =

gVisor is one such application. It is an application kernel written in
Go that implements a substantial portion of the Linux system call
interface. gVisor intercepts application system calls and acts as the
guest kernel. It has a platform abstraction that implements interception
of syscalls, basic context switching, and memory mapping functionality.
Currently, it has two platforms: ptrace and KVM.

The ptrace platform uses PTRACE_SYSEMU to execute user code without
allowing it to perform host system calls, and it creates stub processes
to manage user address spaces. This platform is primarily for testing
needs due to its bad performance.

Another option is the KVM platform. In this case, the Sentry (gVisor
kernel) can run in a guest ring0 and create/manage multiple address
spaces. Its performance is much better than the ptrace one, but it is
still not great compared with the native performance. This change
optimizes the most critical part, which is the syscall overhead.  The
idea of using vmcall to execute system calls isn’t new. Two large users
of gVisor (Google and AntFinacial) have out-of-tree code to implement
such hypercalls.

In the Google kernel, we have a kvm-like subsystem designed especially
for gVisor. This change is the first step of integrating it into the KVM
code base and making it available to all Linux users.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jianfeng Tan <henry.tjf@antfin.com>
Cc: Adin Scannell <ascannell@google.com>
Cc: Konstantin Bogomolov <bogomolov@google.com>
Cc: Etienne Perot <eperot@google.com>

Andrei Vagin (5):
  kernel: add a new helper to execute system calls from kernel code
  kvm: add controls to enable/disable paravirtualized system calls
  KVM/x86: add a new hypercall to execute host system calls.
  selftests/kvm/x86_64: set rax before vmcall
  selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL

 Documentation/virt/kvm/x86/hypercalls.rst     |  15 ++
 arch/x86/entry/common.c                       |  48 ++++++
 arch/x86/include/asm/syscall.h                |   1 +
 arch/x86/include/uapi/asm/kvm_para.h          |   2 +
 arch/x86/kvm/cpuid.c                          |  25 +++
 arch/x86/kvm/cpuid.h                          |   8 +-
 arch/x86/kvm/x86.c                            |  37 +++++
 include/uapi/linux/kvm.h                      |   1 +
 include/uapi/linux/kvm_para.h                 |   1 +
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/processor.h  |   4 +
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +-
 .../kvm/x86_64/kvm_pv_syscall_test.c          | 145 ++++++++++++++++++
 14 files changed, 289 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c

-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] kernel: add a new helper to execute system calls from kernel code
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
@ 2022-07-22 23:02 ` Andrei Vagin
  2022-07-22 23:02 ` [PATCH 2/5] kvm/x86: add controls to enable/disable paravirtualized system calls Andrei Vagin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

This helper will be used to implement a kvm hypercall to call host
system calls.

The new helper executes seccomp rules and calls trace_sys_{enter,exit}
hooks. But it intentionally doesn't call ptrace hooks because calling
syscalls are not linked with the current process state.

Signed-off-by: Andrei Vagin <avagin@google.com>
---
 arch/x86/entry/common.c        | 50 ++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/syscall.h |  1 +
 2 files changed, 51 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..7f4c172a9a4e 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <trace/events/syscalls.h>
 
 #ifdef CONFIG_XEN_PV
 #include <xen/xen-ops.h>
@@ -37,6 +38,55 @@
 
 #ifdef CONFIG_X86_64
 
+/*
+ * do_ksyscall_64 executes a system call. This helper can be used from the
+ * kernel code.
+ */
+bool do_ksyscall_64(int nr, struct pt_regs *regs)
+{
+	struct task_struct *task = current;
+	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+	/*
+	 * Convert negative numbers to very high and thus out of range
+	 * numbers for comparisons.
+	 */
+	unsigned int unr = nr;
+
+#ifdef CONFIG_IA32_EMULATION
+	if (task->thread_info.status & TS_COMPAT)
+		return false;
+#endif
+
+	if (work & SYSCALL_WORK_SECCOMP) {
+		struct seccomp_data sd;
+		unsigned long args[6];
+
+		sd.nr = nr;
+		sd.arch = AUDIT_ARCH_X86_64;
+		syscall_get_arguments(task, regs, args);
+		sd.args[0] = args[0];
+		sd.args[1] = args[1];
+		sd.args[2] = args[2];
+		sd.args[3] = args[3];
+		sd.args[4] = args[4];
+		sd.args[5] = args[5];
+		sd.instruction_pointer = regs->ip;
+		if (__secure_computing(&sd) == -1)
+			return false;
+	}
+
+	if (likely(unr >= NR_syscalls))
+		return false;
+
+	unr = array_index_nospec(unr, NR_syscalls);
+
+	trace_sys_enter(regs, unr);
+	regs->ax = sys_call_table[unr](regs);
+	trace_sys_exit(regs, syscall_get_return_value(task, regs));
+	return true;
+}
+EXPORT_SYMBOL_GPL(do_ksyscall_64);
+
 static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
 {
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 5b85987a5e97..6cde1ddeb50b 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -126,6 +126,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
+bool do_ksyscall_64(int nr, struct pt_regs *regs);
 void do_syscall_64(struct pt_regs *regs, int nr);
 void do_int80_syscall_32(struct pt_regs *regs);
 long do_fast_syscall_32(struct pt_regs *regs);
-- 
2.37.1.359.gd136c6c3e2-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] kvm/x86: add controls to enable/disable paravirtualized system calls
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
  2022-07-22 23:02 ` [PATCH 1/5] kernel: add a new helper to execute system calls from kernel code Andrei Vagin
@ 2022-07-22 23:02 ` Andrei Vagin
  2022-07-22 23:02 ` [PATCH 3/5] KVM/x86: add a new hypercall to execute host " Andrei Vagin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

The following change will add a new hypercall to execute host syscalls.

This hypercall is helpful for user-mode kernel solutions such as gVisor
that needs to manage multiple address spaces.

The new hypercall is a backdoor for most KVM users, so it must be
disabled by default. This change introduces a new capability that has to
be set to enable the hypercall. There is another standard way to allow
hypercalls by using KVM_SET_CPUID2. It isn't suitable in this case
because one of the common ways of using it is to request all available
features (KVM_GET_SUPPORTED_CPUID) and let them all together. In this
case, it is a hard requirement that the new hypercall can be enabled
only intentionally.

Signed-off-by: Andrei Vagin <avagin@google.com>
---
 arch/x86/include/uapi/asm/kvm_para.h |  3 +++
 arch/x86/kvm/cpuid.c                 | 25 +++++++++++++++++++++++++
 arch/x86/kvm/cpuid.h                 |  8 +++++++-
 arch/x86/kvm/x86.c                   |  4 ++++
 include/uapi/linux/kvm.h             |  1 +
 5 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..84ad13ffc23c 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -37,6 +37,9 @@
 #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
 #define KVM_FEATURE_MIGRATION_CONTROL	17
 
+/* Features that are not controlled by KVM_SET_CPUID2. */
+#define KVM_FEATURE_PV_HOST_SYSCALL	31
+
 #define KVM_HINTS_REALTIME      0
 
 /* The last 8 bits are used to indicate how to interpret the flags field
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index de6d44e07e34..4fdfe9409506 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -104,6 +104,10 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
 			return -EINVAL;
 	}
 
+	best = cpuid_entry2_find(entries, nent, KVM_CPUID_FEATURES, 0);
+	if (best && (best->eax & (1<<KVM_FEATURE_PV_HOST_SYSCALL)))
+		return -EINVAL;
+
 	/*
 	 * Exposing dynamic xfeatures to the guest requires additional
 	 * enabling in the FPU, e.g. to expand the guest XSAVE state size.
@@ -273,6 +277,27 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
 	}
 }
 
+int kvm_vcpu_pv_set_host_syscall(struct kvm_vcpu *vcpu, bool set)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	if (!vcpu->arch.pv_cpuid.enforce)
+		return -EINVAL;
+
+	best = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0);
+	if (!best)
+		return -EINVAL;
+
+	if (set)
+		best->eax |= 1 << KVM_FEATURE_PV_HOST_SYSCALL;
+	else
+		best->eax &= ~(1 << KVM_FEATURE_PV_HOST_SYSCALL);
+
+	kvm_update_pv_runtime(vcpu);
+
+	return 0;
+}
+
 void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
 {
 	__kvm_update_cpuid_runtime(vcpu, vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 8a770b481d9d..80721093b82b 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -219,10 +219,16 @@ static __always_inline void kvm_cpu_cap_check_and_set(unsigned int x86_feature)
 static __always_inline bool guest_pv_has(struct kvm_vcpu *vcpu,
 					 unsigned int kvm_feature)
 {
-	if (!vcpu->arch.pv_cpuid.enforce)
+	if (!vcpu->arch.pv_cpuid.enforce) {
+		if (kvm_feature == KVM_FEATURE_PV_HOST_SYSCALL)
+			return false;
+
 		return true;
+	}
 
 	return vcpu->arch.pv_cpuid.features & (1u << kvm_feature);
 }
 
+int kvm_vcpu_pv_set_host_syscall(struct kvm_vcpu *vcpu, bool set);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5fa335a4ea7..19e634768161 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5306,6 +5306,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 			kvm_update_pv_runtime(vcpu);
 
 		return 0;
+
+	case KVM_CAP_PV_HOST_SYSCALL:
+		return kvm_vcpu_pv_set_host_syscall(vcpu, cap->args[0]);
+
 	default:
 		return -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 860f867c50c0..89ed59d13877 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1157,6 +1157,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_VM_TSC_CONTROL 214
 #define KVM_CAP_SYSTEM_EVENT_DATA 215
 #define KVM_CAP_ARM_SYSTEM_SUSPEND 216
+#define KVM_CAP_PV_HOST_SYSCALL 217
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.37.1.359.gd136c6c3e2-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] KVM/x86: add a new hypercall to execute host system calls.
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
  2022-07-22 23:02 ` [PATCH 1/5] kernel: add a new helper to execute system calls from kernel code Andrei Vagin
  2022-07-22 23:02 ` [PATCH 2/5] kvm/x86: add controls to enable/disable paravirtualized system calls Andrei Vagin
@ 2022-07-22 23:02 ` Andrei Vagin
  2022-07-22 23:02 ` [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall Andrei Vagin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

There is a class of applications that use KVM to manage multiple address
spaces rather than use it as an isolation boundary. In all other terms,
they are normal processes that execute system calls, handle signals,
etc. Currently, each time when such a process needs to interact with the
operation system, it has to switch to host and back to guest. Such
entire switches are expensive and significantly increase the overhead of
system calls. The new hypercall reduces this overhead by more than two
times.

The new hypercall allows to execute host system calls. As for native
calls, seccomp filters are executed before calls.  It takes one argument
that is a pointer to a pt_regs structure in the host address space. It
provides registers to execute a system call according to the calling
convention. Arguments are passed in %rdi, %rsi, %rdx, %r10, %r8 and %r9
and then a return code is stored in %rax. 

The hypercall returns 0 if a system call has been executed. Otherwise,
it returns an error code.

Signed-off-by: Andrei Vagin <avagin@google.com>
---
 Documentation/virt/kvm/x86/hypercalls.rst | 18 +++++++++++++
 arch/x86/kvm/x86.c                        | 33 +++++++++++++++++++++++
 include/uapi/linux/kvm_para.h             |  1 +
 3 files changed, 52 insertions(+)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index e56fa8b9cfca..eb18f2128bfe 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -190,3 +190,21 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
 before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID.  In
 addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
 must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+
+9. KVM_HC_HOST_SYSCALL
+---------------------
+:Architecture: x86
+:Status: active
+:Purpose: Execute a specified system call.
+
+- a0: pointer to a pt_regs structure in the host addess space.
+
+This hypercall lets a guest to execute host system calls. The first and only
+argument represents process registers that are used as input and output
+parameters.
+
+Returns 0 if the requested syscall has been executed. Otherwise, it returns an
+error code.
+
+**Implementation note**: The KVM_CAP_PV_HOST_SYSCALL capability has to be set
+to use this hypercall.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 19e634768161..aa54e180c9d4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -81,6 +81,7 @@
 #include <asm/emulate_prefix.h>
 #include <asm/sgx.h>
 #include <clocksource/hyperv_timer.h>
+#include <asm/syscall.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -9253,6 +9254,27 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
+static int kvm_pv_host_syscall(unsigned long a0)
+{
+	struct pt_regs pt_regs = {};
+	unsigned long sysno;
+
+	if (copy_from_user(&pt_regs, (void *)a0, sizeof(pt_regs)))
+		return -EFAULT;
+
+	sysno = pt_regs.ax;
+	pt_regs.orig_ax = pt_regs.ax;
+	pt_regs.ax = -ENOSYS;
+
+	do_ksyscall_64(sysno, &pt_regs);
+
+	pt_regs.orig_ax = -1;
+	if (copy_to_user((void *)a0, &pt_regs, sizeof(pt_regs)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
@@ -9318,6 +9340,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu, a0);
 		ret = 0;
 		break;
+
 	case KVM_HC_MAP_GPA_RANGE: {
 		u64 gpa = a0, npages = a1, attrs = a2;
 
@@ -9340,6 +9363,16 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		vcpu->arch.complete_userspace_io = complete_hypercall_exit;
 		return 0;
 	}
+
+	case KVM_HC_HOST_SYSCALL:
+		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_HOST_SYSCALL))
+			break;
+
+		kvm_vcpu_srcu_read_unlock(vcpu);
+		ret = kvm_pv_host_syscall(a0);
+		kvm_vcpu_srcu_read_lock(vcpu);
+		break;
+
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 960c7e93d1a9..3fcfb3241f35 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
 #define KVM_HC_SEND_IPI		10
 #define KVM_HC_SCHED_YIELD		11
 #define KVM_HC_MAP_GPA_RANGE		12
+#define KVM_HC_HOST_SYSCALL		13
 
 /*
  * hypercalls use architecture specific
-- 
2.37.1.359.gd136c6c3e2-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
                   ` (2 preceding siblings ...)
  2022-07-22 23:02 ` [PATCH 3/5] KVM/x86: add a new hypercall to execute host " Andrei Vagin
@ 2022-07-22 23:02 ` Andrei Vagin
  2022-08-01 11:32   ` Vitaly Kuznetsov
  2022-07-22 23:02 ` [PATCH 5/5] selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL Andrei Vagin
  2022-07-22 23:41 ` [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Sean Christopherson
  5 siblings, 1 reply; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

kvm_hypercall has to place the hypercall number in rax.

Trace events show that kvm_pv_test doesn't work properly:
     kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0
     kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0
     kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0

With this change, it starts working as expected:
     kvm_pv_test-54285: kvm_hypercall: nr 0x5 a0 0x0 a1 0x0 a2 0x0 a3 0x0
     kvm_pv_test-54285: kvm_hypercall: nr 0xa a0 0x0 a1 0x0 a2 0x0 a3 0x0
     kvm_pv_test-54285: kvm_hypercall: nr 0xb a0 0x0 a1 0x0 a2 0x0 a3 0x0

Signed-off-by: Andrei Vagin <avagin@google.com>
---
 tools/testing/selftests/kvm/lib/x86_64/processor.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index ead7011ee8f6..5d85e1c021da 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -1422,7 +1422,7 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
 
 	asm volatile("vmcall"
 		     : "=a"(r)
-		     : "b"(a0), "c"(a1), "d"(a2), "S"(a3));
+		     : "a"(nr), "b"(a0), "c"(a1), "d"(a2), "S"(a3));
 	return r;
 }
 
-- 
2.37.1.359.gd136c6c3e2-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
                   ` (3 preceding siblings ...)
  2022-07-22 23:02 ` [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall Andrei Vagin
@ 2022-07-22 23:02 ` Andrei Vagin
  2022-07-22 23:41 ` [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Sean Christopherson
  5 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-22 23:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot

* check that syscall are executed.
* check that non-existing syscalls return ENOSYS.

Signed-off-by: Andrei Vagin <avagin@google.com>
---
 0000-kvm-host-syscall.patch                   |   7 +
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/processor.h  |   4 +
 .../kvm/x86_64/kvm_pv_syscall_test.c          | 145 ++++++++++++++++++
 5 files changed, 158 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c

diff --git a/0000-kvm-host-syscall.patch b/0000-kvm-host-syscall.patch
index 364db7471abc..653430d57c62 100644
--- a/0000-kvm-host-syscall.patch
+++ b/0000-kvm-host-syscall.patch
@@ -57,6 +57,13 @@ In the Google kernel, we have a kvm-like subsystem designed especially
 for gVisor. This change is the first step of integrating it into the KVM
 code base and making it available to all Linux users.
 
+Cc: Paolo Bonzini <pbonzini@redhat.com>
+Cc: Sean Christopherson <seanjc@google.com>
+Cc: Wanpeng Li <wanpengli@tencent.com>
+Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
+Cc: Jianfeng Tan <henry.tjf@antfin.com>
+Cc: Adin Scannell <ascannell@google.com>
+
 Andrei Vagin (5):
   kernel: add a new helper to execute system calls from kernel code
   kvm: add controls to enable/disable paravirtualized system calls
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 4509a3a7eeae..57d39fec6fdd 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -21,6 +21,7 @@
 /x86_64/get_msr_index_features
 /x86_64/kvm_clock_test
 /x86_64/kvm_pv_test
+/x86_64/kvm_pv_syscall_test
 /x86_64/hyperv_clock
 /x86_64/hyperv_cpuid
 /x86_64/hyperv_features
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 22423c871ed6..e6459f3e5318 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
+TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_syscall_test
 TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
 TEST_GEN_PROGS_x86_64 += x86_64/mmu_role_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 6ce185449259..4503e9556279 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -17,6 +17,7 @@
 
 #include "../kvm_util.h"
 
+#ifndef X86_EFLAGS_FIXED
 #define X86_EFLAGS_FIXED	 (1u << 1)
 
 #define X86_CR4_VME		(1ul << 0)
@@ -40,6 +41,7 @@
 #define X86_CR4_SMEP		(1ul << 20)
 #define X86_CR4_SMAP		(1ul << 21)
 #define X86_CR4_PKE		(1ul << 22)
+#endif
 
 /* CPUID.1.ECX */
 #define CPUID_VMX		(1ul << 5)
@@ -503,6 +505,7 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level)
 /*
  * Basic CPU control in CR0
  */
+#ifndef X86_CR0_PE
 #define X86_CR0_PE          (1UL<<0) /* Protection Enable */
 #define X86_CR0_MP          (1UL<<1) /* Monitor Coprocessor */
 #define X86_CR0_EM          (1UL<<2) /* Emulation */
@@ -514,6 +517,7 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level)
 #define X86_CR0_NW          (1UL<<29) /* Not Write-through */
 #define X86_CR0_CD          (1UL<<30) /* Cache Disable */
 #define X86_CR0_PG          (1UL<<31) /* Paging */
+#endif
 
 #define XSTATE_XTILE_CFG_BIT		17
 #define XSTATE_XTILE_DATA_BIT		18
diff --git a/tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c b/tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c
new file mode 100644
index 000000000000..601f84b11711
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020, Google LLC.
+ *
+ * Tests for KVM paravirtual feature disablement
+ */
+#include <asm/kvm_para.h>
+#include <asm/ptrace.h>
+#include <linux/kvm_para.h>
+#include <stdint.h>
+
+#include <linux/unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+struct pt_regs regs_dup = {
+	.rax = __NR_dup,
+	.rdi = -1,
+};
+
+struct pt_regs regs_nosys = {
+	.rax = -1,
+};
+
+struct hcall_data {
+	const char *name;
+	struct pt_regs *regs;
+	long ret;
+};
+
+#define TEST_HCALL(hc) { .nr = hc, .name = #hc }
+#define UCALL_PR_HCALL 0xdeadc0de
+#define PR_HCALL(hc) ucall(UCALL_PR_HCALL, 1, hc)
+
+/*
+ * KVM hypercalls to test. Expect -KVM_ENOSYS when called, as the corresponding
+ * features have been cleared in KVM_CPUID_FEATURES.
+ */
+static struct hcall_data hcalls_to_test[] = {
+	{.name = "dup",    .regs = &regs_dup,   .ret = -EBADF},
+	{.name = "enosys", .regs = &regs_nosys, .ret = -ENOSYS},
+};
+
+static void test_hcall(struct hcall_data *hc)
+{
+	uint64_t r;
+
+	PR_HCALL(hc);
+	r = kvm_hypercall(KVM_HC_HOST_SYSCALL, (unsigned long)hc->regs, 0, 0, 0);
+	GUEST_ASSERT(r == 0);
+}
+
+static void guest_main(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(hcalls_to_test); i++)
+		test_hcall(&hcalls_to_test[i]);
+
+	GUEST_DONE();
+}
+
+static void pr_hcall(struct ucall *uc)
+{
+	struct hcall_data *hc = (struct hcall_data *)uc->args[0];
+
+	pr_info("testing hcall: %s\n", hc->name);
+}
+
+static void handle_abort(struct ucall *uc)
+{
+	TEST_FAIL("%s at %s:%ld", (const char *)uc->args[0],
+		  __FILE__, uc->args[1]);
+}
+
+#define VCPU_ID 0
+
+static void enter_guest(struct kvm_vm *vm)
+{
+	struct kvm_run *run;
+	struct ucall uc;
+	int r, i;
+
+	run = vcpu_state(vm, VCPU_ID);
+
+	while (true) {
+		r = _vcpu_run(vm, VCPU_ID);
+		TEST_ASSERT(!r, "vcpu_run failed: %d\n", r);
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "unexpected exit reason: %u (%s)",
+			    run->exit_reason, exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vm, VCPU_ID, &uc)) {
+		case UCALL_PR_HCALL:
+			pr_hcall(&uc);
+			break;
+		case UCALL_ABORT:
+			handle_abort(&uc);
+			return;
+		case UCALL_DONE:
+			goto out;
+		}
+	}
+
+out:
+	for (i = 0; i < ARRAY_SIZE(hcalls_to_test); i++) {
+		struct hcall_data *hc = &hcalls_to_test[i];
+
+		TEST_ASSERT(hc->ret == hc->regs->rax, "%s: ret %ld (expected %ld)",
+				hc->name, hc->ret, hc->regs->rax);
+	}
+}
+
+int main(void)
+{
+	struct kvm_enable_cap cap = {0};
+	struct kvm_cpuid2 *best;
+	struct kvm_vm *vm;
+
+	if (!kvm_check_cap(KVM_CAP_ENFORCE_PV_FEATURE_CPUID)) {
+		pr_info("will skip kvm paravirt restriction tests.\n");
+		return 0;
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, guest_main);
+
+	cap.cap = KVM_CAP_ENFORCE_PV_FEATURE_CPUID;
+	cap.args[0] = 1;
+	vcpu_enable_cap(vm, VCPU_ID, &cap);
+
+	best = kvm_get_supported_cpuid();
+	vcpu_set_cpuid(vm, VCPU_ID, best);
+
+	cap.cap = KVM_CAP_PV_HOST_SYSCALL;
+	cap.args[0] = 1;
+	vcpu_enable_cap(vm, VCPU_ID, &cap);
+
+	vm_init_descriptor_tables(vm);
+	vcpu_init_descriptor_tables(vm, VCPU_ID);
+
+	enter_guest(vm);
+	kvm_vm_free(vm);
+}
-- 
2.37.1.359.gd136c6c3e2-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
                   ` (4 preceding siblings ...)
  2022-07-22 23:02 ` [PATCH 5/5] selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL Andrei Vagin
@ 2022-07-22 23:41 ` Sean Christopherson
  2022-07-26  8:33   ` Andrei Vagin
  2022-07-26 21:27   ` Thomas Gleixner
  5 siblings, 2 replies; 18+ messages in thread
From: Sean Christopherson @ 2022-07-22 23:41 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Paolo Bonzini, linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

+x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.

On Fri, Jul 22, 2022, Andrei Vagin wrote:
> Another option is the KVM platform. In this case, the Sentry (gVisor
> kernel) can run in a guest ring0 and create/manage multiple address
> spaces. Its performance is much better than the ptrace one, but it is
> still not great compared with the native performance. This change
> optimizes the most critical part, which is the syscall overhead.

What exactly is the source of the syscall overhead, and what alternatives have
been explored?  Making arbitrary syscalls from within KVM is mildly terrifying.

> The idea of using vmcall to execute system calls isn’t new. Two large users
> of gVisor (Google and AntFinacial) have out-of-tree code to implement such
> hypercalls.
>
> In the Google kernel, we have a kvm-like subsystem designed especially
> for gVisor. This change is the first step of integrating it into the KVM
> code base and making it available to all Linux users.

Can you please lay out the complete set of changes that you will be proposing?
Doesn't have to be gory details, but at a minimum there needs to be a high level
description that very clearly defines the scope of what changes you want to make
and what the end result will look like.

It's practically impossible to review this series without first understanding the
bigger picture, e.g. if KVM_HC_HOST_SYSCALL is ultimately useless without the other
bits you plan to upstream, then merging it without a high level of confidence that
the other bits are acceptable is a bad idea since it commits KVM to supporting
unused ABI.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-22 23:41 ` [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Sean Christopherson
@ 2022-07-26  8:33   ` Andrei Vagin
  2022-07-26 10:27     ` Paolo Bonzini
  2022-07-26 15:10     ` Sean Christopherson
  2022-07-26 21:27   ` Thomas Gleixner
  1 sibling, 2 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-26  8:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.
>
> On Fri, Jul 22, 2022, Andrei Vagin wrote:
> > Another option is the KVM platform. In this case, the Sentry (gVisor
> > kernel) can run in a guest ring0 and create/manage multiple address
> > spaces. Its performance is much better than the ptrace one, but it is
> > still not great compared with the native performance. This change
> > optimizes the most critical part, which is the syscall overhead.
>
> What exactly is the source of the syscall overhead,

Here are perf traces for two cases: when "guest" syscalls are executed via
hypercalls and when syscalls are executed by the user-space VMM:
https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e

And here are two tests that I use to collect these traces:
https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99

If we compare these traces, we can find that in the second case, we spend extra
time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put,
syscall_exit_to_user_mode.

> and what alternatives have been explored?  Making arbitrary syscalls from
> within KVM is mildly terrifying.

"mildly terrifying" is a good sentence in this case:). If I were in your place,
I would think about it similarly.

I understand these concerns about calling syscalls from the KVM code, and this
is why I hide this feature under a separate capability that can be enabled
explicitly.

We can think about restricting the list of system calls that this hypercall can
execute. In the user-space changes for gVisor, we have a list of system calls
that are not executed via this hypercall. For example, sigprocmask is never
executed by this hypercall, because the kvm vcpu has its signal mask.  Another
example is the ioctl syscall, because it can be one of kvm ioctl-s.

As for alternatives, we explored different ways:

== Host Ring3/Guest ring0 mixed mode ==

This is how the gVisor KVM platform works right now. We don’t have a separate
hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
in the host ring3 and the guest ring0 and can transparently switch between
these two contexts.

When the Sentry starts, it creates a new kernel VM instance and maps its memory
to the guest physical. Then it makes a set of page tables for the Sentry that
mirrors the host virtual address space. When host and guest address spaces are
identical, the Sentry can switch between these two contexts.

The bluepill function switches the Sentry into guest ring0. It calls a
privileged instruction (CLI) that is a no-op in the guest (by design, since we
ensure interrupts are disabled for guest ring 0 execution) and triggers a
signal on the host. The signal is handled by the bluepillHandler that takes a
virtual CPU and executes it with the current thread state grabbed from a signal
frame.

As for regular VMs, user processes have their own address spaces (page tables)
and run in guest ring3. So when the Sentry is going to execute a user process,
it needs to be sure that it is running inside a VM, and it is the exact point
when it calls bluepill(). Then it executes a user process with its page tables
before it triggers an exception or a system call. All such events are trapped
and handled in the Sentry.

The Sentry is a normal Linux process that can trigger a fault and execute
system calls. To handle these events, the Sentry returns to the host mode. If
ring0 sysenter or exception entry point detects an event from the Sentry, they
save the current thread state on a per-CPU structure and trigger VMEXIT. This
returns us into bluepillHandler, where we set the thread state on a signal
frame and exit from the signal handler, so the Sentry resumes from the point
where it has been in the VM.

In this scheme, the sentry syscall time is 3600ns. This is for the case when a
system call is called from gr0.

The benefit of this way is that only a first system call triggers vmexit and
all subsequent syscalls are executed on the host natively.

But it has downsides:
* Each sentry system call trigger the full exit to hr3.
* Each vmenter/vmexit requires to trigger a signal but it is expensive.
* It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
  has to be fully enclosed in a VM to be able to support these technologies.

== Execute system calls from a user-space VMM ==

In this case, the Sentry is always running in VM, and a syscall handler in GR0
triggers vmexit to transfer control to VMM (user process that is running in
hr3), VMM executes a required system call, and transfers control back to the
Sentry. We can say that it implements the suggested hypercall in the
user-space.

The sentry syscall time is 2100ns in this case.

The new hypercall does the same but without switching to the host ring 3. It
reduces the sentry syscall time to 1000ns.


== A new BPF hook to handle vmexit-s  ==

https://github.com/avagin/linux-task-diag/commits/kvm-bpf

This way allows us to reach the same performance numbers, but it gives less
control over who and how use this functionality. Second, it requires adding a
few questionable BPF helpers like calling syscall from BPF hooks.

== Non-KVM platforms ==

We are experimenting with non-KVM platforms. We have the ptrace platform, but it
is almost for experiments due to the slowness of the ptrace interface.

Another idea was to add the process_vm_exec system call:
https://lwn.net/Articles/852662/

This system call can significantly increase performance compared with the
ptrace platform, but it is still slower than the KVM platform in its current
form (without the new hypercall). But this is true only if we run the KVM
platform on a bare-metal. In the case of nested-virtualization, the KVM
platform becomes much slower, which is expected.

We have another idea to use the seccomp notify to trap system calls, but it
requires some kernel change to reach a reasonable performance. I am working on
these changes and will present them soon.

I want to emphasize that non-KVM platforms don't allow us to implement the
confidential concept in gVisor, but this is one of our main goals concerning
the KVM platform.

All previous numbers have been getting from the same host (Xeon(R) Gold 6268CL,
5.19-rc5).

>
> > The idea of using vmcall to execute system calls isn’t new. Two large users
> > of gVisor (Google and AntFinacial) have out-of-tree code to implement such
> > hypercalls.
> >
> > In the Google kernel, we have a kvm-like subsystem designed especially
> > for gVisor. This change is the first step of integrating it into the KVM
> > code base and making it available to all Linux users.
>
> Can you please lay out the complete set of changes that you will be proposing?
> Doesn't have to be gory details, but at a minimum there needs to be a high level
> description that very clearly defines the scope of what changes you want to make
> and what the end result will look like.
>
> It's practically impossible to review this series without first understanding the
> bigger picture, e.g. if KVM_HC_HOST_SYSCALL is ultimately useless without the other
> bits you plan to upstream, then merging it without a high level of confidence that
> the other bits are acceptable is a bad idea since it commits KVM to supporting
> unused ABI.

I was not precise in my description. This is the only change that we
need right now.
The gVisor KVM platform is the real thing that exists today and works
on the upstream kernels:
https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm/

This hypercall improves its performance and makes it comparable with
the google-internal platform.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26  8:33   ` Andrei Vagin
@ 2022-07-26 10:27     ` Paolo Bonzini
  2022-07-27  6:44       ` Andrei Vagin
  2022-07-26 15:10     ` Sean Christopherson
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2022-07-26 10:27 UTC (permalink / raw)
  To: Andrei Vagin, Sean Christopherson
  Cc: linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov, Jianfeng Tan,
	Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On 7/26/22 10:33, Andrei Vagin wrote:
> We can think about restricting the list of system calls that this hypercall can
> execute. In the user-space changes for gVisor, we have a list of system calls
> that are not executed via this hypercall. For example, sigprocmask is never
> executed by this hypercall, because the kvm vcpu has its signal mask.  Another
> example is the ioctl syscall, because it can be one of kvm ioctl-s.

The main issue I have is that the system call addresses are not translated.

On one hand, I understand why it's done like this; it's pretty much 
impossible to do it without duplicating half of the sentry in the host 
kernel.  And the KVM API you're adding is certainly sensible.

On the other hand this makes the hypercall even more specialized, as it 
depends on the guest's memslot layout, and not self-sufficient, in the 
sense that the sandbox isn't secure without prior copying and validation 
of arguments in guest ring0.

> == Host Ring3/Guest ring0 mixed mode ==
> 
> This is how the gVisor KVM platform works right now. We don’t have a separate
> hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
> machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
> in the host ring3 and the guest ring0 and can transparently switch between
> these two contexts.  In this scheme, the sentry syscall time is 3600ns.
> This is for the case when a system call is called from gr0.
> 
> The benefit of this way is that only a first system call triggers vmexit and
> all subsequent syscalls are executed on the host natively.
> 
> But it has downsides:
> * Each sentry system call trigger the full exit to hr3.
> * Each vmenter/vmexit requires to trigger a signal but it is expensive.
> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
>    has to be fully enclosed in a VM to be able to support these technologies.
> 
> == Execute system calls from a user-space VMM ==
> 
> In this case, the Sentry is always running in VM, and a syscall handler in GR0
> triggers vmexit to transfer control to VMM (user process that is running in
> hr3), VMM executes a required system call, and transfers control back to the
> Sentry. We can say that it implements the suggested hypercall in the
> user-space.
> 
> The sentry syscall time is 2100ns in this case.
> 
> The new hypercall does the same but without switching to the host ring 3. It
> reduces the sentry syscall time to 1000ns.

Yeah, ~3000 clock cycles is what I would expect.

What does it translate to in terms of benchmarks?  For example a simple 
netperf/UDP_RR benchmark.

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26  8:33   ` Andrei Vagin
  2022-07-26 10:27     ` Paolo Bonzini
@ 2022-07-26 15:10     ` Sean Christopherson
  2022-07-26 22:10       ` Thomas Gleixner
  2022-07-27  0:25       ` Andrei Vagin
  1 sibling, 2 replies; 18+ messages in thread
From: Sean Christopherson @ 2022-07-26 15:10 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Paolo Bonzini, linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, Jul 26, 2022, Andrei Vagin wrote:
> On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.
> >
> > On Fri, Jul 22, 2022, Andrei Vagin wrote:
> > > Another option is the KVM platform. In this case, the Sentry (gVisor
> > > kernel) can run in a guest ring0 and create/manage multiple address
> > > spaces. Its performance is much better than the ptrace one, but it is
> > > still not great compared with the native performance. This change
> > > optimizes the most critical part, which is the syscall overhead.
> >
> > What exactly is the source of the syscall overhead,
> 
> Here are perf traces for two cases: when "guest" syscalls are executed via
> hypercalls and when syscalls are executed by the user-space VMM:
> https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e
> 
> And here are two tests that I use to collect these traces:
> https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99
> 
> If we compare these traces, we can find that in the second case, we spend extra
> time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put,
> syscall_exit_to_user_mode.

So of those, I think the only path a robust implementation can actually avoid,
without significantly whittling down the allowed set of syscalls, is
syscall_exit_to_user_mode().

The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run
through that before calling out of KVM.  E.g. prctrl(ARCH_GET_GS) will read the
wrong GS.base if MSR_KERNEL_GS_BASE isn't restored.  And that necessitates
calling vmx_prepare_switch_to_guest() when resuming the vCPU.

FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound
to be a syscall that accesses user FPU state and will do the wrong thing if guest
state is loaded.

For gVisor, that's all presumably a non-issue because it uses a small set of
syscalls (or has guest==host state?), but for a common KVM feature it's problematic.

> > and what alternatives have been explored?  Making arbitrary syscalls from
> > within KVM is mildly terrifying.
> 
> "mildly terrifying" is a good sentence in this case:). If I were in your place,
> I would think about it similarly.
> 
> I understand these concerns about calling syscalls from the KVM code, and this
> is why I hide this feature under a separate capability that can be enabled
> explicitly.
> 
> We can think about restricting the list of system calls that this hypercall can
> execute. In the user-space changes for gVisor, we have a list of system calls
> that are not executed via this hypercall.

Can you provide that list?

> But it has downsides:
> * Each sentry system call trigger the full exit to hr3.
> * Each vmenter/vmexit requires to trigger a signal but it is expensive.

Can you explain this one?  I didn't quite follow what this is referring to.

> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
>   has to be fully enclosed in a VM to be able to support these technologies.

Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
tackled the "syscalls are crazy expensive" problem by using a message queue and
a dedicated task outside of the enclave to handle syscalls.  Would something like
that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
non-starter?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-22 23:41 ` [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Sean Christopherson
  2022-07-26  8:33   ` Andrei Vagin
@ 2022-07-26 21:27   ` Thomas Gleixner
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2022-07-26 21:27 UTC (permalink / raw)
  To: Sean Christopherson, Andrei Vagin
  Cc: Paolo Bonzini, linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Jul 22 2022 at 23:41, Sean Christopherson wrote:
> +x86 maintainers, patch 1 most definitely needs acceptance from folks
> beyond KVM.

Thanks for putting us on CC. It seems to be incredibly hard to CC the
relevant maintainers and to get the prefix in the subject straight.

> On Fri, Jul 22, 2022, Andrei Vagin wrote:
>> Another option is the KVM platform. In this case, the Sentry (gVisor
>> kernel) can run in a guest ring0 and create/manage multiple address
>> spaces. Its performance is much better than the ptrace one, but it is
>> still not great compared with the native performance. This change
>> optimizes the most critical part, which is the syscall overhead.
>
> What exactly is the source of the syscall overhead, and what alternatives have
> been explored?  Making arbitrary syscalls from within KVM is mildly terrifying.

What's even worse is that this exposes a magic kernel syscall interface
to random driver writers. Seriously no.

This approach is certainly a clever idea, but exposing this outside of a
very restricted and controlled environment is a patently bad idea.

I skimmed the documentation on the project page:

  sudo modprobe kvm-intel && sudo chmod a+rw /dev/kvm

Can you spot the fail?

I gave up reading further as shortly after that gem the page failed to
render sensibly in Firefox. Hint: Graphics

What's completely missing from the cover letter _and_ from the project
documentation is which subset of KVM functionality this is actually
using and how the actual content of the "guest" looks like. It's all
blury handwaving and lots of marketing to me.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26 15:10     ` Sean Christopherson
@ 2022-07-26 22:10       ` Thomas Gleixner
  2022-07-27  1:03         ` Andrei Vagin
  2022-07-27  0:25       ` Andrei Vagin
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2022-07-26 22:10 UTC (permalink / raw)
  To: Sean Christopherson, Andrei Vagin
  Cc: Paolo Bonzini, linux-kernel, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot,
	Andy Lutomirski, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote:
> On Tue, Jul 26, 2022, Andrei Vagin wrote:
>> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
>>   has to be fully enclosed in a VM to be able to support these technologies.
>
> Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
> tackled the "syscalls are crazy expensive" problem by using a message queue and
> a dedicated task outside of the enclave to handle syscalls.  Would something like
> that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
> non-starter?

Let's put VMs aside for a moment. The problem you are trying to solve is
ptrace overhead because that requires context switching, right?

Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires
signals, which are not cheap either, but we certainly could come up with
a lightweight signal implementation for that particular use case.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26 15:10     ` Sean Christopherson
  2022-07-26 22:10       ` Thomas Gleixner
@ 2022-07-27  0:25       ` Andrei Vagin
  1 sibling, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-27  0:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andrei Vagin, Paolo Bonzini, linux-kernel, kvm, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Tue, Jul 26, 2022 at 03:10:34PM +0000, Sean Christopherson wrote:
> On Tue, Jul 26, 2022, Andrei Vagin wrote:
> > On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.
> > >
> > > On Fri, Jul 22, 2022, Andrei Vagin wrote:
> > > > Another option is the KVM platform. In this case, the Sentry (gVisor
> > > > kernel) can run in a guest ring0 and create/manage multiple address
> > > > spaces. Its performance is much better than the ptrace one, but it is
> > > > still not great compared with the native performance. This change
> > > > optimizes the most critical part, which is the syscall overhead.
> > >
> > > What exactly is the source of the syscall overhead,
> > 
> > Here are perf traces for two cases: when "guest" syscalls are executed via
> > hypercalls and when syscalls are executed by the user-space VMM:
> > https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e
> > 
> > And here are two tests that I use to collect these traces:
> > https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99
> > 
> > If we compare these traces, we can find that in the second case, we spend extra
> > time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put,
> > syscall_exit_to_user_mode.
> 
> So of those, I think the only path a robust implementation can actually avoid,
> without significantly whittling down the allowed set of syscalls, is
> syscall_exit_to_user_mode().
> 
> The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run
> through that before calling out of KVM.  E.g. prctrl(ARCH_GET_GS) will read the
> wrong GS.base if MSR_KERNEL_GS_BASE isn't restored.  And that necessitates
> calling vmx_prepare_switch_to_guest() when resuming the vCPU.
> 
> FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound
> to be a syscall that accesses user FPU state and will do the wrong thing if guest
> state is loaded.
> 
> For gVisor, that's all presumably a non-issue because it uses a small set of
> syscalls (or has guest==host state?), but for a common KVM feature it's problematic.


I think the number of system calls that touch a state that is
intersected with KVM is very limited and we can blocklist all of them.
Another option is to have an allow list of system calls to be sure that
we don't miss anything.

> 
> > > and what alternatives have been explored?  Making arbitrary syscalls from
> > > within KVM is mildly terrifying.
> > 
> > "mildly terrifying" is a good sentence in this case:). If I were in your place,
> > I would think about it similarly.
> > 
> > I understand these concerns about calling syscalls from the KVM code, and this
> > is why I hide this feature under a separate capability that can be enabled
> > explicitly.
> > 
> > We can think about restricting the list of system calls that this hypercall can
> > execute. In the user-space changes for gVisor, we have a list of system calls
> > that are not executed via this hypercall.
> 
> Can you provide that list?

Here is the list that are not executed via this hypercall:
clone, exit, exit_group, ioctl, rt_sigreturn, mmap, arch_prctl,
sigprocmask.

And here is the list of all system calls that we allow for the Sentry:
clock_gettime, close, dup, dup3, epoll_create1, epoll_ctl, epoll_pwait,
eventfd2, exit, exit_group, fallocate, fchmod, fcntl, fstat, fsync,
ftruncate, futex, getcpu, getpid, getrandom, getsockopt, gettid,
gettimeofday, ioctl, lseek, madvise, membarrier, mincore, mmap,
mprotect, munmap, nanosleep, ppol, pread64, preadv, preadv2, pwrite64,
pwritev, pwritev2, read, recvmsg, recvmmsg, sendmsg, sendmmsg,
restart_syscall, rt_sigaction, rt_sigprocmask, rt_sigreturn,
sched_yield, settimer, shutdown, sigaltstack, statx, sync_file_range,
tee, timer_create, timer_delete, timer_settime, tgkill, utimensat,
write, writev.

> 
> > But it has downsides:
> > * Each sentry system call trigger the full exit to hr3.
> > * Each vmenter/vmexit requires to trigger a signal but it is expensive.
> 
> Can you explain this one?  I didn't quite follow what this is referring to.

In my message, there was the explanation of how the gVisor KVM platform
works right now, and here are two points why it is slow.

Each time when the Sentry triggers a system call, it has to switch to
the host ring3.


When the Sentry wants to switch to the guest ring0, it triggers a signal to
fall in a signal handler. There, we have a sigcontext that we use to get
the current thread state to resume execution in gr0, and then when the
Sentry needs to switch back to hr3, we set the sentry state from gr0 to
sigcontext and returns from the signal handler.

> 
> > * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
> >   has to be fully enclosed in a VM to be able to support these technologies.
> 
> Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
> tackled the "syscalls are crazy expensive" problem by using a message queue and
> a dedicated task outside of the enclave to handle syscalls.  Would something like
> that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
> non-starter?

Context-switching is expensive... There was a few attempts to implement
synchronous context-switching ([1], [2]) that can help in this case,
but even with this sort of optimizations, it is too expensive.

1. https://lwn.net/Articles/824409/
2. https://www.spinics.net/lists/linux-api/msg50417.html

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26 22:10       ` Thomas Gleixner
@ 2022-07-27  1:03         ` Andrei Vagin
  2022-08-22 20:26           ` Andrei Vagin
  0 siblings, 1 reply; 18+ messages in thread
From: Andrei Vagin @ 2022-07-27  1:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, Paolo Bonzini, linux-kernel, kvm,
	Wanpeng Li, Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

On Tue, Jul 26, 2022 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote:
> > On Tue, Jul 26, 2022, Andrei Vagin wrote:
> >> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
> >>   has to be fully enclosed in a VM to be able to support these technologies.
> >
> > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
> > tackled the "syscalls are crazy expensive" problem by using a message queue and
> > a dedicated task outside of the enclave to handle syscalls.  Would something like
> > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
> > non-starter?
>
> Let's put VMs aside for a moment. The problem you are trying to solve is
> ptrace overhead because that requires context switching, right?

Yes, you are right.

>
> Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires
> signals, which are not cheap either, but we certainly could come up with
> a lightweight signal implementation for that particular use case.

We thought about this interface and how it could be used for gVisor needs. I
think the main question is how to manage guest address spaces.  gVisor can run
multiple processes in one sandbox. Each process must have its address space
isolated from other address spaces. The gVisor kernel (Sentry) has to run in a
separate address space that guest processes don't have access to, but the
Sentry has to be able to access all other address spaces.

>
> Thanks,
>
>         tglx
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-26 10:27     ` Paolo Bonzini
@ 2022-07-27  6:44       ` Andrei Vagin
  0 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-07-27  6:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Wanpeng Li,
	Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Tue, Jul 26, 2022 at 3:27 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 7/26/22 10:33, Andrei Vagin wrote:
...
> > == Execute system calls from a user-space VMM ==
> >
> > In this case, the Sentry is always running in VM, and a syscall handler in GR0
> > triggers vmexit to transfer control to VMM (user process that is running in
> > hr3), VMM executes a required system call, and transfers control back to the
> > Sentry. We can say that it implements the suggested hypercall in the
> > user-space.
> >
> > The sentry syscall time is 2100ns in this case.
> >
> > The new hypercall does the same but without switching to the host ring 3. It
> > reduces the sentry syscall time to 1000ns.
>
> Yeah, ~3000 clock cycles is what I would expect.
>
> What does it translate to in terms of benchmarks?  For example a simple
> netperf/UDP_RR benchmark.

* netperf in gVisor with the syscall fast path:
$  ./runsc --platform kvm --network host --rootless do netperf -H ::1
-p 12865 -t UDP_RR
MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to
::1 (::1) port 0 AF_INET6 : interval : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

212992 212992 1        1       10.00    95965.18
212992 212992

* netperf in gVisor without syscall fast path:
$  ./runsc.orig --platform kvm --network host --rootless do netperf -H
::1 -p 12865 -t UDP_RR
MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to
::1 (::1) port 0 AF_INET6 : interval : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

212992 212992 1        1       10.00    58709.17
212992 212992

* netperf executed on the host without gVisor
$ netperf -H ::1 -p 12865 -t UDP_RR
MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to
::1 (::1) port 0 AF_INET6 : interval : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

212992 212992 1        1       10.00    146460.80
212992 212992

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall
  2022-07-22 23:02 ` [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall Andrei Vagin
@ 2022-08-01 11:32   ` Vitaly Kuznetsov
  2022-08-01 12:43     ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Vitaly Kuznetsov @ 2022-08-01 11:32 UTC (permalink / raw)
  To: Andrei Vagin, Paolo Bonzini
  Cc: linux-kernel, kvm, Andrei Vagin, Sean Christopherson, Wanpeng Li,
	Jianfeng Tan, Adin Scannell, Konstantin Bogomolov, Etienne Perot

Andrei Vagin <avagin@google.com> writes:

> kvm_hypercall has to place the hypercall number in rax.
>
> Trace events show that kvm_pv_test doesn't work properly:
>      kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0
>      kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0
>      kvm_pv_test-53132: kvm_hypercall: nr 0x0 a0 0x0 a1 0x0 a2 0x0 a3 0x0
>
> With this change, it starts working as expected:
>      kvm_pv_test-54285: kvm_hypercall: nr 0x5 a0 0x0 a1 0x0 a2 0x0 a3 0x0
>      kvm_pv_test-54285: kvm_hypercall: nr 0xa a0 0x0 a1 0x0 a2 0x0 a3 0x0
>      kvm_pv_test-54285: kvm_hypercall: nr 0xb a0 0x0 a1 0x0 a2 0x0 a3 0x0
>

Fixes: ac4a4d6de22e ("selftests: kvm: test enforcement of paravirtual cpuid features")

> Signed-off-by: Andrei Vagin <avagin@google.com>
> ---
>  tools/testing/selftests/kvm/lib/x86_64/processor.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
> index ead7011ee8f6..5d85e1c021da 100644
> --- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
> @@ -1422,7 +1422,7 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
>  
>  	asm volatile("vmcall"
>  		     : "=a"(r)
> -		     : "b"(a0), "c"(a1), "d"(a2), "S"(a3));
> +		     : "a"(nr), "b"(a0), "c"(a1), "d"(a2), "S"(a3));

Wouldn't '"+a"(r)' instead of '"=a"(r)' suffice (assuming we also assing
'r' to 'nr' in the beginning, something like

diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index ead7011ee8f6..fdd6554b94a1 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -1418,10 +1418,10 @@ bool set_cpuid(struct kvm_cpuid2 *cpuid,
 uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
                       uint64_t a3)
 {
-       uint64_t r;
+       uint64_t r = nr;
 
        asm volatile("vmcall"
-                    : "=a"(r)
+                    : "+a"(r)
                     : "b"(a0), "c"(a1), "d"(a2), "S"(a3));
        return r;
 }

-- 
Vitaly


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall
  2022-08-01 11:32   ` Vitaly Kuznetsov
@ 2022-08-01 12:43     ` Paolo Bonzini
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2022-08-01 12:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Andrei Vagin
  Cc: linux-kernel, kvm, Sean Christopherson, Wanpeng Li, Jianfeng Tan,
	Adin Scannell, Konstantin Bogomolov, Etienne Perot

On 8/1/22 13:32, Vitaly Kuznetsov wrote:
> Fixes: ac4a4d6de22e ("selftests: kvm: test enforcement of paravirtual cpuid features")

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
  2022-07-27  1:03         ` Andrei Vagin
@ 2022-08-22 20:26           ` Andrei Vagin
  0 siblings, 0 replies; 18+ messages in thread
From: Andrei Vagin @ 2022-08-22 20:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, Paolo Bonzini, linux-kernel, kvm,
	Wanpeng Li, Vitaly Kuznetsov, Jianfeng Tan, Adin Scannell,
	Konstantin Bogomolov, Etienne Perot, Andy Lutomirski,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

On Tue, Jul 26, 2022 at 6:03 PM Andrei Vagin <avagin@google.com> wrote:
>
> On Tue, Jul 26, 2022 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Tue, Jul 26 2022 at 15:10, Sean Christopherson wrote:
> > > On Tue, Jul 26, 2022, Andrei Vagin wrote:
> > >> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
> > >>   has to be fully enclosed in a VM to be able to support these technologies.
> > >
> > > Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
> > > tackled the "syscalls are crazy expensive" problem by using a message queue and
> > > a dedicated task outside of the enclave to handle syscalls.  Would something like
> > > that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
> > > non-starter?
> >
> > Let's put VMs aside for a moment. The problem you are trying to solve is
> > ptrace overhead because that requires context switching, right?
>
> Yes, you are right.
>
> >
> > Did you ever try to solve this with SYSCALL_USER_DISPATCH? That requires
> > signals, which are not cheap either, but we certainly could come up with
> > a lightweight signal implementation for that particular use case.

Thomas,

I found that the idea of a lightweight signal implementation can be interesting
in a slightly different context. I have a prototype of a gVisor platform that
uses seccomp to trap guest system calls. Guest threads are running in stub
processes that are used to manage guest address spaces.  Each stub process sets
seccomp filters to trap all system calls and it has a signal handler for
 SIGTRAP, SIGSEGV, SIGFPU, SIGILL, and SIGBUS. Each time when one of these
signals is triggered, the signal handler notifies the Sentry about it. This
platform has two obvious problems:

* It requires context switching.
* Signals are expensive.

The first one can be solved with umcg which allows doing synchronous context
switches between processes.  A lightweight signal implementation can solve the
second problem.

Do you have any concrete ideas on how to do that?

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-08-22 20:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-22 23:02 [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Andrei Vagin
2022-07-22 23:02 ` [PATCH 1/5] kernel: add a new helper to execute system calls from kernel code Andrei Vagin
2022-07-22 23:02 ` [PATCH 2/5] kvm/x86: add controls to enable/disable paravirtualized system calls Andrei Vagin
2022-07-22 23:02 ` [PATCH 3/5] KVM/x86: add a new hypercall to execute host " Andrei Vagin
2022-07-22 23:02 ` [PATCH 4/5] selftests/kvm/x86_64: set rax before vmcall Andrei Vagin
2022-08-01 11:32   ` Vitaly Kuznetsov
2022-08-01 12:43     ` Paolo Bonzini
2022-07-22 23:02 ` [PATCH 5/5] selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL Andrei Vagin
2022-07-22 23:41 ` [PATCH 0/5] KVM/x86: add a new hypercall to execute host system Sean Christopherson
2022-07-26  8:33   ` Andrei Vagin
2022-07-26 10:27     ` Paolo Bonzini
2022-07-27  6:44       ` Andrei Vagin
2022-07-26 15:10     ` Sean Christopherson
2022-07-26 22:10       ` Thomas Gleixner
2022-07-27  1:03         ` Andrei Vagin
2022-08-22 20:26           ` Andrei Vagin
2022-07-27  0:25       ` Andrei Vagin
2022-07-26 21:27   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.