All of lore.kernel.org
 help / color / mirror / Atom feed
From: "tip-bot2 for Gabriel Krisman Bertazi" <tip-bot2@linutronix.de>
To: linux-tip-commits@vger.kernel.org
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Kees Cook <keescook@chromium.org>,
	Andy Lutomirski <luto@kernel.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: [tip: core/entry] docs: Document Syscall User Dispatch
Date: Wed, 02 Dec 2020 14:12:02 -0000	[thread overview]
Message-ID: <160691832261.3364.17061203625721748275.tip-bot2@tip-bot2> (raw)
In-Reply-To: <20201127193238.821364-8-krisman@collabora.com>

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     a4452e671c6770e1bb80764f39995934067f70a0
Gitweb:        https://git.kernel.org/tip/a4452e671c6770e1bb80764f39995934067f70a0
Author:        Gabriel Krisman Bertazi <krisman@collabora.com>
AuthorDate:    Fri, 27 Nov 2020 14:32:38 -05:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

docs: Document Syscall User Dispatch

Explain the interface, provide some background and security notes.

[ tglx: Add note about non-visibility, add it to the index and fix the
  	kerneldoc warning ] 

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-8-krisman@collabora.com
---
 Documentation/admin-guide/index.rst                 |  1 +-
 Documentation/admin-guide/syscall-user-dispatch.rst | 90 ++++++++++++-
 2 files changed, 91 insertions(+)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4e0c4ae..b29d3c1 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -111,6 +111,7 @@ configure specific aspects of kernel behavior to your liking.
    rtc
    serial-console
    svga
+   syscall-user-dispatch
    sysrq
    thunderbolt
    ufs
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 0000000..a380d65
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process.  Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters.  Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace.  The application is in control of a flip
+switch, indicating the current personality of the process.  A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes.  Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering.  It uses
+only the syscall dispatcher address and the userspace key.
+
+As the ABI of these intercepted syscalls is unknown to Linux, these
+syscalls are not instrumentable via ptrace or the syscall tracepoints.
+
+Interface
+---------
+
+A thread can setup this mechanism on supported kernels by executing the
+following prctl:
+
+  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread.  When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+[<offset>, <offset>+<length>) delimit a memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector.  This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly.  selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process.  It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value.  If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.

  parent reply	other threads:[~2020-12-02 14:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-27 19:32 [PATCH v8 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 2/7] signal: Expose SYS_USER_DISPATCH si_code type Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
2020-12-01 22:57   ` Kees Cook
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2021-06-30 21:44   ` [PATCH v8 3/7] " Eric W. Biederman
2021-07-01 17:09     ` Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 4/7] entry: Support Syscall User Dispatch on common syscall entry Gabriel Krisman Bertazi
2020-12-01 22:57   ` Kees Cook
2020-12-02  0:04   ` Andy Lutomirski
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 5/7] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 6/7] selftests: Add benchmark " Gabriel Krisman Bertazi
2020-12-01 22:58   ` Kees Cook
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
2020-12-01 22:21   ` Jonathan Corbet
2020-12-01 23:46     ` Thomas Gleixner
2020-12-01 22:53   ` Thomas Gleixner
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi [this message]
2020-12-02  0:04 ` [PATCH v8 0/7] " Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=160691832261.3364.17061203625721748275.tip-bot2@tip-bot2 \
    --to=tip-bot2@linutronix.de \
    --cc=keescook@chromium.org \
    --cc=krisman@collabora.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.