linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "tip-bot2 for Gabriel Krisman Bertazi" <tip-bot2@linutronix.de>
To: linux-tip-commits@vger.kernel.org
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Kees Cook <keescook@chromium.org>,
	Andy Lutomirski <luto@kernel.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: [tip: core/entry] docs: Document Syscall User Dispatch
Date: Wed, 02 Dec 2020 14:12:02 -0000	[thread overview]
Message-ID: <160691832261.3364.17061203625721748275.tip-bot2@tip-bot2> (raw)
In-Reply-To: <20201127193238.821364-8-krisman@collabora.com>

The following commit has been merged into the core/entry branch of tip:

Commit-ID:     a4452e671c6770e1bb80764f39995934067f70a0
Gitweb:        https://git.kernel.org/tip/a4452e671c6770e1bb80764f39995934067f70a0
Author:        Gabriel Krisman Bertazi <krisman@collabora.com>
AuthorDate:    Fri, 27 Nov 2020 14:32:38 -05:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

docs: Document Syscall User Dispatch

Explain the interface, provide some background and security notes.

[ tglx: Add note about non-visibility, add it to the index and fix the
  	kerneldoc warning ] 

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-8-krisman@collabora.com
---
 Documentation/admin-guide/index.rst                 |  1 +-
 Documentation/admin-guide/syscall-user-dispatch.rst | 90 ++++++++++++-
 2 files changed, 91 insertions(+)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4e0c4ae..b29d3c1 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -111,6 +111,7 @@ configure specific aspects of kernel behavior to your liking.
    rtc
    serial-console
    svga
+   syscall-user-dispatch
    sysrq
    thunderbolt
    ufs
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 0000000..a380d65
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process.  Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters.  Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace.  The application is in control of a flip
+switch, indicating the current personality of the process.  A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes.  Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering.  It uses
+only the syscall dispatcher address and the userspace key.
+
+As the ABI of these intercepted syscalls is unknown to Linux, these
+syscalls are not instrumentable via ptrace or the syscall tracepoints.
+
+Interface
+---------
+
+A thread can setup this mechanism on supported kernels by executing the
+following prctl:
+
+  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread.  When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+[<offset>, <offset>+<length>) delimit a memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector.  This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly.  selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process.  It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value.  If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.

  parent reply	other threads:[~2020-12-02 14:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-27 19:32 [PATCH v8 0/7] Syscall User Dispatch Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 1/7] x86: vdso: Expose sigreturn address on vdso to the kernel Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 2/7] signal: Expose SYS_USER_DISPATCH si_code type Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 3/7] kernel: Implement selective syscall userspace redirection Gabriel Krisman Bertazi
2020-12-01 22:57   ` Kees Cook
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2021-06-30 21:44   ` [PATCH v8 3/7] " Eric W. Biederman
2021-07-01 17:09     ` Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 4/7] entry: Support Syscall User Dispatch on common syscall entry Gabriel Krisman Bertazi
2020-12-01 22:57   ` Kees Cook
2020-12-02  0:04   ` Andy Lutomirski
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 5/7] selftests: Add kselftest for syscall user dispatch Gabriel Krisman Bertazi
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 6/7] selftests: Add benchmark " Gabriel Krisman Bertazi
2020-12-01 22:58   ` Kees Cook
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi
2020-11-27 19:32 ` [PATCH v8 7/7] docs: Document Syscall User Dispatch Gabriel Krisman Bertazi
2020-12-01 22:21   ` Jonathan Corbet
2020-12-01 23:46     ` Thomas Gleixner
2020-12-01 22:53   ` Thomas Gleixner
2020-12-02  9:38   ` [tip: core/entry] " tip-bot2 for Gabriel Krisman Bertazi
2020-12-02 14:12   ` tip-bot2 for Gabriel Krisman Bertazi [this message]
2020-12-02  0:04 ` [PATCH v8 0/7] " Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=160691832261.3364.17061203625721748275.tip-bot2@tip-bot2 \
    --to=tip-bot2@linutronix.de \
    --cc=keescook@chromium.org \
    --cc=krisman@collabora.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).