From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0583BC433F5 for ; Wed, 22 Sep 2021 14:23:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DC2CF611EE for ; Wed, 22 Sep 2021 14:23:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235285AbhIVOY6 (ORCPT ); Wed, 22 Sep 2021 10:24:58 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:41898 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232401AbhIVOYz (ORCPT ); Wed, 22 Sep 2021 10:24:55 -0400 Received: from in01.mta.xmission.com ([166.70.13.51]:36930) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mT39d-001fDd-Ov; Wed, 22 Sep 2021 08:23:21 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95]:45652 helo=email.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mT39a-009l0S-W4; Wed, 22 Sep 2021 08:23:21 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Peter Collingbourne Cc: Catalin Marinas , Will Deacon , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Thomas Gleixner , Andy Lutomirski , Kees Cook , Andrew Morton , Masahiro Yamada , Sami Tolvanen , YiFei Zhu , Colin Ian King , Mark Rutland , Frederic Weisbecker , Viresh Kumar , Andrey Konovalov , Gabriel Krisman Bertazi , Balbir Singh , Chris Hyser , Daniel Vetter , Chris Wilson , Arnd Bergmann , Dmitry Vyukov , Christian Brauner , Alexey Gladkov , Ran Xiaokai , David Hildenbrand , Xiaofeng Cao , Cyrill Gorcunov , Thomas Cedeno , Marco Elver , Alexander Potapenko , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Evgenii Stepanov References: <20210922061809.736124-1-pcc@google.com> Date: Wed, 22 Sep 2021 09:23:10 -0500 In-Reply-To: <20210922061809.736124-1-pcc@google.com> (Peter Collingbourne's message of "Tue, 21 Sep 2021 23:18:09 -0700") Message-ID: <87k0j8zo35.fsf@disp2133> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1mT39a-009l0S-W4;;;mid=<87k0j8zo35.fsf@disp2133>;;;hst=in01.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+//FTHUR0wOG0wPPwTcbYaiVVeRSruPSs= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS) X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Peter Collingbourne writes: > This patch introduces a kernel feature known as uaccess logging. > With uaccess logging, the userspace program passes the address and size > of a so-called uaccess buffer to the kernel via a prctl(). The prctl() > is a request for the kernel to log any uaccesses made during the next > syscall to the uaccess buffer. When the next syscall returns, the address > one past the end of the logged uaccess buffer entries is written to the > location specified by the third argument to the prctl(). In this way, > the userspace program may enumerate the uaccesses logged to the access > buffer to determine which accesses occurred. > > Uaccess logging has several use cases focused around bug detection > tools: > > 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools > making use of the ARM Memory Tagging Extension (MTE) need to monitor > all memory accesses in a program so that they can detect memory > errors. For accesses made purely in userspace, this is achieved > via compiler instrumentation, or for MTE, via direct hardware > support. However, accesses made by the kernel on behalf of the > user program via syscalls (i.e. uaccesses) are invisible to these > tools. With MTE there is some level of error detection possible in > the kernel (in synchronous mode, bad accesses generally result in > returning -EFAULT from the syscall), but by the time we get back to > userspace we've lost the information about the address and size of the > failed access, which makes it harder to produce a useful error report. > > With the current versions of the sanitizers, we address this by > interposing the libc syscall stubs with a wrapper that checks the > memory based on what we believe the uaccesses will be. However, this > creates a maintenance burden: each syscall must be annotated with > its uaccesses in order to be recognized by the sanitizer, and these > annotations must be continuously updated as the kernel changes. This > is especially burdensome for syscalls such as ioctl(2) which have a > large surface area of possible uaccesses. > > 2) Verifying the validity of kernel accesses. This can be achieved in > conjunction with the userspace memory safety tools mentioned in (1). > Even a sanitizer whose syscall wrappers have complete knowledge of > the kernel's intended API may vary from the kernel's actual uaccesses > due to kernel bugs. A sanitizer with knowledge of the kernel's actual > uaccesses may produce more accurate error reports that reveal such > bugs. > > An example of such a bug, which was found by an earlier version of this > patch together with a prototype client of the API in HWASan, was fixed > by commit d0efb16294d1 ("net: don't unconditionally copy_from_user > a struct ifreq for socket ioctls"). Although this bug turned out to > relatively harmless, it was a bug nonetheless and it's always possible > that more serious bugs of this sort may be introduced in the future. > > 3) Kernel fuzzing. We may use the list of reported kernel accesses to > guide a kernel fuzzing tool such as syzkaller (so that it knows which > parts of user memory to fuzz), as an alternative to providing the tool > with a list of syscalls and their uaccesses (which again thanks to > (2) may not be accurate). How is logging the kernel's activity like this not a significant information leak? How is this safe for unprivileged users? Eric > All signals except SIGKILL and SIGSTOP are masked for the interval > between the prctl() and the next syscall in order to prevent handlers > for intervening asynchronous signals from issuing syscalls that may > cause uaccesses from the wrong syscall to be logged. > > The format of a uaccess buffer entry is defined as follows: > > struct access_buffer_entry { > u64 addr, size, flags; > }; > > The meaning of addr and size should be obvious. On arm64, tag bits > are preserved in the addr field. The current meaning of the flags > field is that bit 0 indicates whether the access was a read (clear) > or a write (set). The meaning of all other flag bits is reserved. > All fields are of type u64 in order to avoid compat concerns. > > Here is an example of a code snippet that will enumerate the accesses > performed by a uname(2) syscall: > > struct access_buffer_entry entries[64]; > uint64_t entries_end64 = (uint64_t)&entries; > struct utsname un; > prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0); > uname(&un); > struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64; > for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) { > printf("%s at 0x%lu size 0x%lx\n", > entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ", > (unsigned long)entries[i].addr, (unsigned long)entries[i].size); > } > > Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of > course, not all of the accesses may fit in the buffer, but aside from > that, there are syscalls such as async I/O that are currently missed due > to the uaccesses occurring on a different kernel task (this is analogous > to how async I/O accesses are exempt from userspace MTE checks). We > view this as acceptable, as the access buffer can be sized sufficiently > large to handle syscalls that make a reasonable number of uaccesses, > and syscalls that use a different task for uaccesses are rare. In > many cases, the sanitizer does not need to see every memory access, > so it's fine if we miss the odd uaccess here and there. Even for those > sanitizers that do need to see every memory access it still represents > a much lower maintenance burden if we just have to handle the unusual > syscalls specially. > > Because we don't have a common kernel entry/exit code path that is used > on all architectures, uaccess logging is only implemented for arm64 and > architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390. > > One downside of this ABI is that it involves making two syscalls per > "real" syscall, which can harm performance. One possible way to avoid > this may be to have the prctl() register the uaccess buffer location > once at thread startup and use the same location for all syscalls in > the thread. However, because the program may be making syscalls very > early, before TLS is available, this may not always work. Furthermore, > because of the same asynchronous signal concerns that prompted temporarily > masking signals after the prctl(), the syscall stub would need to be made > reentrant, and it is unclear whether this is feasible without manually > masking asynchronous signals using rt_sigprocmask(2) while reading the > uaccess buffer, defeating the purpose of avoiding the extra syscall. > > One idea that we considered involved using the stack pointer address as > a unique identifier for the syscall, but this currently would need to be > arch-specific as we currently do not appear to have an arch-generic way > of retrieving the stack pointer; the userspace side would also need some > arch-specific code for this to work. It's also possible that a longjmp() > past the signal handler would make the stack pointer address not unique > enough for this purpose. > > On the other hand, by allocating the uaccess log on the stack and blocking > asynchronous signals for the interval between the prctl() and the "real" > syscall, we can avoid any reentrancy and TLS concerns. > > Another way to avoid the overhead may be to use an architecture-specific > calling convention to pass the address of the uaccess buffer to the kernel > at syscall time in registers currently unused for syscall arguments. For > example, one arm64-specific scheme that was used in a previous iteration > of the patch was: > > - Bit 0 of the immediate argument to the SVC instruction must be set. > - Register X6 contains the address of the access buffer. > - Register X7 contains the size of the access buffer in bytes. > - On return, X6 will contain the address of the memory location following > any access buffer entries written by the kernel. > > However, this would need to be implemented separately for each > architecture (and some of them don't have enough registers anyway), > whereas the prctl() is (at least in theory) architecture-generic. > > We also evaluated implementing this on top of the existing tracepoint > facility, but concluded that it is not suitable for this purpose: > > - Tracepoints have a per-task granularity at best, whereas we really want > to trace per-syscall. This is so that we can exclude syscalls that > should not be traced, such as syscalls that make up part of the > sanitizer implementation (to avoid infinite recursion when e.g. printing > an error report). > > - Tracing would need to be synchronous in order to produce useful > stack traces. For example this could be achieved using the new SIGTRAP > on perf events mechanism. However, this would require logging each > access to the stack (in the form of a sigcontext) and this is more > likely to overflow the stack due to being much larger than a uaccess > buffer entry as well as being unbounded, in contrast to the bounded > buffer size passed to prctl(). An approach based on signal handlers is > also likely to fall foul of the asynchronous signal issues mentioned > previously, together with needing sigreturn to be handled specially > (because it copies a sigcontext from userspace) otherwise we could > never return from the signal handler. Furthermore, arguments to the > trace events are not available to SIGTRAP. (This on its own wouldn't > be insurmountable though -- we could add the arguments as fields > to siginfo.) > > - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt > -- e.g. trace_pipe_raw gives access to the internal ring buffer, but > I don't think it's useable because it's per-CPU and not per-task. > > - Tracepoints can be used by eBPF programs, but eBPF programs may > only be loaded as root, among other potential headaches. > > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d > Signed-off-by: Peter Collingbourne > --- > arch/Kconfig | 6 ++ > arch/arm64/Kconfig | 1 + > arch/arm64/kernel/syscall.c | 2 + > include/linux/instrumented.h | 5 +- > include/linux/sched.h | 3 + > include/linux/uaccess_buffer.h | 43 ++++++++++ > include/linux/uaccess_buffer_info.h | 23 ++++++ > include/uapi/linux/prctl.h | 9 +++ > kernel/Makefile | 1 + > kernel/entry/common.c | 3 + > kernel/sys.c | 6 ++ > kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++ > 12 files changed, 219 insertions(+), 1 deletion(-) > create mode 100644 include/linux/uaccess_buffer.h > create mode 100644 include/linux/uaccess_buffer_info.h > create mode 100644 kernel/uaccess_buffer.c > > diff --git a/arch/Kconfig b/arch/Kconfig > index 8df1c7102643..a427f6440cc9 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -31,6 +31,7 @@ config HOTPLUG_SMT > bool > > config GENERIC_ENTRY > + select UACCESS_BUFFER > bool > > config KPROBES > @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT > config ARCH_HAS_PARANOID_L1D_FLUSH > bool > > +config UACCESS_BUFFER > + bool > + help > + Select if the architecture's syscall entry/exit code supports uaccess buffers. > + > source "kernel/gcov/Kconfig" > > source "scripts/gcc-plugins/Kconfig" > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 5c7ae4c3954b..4764e5fd7ba9 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -221,6 +221,7 @@ config ARM64 > select THREAD_INFO_IN_TASK > select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD > select TRACE_IRQFLAGS_SUPPORT > + select UACCESS_BUFFER > help > ARM 64-bit (AArch64) Linux support. > > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c > index 50a0f1a38e84..c3f8652d84a5 100644 > --- a/arch/arm64/kernel/syscall.c > +++ b/arch/arm64/kernel/syscall.c > @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr, > goto trace_exit; > } > > + uaccess_buffer_syscall_entry(); > invoke_syscall(regs, scno, sc_nr, syscall_table); > + uaccess_buffer_syscall_exit(); > > /* > * The tracing status may have changed under our feet, so we have to > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h > index 42faebbaa202..9144936edcb1 100644 > --- a/include/linux/instrumented.h > +++ b/include/linux/instrumented.h > @@ -2,7 +2,7 @@ > > /* > * This header provides generic wrappers for memory access instrumentation that > - * the compiler cannot emit for: KASAN, KCSAN. > + * the compiler cannot emit for: KASAN, KCSAN, access buffers. > */ > #ifndef _LINUX_INSTRUMENTED_H > #define _LINUX_INSTRUMENTED_H > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > /** > * instrument_read - instrument regular read access > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n) > { > kasan_check_read(from, n); > kcsan_check_read(from, n); > + uaccess_buffer_log_write(to, n); > } > > /** > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long > { > kasan_check_write(to, n); > kcsan_check_write(to, n); > + uaccess_buffer_log_read(from, n); > } > > #endif /* _LINUX_INSTRUMENTED_H */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e12b524426b0..3fecb0487b97 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include > > /* task_struct member predeclarations (sorted alphabetically): */ > @@ -1487,6 +1488,8 @@ struct task_struct { > struct callback_head l1d_flush_kill; > #endif > > + struct uaccess_buffer_info uaccess_buffer; > + > /* > * New fields for task_struct should be added above here, so that > * they are included in the randomized portion of task_struct. > diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h > new file mode 100644 > index 000000000000..3b81f2a192a4 > --- /dev/null > +++ b/include/linux/uaccess_buffer.h > @@ -0,0 +1,43 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_ACCESS_BUFFER_H > +#define _LINUX_ACCESS_BUFFER_H > + > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +void uaccess_buffer_log_read(const void __user *from, unsigned long n); > +void uaccess_buffer_log_write(void __user *to, unsigned long n); > + > +void uaccess_buffer_syscall_entry(void); > +void uaccess_buffer_syscall_exit(void); > + > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size, > + unsigned long store_end_addr); > + > +#else > + > +static inline void uaccess_buffer_log_read(const void __user *from, > + unsigned long n) > +{ > +} > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n) > +{ > +} > + > +static inline void uaccess_buffer_syscall_entry(void) > +{ > +} > +static inline void uaccess_buffer_syscall_exit(void) > +{ > +} > + > +static inline int uaccess_buffer_set_logging(unsigned long addr, > + unsigned long size, > + unsigned long store_end_addr) > +{ > + return -EINVAL; > +} > +#endif > + > +#endif /* _LINUX_ACCESS_BUFFER_H */ > diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h > new file mode 100644 > index 000000000000..a6cefe6e73b5 > --- /dev/null > +++ b/include/linux/uaccess_buffer_info.h > @@ -0,0 +1,23 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_ACCESS_BUFFER_INFO_H > +#define _LINUX_ACCESS_BUFFER_INFO_H > + > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +struct uaccess_buffer_info { > + unsigned long addr, size; > + unsigned long store_end_addr; > + sigset_t saved_sigmask; > + u8 state; > +}; > + > +#else > + > +struct uaccess_buffer_info { > +}; > + > +#endif > + > +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 43bd7f713c39..d8baacaef800 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -269,4 +269,13 @@ struct prctl_mm_map { > # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ > # define PR_SCHED_CORE_MAX 4 > > +/* Log uaccesses to a user-provided buffer */ > +#define PR_LOG_UACCESS 63 > + > +/* Format of the entries in the uaccess log. */ > +struct uaccess_buffer_entry { > + __u64 addr, size, flags; > +}; > +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */ > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 4df609be42d0..75a5d95ce9c3 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/ > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o > obj-$(CONFIG_CFI_CLANG) += cfi.o > +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o > > obj-$(CONFIG_PERF_EVENTS) += events/ > > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > index bf16395b9e13..c7e7ff8cbab3 100644 > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall) > if (work & SYSCALL_WORK_ENTER) > syscall = syscall_trace_enter(regs, syscall, work); > > + uaccess_buffer_syscall_entry(); > + > return syscall; > } > > @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) > local_irq_enable(); > } > > + uaccess_buffer_syscall_exit(); > rseq_syscall(regs); > > /* > diff --git a/kernel/sys.c b/kernel/sys.c > index 8fdac0d90504..df487600773c 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > error = sched_core_share_pid(arg2, arg3, arg4, arg5); > break; > #endif > + case PR_LOG_UACCESS: > + if (arg5) > + return -EINVAL; > + error = uaccess_buffer_set_logging(arg2, arg3, arg4); > + break; > default: > error = -EINVAL; > break; > diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c > new file mode 100644 > index 000000000000..b9da89887c4b > --- /dev/null > +++ b/kernel/uaccess_buffer.c > @@ -0,0 +1,118 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +/* > + * We use a separate implementation of copy_to_user() that avoids the call > + * to instrument_copy_to_user() as this would otherwise lead to infinite > + * recursion. > + */ > +static unsigned long > +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n) > +{ > + if (!access_ok(to, n)) > + return n; > + return raw_copy_to_user(to, from, n); > +} > + > +static void uaccess_buffer_log(unsigned long addr, unsigned long size, > + unsigned long flags) > +{ > + struct uaccess_buffer_entry entry; > + > + if (current->uaccess_buffer.size < sizeof(entry) || > + unlikely(uaccess_kernel())) > + return; > + entry.addr = addr; > + entry.size = size; > + entry.flags = flags; > + > + /* > + * If our uaccess fails, abort the log so that the end address writeback > + * does not occur and userspace sees zero accesses. > + */ > + if (uaccess_buffer_copy_to_user( > + (void __user *)current->uaccess_buffer.addr, &entry, > + sizeof(entry))) { > + current->uaccess_buffer.state = 0; > + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0; > + } > + > + current->uaccess_buffer.addr += sizeof(entry); > + current->uaccess_buffer.size -= sizeof(entry); > +} > + > +void uaccess_buffer_log_read(const void __user *from, unsigned long n) > +{ > + uaccess_buffer_log((unsigned long)from, n, 0); > +} > +EXPORT_SYMBOL(uaccess_buffer_log_read); > + > +void uaccess_buffer_log_write(void __user *to, unsigned long n) > +{ > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE); > +} > +EXPORT_SYMBOL(uaccess_buffer_log_write); > + > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size, > + unsigned long store_end_addr) > +{ > + sigset_t temp_sigmask; > + > + current->uaccess_buffer.addr = addr; > + current->uaccess_buffer.size = size; > + current->uaccess_buffer.store_end_addr = store_end_addr; > + > + /* > + * Allow 2 syscalls before resetting the state: the current one (i.e. > + * prctl) and the next one, whose accesses we want to log. > + */ > + current->uaccess_buffer.state = 2; > + > + /* > + * Temporarily mask signals so that an intervening asynchronous signal > + * will not interfere with the logging. > + */ > + current->uaccess_buffer.saved_sigmask = current->blocked; > + sigfillset(&temp_sigmask); > + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); > + __set_current_blocked(&temp_sigmask); > + > + return 0; > +} > + > +void uaccess_buffer_syscall_entry(void) > +{ > + /* > + * The current syscall may be e.g. rt_sigprocmask, and therefore we want > + * to reset the mask before the syscall and not after, so that our > + * temporary mask is unobservable. > + */ > + if (current->uaccess_buffer.state == 1) > + __set_current_blocked(¤t->uaccess_buffer.saved_sigmask); > +} > + > +void uaccess_buffer_syscall_exit(void) > +{ > + if (current->uaccess_buffer.state > 0) { > + --current->uaccess_buffer.state; > + if (current->uaccess_buffer.state == 0) { > + u64 addr64 = current->uaccess_buffer.addr; > + > + uaccess_buffer_copy_to_user( > + (void __user *) > + current->uaccess_buffer.store_end_addr, > + &addr64, sizeof(addr64)); > + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0; > + } > + } > +} > + > +#endif From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DC0DC433EF for ; Wed, 22 Sep 2021 14:26:59 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E501A60F26 for ; Wed, 22 Sep 2021 14:26:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E501A60F26 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=xmission.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Subject:MIME-Version:Message-ID: In-Reply-To:Date:References:Cc:To:From:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=6SvQIE3gWoCGjuthAWLnjyS+Awze7BIC8pj5tVt1wr8=; b=2ABnjH0LxUxNzXgZ2lcs7CMERh T644B4pAztAv/Ms6n2RlTJx47q0N+zf/+CChivtDFITHUpMMCU672fVE90DGogrgxl8i3z2kqfFsI rOEPc682mQWLYJLJ+i2xFesuA/3DAA4VdrHR1N6FbPb6jXIrC/RLmwlvcqXKLcGxtFaA8JjbU1gp2 UTfEqqdEm0jiLCduTFHmOmQ7zlq9vDQqq5X2U9TBNQcENrcOzW1/dJkjIl7nl4mkDHSRTTEZ9QJaC DT1edK0zUorhziH2V4y/Eab/kliNP7O+POxcy0SpcO+kcEO/drp1yauvM1ZLXlfkBU+m7awBTCkFF vr/C4upw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1mT3Af-008ozh-Jc; Wed, 22 Sep 2021 14:24:26 +0000 Received: from out01.mta.xmission.com ([166.70.13.231]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1mT39y-008oiO-79 for linux-arm-kernel@lists.infradead.org; Wed, 22 Sep 2021 14:23:45 +0000 Received: from in01.mta.xmission.com ([166.70.13.51]:36930) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mT39d-001fDd-Ov; Wed, 22 Sep 2021 08:23:21 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95]:45652 helo=email.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mT39a-009l0S-W4; Wed, 22 Sep 2021 08:23:21 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Peter Collingbourne Cc: Catalin Marinas , Will Deacon , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Thomas Gleixner , Andy Lutomirski , Kees Cook , Andrew Morton , Masahiro Yamada , Sami Tolvanen , YiFei Zhu , Colin Ian King , Mark Rutland , Frederic Weisbecker , Viresh Kumar , Andrey Konovalov , Gabriel Krisman Bertazi , Balbir Singh , Chris Hyser , Daniel Vetter , Chris Wilson , Arnd Bergmann , Dmitry Vyukov , Christian Brauner , Alexey Gladkov , Ran Xiaokai , David Hildenbrand , Xiaofeng Cao , Cyrill Gorcunov , Thomas Cedeno , Marco Elver , Alexander Potapenko , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Evgenii Stepanov References: <20210922061809.736124-1-pcc@google.com> Date: Wed, 22 Sep 2021 09:23:10 -0500 In-Reply-To: <20210922061809.736124-1-pcc@google.com> (Peter Collingbourne's message of "Tue, 21 Sep 2021 23:18:09 -0700") Message-ID: <87k0j8zo35.fsf@disp2133> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 X-XM-SPF: eid=1mT39a-009l0S-W4; ; ; mid=<87k0j8zo35.fsf@disp2133>; ; ; hst=in01.mta.xmission.com; ; ; ip=68.227.160.95; ; ; frm=ebiederm@xmission.com; ; ; spf=neutral X-XM-AID: U2FsdGVkX1+//FTHUR0wOG0wPPwTcbYaiVVeRSruPSs= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS) X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210922_072342_395287_EB4733B8 X-CRM114-Status: GOOD ( 66.06 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Peter Collingbourne writes: > This patch introduces a kernel feature known as uaccess logging. > With uaccess logging, the userspace program passes the address and size > of a so-called uaccess buffer to the kernel via a prctl(). The prctl() > is a request for the kernel to log any uaccesses made during the next > syscall to the uaccess buffer. When the next syscall returns, the address > one past the end of the logged uaccess buffer entries is written to the > location specified by the third argument to the prctl(). In this way, > the userspace program may enumerate the uaccesses logged to the access > buffer to determine which accesses occurred. > > Uaccess logging has several use cases focused around bug detection > tools: > > 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools > making use of the ARM Memory Tagging Extension (MTE) need to monitor > all memory accesses in a program so that they can detect memory > errors. For accesses made purely in userspace, this is achieved > via compiler instrumentation, or for MTE, via direct hardware > support. However, accesses made by the kernel on behalf of the > user program via syscalls (i.e. uaccesses) are invisible to these > tools. With MTE there is some level of error detection possible in > the kernel (in synchronous mode, bad accesses generally result in > returning -EFAULT from the syscall), but by the time we get back to > userspace we've lost the information about the address and size of the > failed access, which makes it harder to produce a useful error report. > > With the current versions of the sanitizers, we address this by > interposing the libc syscall stubs with a wrapper that checks the > memory based on what we believe the uaccesses will be. However, this > creates a maintenance burden: each syscall must be annotated with > its uaccesses in order to be recognized by the sanitizer, and these > annotations must be continuously updated as the kernel changes. This > is especially burdensome for syscalls such as ioctl(2) which have a > large surface area of possible uaccesses. > > 2) Verifying the validity of kernel accesses. This can be achieved in > conjunction with the userspace memory safety tools mentioned in (1). > Even a sanitizer whose syscall wrappers have complete knowledge of > the kernel's intended API may vary from the kernel's actual uaccesses > due to kernel bugs. A sanitizer with knowledge of the kernel's actual > uaccesses may produce more accurate error reports that reveal such > bugs. > > An example of such a bug, which was found by an earlier version of this > patch together with a prototype client of the API in HWASan, was fixed > by commit d0efb16294d1 ("net: don't unconditionally copy_from_user > a struct ifreq for socket ioctls"). Although this bug turned out to > relatively harmless, it was a bug nonetheless and it's always possible > that more serious bugs of this sort may be introduced in the future. > > 3) Kernel fuzzing. We may use the list of reported kernel accesses to > guide a kernel fuzzing tool such as syzkaller (so that it knows which > parts of user memory to fuzz), as an alternative to providing the tool > with a list of syscalls and their uaccesses (which again thanks to > (2) may not be accurate). How is logging the kernel's activity like this not a significant information leak? How is this safe for unprivileged users? Eric > All signals except SIGKILL and SIGSTOP are masked for the interval > between the prctl() and the next syscall in order to prevent handlers > for intervening asynchronous signals from issuing syscalls that may > cause uaccesses from the wrong syscall to be logged. > > The format of a uaccess buffer entry is defined as follows: > > struct access_buffer_entry { > u64 addr, size, flags; > }; > > The meaning of addr and size should be obvious. On arm64, tag bits > are preserved in the addr field. The current meaning of the flags > field is that bit 0 indicates whether the access was a read (clear) > or a write (set). The meaning of all other flag bits is reserved. > All fields are of type u64 in order to avoid compat concerns. > > Here is an example of a code snippet that will enumerate the accesses > performed by a uname(2) syscall: > > struct access_buffer_entry entries[64]; > uint64_t entries_end64 = (uint64_t)&entries; > struct utsname un; > prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0); > uname(&un); > struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64; > for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) { > printf("%s at 0x%lu size 0x%lx\n", > entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ", > (unsigned long)entries[i].addr, (unsigned long)entries[i].size); > } > > Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of > course, not all of the accesses may fit in the buffer, but aside from > that, there are syscalls such as async I/O that are currently missed due > to the uaccesses occurring on a different kernel task (this is analogous > to how async I/O accesses are exempt from userspace MTE checks). We > view this as acceptable, as the access buffer can be sized sufficiently > large to handle syscalls that make a reasonable number of uaccesses, > and syscalls that use a different task for uaccesses are rare. In > many cases, the sanitizer does not need to see every memory access, > so it's fine if we miss the odd uaccess here and there. Even for those > sanitizers that do need to see every memory access it still represents > a much lower maintenance burden if we just have to handle the unusual > syscalls specially. > > Because we don't have a common kernel entry/exit code path that is used > on all architectures, uaccess logging is only implemented for arm64 and > architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390. > > One downside of this ABI is that it involves making two syscalls per > "real" syscall, which can harm performance. One possible way to avoid > this may be to have the prctl() register the uaccess buffer location > once at thread startup and use the same location for all syscalls in > the thread. However, because the program may be making syscalls very > early, before TLS is available, this may not always work. Furthermore, > because of the same asynchronous signal concerns that prompted temporarily > masking signals after the prctl(), the syscall stub would need to be made > reentrant, and it is unclear whether this is feasible without manually > masking asynchronous signals using rt_sigprocmask(2) while reading the > uaccess buffer, defeating the purpose of avoiding the extra syscall. > > One idea that we considered involved using the stack pointer address as > a unique identifier for the syscall, but this currently would need to be > arch-specific as we currently do not appear to have an arch-generic way > of retrieving the stack pointer; the userspace side would also need some > arch-specific code for this to work. It's also possible that a longjmp() > past the signal handler would make the stack pointer address not unique > enough for this purpose. > > On the other hand, by allocating the uaccess log on the stack and blocking > asynchronous signals for the interval between the prctl() and the "real" > syscall, we can avoid any reentrancy and TLS concerns. > > Another way to avoid the overhead may be to use an architecture-specific > calling convention to pass the address of the uaccess buffer to the kernel > at syscall time in registers currently unused for syscall arguments. For > example, one arm64-specific scheme that was used in a previous iteration > of the patch was: > > - Bit 0 of the immediate argument to the SVC instruction must be set. > - Register X6 contains the address of the access buffer. > - Register X7 contains the size of the access buffer in bytes. > - On return, X6 will contain the address of the memory location following > any access buffer entries written by the kernel. > > However, this would need to be implemented separately for each > architecture (and some of them don't have enough registers anyway), > whereas the prctl() is (at least in theory) architecture-generic. > > We also evaluated implementing this on top of the existing tracepoint > facility, but concluded that it is not suitable for this purpose: > > - Tracepoints have a per-task granularity at best, whereas we really want > to trace per-syscall. This is so that we can exclude syscalls that > should not be traced, such as syscalls that make up part of the > sanitizer implementation (to avoid infinite recursion when e.g. printing > an error report). > > - Tracing would need to be synchronous in order to produce useful > stack traces. For example this could be achieved using the new SIGTRAP > on perf events mechanism. However, this would require logging each > access to the stack (in the form of a sigcontext) and this is more > likely to overflow the stack due to being much larger than a uaccess > buffer entry as well as being unbounded, in contrast to the bounded > buffer size passed to prctl(). An approach based on signal handlers is > also likely to fall foul of the asynchronous signal issues mentioned > previously, together with needing sigreturn to be handled specially > (because it copies a sigcontext from userspace) otherwise we could > never return from the signal handler. Furthermore, arguments to the > trace events are not available to SIGTRAP. (This on its own wouldn't > be insurmountable though -- we could add the arguments as fields > to siginfo.) > > - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt > -- e.g. trace_pipe_raw gives access to the internal ring buffer, but > I don't think it's useable because it's per-CPU and not per-task. > > - Tracepoints can be used by eBPF programs, but eBPF programs may > only be loaded as root, among other potential headaches. > > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d > Signed-off-by: Peter Collingbourne > --- > arch/Kconfig | 6 ++ > arch/arm64/Kconfig | 1 + > arch/arm64/kernel/syscall.c | 2 + > include/linux/instrumented.h | 5 +- > include/linux/sched.h | 3 + > include/linux/uaccess_buffer.h | 43 ++++++++++ > include/linux/uaccess_buffer_info.h | 23 ++++++ > include/uapi/linux/prctl.h | 9 +++ > kernel/Makefile | 1 + > kernel/entry/common.c | 3 + > kernel/sys.c | 6 ++ > kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++ > 12 files changed, 219 insertions(+), 1 deletion(-) > create mode 100644 include/linux/uaccess_buffer.h > create mode 100644 include/linux/uaccess_buffer_info.h > create mode 100644 kernel/uaccess_buffer.c > > diff --git a/arch/Kconfig b/arch/Kconfig > index 8df1c7102643..a427f6440cc9 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -31,6 +31,7 @@ config HOTPLUG_SMT > bool > > config GENERIC_ENTRY > + select UACCESS_BUFFER > bool > > config KPROBES > @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT > config ARCH_HAS_PARANOID_L1D_FLUSH > bool > > +config UACCESS_BUFFER > + bool > + help > + Select if the architecture's syscall entry/exit code supports uaccess buffers. > + > source "kernel/gcov/Kconfig" > > source "scripts/gcc-plugins/Kconfig" > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 5c7ae4c3954b..4764e5fd7ba9 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -221,6 +221,7 @@ config ARM64 > select THREAD_INFO_IN_TASK > select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD > select TRACE_IRQFLAGS_SUPPORT > + select UACCESS_BUFFER > help > ARM 64-bit (AArch64) Linux support. > > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c > index 50a0f1a38e84..c3f8652d84a5 100644 > --- a/arch/arm64/kernel/syscall.c > +++ b/arch/arm64/kernel/syscall.c > @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr, > goto trace_exit; > } > > + uaccess_buffer_syscall_entry(); > invoke_syscall(regs, scno, sc_nr, syscall_table); > + uaccess_buffer_syscall_exit(); > > /* > * The tracing status may have changed under our feet, so we have to > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h > index 42faebbaa202..9144936edcb1 100644 > --- a/include/linux/instrumented.h > +++ b/include/linux/instrumented.h > @@ -2,7 +2,7 @@ > > /* > * This header provides generic wrappers for memory access instrumentation that > - * the compiler cannot emit for: KASAN, KCSAN. > + * the compiler cannot emit for: KASAN, KCSAN, access buffers. > */ > #ifndef _LINUX_INSTRUMENTED_H > #define _LINUX_INSTRUMENTED_H > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > /** > * instrument_read - instrument regular read access > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n) > { > kasan_check_read(from, n); > kcsan_check_read(from, n); > + uaccess_buffer_log_write(to, n); > } > > /** > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long > { > kasan_check_write(to, n); > kcsan_check_write(to, n); > + uaccess_buffer_log_read(from, n); > } > > #endif /* _LINUX_INSTRUMENTED_H */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e12b524426b0..3fecb0487b97 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include > > /* task_struct member predeclarations (sorted alphabetically): */ > @@ -1487,6 +1488,8 @@ struct task_struct { > struct callback_head l1d_flush_kill; > #endif > > + struct uaccess_buffer_info uaccess_buffer; > + > /* > * New fields for task_struct should be added above here, so that > * they are included in the randomized portion of task_struct. > diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h > new file mode 100644 > index 000000000000..3b81f2a192a4 > --- /dev/null > +++ b/include/linux/uaccess_buffer.h > @@ -0,0 +1,43 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_ACCESS_BUFFER_H > +#define _LINUX_ACCESS_BUFFER_H > + > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +void uaccess_buffer_log_read(const void __user *from, unsigned long n); > +void uaccess_buffer_log_write(void __user *to, unsigned long n); > + > +void uaccess_buffer_syscall_entry(void); > +void uaccess_buffer_syscall_exit(void); > + > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size, > + unsigned long store_end_addr); > + > +#else > + > +static inline void uaccess_buffer_log_read(const void __user *from, > + unsigned long n) > +{ > +} > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n) > +{ > +} > + > +static inline void uaccess_buffer_syscall_entry(void) > +{ > +} > +static inline void uaccess_buffer_syscall_exit(void) > +{ > +} > + > +static inline int uaccess_buffer_set_logging(unsigned long addr, > + unsigned long size, > + unsigned long store_end_addr) > +{ > + return -EINVAL; > +} > +#endif > + > +#endif /* _LINUX_ACCESS_BUFFER_H */ > diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h > new file mode 100644 > index 000000000000..a6cefe6e73b5 > --- /dev/null > +++ b/include/linux/uaccess_buffer_info.h > @@ -0,0 +1,23 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_ACCESS_BUFFER_INFO_H > +#define _LINUX_ACCESS_BUFFER_INFO_H > + > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +struct uaccess_buffer_info { > + unsigned long addr, size; > + unsigned long store_end_addr; > + sigset_t saved_sigmask; > + u8 state; > +}; > + > +#else > + > +struct uaccess_buffer_info { > +}; > + > +#endif > + > +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 43bd7f713c39..d8baacaef800 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -269,4 +269,13 @@ struct prctl_mm_map { > # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ > # define PR_SCHED_CORE_MAX 4 > > +/* Log uaccesses to a user-provided buffer */ > +#define PR_LOG_UACCESS 63 > + > +/* Format of the entries in the uaccess log. */ > +struct uaccess_buffer_entry { > + __u64 addr, size, flags; > +}; > +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */ > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 4df609be42d0..75a5d95ce9c3 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/ > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o > obj-$(CONFIG_CFI_CLANG) += cfi.o > +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o > > obj-$(CONFIG_PERF_EVENTS) += events/ > > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > index bf16395b9e13..c7e7ff8cbab3 100644 > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall) > if (work & SYSCALL_WORK_ENTER) > syscall = syscall_trace_enter(regs, syscall, work); > > + uaccess_buffer_syscall_entry(); > + > return syscall; > } > > @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) > local_irq_enable(); > } > > + uaccess_buffer_syscall_exit(); > rseq_syscall(regs); > > /* > diff --git a/kernel/sys.c b/kernel/sys.c > index 8fdac0d90504..df487600773c 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > error = sched_core_share_pid(arg2, arg3, arg4, arg5); > break; > #endif > + case PR_LOG_UACCESS: > + if (arg5) > + return -EINVAL; > + error = uaccess_buffer_set_logging(arg2, arg3, arg4); > + break; > default: > error = -EINVAL; > break; > diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c > new file mode 100644 > index 000000000000..b9da89887c4b > --- /dev/null > +++ b/kernel/uaccess_buffer.c > @@ -0,0 +1,118 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#ifdef CONFIG_UACCESS_BUFFER > + > +/* > + * We use a separate implementation of copy_to_user() that avoids the call > + * to instrument_copy_to_user() as this would otherwise lead to infinite > + * recursion. > + */ > +static unsigned long > +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n) > +{ > + if (!access_ok(to, n)) > + return n; > + return raw_copy_to_user(to, from, n); > +} > + > +static void uaccess_buffer_log(unsigned long addr, unsigned long size, > + unsigned long flags) > +{ > + struct uaccess_buffer_entry entry; > + > + if (current->uaccess_buffer.size < sizeof(entry) || > + unlikely(uaccess_kernel())) > + return; > + entry.addr = addr; > + entry.size = size; > + entry.flags = flags; > + > + /* > + * If our uaccess fails, abort the log so that the end address writeback > + * does not occur and userspace sees zero accesses. > + */ > + if (uaccess_buffer_copy_to_user( > + (void __user *)current->uaccess_buffer.addr, &entry, > + sizeof(entry))) { > + current->uaccess_buffer.state = 0; > + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0; > + } > + > + current->uaccess_buffer.addr += sizeof(entry); > + current->uaccess_buffer.size -= sizeof(entry); > +} > + > +void uaccess_buffer_log_read(const void __user *from, unsigned long n) > +{ > + uaccess_buffer_log((unsigned long)from, n, 0); > +} > +EXPORT_SYMBOL(uaccess_buffer_log_read); > + > +void uaccess_buffer_log_write(void __user *to, unsigned long n) > +{ > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE); > +} > +EXPORT_SYMBOL(uaccess_buffer_log_write); > + > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size, > + unsigned long store_end_addr) > +{ > + sigset_t temp_sigmask; > + > + current->uaccess_buffer.addr = addr; > + current->uaccess_buffer.size = size; > + current->uaccess_buffer.store_end_addr = store_end_addr; > + > + /* > + * Allow 2 syscalls before resetting the state: the current one (i.e. > + * prctl) and the next one, whose accesses we want to log. > + */ > + current->uaccess_buffer.state = 2; > + > + /* > + * Temporarily mask signals so that an intervening asynchronous signal > + * will not interfere with the logging. > + */ > + current->uaccess_buffer.saved_sigmask = current->blocked; > + sigfillset(&temp_sigmask); > + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); > + __set_current_blocked(&temp_sigmask); > + > + return 0; > +} > + > +void uaccess_buffer_syscall_entry(void) > +{ > + /* > + * The current syscall may be e.g. rt_sigprocmask, and therefore we want > + * to reset the mask before the syscall and not after, so that our > + * temporary mask is unobservable. > + */ > + if (current->uaccess_buffer.state == 1) > + __set_current_blocked(¤t->uaccess_buffer.saved_sigmask); > +} > + > +void uaccess_buffer_syscall_exit(void) > +{ > + if (current->uaccess_buffer.state > 0) { > + --current->uaccess_buffer.state; > + if (current->uaccess_buffer.state == 0) { > + u64 addr64 = current->uaccess_buffer.addr; > + > + uaccess_buffer_copy_to_user( > + (void __user *) > + current->uaccess_buffer.store_end_addr, > + &addr64, sizeof(addr64)); > + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0; > + } > + } > +} > + > +#endif _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel