All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-09 22:15 ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

This patch series introduces a kernel feature known as uaccess
logging, which allows userspace programs to be made aware of the
address and size of uaccesses performed by the kernel during
the servicing of a syscall. More details on the motivation
for and interface to this feature are available in the file
Documentation/admin-guide/uaccess-logging.rst added by the final
patch in the series.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64
and architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

The proposed interface is the result of numerous iterations and
prototyping and is based on a proposal by Dmitry Vyukov. The interface
preserves the correspondence between uaccess log identity and syscall
identity while tolerating incoming asynchronous signals in the interval
between setting up the logging and the actual syscall. We considered
a number of alternative designs but rejected them for various reasons:

- The design from v1 of this patch [1] proposed notifying the kernel
  of the address and size of the uaccess buffer via a prctl that
  would also automatically mask and unmask asynchronous signals as
  needed, but this would require multiple syscalls per "real" syscall,
  harming performance.

- We considered extending the syscall calling convention to
  designate currently-unused registers to be used to pass the
  location of the uaccess buffer, but this was rejected for being
  architecture-specific.

- One idea that we considered involved using the stack pointer address
  as a unique identifier for the syscall, but this currently would
  need to be arch-specific as we currently do not appear to have an
  arch-generic way of retrieving the stack pointer; the userspace
  side would also need some arch-specific code for this to work. It's
  also possible that a longjmp() past the signal handler would make
  the stack pointer address not unique enough for this purpose.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
  to trace per-syscall. This is so that we can exclude syscalls that
  should not be traced, such as syscalls that make up part of the
  sanitizer implementation (to avoid infinite recursion when e.g. printing
  an error report).

- Tracing would need to be synchronous in order to produce useful
  stack traces. For example this could be achieved using the new SIGTRAP
  on perf events mechanism. However, this would require logging each
  access to the stack (in the form of a sigcontext) and this is more
  likely to overflow the stack due to being much larger than a uaccess
  buffer entry as well as being unbounded, in contrast to the bounded
  buffer size passed to prctl(). An approach based on signal handlers is
  also likely to fall foul of the asynchronous signal issues mentioned
  previously, together with needing sigreturn to be handled specially
  (because it copies a sigcontext from userspace) otherwise we could
  never return from the signal handler. Furthermore, arguments to the
  trace events are not available to SIGTRAP. (This on its own wouldn't
  be insurmountable though -- we could add the arguments as fields
  to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
  -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
  I don't think it's usable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
  only be loaded as root, among other potential headaches.

[1] https://lore.kernel.org/all/20210922061809.736124-1-pcc@google.com/

Peter Collingbourne (7):
  include: split out uaccess instrumentation into a separate header
  uaccess-buffer: add core code
  fs: use copy_from_user_nolog() to copy mount() data
  uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  arm64: add support for uaccess logging
  Documentation: document uaccess logging
  selftests: test uaccess logging

 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/uaccess-logging.rst | 151 +++++++++++++++++
 arch/Kconfig                                  |  14 ++
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/thread_info.h          |   7 +-
 arch/arm64/kernel/ptrace.c                    |   7 +
 arch/arm64/kernel/signal.c                    |   5 +
 fs/exec.c                                     |   3 +
 fs/namespace.c                                |   8 +-
 include/linux/entry-common.h                  |   2 +
 include/linux/instrumented-uaccess.h          |  53 ++++++
 include/linux/instrumented.h                  |  34 ----
 include/linux/sched.h                         |   5 +
 include/linux/thread_info.h                   |   4 +
 include/linux/uaccess-buffer-info.h           |  46 ++++++
 include/linux/uaccess-buffer.h                | 152 ++++++++++++++++++
 include/linux/uaccess.h                       |   2 +-
 include/uapi/linux/prctl.h                    |   3 +
 include/uapi/linux/uaccess-buffer.h           |  27 ++++
 kernel/Makefile                               |   1 +
 kernel/bpf/helpers.c                          |   7 +-
 kernel/entry/common.c                         |  14 +-
 kernel/fork.c                                 |   4 +
 kernel/signal.c                               |   9 +-
 kernel/sys.c                                  |   6 +
 kernel/uaccess-buffer.c                       | 145 +++++++++++++++++
 lib/iov_iter.c                                |   2 +-
 lib/usercopy.c                                |   2 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/uaccess_buffer/Makefile |   4 +
 .../uaccess_buffer/uaccess_buffer_test.c      | 126 +++++++++++++++
 31 files changed, 802 insertions(+), 44 deletions(-)
 create mode 100644 Documentation/admin-guide/uaccess-logging.rst
 create mode 100644 include/linux/instrumented-uaccess.h
 create mode 100644 include/linux/uaccess-buffer-info.h
 create mode 100644 include/linux/uaccess-buffer.h
 create mode 100644 include/uapi/linux/uaccess-buffer.h
 create mode 100644 kernel/uaccess-buffer.c
 create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
 create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c

-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-09 22:15 ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

This patch series introduces a kernel feature known as uaccess
logging, which allows userspace programs to be made aware of the
address and size of uaccesses performed by the kernel during
the servicing of a syscall. More details on the motivation
for and interface to this feature are available in the file
Documentation/admin-guide/uaccess-logging.rst added by the final
patch in the series.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64
and architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

The proposed interface is the result of numerous iterations and
prototyping and is based on a proposal by Dmitry Vyukov. The interface
preserves the correspondence between uaccess log identity and syscall
identity while tolerating incoming asynchronous signals in the interval
between setting up the logging and the actual syscall. We considered
a number of alternative designs but rejected them for various reasons:

- The design from v1 of this patch [1] proposed notifying the kernel
  of the address and size of the uaccess buffer via a prctl that
  would also automatically mask and unmask asynchronous signals as
  needed, but this would require multiple syscalls per "real" syscall,
  harming performance.

- We considered extending the syscall calling convention to
  designate currently-unused registers to be used to pass the
  location of the uaccess buffer, but this was rejected for being
  architecture-specific.

- One idea that we considered involved using the stack pointer address
  as a unique identifier for the syscall, but this currently would
  need to be arch-specific as we currently do not appear to have an
  arch-generic way of retrieving the stack pointer; the userspace
  side would also need some arch-specific code for this to work. It's
  also possible that a longjmp() past the signal handler would make
  the stack pointer address not unique enough for this purpose.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
  to trace per-syscall. This is so that we can exclude syscalls that
  should not be traced, such as syscalls that make up part of the
  sanitizer implementation (to avoid infinite recursion when e.g. printing
  an error report).

- Tracing would need to be synchronous in order to produce useful
  stack traces. For example this could be achieved using the new SIGTRAP
  on perf events mechanism. However, this would require logging each
  access to the stack (in the form of a sigcontext) and this is more
  likely to overflow the stack due to being much larger than a uaccess
  buffer entry as well as being unbounded, in contrast to the bounded
  buffer size passed to prctl(). An approach based on signal handlers is
  also likely to fall foul of the asynchronous signal issues mentioned
  previously, together with needing sigreturn to be handled specially
  (because it copies a sigcontext from userspace) otherwise we could
  never return from the signal handler. Furthermore, arguments to the
  trace events are not available to SIGTRAP. (This on its own wouldn't
  be insurmountable though -- we could add the arguments as fields
  to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
  -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
  I don't think it's usable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
  only be loaded as root, among other potential headaches.

[1] https://lore.kernel.org/all/20210922061809.736124-1-pcc@google.com/

Peter Collingbourne (7):
  include: split out uaccess instrumentation into a separate header
  uaccess-buffer: add core code
  fs: use copy_from_user_nolog() to copy mount() data
  uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  arm64: add support for uaccess logging
  Documentation: document uaccess logging
  selftests: test uaccess logging

 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/uaccess-logging.rst | 151 +++++++++++++++++
 arch/Kconfig                                  |  14 ++
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/thread_info.h          |   7 +-
 arch/arm64/kernel/ptrace.c                    |   7 +
 arch/arm64/kernel/signal.c                    |   5 +
 fs/exec.c                                     |   3 +
 fs/namespace.c                                |   8 +-
 include/linux/entry-common.h                  |   2 +
 include/linux/instrumented-uaccess.h          |  53 ++++++
 include/linux/instrumented.h                  |  34 ----
 include/linux/sched.h                         |   5 +
 include/linux/thread_info.h                   |   4 +
 include/linux/uaccess-buffer-info.h           |  46 ++++++
 include/linux/uaccess-buffer.h                | 152 ++++++++++++++++++
 include/linux/uaccess.h                       |   2 +-
 include/uapi/linux/prctl.h                    |   3 +
 include/uapi/linux/uaccess-buffer.h           |  27 ++++
 kernel/Makefile                               |   1 +
 kernel/bpf/helpers.c                          |   7 +-
 kernel/entry/common.c                         |  14 +-
 kernel/fork.c                                 |   4 +
 kernel/signal.c                               |   9 +-
 kernel/sys.c                                  |   6 +
 kernel/uaccess-buffer.c                       | 145 +++++++++++++++++
 lib/iov_iter.c                                |   2 +-
 lib/usercopy.c                                |   2 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/uaccess_buffer/Makefile |   4 +
 .../uaccess_buffer/uaccess_buffer_test.c      | 126 +++++++++++++++
 31 files changed, 802 insertions(+), 44 deletions(-)
 create mode 100644 Documentation/admin-guide/uaccess-logging.rst
 create mode 100644 include/linux/instrumented-uaccess.h
 create mode 100644 include/linux/uaccess-buffer-info.h
 create mode 100644 include/linux/uaccess-buffer.h
 create mode 100644 include/uapi/linux/uaccess-buffer.h
 create mode 100644 kernel/uaccess-buffer.c
 create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
 create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c

-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4 1/7] include: split out uaccess instrumentation into a separate header
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

In an upcoming change we are going to add uaccess instrumentation
that uses inline access to struct task_struct from the
instrumentation routines. Because instrumentation.h is included
from many places including (recursively) from sched.h this would
otherwise lead to a circular dependency. Break the dependency by
moving uaccess instrumentation routines into a separate header,
instrumentation-uaccess.h.

Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
---
 include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
 include/linux/instrumented.h         | 34 -------------------
 include/linux/uaccess.h              |  2 +-
 lib/iov_iter.c                       |  2 +-
 lib/usercopy.c                       |  2 +-
 5 files changed, 52 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/instrumented-uaccess.h

diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
new file mode 100644
index 000000000000..ece549088e50
--- /dev/null
+++ b/include/linux/instrumented-uaccess.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This header provides generic wrappers for memory access instrumentation for
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ */
+#ifndef _LINUX_INSTRUMENTED_UACCESS_H
+#define _LINUX_INSTRUMENTED_UACCESS_H
+
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kcsan-checks.h>
+#include <linux/types.h>
+
+/**
+ * instrument_copy_to_user - instrument reads of copy_to_user
+ *
+ * Instrument reads from kernel memory, that are due to copy_to_user (and
+ * variants). The instrumentation must be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
+{
+	kasan_check_read(from, n);
+	kcsan_check_read(from, n);
+}
+
+/**
+ * instrument_copy_from_user - instrument writes of copy_from_user
+ *
+ * Instrument writes to kernel memory, that are due to copy_from_user (and
+ * variants). The instrumentation should be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
+{
+	kasan_check_write(to, n);
+	kcsan_check_write(to, n);
+}
+
+#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
index 42faebbaa202..b68f415510c7 100644
--- a/include/linux/instrumented.h
+++ b/include/linux/instrumented.h
@@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
 	kcsan_check_atomic_read_write(v, size);
 }
 
-/**
- * instrument_copy_to_user - instrument reads of copy_to_user
- *
- * Instrument reads from kernel memory, that are due to copy_to_user (and
- * variants). The instrumentation must be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
-{
-	kasan_check_read(from, n);
-	kcsan_check_read(from, n);
-}
-
-/**
- * instrument_copy_from_user - instrument writes of copy_from_user
- *
- * Instrument writes to kernel memory, that are due to copy_from_user (and
- * variants). The instrumentation should be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
-{
-	kasan_check_write(to, n);
-	kcsan_check_write(to, n);
-}
-
 #endif /* _LINUX_INSTRUMENTED_H */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ac0394087f7d..c0c467e39657 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -3,7 +3,7 @@
 #define __LINUX_UACCESS_H__
 
 #include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 #include <linux/minmax.h>
 #include <linux/sched.h>
 #include <linux/thread_info.h>
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..3f9dc6df7102 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -12,7 +12,7 @@
 #include <linux/compat.h>
 #include <net/checksum.h>
 #include <linux/scatterlist.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 
 #define PIPE_PARANOIA /* for now */
 
diff --git a/lib/usercopy.c b/lib/usercopy.c
index 7413dd300516..1cd188e62d06 100644
--- a/lib/usercopy.c
+++ b/lib/usercopy.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bitops.h>
 #include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 #include <linux/uaccess.h>
 
 /* out-of-line parts */
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 1/7] include: split out uaccess instrumentation into a separate header
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

In an upcoming change we are going to add uaccess instrumentation
that uses inline access to struct task_struct from the
instrumentation routines. Because instrumentation.h is included
from many places including (recursively) from sched.h this would
otherwise lead to a circular dependency. Break the dependency by
moving uaccess instrumentation routines into a separate header,
instrumentation-uaccess.h.

Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
---
 include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
 include/linux/instrumented.h         | 34 -------------------
 include/linux/uaccess.h              |  2 +-
 lib/iov_iter.c                       |  2 +-
 lib/usercopy.c                       |  2 +-
 5 files changed, 52 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/instrumented-uaccess.h

diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
new file mode 100644
index 000000000000..ece549088e50
--- /dev/null
+++ b/include/linux/instrumented-uaccess.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This header provides generic wrappers for memory access instrumentation for
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ */
+#ifndef _LINUX_INSTRUMENTED_UACCESS_H
+#define _LINUX_INSTRUMENTED_UACCESS_H
+
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kcsan-checks.h>
+#include <linux/types.h>
+
+/**
+ * instrument_copy_to_user - instrument reads of copy_to_user
+ *
+ * Instrument reads from kernel memory, that are due to copy_to_user (and
+ * variants). The instrumentation must be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
+{
+	kasan_check_read(from, n);
+	kcsan_check_read(from, n);
+}
+
+/**
+ * instrument_copy_from_user - instrument writes of copy_from_user
+ *
+ * Instrument writes to kernel memory, that are due to copy_from_user (and
+ * variants). The instrumentation should be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
+{
+	kasan_check_write(to, n);
+	kcsan_check_write(to, n);
+}
+
+#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
index 42faebbaa202..b68f415510c7 100644
--- a/include/linux/instrumented.h
+++ b/include/linux/instrumented.h
@@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
 	kcsan_check_atomic_read_write(v, size);
 }
 
-/**
- * instrument_copy_to_user - instrument reads of copy_to_user
- *
- * Instrument reads from kernel memory, that are due to copy_to_user (and
- * variants). The instrumentation must be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
-{
-	kasan_check_read(from, n);
-	kcsan_check_read(from, n);
-}
-
-/**
- * instrument_copy_from_user - instrument writes of copy_from_user
- *
- * Instrument writes to kernel memory, that are due to copy_from_user (and
- * variants). The instrumentation should be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
-{
-	kasan_check_write(to, n);
-	kcsan_check_write(to, n);
-}
-
 #endif /* _LINUX_INSTRUMENTED_H */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ac0394087f7d..c0c467e39657 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -3,7 +3,7 @@
 #define __LINUX_UACCESS_H__
 
 #include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 #include <linux/minmax.h>
 #include <linux/sched.h>
 #include <linux/thread_info.h>
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..3f9dc6df7102 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -12,7 +12,7 @@
 #include <linux/compat.h>
 #include <net/checksum.h>
 #include <linux/scatterlist.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 
 #define PIPE_PARANOIA /* for now */
 
diff --git a/lib/usercopy.c b/lib/usercopy.c
index 7413dd300516..1cd188e62d06 100644
--- a/lib/usercopy.c
+++ b/lib/usercopy.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bitops.h>
 #include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
 #include <linux/uaccess.h>
 
 /* out-of-line parts */
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 2/7] uaccess-buffer: add core code
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add the core code to support uaccess logging. Subsequent patches will
hook this up to the arch-specific kernel entry and exit code for
certain architectures.

Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v4:
- add CONFIG_UACCESS_BUFFER
- add kernel doc comments to uaccess-buffer.h
- outline uaccess_buffer_set_descriptor_addr_addr
- switch to using spin_lock_irqsave/spin_unlock_irqrestore during
  pre/post-exit-loop code because preemption is disabled at that point
- set kend to NULL if krealloc failed
- size_t -> unsigned long in copy_from_user_nolog signature

v3:
- performance optimizations for entry/exit code
- don't use kcur == NULL to mean overflow
- fix potential double free in clone()
- don't allocate a new kernel-side uaccess buffer for each syscall
- fix uaccess buffer leak on exit
- fix some sparse warnings

v2:
- New interface that avoids multiple syscalls per real syscall and
  is arch-generic
- Avoid logging uaccesses done by BPF programs
- Add documentation
- Split up into multiple patches
- Various code moves, renames etc as requested by Marco

 arch/Kconfig                         |  13 +++
 fs/exec.c                            |   3 +
 include/linux/instrumented-uaccess.h |   6 +-
 include/linux/sched.h                |   5 +
 include/linux/uaccess-buffer-info.h  |  46 ++++++++
 include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
 include/uapi/linux/prctl.h           |   3 +
 include/uapi/linux/uaccess-buffer.h  |  27 +++++
 kernel/Makefile                      |   1 +
 kernel/bpf/helpers.c                 |   7 +-
 kernel/fork.c                        |   4 +
 kernel/signal.c                      |   9 +-
 kernel/sys.c                         |   6 ++
 kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
 14 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/uaccess-buffer-info.h
 create mode 100644 include/linux/uaccess-buffer.h
 create mode 100644 include/uapi/linux/uaccess-buffer.h
 create mode 100644 kernel/uaccess-buffer.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d3c4ab249e9c..17819f53ea80 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
 config DYNAMIC_SIGFRAME
 	bool
 
+config HAVE_ARCH_UACCESS_BUFFER
+	bool
+	help
+	  Select if the architecture's syscall entry/exit code supports uaccess buffers.
+
+config UACCESS_BUFFER
+	bool "Uaccess logging" if EXPERT
+	default y
+	depends on HAVE_ARCH_UACCESS_BUFFER
+	help
+	  Select to enable support for uaccess logging
+	  (see Documentation/admin-guide/uaccess-logging.rst).
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..c9975e790f30 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
 #include <linux/vmalloc.h>
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
 	me->personality &= ~bprm->per_clear;
 
 	clear_syscall_work_syscall_user_dispatch(me);
+	uaccess_buffer_set_descriptor_addr_addr(0);
+	uaccess_buffer_free(current);
 
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
index ece549088e50..b967f4436d15 100644
--- a/include/linux/instrumented-uaccess.h
+++ b/include/linux/instrumented-uaccess.h
@@ -2,7 +2,8 @@
 
 /*
  * This header provides generic wrappers for memory access instrumentation for
- * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
+ * uaccess buffers.
  */
 #ifndef _LINUX_INSTRUMENTED_UACCESS_H
 #define _LINUX_INSTRUMENTED_UACCESS_H
@@ -11,6 +12,7 @@
 #include <linux/kasan-checks.h>
 #include <linux/kcsan-checks.h>
 #include <linux/types.h>
+#include <linux/uaccess-buffer.h>
 
 /**
  * instrument_copy_to_user - instrument reads of copy_to_user
@@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
 {
 	kasan_check_read(from, n);
 	kcsan_check_read(from, n);
+	uaccess_buffer_log_write(to, n);
 }
 
 /**
@@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
 {
 	kasan_check_write(to, n);
 	kcsan_check_write(to, n);
+	uaccess_buffer_log_read(from, n);
 }
 
 #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..96014dd2702e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/rseq.h>
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
+#include <linux/uaccess-buffer-info.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1484,6 +1485,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_UACCESS_BUFFER
+	struct uaccess_buffer_info	uaccess_buffer;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
new file mode 100644
index 000000000000..46e2b1a4a20f
--- /dev/null
+++ b/include/linux/uaccess-buffer-info.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_INFO_H
+#define _LINUX_UACCESS_BUFFER_INFO_H
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+struct uaccess_buffer_info {
+	/*
+	 * The pointer to pointer to struct uaccess_descriptor. This is the
+	 * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+	 */
+	struct uaccess_descriptor __user *__user *desc_ptr_ptr;
+
+	/*
+	 * The pointer to struct uaccess_descriptor read at syscall entry time.
+	 */
+	struct uaccess_descriptor __user *desc_ptr;
+
+	/*
+	 * A pointer to the kernel's temporary copy of the uaccess log for the
+	 * current syscall. We log to a kernel buffer in order to avoid leaking
+	 * timing information to userspace.
+	 */
+	struct uaccess_buffer_entry *kbegin;
+
+	/*
+	 * The position of the next uaccess buffer entry for the current
+	 * syscall, or NULL if we are not logging the current syscall.
+	 */
+	struct uaccess_buffer_entry *kcur;
+
+	/*
+	 * A pointer to the end of the kernel's uaccess log.
+	 */
+	struct uaccess_buffer_entry *kend;
+
+	/*
+	 * The pointer to the userspace uaccess log, as read from the
+	 * struct uaccess_descriptor.
+	 */
+	struct uaccess_buffer_entry __user *ubegin;
+};
+
+#endif
+
+#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..2e9b4010fb59
--- /dev/null
+++ b/include/linux/uaccess-buffer.h
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_H
+#define _LINUX_UACCESS_BUFFER_H
+
+#include <linux/sched.h>
+#include <uapi/linux/uaccess-buffer.h>
+
+#include <asm-generic/errno-base.h>
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+/*
+ * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
+ * blocked due to uaccess logging
+ * @tsk: the task.
+ */
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+	return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
+}
+
+void __uaccess_buffer_syscall_entry(void);
+/*
+ * uaccess_buffer_syscall_entry - hook to be run before syscall entry
+ */
+static inline void uaccess_buffer_syscall_entry(void)
+{
+	__uaccess_buffer_syscall_entry();
+}
+
+void __uaccess_buffer_syscall_exit(void);
+/*
+ * uaccess_buffer_syscall_exit - hook to be run after syscall exit
+ */
+static inline void uaccess_buffer_syscall_exit(void)
+{
+	__uaccess_buffer_syscall_exit();
+}
+
+bool __uaccess_buffer_pre_exit_loop(void);
+/*
+ * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
+ * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
+ * be passed to uaccess_buffer_post_exit_loop.
+ */
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+	if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
+		return false;
+	return __uaccess_buffer_pre_exit_loop();
+}
+
+void __uaccess_buffer_post_exit_loop(void);
+/*
+ * uaccess_buffer_post_exit_loop - hook to be run immediately after the
+ * pre-kernel-exit loop that handles signals, tracing etc.
+ * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
+ */
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+	if (pending)
+		__uaccess_buffer_post_exit_loop();
+}
+
+/*
+ * uaccess_buffer_set_descriptor_addr_addr - implements
+ * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+ */
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
+
+/*
+ * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
+ * logging. This is useful in special cases, such as when the kernel overreads a
+ * buffer.
+ * @to: the pointer to kernel memory.
+ * @from: the pointer to user memory.
+ * @len: the number of bytes to copy.
+ */
+unsigned long copy_from_user_nolog(void *to, const void __user *from,
+				   unsigned long len);
+
+/*
+ * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
+ * for uaccess logging to be cancelled for the current syscall
+ * @tsk: the task.
+ */
+void uaccess_buffer_free(struct task_struct *tsk);
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+/*
+ * uaccess_buffer_log_read - log a read access
+ * @from: the address of the access.
+ * @n: the number of bytes.
+ */
+static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+		__uaccess_buffer_log_read(from, n);
+}
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+/*
+ * uaccess_buffer_log_write - log a write access
+ * @to: the address of the access.
+ * @n: the number of bytes.
+ */
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+		__uaccess_buffer_log_write(to, n);
+}
+
+#else
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+	return false;
+}
+static inline void uaccess_buffer_syscall_entry(void)
+{
+}
+static inline void uaccess_buffer_syscall_exit(void)
+{
+}
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+	return false;
+}
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+}
+static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+	return -EINVAL;
+}
+static inline void uaccess_buffer_free(struct task_struct *tsk)
+{
+}
+
+#define copy_from_user_nolog copy_from_user
+
+static inline void uaccess_buffer_log_read(const void __user *from,
+					   unsigned long n)
+{
+}
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+}
+
+#endif
+
+#endif  /* _LINUX_UACCESS_BUFFER_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index bb73e9a0b24f..74b37469c7b3 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -272,4 +272,7 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SCOPE_THREAD_GROUP	1
 # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP	2
 
+/* Configure uaccess logging feature */
+#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR	63
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..bf10f7c78857
--- /dev/null
+++ b/include/uapi/linux/uaccess-buffer.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
+#define _UAPI_LINUX_UACCESS_BUFFER_H
+
+#include <linux/types.h>
+
+/* Location of the uaccess log. */
+struct uaccess_descriptor {
+	/* Address of the uaccess_buffer_entry array. */
+	__u64 addr;
+	/* Size of the uaccess_buffer_entry array in number of elements. */
+	__u64 size;
+};
+
+/* Format of the entries in the uaccess log. */
+struct uaccess_buffer_entry {
+	/* Address being accessed. */
+	__u64 addr;
+	/* Number of bytes that were accessed. */
+	__u64 size;
+	/* UACCESS_BUFFER_* flags. */
+	__u64 flags;
+};
+
+#define UACCESS_BUFFER_FLAG_WRITE	1 /* access was a write */
+
+#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 186c49582f45..e5f6c56696a2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
 obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
 obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
 obj-$(CONFIG_CFI_CLANG) += cfi.o
+obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 649f07623df6..ab6520a633ef 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -15,6 +15,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/proc_ns.h>
 #include <linux/security.h>
+#include <linux/uaccess-buffer.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
 BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
 	   const void __user *, user_ptr)
 {
-	int ret = copy_from_user(dst, user_ptr, size);
+	/*
+	 * Avoid logging uaccesses here as the BPF program may not be following
+	 * the uaccess log rules.
+	 */
+	int ret = copy_from_user_nolog(dst, user_ptr, size);
 
 	if (unlikely(ret)) {
 		memset(dst, 0, size);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..8be2ca528a65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -96,6 +96,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 	sched_core_free(tsk);
+	uaccess_buffer_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
@@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (memcg_charge_kernel_stack(tsk))
 		goto free_stack;
 
+	uaccess_buffer_free(orig);
+
 	stack_vm_area = task_stack_vm_area(tsk);
 
 	err = arch_dup_task_struct(tsk, orig);
diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..b85d7d4844f6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -45,6 +45,7 @@
 #include <linux/posix-timers.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/uaccess-buffer.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 	if (sig_fatal(p, sig) &&
 	    !(signal->flags & SIGNAL_GROUP_EXIT) &&
 	    !sigismember(&t->real_blocked, sig) &&
-	    (sig == SIGKILL || !p->ptrace)) {
+	    (sig == SIGKILL ||
+	     !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
 		/*
 		 * This signal will be fatal to the whole group.
 		 */
@@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
 void __set_current_blocked(const sigset_t *newset)
 {
 	struct task_struct *tsk = current;
+	unsigned long flags;
 
 	/*
 	 * In case the signal mask hasn't changed, there is nothing we need
@@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
 	if (sigequalsets(&tsk->blocked, newset))
 		return;
 
-	spin_lock_irq(&tsk->sighand->siglock);
+	spin_lock_irqsave(&tsk->sighand->siglock, flags);
 	__set_task_blocked(tsk, newset);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
 }
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..c71a9a9c0f68 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/version.h>
 #include <linux/ctype.h>
 #include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = uaccess_buffer_set_descriptor_addr_addr(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
new file mode 100644
index 000000000000..d3129244b7d9
--- /dev/null
+++ b/kernel/uaccess-buffer.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Support for uaccess logging via uaccess buffers.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#include <linux/compat.h>
+#include <linux/mm.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uaccess-buffer.h>
+
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+	current->uaccess_buffer.desc_ptr_ptr =
+		(struct uaccess_descriptor __user * __user *)addr;
+	if (addr)
+		set_syscall_work(UACCESS_BUFFER_ENTRY);
+	else
+		clear_syscall_work(UACCESS_BUFFER_ENTRY);
+	return 0;
+}
+
+static void uaccess_buffer_log(unsigned long addr, unsigned long size,
+			      unsigned long flags)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_buffer_entry *entry = buf->kcur;
+
+	if (entry == buf->kend || unlikely(uaccess_kernel()))
+		return;
+	entry->addr = addr;
+	entry->size = size;
+	entry->flags = flags;
+
+	++buf->kcur;
+}
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+	uaccess_buffer_log((unsigned long)from, n, 0);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_read);
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+	uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_write);
+
+bool __uaccess_buffer_pre_exit_loop(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_descriptor __user *desc_ptr;
+	sigset_t tmp_mask;
+
+	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
+		return false;
+
+	current->real_blocked = current->blocked;
+	sigfillset(&tmp_mask);
+	set_current_blocked(&tmp_mask);
+	return true;
+}
+
+void __uaccess_buffer_post_exit_loop(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&current->sighand->siglock, flags);
+	current->blocked = current->real_blocked;
+	recalc_sigpending();
+	spin_unlock_irqrestore(&current->sighand->siglock, flags);
+}
+
+void uaccess_buffer_free(struct task_struct *tsk)
+{
+	struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
+
+	kfree(buf->kbegin);
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	buf->kbegin = buf->kcur = buf->kend = NULL;
+}
+
+void __uaccess_buffer_syscall_entry(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_descriptor desc;
+
+	if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
+	    put_user(0, buf->desc_ptr_ptr) ||
+	    copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
+		return;
+
+	if (desc.size > 1024)
+		desc.size = 1024;
+
+	if (buf->kend - buf->kbegin != desc.size)
+		buf->kbegin =
+			krealloc_array(buf->kbegin, desc.size,
+				       sizeof(struct uaccess_buffer_entry),
+				       GFP_KERNEL);
+	if (!buf->kbegin) {
+		buf->kend = NULL;
+		return;
+	}
+
+	set_syscall_work(UACCESS_BUFFER_EXIT);
+	buf->kcur = buf->kbegin;
+	buf->kend = buf->kbegin + desc.size;
+	buf->ubegin =
+		(struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
+}
+
+void __uaccess_buffer_syscall_exit(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	u64 num_entries = buf->kcur - buf->kbegin;
+	struct uaccess_descriptor desc;
+
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
+	desc.size = buf->kend - buf->kcur;
+	buf->kcur = NULL;
+	if (copy_to_user(buf->ubegin, buf->kbegin,
+			 num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
+		(void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
+}
+
+unsigned long copy_from_user_nolog(void *to, const void __user *from,
+				   unsigned long len)
+{
+	size_t retval;
+
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	retval = copy_from_user(to, from, len);
+	if (current->uaccess_buffer.kcur)
+		set_syscall_work(UACCESS_BUFFER_EXIT);
+	return retval;
+}
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 2/7] uaccess-buffer: add core code
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add the core code to support uaccess logging. Subsequent patches will
hook this up to the arch-specific kernel entry and exit code for
certain architectures.

Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v4:
- add CONFIG_UACCESS_BUFFER
- add kernel doc comments to uaccess-buffer.h
- outline uaccess_buffer_set_descriptor_addr_addr
- switch to using spin_lock_irqsave/spin_unlock_irqrestore during
  pre/post-exit-loop code because preemption is disabled at that point
- set kend to NULL if krealloc failed
- size_t -> unsigned long in copy_from_user_nolog signature

v3:
- performance optimizations for entry/exit code
- don't use kcur == NULL to mean overflow
- fix potential double free in clone()
- don't allocate a new kernel-side uaccess buffer for each syscall
- fix uaccess buffer leak on exit
- fix some sparse warnings

v2:
- New interface that avoids multiple syscalls per real syscall and
  is arch-generic
- Avoid logging uaccesses done by BPF programs
- Add documentation
- Split up into multiple patches
- Various code moves, renames etc as requested by Marco

 arch/Kconfig                         |  13 +++
 fs/exec.c                            |   3 +
 include/linux/instrumented-uaccess.h |   6 +-
 include/linux/sched.h                |   5 +
 include/linux/uaccess-buffer-info.h  |  46 ++++++++
 include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
 include/uapi/linux/prctl.h           |   3 +
 include/uapi/linux/uaccess-buffer.h  |  27 +++++
 kernel/Makefile                      |   1 +
 kernel/bpf/helpers.c                 |   7 +-
 kernel/fork.c                        |   4 +
 kernel/signal.c                      |   9 +-
 kernel/sys.c                         |   6 ++
 kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
 14 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/uaccess-buffer-info.h
 create mode 100644 include/linux/uaccess-buffer.h
 create mode 100644 include/uapi/linux/uaccess-buffer.h
 create mode 100644 kernel/uaccess-buffer.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d3c4ab249e9c..17819f53ea80 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
 config DYNAMIC_SIGFRAME
 	bool
 
+config HAVE_ARCH_UACCESS_BUFFER
+	bool
+	help
+	  Select if the architecture's syscall entry/exit code supports uaccess buffers.
+
+config UACCESS_BUFFER
+	bool "Uaccess logging" if EXPERT
+	default y
+	depends on HAVE_ARCH_UACCESS_BUFFER
+	help
+	  Select to enable support for uaccess logging
+	  (see Documentation/admin-guide/uaccess-logging.rst).
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..c9975e790f30 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
 #include <linux/vmalloc.h>
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
 	me->personality &= ~bprm->per_clear;
 
 	clear_syscall_work_syscall_user_dispatch(me);
+	uaccess_buffer_set_descriptor_addr_addr(0);
+	uaccess_buffer_free(current);
 
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
index ece549088e50..b967f4436d15 100644
--- a/include/linux/instrumented-uaccess.h
+++ b/include/linux/instrumented-uaccess.h
@@ -2,7 +2,8 @@
 
 /*
  * This header provides generic wrappers for memory access instrumentation for
- * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
+ * uaccess buffers.
  */
 #ifndef _LINUX_INSTRUMENTED_UACCESS_H
 #define _LINUX_INSTRUMENTED_UACCESS_H
@@ -11,6 +12,7 @@
 #include <linux/kasan-checks.h>
 #include <linux/kcsan-checks.h>
 #include <linux/types.h>
+#include <linux/uaccess-buffer.h>
 
 /**
  * instrument_copy_to_user - instrument reads of copy_to_user
@@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
 {
 	kasan_check_read(from, n);
 	kcsan_check_read(from, n);
+	uaccess_buffer_log_write(to, n);
 }
 
 /**
@@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
 {
 	kasan_check_write(to, n);
 	kcsan_check_write(to, n);
+	uaccess_buffer_log_read(from, n);
 }
 
 #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..96014dd2702e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/rseq.h>
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
+#include <linux/uaccess-buffer-info.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1484,6 +1485,10 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_UACCESS_BUFFER
+	struct uaccess_buffer_info	uaccess_buffer;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
new file mode 100644
index 000000000000..46e2b1a4a20f
--- /dev/null
+++ b/include/linux/uaccess-buffer-info.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_INFO_H
+#define _LINUX_UACCESS_BUFFER_INFO_H
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+struct uaccess_buffer_info {
+	/*
+	 * The pointer to pointer to struct uaccess_descriptor. This is the
+	 * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+	 */
+	struct uaccess_descriptor __user *__user *desc_ptr_ptr;
+
+	/*
+	 * The pointer to struct uaccess_descriptor read at syscall entry time.
+	 */
+	struct uaccess_descriptor __user *desc_ptr;
+
+	/*
+	 * A pointer to the kernel's temporary copy of the uaccess log for the
+	 * current syscall. We log to a kernel buffer in order to avoid leaking
+	 * timing information to userspace.
+	 */
+	struct uaccess_buffer_entry *kbegin;
+
+	/*
+	 * The position of the next uaccess buffer entry for the current
+	 * syscall, or NULL if we are not logging the current syscall.
+	 */
+	struct uaccess_buffer_entry *kcur;
+
+	/*
+	 * A pointer to the end of the kernel's uaccess log.
+	 */
+	struct uaccess_buffer_entry *kend;
+
+	/*
+	 * The pointer to the userspace uaccess log, as read from the
+	 * struct uaccess_descriptor.
+	 */
+	struct uaccess_buffer_entry __user *ubegin;
+};
+
+#endif
+
+#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..2e9b4010fb59
--- /dev/null
+++ b/include/linux/uaccess-buffer.h
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_H
+#define _LINUX_UACCESS_BUFFER_H
+
+#include <linux/sched.h>
+#include <uapi/linux/uaccess-buffer.h>
+
+#include <asm-generic/errno-base.h>
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+/*
+ * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
+ * blocked due to uaccess logging
+ * @tsk: the task.
+ */
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+	return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
+}
+
+void __uaccess_buffer_syscall_entry(void);
+/*
+ * uaccess_buffer_syscall_entry - hook to be run before syscall entry
+ */
+static inline void uaccess_buffer_syscall_entry(void)
+{
+	__uaccess_buffer_syscall_entry();
+}
+
+void __uaccess_buffer_syscall_exit(void);
+/*
+ * uaccess_buffer_syscall_exit - hook to be run after syscall exit
+ */
+static inline void uaccess_buffer_syscall_exit(void)
+{
+	__uaccess_buffer_syscall_exit();
+}
+
+bool __uaccess_buffer_pre_exit_loop(void);
+/*
+ * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
+ * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
+ * be passed to uaccess_buffer_post_exit_loop.
+ */
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+	if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
+		return false;
+	return __uaccess_buffer_pre_exit_loop();
+}
+
+void __uaccess_buffer_post_exit_loop(void);
+/*
+ * uaccess_buffer_post_exit_loop - hook to be run immediately after the
+ * pre-kernel-exit loop that handles signals, tracing etc.
+ * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
+ */
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+	if (pending)
+		__uaccess_buffer_post_exit_loop();
+}
+
+/*
+ * uaccess_buffer_set_descriptor_addr_addr - implements
+ * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+ */
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
+
+/*
+ * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
+ * logging. This is useful in special cases, such as when the kernel overreads a
+ * buffer.
+ * @to: the pointer to kernel memory.
+ * @from: the pointer to user memory.
+ * @len: the number of bytes to copy.
+ */
+unsigned long copy_from_user_nolog(void *to, const void __user *from,
+				   unsigned long len);
+
+/*
+ * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
+ * for uaccess logging to be cancelled for the current syscall
+ * @tsk: the task.
+ */
+void uaccess_buffer_free(struct task_struct *tsk);
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+/*
+ * uaccess_buffer_log_read - log a read access
+ * @from: the address of the access.
+ * @n: the number of bytes.
+ */
+static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+		__uaccess_buffer_log_read(from, n);
+}
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+/*
+ * uaccess_buffer_log_write - log a write access
+ * @to: the address of the access.
+ * @n: the number of bytes.
+ */
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+		__uaccess_buffer_log_write(to, n);
+}
+
+#else
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+	return false;
+}
+static inline void uaccess_buffer_syscall_entry(void)
+{
+}
+static inline void uaccess_buffer_syscall_exit(void)
+{
+}
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+	return false;
+}
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+}
+static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+	return -EINVAL;
+}
+static inline void uaccess_buffer_free(struct task_struct *tsk)
+{
+}
+
+#define copy_from_user_nolog copy_from_user
+
+static inline void uaccess_buffer_log_read(const void __user *from,
+					   unsigned long n)
+{
+}
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+}
+
+#endif
+
+#endif  /* _LINUX_UACCESS_BUFFER_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index bb73e9a0b24f..74b37469c7b3 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -272,4 +272,7 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SCOPE_THREAD_GROUP	1
 # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP	2
 
+/* Configure uaccess logging feature */
+#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR	63
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..bf10f7c78857
--- /dev/null
+++ b/include/uapi/linux/uaccess-buffer.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
+#define _UAPI_LINUX_UACCESS_BUFFER_H
+
+#include <linux/types.h>
+
+/* Location of the uaccess log. */
+struct uaccess_descriptor {
+	/* Address of the uaccess_buffer_entry array. */
+	__u64 addr;
+	/* Size of the uaccess_buffer_entry array in number of elements. */
+	__u64 size;
+};
+
+/* Format of the entries in the uaccess log. */
+struct uaccess_buffer_entry {
+	/* Address being accessed. */
+	__u64 addr;
+	/* Number of bytes that were accessed. */
+	__u64 size;
+	/* UACCESS_BUFFER_* flags. */
+	__u64 flags;
+};
+
+#define UACCESS_BUFFER_FLAG_WRITE	1 /* access was a write */
+
+#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 186c49582f45..e5f6c56696a2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
 obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
 obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
 obj-$(CONFIG_CFI_CLANG) += cfi.o
+obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 649f07623df6..ab6520a633ef 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -15,6 +15,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/proc_ns.h>
 #include <linux/security.h>
+#include <linux/uaccess-buffer.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
 BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
 	   const void __user *, user_ptr)
 {
-	int ret = copy_from_user(dst, user_ptr, size);
+	/*
+	 * Avoid logging uaccesses here as the BPF program may not be following
+	 * the uaccess log rules.
+	 */
+	int ret = copy_from_user_nolog(dst, user_ptr, size);
 
 	if (unlikely(ret)) {
 		memset(dst, 0, size);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..8be2ca528a65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -96,6 +96,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 	sched_core_free(tsk);
+	uaccess_buffer_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
@@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (memcg_charge_kernel_stack(tsk))
 		goto free_stack;
 
+	uaccess_buffer_free(orig);
+
 	stack_vm_area = task_stack_vm_area(tsk);
 
 	err = arch_dup_task_struct(tsk, orig);
diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..b85d7d4844f6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -45,6 +45,7 @@
 #include <linux/posix-timers.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/uaccess-buffer.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 	if (sig_fatal(p, sig) &&
 	    !(signal->flags & SIGNAL_GROUP_EXIT) &&
 	    !sigismember(&t->real_blocked, sig) &&
-	    (sig == SIGKILL || !p->ptrace)) {
+	    (sig == SIGKILL ||
+	     !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
 		/*
 		 * This signal will be fatal to the whole group.
 		 */
@@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
 void __set_current_blocked(const sigset_t *newset)
 {
 	struct task_struct *tsk = current;
+	unsigned long flags;
 
 	/*
 	 * In case the signal mask hasn't changed, there is nothing we need
@@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
 	if (sigequalsets(&tsk->blocked, newset))
 		return;
 
-	spin_lock_irq(&tsk->sighand->siglock);
+	spin_lock_irqsave(&tsk->sighand->siglock, flags);
 	__set_task_blocked(tsk, newset);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
 }
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..c71a9a9c0f68 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/version.h>
 #include <linux/ctype.h>
 #include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = uaccess_buffer_set_descriptor_addr_addr(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
new file mode 100644
index 000000000000..d3129244b7d9
--- /dev/null
+++ b/kernel/uaccess-buffer.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Support for uaccess logging via uaccess buffers.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#include <linux/compat.h>
+#include <linux/mm.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uaccess-buffer.h>
+
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+	current->uaccess_buffer.desc_ptr_ptr =
+		(struct uaccess_descriptor __user * __user *)addr;
+	if (addr)
+		set_syscall_work(UACCESS_BUFFER_ENTRY);
+	else
+		clear_syscall_work(UACCESS_BUFFER_ENTRY);
+	return 0;
+}
+
+static void uaccess_buffer_log(unsigned long addr, unsigned long size,
+			      unsigned long flags)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_buffer_entry *entry = buf->kcur;
+
+	if (entry == buf->kend || unlikely(uaccess_kernel()))
+		return;
+	entry->addr = addr;
+	entry->size = size;
+	entry->flags = flags;
+
+	++buf->kcur;
+}
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+	uaccess_buffer_log((unsigned long)from, n, 0);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_read);
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+	uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_write);
+
+bool __uaccess_buffer_pre_exit_loop(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_descriptor __user *desc_ptr;
+	sigset_t tmp_mask;
+
+	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
+		return false;
+
+	current->real_blocked = current->blocked;
+	sigfillset(&tmp_mask);
+	set_current_blocked(&tmp_mask);
+	return true;
+}
+
+void __uaccess_buffer_post_exit_loop(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&current->sighand->siglock, flags);
+	current->blocked = current->real_blocked;
+	recalc_sigpending();
+	spin_unlock_irqrestore(&current->sighand->siglock, flags);
+}
+
+void uaccess_buffer_free(struct task_struct *tsk)
+{
+	struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
+
+	kfree(buf->kbegin);
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	buf->kbegin = buf->kcur = buf->kend = NULL;
+}
+
+void __uaccess_buffer_syscall_entry(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	struct uaccess_descriptor desc;
+
+	if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
+	    put_user(0, buf->desc_ptr_ptr) ||
+	    copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
+		return;
+
+	if (desc.size > 1024)
+		desc.size = 1024;
+
+	if (buf->kend - buf->kbegin != desc.size)
+		buf->kbegin =
+			krealloc_array(buf->kbegin, desc.size,
+				       sizeof(struct uaccess_buffer_entry),
+				       GFP_KERNEL);
+	if (!buf->kbegin) {
+		buf->kend = NULL;
+		return;
+	}
+
+	set_syscall_work(UACCESS_BUFFER_EXIT);
+	buf->kcur = buf->kbegin;
+	buf->kend = buf->kbegin + desc.size;
+	buf->ubegin =
+		(struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
+}
+
+void __uaccess_buffer_syscall_exit(void)
+{
+	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+	u64 num_entries = buf->kcur - buf->kbegin;
+	struct uaccess_descriptor desc;
+
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
+	desc.size = buf->kend - buf->kcur;
+	buf->kcur = NULL;
+	if (copy_to_user(buf->ubegin, buf->kbegin,
+			 num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
+		(void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
+}
+
+unsigned long copy_from_user_nolog(void *to, const void __user *from,
+				   unsigned long len)
+{
+	size_t retval;
+
+	clear_syscall_work(UACCESS_BUFFER_EXIT);
+	retval = copy_from_user(to, from, len);
+	if (current->uaccess_buffer.kcur)
+		set_syscall_work(UACCESS_BUFFER_EXIT);
+	return retval;
+}
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 3/7] fs: use copy_from_user_nolog() to copy mount() data
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

With uaccess logging the contract is that the kernel must not report
accessing more data than necessary, as this can lead to false positive
reports in downstream consumers. This generally works out of the box
when instrumenting copy_{from,to}_user(), but with the data argument
to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
much as we can, if the PAGE_SIZE sized access failed) and figure out
later how much we actually need.

To prevent this from leading to a false positive report, use
copy_from_user_nolog(), which will prevent the access from being logged.
Recall that it is valid for the kernel to report accessing less
data than it actually accessed, as uaccess logging is a best-effort
mechanism for reporting uaccesses.

Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 fs/namespace.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..8f5f2aaca64e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/shmem_fs.h>
+#include <linux/uaccess-buffer.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
 	if (!copy)
 		return ERR_PTR(-ENOMEM);
 
-	left = copy_from_user(copy, data, PAGE_SIZE);
+	/*
+	 * Use copy_from_user_nolog to avoid reporting overly large accesses in
+	 * the uaccess buffer, as this can lead to false positive reports in
+	 * downstream consumers.
+	 */
+	left = copy_from_user_nolog(copy, data, PAGE_SIZE);
 
 	/*
 	 * Not all architectures have an exact copy_from_user(). Resort to
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 3/7] fs: use copy_from_user_nolog() to copy mount() data
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

With uaccess logging the contract is that the kernel must not report
accessing more data than necessary, as this can lead to false positive
reports in downstream consumers. This generally works out of the box
when instrumenting copy_{from,to}_user(), but with the data argument
to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
much as we can, if the PAGE_SIZE sized access failed) and figure out
later how much we actually need.

To prevent this from leading to a false positive report, use
copy_from_user_nolog(), which will prevent the access from being logged.
Recall that it is valid for the kernel to report accessing less
data than it actually accessed, as uaccess logging is a best-effort
mechanism for reporting uaccesses.

Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 fs/namespace.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..8f5f2aaca64e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/shmem_fs.h>
+#include <linux/uaccess-buffer.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
 	if (!copy)
 		return ERR_PTR(-ENOMEM);
 
-	left = copy_from_user(copy, data, PAGE_SIZE);
+	/*
+	 * Use copy_from_user_nolog to avoid reporting overly large accesses in
+	 * the uaccess buffer, as this can lead to false positive reports in
+	 * downstream consumers.
+	 */
+	left = copy_from_user_nolog(copy, data, PAGE_SIZE);
 
 	/*
 	 * Not all architectures have an exact copy_from_user(). Resort to
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add uaccess logging support on architectures that use
CONFIG_GENERIC_ENTRY (currently only s390 and x86).

Link: https://linux-review.googlesource.com/id/I3c5eb19a7e4a1dbe6095f6971f7826c4b0663f7d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
---
v4:
- move pre/post-exit-loop calls into if statement

 arch/Kconfig                 |  1 +
 include/linux/entry-common.h |  2 ++
 include/linux/thread_info.h  |  4 ++++
 kernel/entry/common.c        | 14 +++++++++++++-
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 17819f53ea80..bc849a61b636 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -31,6 +31,7 @@ config HOTPLUG_SMT
 	bool
 
 config GENERIC_ENTRY
+       select HAVE_ARCH_UACCESS_BUFFER
        bool
 
 config KPROBES
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..973fcd1d48a3 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -42,12 +42,14 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_UACCESS_BUFFER_ENTRY |	\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
+				 SYSCALL_WORK_UACCESS_BUFFER_EXIT |	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
 /*
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ad0c4e041030..b0f8ea86967f 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,8 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY,
+	SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +57,8 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_UACCESS_BUFFER_ENTRY	BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY)
+#define SYSCALL_WORK_UACCESS_BUFFER_EXIT	BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT)
 #endif
 
 #include <asm/thread_info.h>
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..59ec6e3f793b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/uaccess-buffer.h>
 
 #include "common.h"
 
@@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 			return ret;
 	}
 
+	if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
+		uaccess_buffer_syscall_entry();
+
 	/* Either of the above might have changed the syscall number */
 	syscall = syscall_get_nr(current, regs);
 
@@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	bool uaccess_buffer_pending;
 
 	lockdep_assert_irqs_disabled();
 
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
+		bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
+	}
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 
 	audit_syscall_exit(regs);
 
+	if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
+		uaccess_buffer_syscall_exit();
+
 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
 		trace_sys_exit(regs, syscall_get_return_value(current, regs));
 
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add uaccess logging support on architectures that use
CONFIG_GENERIC_ENTRY (currently only s390 and x86).

Link: https://linux-review.googlesource.com/id/I3c5eb19a7e4a1dbe6095f6971f7826c4b0663f7d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
---
v4:
- move pre/post-exit-loop calls into if statement

 arch/Kconfig                 |  1 +
 include/linux/entry-common.h |  2 ++
 include/linux/thread_info.h  |  4 ++++
 kernel/entry/common.c        | 14 +++++++++++++-
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 17819f53ea80..bc849a61b636 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -31,6 +31,7 @@ config HOTPLUG_SMT
 	bool
 
 config GENERIC_ENTRY
+       select HAVE_ARCH_UACCESS_BUFFER
        bool
 
 config KPROBES
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..973fcd1d48a3 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -42,12 +42,14 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_UACCESS_BUFFER_ENTRY |	\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
+				 SYSCALL_WORK_UACCESS_BUFFER_EXIT |	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
 /*
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ad0c4e041030..b0f8ea86967f 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,8 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY,
+	SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +57,8 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_UACCESS_BUFFER_ENTRY	BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY)
+#define SYSCALL_WORK_UACCESS_BUFFER_EXIT	BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT)
 #endif
 
 #include <asm/thread_info.h>
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..59ec6e3f793b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/uaccess-buffer.h>
 
 #include "common.h"
 
@@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
 			return ret;
 	}
 
+	if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
+		uaccess_buffer_syscall_entry();
+
 	/* Either of the above might have changed the syscall number */
 	syscall = syscall_get_nr(current, regs);
 
@@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	bool uaccess_buffer_pending;
 
 	lockdep_assert_irqs_disabled();
 
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
+		bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
+	}
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 
 	audit_syscall_exit(regs);
 
+	if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
+		uaccess_buffer_syscall_exit();
+
 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
 		trace_sys_exit(regs, syscall_get_return_value(current, regs));
 
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 5/7] arm64: add support for uaccess logging
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
uaccess logging directly to the architecture.

Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v4:
- remove unnecessary hunk

 arch/arm64/Kconfig                   | 1 +
 arch/arm64/include/asm/thread_info.h | 7 ++++++-
 arch/arm64/kernel/ptrace.c           | 7 +++++++
 arch/arm64/kernel/signal.c           | 5 +++++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..6023946abe4a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -161,6 +161,7 @@ config ARM64
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_UACCESS_BUFFER
 	select HAVE_ARCH_VMAP_STACK
 	select HAVE_ARM_SMCCC
 	select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index e1317b7c4525..0461b36251ea 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define TIF_SVE_VL_INHERIT	24	/* Inherit SVE vl_onexec across exec */
 #define TIF_SSBD		25	/* Wants SSB mitigation */
 #define TIF_TAGGED_ADDR		26	/* Allow tagged user addresses */
+#define TIF_UACCESS_BUFFER_ENTRY 27     /* thread has non-zero uaccess_desc_addr_addr */
+#define TIF_UACCESS_BUFFER_EXIT  28     /* thread has non-zero kcur */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
@@ -98,6 +100,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define _TIF_SVE		(1 << TIF_SVE)
 #define _TIF_MTE_ASYNC_FAULT	(1 << TIF_MTE_ASYNC_FAULT)
 #define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_UACCESS_BUFFER_ENTRY	(1 << TIF_UACCESS_BUFFER_ENTRY)
+#define _TIF_UACCESS_BUFFER_EXIT	(1 << TIF_UACCESS_BUFFER_EXIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
@@ -106,7 +110,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
-				 _TIF_SYSCALL_EMU)
+				 _TIF_SYSCALL_EMU | _TIF_UACCESS_BUFFER_ENTRY | \
+				 _TIF_UACCESS_BUFFER_EXIT)
 
 #ifdef CONFIG_SHADOW_CALL_STACK
 #define INIT_SCS							\
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 88a9034fb9b5..283372eccaeb 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1854,6 +1855,9 @@ int syscall_trace_enter(struct pt_regs *regs)
 	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, regs->syscallno);
 
+	if (flags & _TIF_UACCESS_BUFFER_ENTRY)
+		uaccess_buffer_syscall_entry();
+
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
 			    regs->regs[2], regs->regs[3]);
 
@@ -1866,6 +1870,9 @@ void syscall_trace_exit(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
+	if (flags & _TIF_UACCESS_BUFFER_EXIT)
+		uaccess_buffer_syscall_exit();
+
 	if (flags & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_exit(regs, syscall_get_return_value(current, regs));
 
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 8f6372b44b65..5bbd98e5c257 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)
 
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
 {
+	bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
 	do {
 		if (thread_flags & _TIF_NEED_RESCHED) {
 			/* Unmask Debug and SError for the next task */
@@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
 	} while (thread_flags & _TIF_WORK_MASK);
+
+	uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 5/7] arm64: add support for uaccess logging
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
uaccess logging directly to the architecture.

Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v4:
- remove unnecessary hunk

 arch/arm64/Kconfig                   | 1 +
 arch/arm64/include/asm/thread_info.h | 7 ++++++-
 arch/arm64/kernel/ptrace.c           | 7 +++++++
 arch/arm64/kernel/signal.c           | 5 +++++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..6023946abe4a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -161,6 +161,7 @@ config ARM64
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_UACCESS_BUFFER
 	select HAVE_ARCH_VMAP_STACK
 	select HAVE_ARM_SMCCC
 	select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index e1317b7c4525..0461b36251ea 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define TIF_SVE_VL_INHERIT	24	/* Inherit SVE vl_onexec across exec */
 #define TIF_SSBD		25	/* Wants SSB mitigation */
 #define TIF_TAGGED_ADDR		26	/* Allow tagged user addresses */
+#define TIF_UACCESS_BUFFER_ENTRY 27     /* thread has non-zero uaccess_desc_addr_addr */
+#define TIF_UACCESS_BUFFER_EXIT  28     /* thread has non-zero kcur */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
@@ -98,6 +100,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 #define _TIF_SVE		(1 << TIF_SVE)
 #define _TIF_MTE_ASYNC_FAULT	(1 << TIF_MTE_ASYNC_FAULT)
 #define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_UACCESS_BUFFER_ENTRY	(1 << TIF_UACCESS_BUFFER_ENTRY)
+#define _TIF_UACCESS_BUFFER_EXIT	(1 << TIF_UACCESS_BUFFER_EXIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
@@ -106,7 +110,8 @@ int arch_dup_task_struct(struct task_struct *dst,
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
-				 _TIF_SYSCALL_EMU)
+				 _TIF_SYSCALL_EMU | _TIF_UACCESS_BUFFER_ENTRY | \
+				 _TIF_UACCESS_BUFFER_EXIT)
 
 #ifdef CONFIG_SHADOW_CALL_STACK
 #define INIT_SCS							\
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 88a9034fb9b5..283372eccaeb 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1854,6 +1855,9 @@ int syscall_trace_enter(struct pt_regs *regs)
 	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, regs->syscallno);
 
+	if (flags & _TIF_UACCESS_BUFFER_ENTRY)
+		uaccess_buffer_syscall_entry();
+
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
 			    regs->regs[2], regs->regs[3]);
 
@@ -1866,6 +1870,9 @@ void syscall_trace_exit(struct pt_regs *regs)
 
 	audit_syscall_exit(regs);
 
+	if (flags & _TIF_UACCESS_BUFFER_EXIT)
+		uaccess_buffer_syscall_exit();
+
 	if (flags & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_exit(regs, syscall_get_return_value(current, regs));
 
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 8f6372b44b65..5bbd98e5c257 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)
 
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
 {
+	bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
 	do {
 		if (thread_flags & _TIF_NEED_RESCHED) {
 			/* Unmask Debug and SError for the next task */
@@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
 	} while (thread_flags & _TIF_WORK_MASK);
+
+	uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 6/7] Documentation: document uaccess logging
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add documentation for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v3:
- document what happens if passing NULL to prctl
- be explicit about meaning of addr and size

 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/uaccess-logging.rst | 151 ++++++++++++++++++
 2 files changed, 152 insertions(+)
 create mode 100644 Documentation/admin-guide/uaccess-logging.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..4f6ee447ab2f 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -54,6 +54,7 @@ ABI will be found here.
    :maxdepth: 1
 
    sysfs-rules
+   uaccess-logging
 
 The rest of this manual consists of various unordered guides on how to
 configure specific aspects of kernel behavior to your liking.
diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
new file mode 100644
index 000000000000..24def38bbdf8
--- /dev/null
+++ b/Documentation/admin-guide/uaccess-logging.rst
@@ -0,0 +1,151 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Uaccess Logging
+===============
+
+Background
+----------
+
+Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
+making use of the ARM Memory Tagging Extension (MTE) need to
+monitor all memory accesses in a program so that they can detect
+memory errors. Furthermore, fuzzing tools such as syzkaller need to
+monitor all memory accesses so that they know which parts of memory
+to fuzz. For accesses made purely in userspace, this is achieved
+via compiler instrumentation, or for MTE, via direct hardware
+support. However, accesses made by the kernel on behalf of the user
+program via syscalls (i.e. uaccesses) are normally invisible to
+these tools.
+
+Traditionally, the sanitizers have handled this by interposing the libc
+syscall stubs with a wrapper that checks the memory based on what we
+believe the uaccesses will be. However, this creates a maintenance
+burden: each syscall must be annotated with its uaccesses in order
+to be recognized by the sanitizer, and these annotations must be
+continuously updated as the kernel changes.
+
+The kernel's uaccess logging feature provides userspace tools with
+the address and size of each userspace access, thereby allowing these
+tools to report memory errors involving these accesses without needing
+annotations for every syscall.
+
+By relying on the kernel's actual uaccesses, rather than a
+reimplementation of them, the userspace memory safety tools may
+play a dual role of verifying the validity of kernel accesses. Even
+a sanitizer whose syscall wrappers have complete knowledge of the
+kernel's intended API may vary from the kernel's actual uaccesses due
+to kernel bugs. A sanitizer with knowledge of the kernel's actual
+uaccesses may produce more accurate error reports that reveal such
+bugs. For example, a kernel that accesses more memory than expected
+by the userspace program could indicate that either userspace or the
+kernel has the wrong idea about which kernel functionality is being
+requested -- either way, there is a bug.
+
+Interface
+---------
+
+The feature may be used via the following prctl:
+
+.. code-block:: c
+
+  uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
+  prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
+
+Supplying a non-zero address as the second argument to ``prctl``
+will cause the kernel to read an address (referred to as the *uaccess
+descriptor address*) from that address on each kernel entry. Specifying
+an address of NULL as the second argument will restore the kernel's
+default behavior, i.e. no uaccess descriptor address is read.
+
+When entering the kernel with a non-zero uaccess descriptor address
+to handle a syscall, the kernel will read a data structure of type
+``struct uaccess_descriptor`` from the uaccess descriptor address,
+which is defined as follows:
+
+.. code-block:: c
+
+  struct uaccess_descriptor {
+    uint64_t addr, size;
+  };
+
+This data structure contains the address and size (in array elements)
+of a *uaccess buffer*, which is an array of data structures of type
+``struct uaccess_buffer_entry``. Before returning to userspace, the
+kernel will log information about uaccesses to sequential entries
+in the uaccess buffer. It will also store ``NULL`` to the uaccess
+descriptor address, and store the address and size of the unused
+portion of the uaccess buffer to the uaccess descriptor.
+
+The format of a uaccess buffer entry is defined as follows:
+
+.. code-block:: c
+
+  struct uaccess_buffer_entry {
+    uint64_t addr, size, flags;
+  };
+
+``addr`` and ``size`` contain the address and size of the user memory
+access. On arm64, tag bits are preserved in the ``addr`` field. There
+is currently one flag bit assignment for the ``flags`` field:
+
+.. code-block:: c
+
+  #define UACCESS_BUFFER_FLAG_WRITE 1
+
+This flag is set if the access was a write, or clear if it was a
+read. The meaning of all other flag bits is reserved.
+
+When entering the kernel with a non-zero uaccess descriptor
+address for a reason other than a syscall (for example, when
+IPI'd due to an incoming asynchronous signal), any signals other
+than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
+``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
+initialized with ``sigfillset(set)``. This is to prevent incoming
+signals from interfering with uaccess logging.
+
+Example
+-------
+
+Here is an example of a code snippet that will enumerate the accesses
+performed by a ``uname(2)`` syscall:
+
+.. code-block:: c
+
+  struct uaccess_buffer_entry entries[64];
+  struct uaccess_descriptor desc;
+  uint64_t desc_addr = 0;
+  prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
+
+  desc.addr = (uint64_t)&entries;
+  desc.size = 64;
+  desc_addr = (uint64_t)&desc;
+
+  struct utsname un;
+  uname(&un);
+
+  struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr;
+  for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) {
+    printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
+           (unsigned long)entry->addr, (unsigned long)entry->size);
+  }
+
+Limitations
+-----------
+
+This feature is currently only supported on the arm64, s390 and x86
+architectures.
+
+Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
+course, not all of the accesses may fit in the buffer, but aside from
+that, not all internal kernel APIs that access userspace memory are
+covered. Therefore, userspace programs should tolerate unreported
+accesses.
+
+On the other hand, the kernel guarantees that it will not
+(intentionally) report accessing more data than it is specified
+to read. For example, if the kernel implements a syscall that is
+specified to read a data structure of size ``N`` bytes by first
+reading a page's worth of data and then only using the first ``N``
+bytes from it, the kernel will either report reading ``N`` bytes or
+not report the access at all.
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 6/7] Documentation: document uaccess logging
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add documentation for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
v3:
- document what happens if passing NULL to prctl
- be explicit about meaning of addr and size

 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/uaccess-logging.rst | 151 ++++++++++++++++++
 2 files changed, 152 insertions(+)
 create mode 100644 Documentation/admin-guide/uaccess-logging.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..4f6ee447ab2f 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -54,6 +54,7 @@ ABI will be found here.
    :maxdepth: 1
 
    sysfs-rules
+   uaccess-logging
 
 The rest of this manual consists of various unordered guides on how to
 configure specific aspects of kernel behavior to your liking.
diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
new file mode 100644
index 000000000000..24def38bbdf8
--- /dev/null
+++ b/Documentation/admin-guide/uaccess-logging.rst
@@ -0,0 +1,151 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Uaccess Logging
+===============
+
+Background
+----------
+
+Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
+making use of the ARM Memory Tagging Extension (MTE) need to
+monitor all memory accesses in a program so that they can detect
+memory errors. Furthermore, fuzzing tools such as syzkaller need to
+monitor all memory accesses so that they know which parts of memory
+to fuzz. For accesses made purely in userspace, this is achieved
+via compiler instrumentation, or for MTE, via direct hardware
+support. However, accesses made by the kernel on behalf of the user
+program via syscalls (i.e. uaccesses) are normally invisible to
+these tools.
+
+Traditionally, the sanitizers have handled this by interposing the libc
+syscall stubs with a wrapper that checks the memory based on what we
+believe the uaccesses will be. However, this creates a maintenance
+burden: each syscall must be annotated with its uaccesses in order
+to be recognized by the sanitizer, and these annotations must be
+continuously updated as the kernel changes.
+
+The kernel's uaccess logging feature provides userspace tools with
+the address and size of each userspace access, thereby allowing these
+tools to report memory errors involving these accesses without needing
+annotations for every syscall.
+
+By relying on the kernel's actual uaccesses, rather than a
+reimplementation of them, the userspace memory safety tools may
+play a dual role of verifying the validity of kernel accesses. Even
+a sanitizer whose syscall wrappers have complete knowledge of the
+kernel's intended API may vary from the kernel's actual uaccesses due
+to kernel bugs. A sanitizer with knowledge of the kernel's actual
+uaccesses may produce more accurate error reports that reveal such
+bugs. For example, a kernel that accesses more memory than expected
+by the userspace program could indicate that either userspace or the
+kernel has the wrong idea about which kernel functionality is being
+requested -- either way, there is a bug.
+
+Interface
+---------
+
+The feature may be used via the following prctl:
+
+.. code-block:: c
+
+  uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
+  prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
+
+Supplying a non-zero address as the second argument to ``prctl``
+will cause the kernel to read an address (referred to as the *uaccess
+descriptor address*) from that address on each kernel entry. Specifying
+an address of NULL as the second argument will restore the kernel's
+default behavior, i.e. no uaccess descriptor address is read.
+
+When entering the kernel with a non-zero uaccess descriptor address
+to handle a syscall, the kernel will read a data structure of type
+``struct uaccess_descriptor`` from the uaccess descriptor address,
+which is defined as follows:
+
+.. code-block:: c
+
+  struct uaccess_descriptor {
+    uint64_t addr, size;
+  };
+
+This data structure contains the address and size (in array elements)
+of a *uaccess buffer*, which is an array of data structures of type
+``struct uaccess_buffer_entry``. Before returning to userspace, the
+kernel will log information about uaccesses to sequential entries
+in the uaccess buffer. It will also store ``NULL`` to the uaccess
+descriptor address, and store the address and size of the unused
+portion of the uaccess buffer to the uaccess descriptor.
+
+The format of a uaccess buffer entry is defined as follows:
+
+.. code-block:: c
+
+  struct uaccess_buffer_entry {
+    uint64_t addr, size, flags;
+  };
+
+``addr`` and ``size`` contain the address and size of the user memory
+access. On arm64, tag bits are preserved in the ``addr`` field. There
+is currently one flag bit assignment for the ``flags`` field:
+
+.. code-block:: c
+
+  #define UACCESS_BUFFER_FLAG_WRITE 1
+
+This flag is set if the access was a write, or clear if it was a
+read. The meaning of all other flag bits is reserved.
+
+When entering the kernel with a non-zero uaccess descriptor
+address for a reason other than a syscall (for example, when
+IPI'd due to an incoming asynchronous signal), any signals other
+than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
+``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
+initialized with ``sigfillset(set)``. This is to prevent incoming
+signals from interfering with uaccess logging.
+
+Example
+-------
+
+Here is an example of a code snippet that will enumerate the accesses
+performed by a ``uname(2)`` syscall:
+
+.. code-block:: c
+
+  struct uaccess_buffer_entry entries[64];
+  struct uaccess_descriptor desc;
+  uint64_t desc_addr = 0;
+  prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
+
+  desc.addr = (uint64_t)&entries;
+  desc.size = 64;
+  desc_addr = (uint64_t)&desc;
+
+  struct utsname un;
+  uname(&un);
+
+  struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr;
+  for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) {
+    printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
+           (unsigned long)entry->addr, (unsigned long)entry->size);
+  }
+
+Limitations
+-----------
+
+This feature is currently only supported on the arm64, s390 and x86
+architectures.
+
+Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
+course, not all of the accesses may fit in the buffer, but aside from
+that, not all internal kernel APIs that access userspace memory are
+covered. Therefore, userspace programs should tolerate unreported
+accesses.
+
+On the other hand, the kernel guarantees that it will not
+(intentionally) report accessing more data than it is specified
+to read. For example, if the kernel implements a syscall that is
+specified to read a data structure of size ``N`` bytes by first
+reading a page's worth of data and then only using the first ``N``
+bytes from it, the kernel will either report reading ``N`` bytes or
+not report the access at all.
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 7/7] selftests: test uaccess logging
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-09 22:15   ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add a kselftest for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/I39e1707fb8aef53747c42bd55b46ecaa67205199
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/uaccess_buffer/Makefile |   4 +
 .../uaccess_buffer/uaccess_buffer_test.c      | 126 ++++++++++++++++++
 3 files changed, 131 insertions(+)
 create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
 create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c852eb40c4f7..291b62430557 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@ TARGETS += timers
 endif
 TARGETS += tmpfs
 TARGETS += tpm2
+TARGETS += uaccess_buffer
 TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
diff --git a/tools/testing/selftests/uaccess_buffer/Makefile b/tools/testing/selftests/uaccess_buffer/Makefile
new file mode 100644
index 000000000000..e6e5fb43ce29
--- /dev/null
+++ b/tools/testing/selftests/uaccess_buffer/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := uaccess_buffer_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
new file mode 100644
index 000000000000..051062e4fbf9
--- /dev/null
+++ b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../kselftest_harness.h"
+
+#include <linux/uaccess-buffer.h>
+#include <sys/prctl.h>
+#include <sys/utsname.h>
+
+FIXTURE(uaccess_buffer)
+{
+	uint64_t addr;
+};
+
+FIXTURE_SETUP(uaccess_buffer)
+{
+	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &self->addr, 0,
+			   0, 0));
+}
+
+FIXTURE_TEARDOWN(uaccess_buffer)
+{
+	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, 0, 0, 0, 0));
+}
+
+TEST_F(uaccess_buffer, uname)
+{
+	struct uaccess_descriptor desc;
+	struct uaccess_buffer_entry entries[64];
+	struct utsname un;
+
+	desc.addr = (uint64_t)(unsigned long)entries;
+	desc.size = 64;
+	self->addr = (uint64_t)(unsigned long)&desc;
+	ASSERT_EQ(0, uname(&un));
+	ASSERT_EQ(0, self->addr);
+
+	if (desc.size == 63) {
+		ASSERT_EQ((uint64_t)(unsigned long)(entries + 1), desc.addr);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
+		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
+	} else {
+		/* See override_architecture in kernel/sys.c */
+		ASSERT_EQ(62, desc.size);
+		ASSERT_EQ((uint64_t)(unsigned long)(entries + 2), desc.addr);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
+		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un.machine,
+			  entries[1].addr);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[1].flags);
+	}
+}
+
+static bool handled;
+
+static void usr1_handler(int signo)
+{
+	handled = true;
+}
+
+TEST_F(uaccess_buffer, blocked_signals)
+{
+	struct uaccess_descriptor desc;
+	struct shared_buf {
+		bool ready;
+		bool killed;
+	} volatile *shared = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+				  MAP_ANON | MAP_SHARED, -1, 0);
+	struct sigaction act = {}, oldact;
+	int pid;
+
+	handled = false;
+	act.sa_handler = usr1_handler;
+	sigaction(SIGUSR1, &act, &oldact);
+
+	pid = fork();
+	if (pid == 0) {
+		/*
+		 * Busy loop to synchronize instead of issuing syscalls because
+		 * we need to test the behavior in the case where no syscall is
+		 * issued by the parent process.
+		 */
+		while (!shared->ready)
+			;
+		kill(getppid(), SIGUSR1);
+		shared->killed = true;
+		_exit(0);
+	} else {
+		int i;
+
+		desc.addr = 0;
+		desc.size = 0;
+		self->addr = (uint64_t)(unsigned long)&desc;
+
+		shared->ready = true;
+		while (!shared->killed)
+			;
+
+		/*
+		 * The kernel should have IPI'd us by now, but let's wait a bit
+		 * longer just in case.
+		 */
+		for (i = 0; i != 1000000; ++i)
+			;
+
+		ASSERT_FALSE(handled);
+
+		/*
+		 * Returning from the waitpid syscall should trigger the signal
+		 * handler. The signal itself may also interrupt waitpid, so
+		 * make sure to handle EINTR.
+		 */
+		while (waitpid(pid, NULL, 0) == -1)
+			ASSERT_EQ(EINTR, errno);
+		ASSERT_TRUE(handled);
+	}
+
+	munmap((void *)shared, getpagesize());
+	sigaction(SIGUSR1, &oldact, NULL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 7/7] selftests: test uaccess logging
@ 2021-12-09 22:15   ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-09 22:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Peter Collingbourne, Gabriel Krisman Bertazi, Chris Hyser,
	Daniel Vetter, Chris Wilson, Arnd Bergmann, Dmitry Vyukov,
	Christian Brauner, Eric W. Biederman, Alexey Gladkov,
	Ran Xiaokai, David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov,
	Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Add a kselftest for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/I39e1707fb8aef53747c42bd55b46ecaa67205199
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/uaccess_buffer/Makefile |   4 +
 .../uaccess_buffer/uaccess_buffer_test.c      | 126 ++++++++++++++++++
 3 files changed, 131 insertions(+)
 create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
 create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c852eb40c4f7..291b62430557 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@ TARGETS += timers
 endif
 TARGETS += tmpfs
 TARGETS += tpm2
+TARGETS += uaccess_buffer
 TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
diff --git a/tools/testing/selftests/uaccess_buffer/Makefile b/tools/testing/selftests/uaccess_buffer/Makefile
new file mode 100644
index 000000000000..e6e5fb43ce29
--- /dev/null
+++ b/tools/testing/selftests/uaccess_buffer/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := uaccess_buffer_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
new file mode 100644
index 000000000000..051062e4fbf9
--- /dev/null
+++ b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../kselftest_harness.h"
+
+#include <linux/uaccess-buffer.h>
+#include <sys/prctl.h>
+#include <sys/utsname.h>
+
+FIXTURE(uaccess_buffer)
+{
+	uint64_t addr;
+};
+
+FIXTURE_SETUP(uaccess_buffer)
+{
+	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &self->addr, 0,
+			   0, 0));
+}
+
+FIXTURE_TEARDOWN(uaccess_buffer)
+{
+	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, 0, 0, 0, 0));
+}
+
+TEST_F(uaccess_buffer, uname)
+{
+	struct uaccess_descriptor desc;
+	struct uaccess_buffer_entry entries[64];
+	struct utsname un;
+
+	desc.addr = (uint64_t)(unsigned long)entries;
+	desc.size = 64;
+	self->addr = (uint64_t)(unsigned long)&desc;
+	ASSERT_EQ(0, uname(&un));
+	ASSERT_EQ(0, self->addr);
+
+	if (desc.size == 63) {
+		ASSERT_EQ((uint64_t)(unsigned long)(entries + 1), desc.addr);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
+		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
+	} else {
+		/* See override_architecture in kernel/sys.c */
+		ASSERT_EQ(62, desc.size);
+		ASSERT_EQ((uint64_t)(unsigned long)(entries + 2), desc.addr);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
+		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
+
+		ASSERT_EQ((uint64_t)(unsigned long)&un.machine,
+			  entries[1].addr);
+		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[1].flags);
+	}
+}
+
+static bool handled;
+
+static void usr1_handler(int signo)
+{
+	handled = true;
+}
+
+TEST_F(uaccess_buffer, blocked_signals)
+{
+	struct uaccess_descriptor desc;
+	struct shared_buf {
+		bool ready;
+		bool killed;
+	} volatile *shared = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+				  MAP_ANON | MAP_SHARED, -1, 0);
+	struct sigaction act = {}, oldact;
+	int pid;
+
+	handled = false;
+	act.sa_handler = usr1_handler;
+	sigaction(SIGUSR1, &act, &oldact);
+
+	pid = fork();
+	if (pid == 0) {
+		/*
+		 * Busy loop to synchronize instead of issuing syscalls because
+		 * we need to test the behavior in the case where no syscall is
+		 * issued by the parent process.
+		 */
+		while (!shared->ready)
+			;
+		kill(getppid(), SIGUSR1);
+		shared->killed = true;
+		_exit(0);
+	} else {
+		int i;
+
+		desc.addr = 0;
+		desc.size = 0;
+		self->addr = (uint64_t)(unsigned long)&desc;
+
+		shared->ready = true;
+		while (!shared->killed)
+			;
+
+		/*
+		 * The kernel should have IPI'd us by now, but let's wait a bit
+		 * longer just in case.
+		 */
+		for (i = 0; i != 1000000; ++i)
+			;
+
+		ASSERT_FALSE(handled);
+
+		/*
+		 * Returning from the waitpid syscall should trigger the signal
+		 * handler. The signal itself may also interrupt waitpid, so
+		 * make sure to handle EINTR.
+		 */
+		while (waitpid(pid, NULL, 0) == -1)
+			ASSERT_EQ(EINTR, errno);
+		ASSERT_TRUE(handled);
+	}
+
+	munmap((void *)shared, getpagesize());
+	sigaction(SIGUSR1, &oldact, NULL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.34.1.173.g76aa8bc2d0-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 2/7] uaccess-buffer: add core code
  2021-12-09 22:15   ` Peter Collingbourne
@ 2021-12-10  3:52     ` Dmitry Vyukov
  -1 siblings, 0 replies; 46+ messages in thread
From: Dmitry Vyukov @ 2021-12-10  3:52 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, 9 Dec 2021 at 23:15, Peter Collingbourne <pcc@google.com> wrote:
>
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <pcc@google.com>

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

> ---
> v4:
> - add CONFIG_UACCESS_BUFFER
> - add kernel doc comments to uaccess-buffer.h
> - outline uaccess_buffer_set_descriptor_addr_addr
> - switch to using spin_lock_irqsave/spin_unlock_irqrestore during
>   pre/post-exit-loop code because preemption is disabled at that point
> - set kend to NULL if krealloc failed
> - size_t -> unsigned long in copy_from_user_nolog signature
>
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
>
> v2:
> - New interface that avoids multiple syscalls per real syscall and
>   is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
>
>  arch/Kconfig                         |  13 +++
>  fs/exec.c                            |   3 +
>  include/linux/instrumented-uaccess.h |   6 +-
>  include/linux/sched.h                |   5 +
>  include/linux/uaccess-buffer-info.h  |  46 ++++++++
>  include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
>  include/uapi/linux/prctl.h           |   3 +
>  include/uapi/linux/uaccess-buffer.h  |  27 +++++
>  kernel/Makefile                      |   1 +
>  kernel/bpf/helpers.c                 |   7 +-
>  kernel/fork.c                        |   4 +
>  kernel/signal.c                      |   9 +-
>  kernel/sys.c                         |   6 ++
>  kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
>  14 files changed, 422 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/uaccess-buffer-info.h
>  create mode 100644 include/linux/uaccess-buffer.h
>  create mode 100644 include/uapi/linux/uaccess-buffer.h
>  create mode 100644 kernel/uaccess-buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d3c4ab249e9c..17819f53ea80 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
>  config DYNAMIC_SIGFRAME
>         bool
>
> +config HAVE_ARCH_UACCESS_BUFFER
> +       bool
> +       help
> +         Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> +config UACCESS_BUFFER
> +       bool "Uaccess logging" if EXPERT
> +       default y
> +       depends on HAVE_ARCH_UACCESS_BUFFER
> +       help
> +         Select to enable support for uaccess logging
> +         (see Documentation/admin-guide/uaccess-logging.rst).
> +
>  source "kernel/gcov/Kconfig"
>
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>         me->personality &= ~bprm->per_clear;
>
>         clear_syscall_work_syscall_user_dispatch(me);
> +       uaccess_buffer_set_descriptor_addr_addr(0);
> +       uaccess_buffer_free(current);
>
>         /*
>          * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>
>  /*
>   * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
>   */
>  #ifndef _LINUX_INSTRUMENTED_UACCESS_H
>  #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
>  #include <linux/kasan-checks.h>
>  #include <linux/kcsan-checks.h>
>  #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>
>  /**
>   * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
>  {
>         kasan_check_read(from, n);
>         kcsan_check_read(from, n);
> +       uaccess_buffer_log_write(to, n);
>  }
>
>  /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
>  {
>         kasan_check_write(to, n);
>         kcsan_check_write(to, n);
> +       uaccess_buffer_log_read(from, n);
>  }
>
>  #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..96014dd2702e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
>  #include <linux/rseq.h>
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
>  #include <asm/kmap_size.h>
>
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
>         struct callback_head            l1d_flush_kill;
>  #endif
>
> +#ifdef CONFIG_UACCESS_BUFFER
> +       struct uaccess_buffer_info      uaccess_buffer;
> +#endif
> +
>         /*
>          * New fields for task_struct should be added above here, so that
>          * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..46e2b1a4a20f
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> +       /*
> +        * The pointer to pointer to struct uaccess_descriptor. This is the
> +        * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> +        */
> +       struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> +       /*
> +        * The pointer to struct uaccess_descriptor read at syscall entry time.
> +        */
> +       struct uaccess_descriptor __user *desc_ptr;
> +
> +       /*
> +        * A pointer to the kernel's temporary copy of the uaccess log for the
> +        * current syscall. We log to a kernel buffer in order to avoid leaking
> +        * timing information to userspace.
> +        */
> +       struct uaccess_buffer_entry *kbegin;
> +
> +       /*
> +        * The position of the next uaccess buffer entry for the current
> +        * syscall, or NULL if we are not logging the current syscall.
> +        */
> +       struct uaccess_buffer_entry *kcur;
> +
> +       /*
> +        * A pointer to the end of the kernel's uaccess log.
> +        */
> +       struct uaccess_buffer_entry *kend;
> +
> +       /*
> +        * The pointer to the userspace uaccess log, as read from the
> +        * struct uaccess_descriptor.
> +        */
> +       struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..2e9b4010fb59
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
> + * blocked due to uaccess logging
> + * @tsk: the task.
> + */
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +       return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +       __uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +       __uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +       if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> +               return false;
> +       return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +       if (pending)
> +               __uaccess_buffer_post_exit_loop();
> +}
> +
> +/*
> + * uaccess_buffer_set_descriptor_addr_addr - implements
> + * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> +
> +/*
> + * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
> + * logging. This is useful in special cases, such as when the kernel overreads a
> + * buffer.
> + * @to: the pointer to kernel memory.
> + * @from: the pointer to user memory.
> + * @len: the number of bytes to copy.
> + */
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +                                  unsigned long len);
> +
> +/*
> + * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
> + * for uaccess logging to be cancelled for the current syscall
> + * @tsk: the task.
> + */
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +/*
> + * uaccess_buffer_log_read - log a read access
> + * @from: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +               __uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +/*
> + * uaccess_buffer_log_write - log a write access
> + * @to: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +               __uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +       return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +       return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +       return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog copy_from_user
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> +                                          unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_SCOPE_THREAD_GROUP      1
>  # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP     2
>
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR    63
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> +       /* Address of the uaccess_buffer_entry array. */
> +       __u64 addr;
> +       /* Size of the uaccess_buffer_entry array in number of elements. */
> +       __u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> +       /* Address being accessed. */
> +       __u64 addr;
> +       /* Number of bytes that were accessed. */
> +       __u64 size;
> +       /* UACCESS_BUFFER_* flags. */
> +       __u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE      1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..e5f6c56696a2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
>  obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
>
>  obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/proc_ns.h>
>  #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include "../../lib/kstrtox.h"
>
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
>  BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
>            const void __user *, user_ptr)
>  {
> -       int ret = copy_from_user(dst, user_ptr, size);
> +       /*
> +        * Avoid logging uaccesses here as the BPF program may not be following
> +        * the uaccess log rules.
> +        */
> +       int ret = copy_from_user_nolog(dst, user_ptr, size);
>
>         if (unlikely(ret)) {
>                 memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
>         delayacct_tsk_free(tsk);
>         put_signal_struct(tsk->signal);
>         sched_core_free(tsk);
> +       uaccess_buffer_free(tsk);
>
>         if (!profile_handoff_task(tsk))
>                 free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>         if (memcg_charge_kernel_stack(tsk))
>                 goto free_stack;
>
> +       uaccess_buffer_free(orig);
> +
>         stack_vm_area = task_stack_vm_area(tsk);
>
>         err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..b85d7d4844f6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/cgroup.h>
>  #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>         if (sig_fatal(p, sig) &&
>             !(signal->flags & SIGNAL_GROUP_EXIT) &&
>             !sigismember(&t->real_blocked, sig) &&
> -           (sig == SIGKILL || !p->ptrace)) {
> +           (sig == SIGKILL ||
> +            !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>                 /*
>                  * This signal will be fatal to the whole group.
>                  */
> @@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
>  void __set_current_blocked(const sigset_t *newset)
>  {
>         struct task_struct *tsk = current;
> +       unsigned long flags;
>
>         /*
>          * In case the signal mask hasn't changed, there is nothing we need
> @@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
>         if (sigequalsets(&tsk->blocked, newset))
>                 return;
>
> -       spin_lock_irq(&tsk->sighand->siglock);
> +       spin_lock_irqsave(&tsk->sighand->siglock, flags);
>         __set_task_blocked(tsk, newset);
> -       spin_unlock_irq(&tsk->sighand->siglock);
> +       spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
>  }
>
>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/version.h>
>  #include <linux/ctype.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>                 error = sched_core_share_pid(arg2, arg3, arg4, arg5);
>                 break;
>  #endif
> +       case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> +               break;
>         default:
>                 error = -EINVAL;
>                 break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..d3129244b7d9
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +       current->uaccess_buffer.desc_ptr_ptr =
> +               (struct uaccess_descriptor __user * __user *)addr;
> +       if (addr)
> +               set_syscall_work(UACCESS_BUFFER_ENTRY);
> +       else
> +               clear_syscall_work(UACCESS_BUFFER_ENTRY);
> +       return 0;
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> +                             unsigned long flags)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_buffer_entry *entry = buf->kcur;
> +
> +       if (entry == buf->kend || unlikely(uaccess_kernel()))
> +               return;
> +       entry->addr = addr;
> +       entry->size = size;
> +       entry->flags = flags;
> +
> +       ++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +       uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +       uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_descriptor __user *desc_ptr;
> +       sigset_t tmp_mask;
> +
> +       if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +               return false;
> +
> +       current->real_blocked = current->blocked;
> +       sigfillset(&tmp_mask);
> +       set_current_blocked(&tmp_mask);
> +       return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&current->sighand->siglock, flags);
> +       current->blocked = current->real_blocked;
> +       recalc_sigpending();
> +       spin_unlock_irqrestore(&current->sighand->siglock, flags);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +       struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> +       kfree(buf->kbegin);
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_descriptor desc;
> +
> +       if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> +           put_user(0, buf->desc_ptr_ptr) ||
> +           copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> +               return;
> +
> +       if (desc.size > 1024)
> +               desc.size = 1024;
> +
> +       if (buf->kend - buf->kbegin != desc.size)
> +               buf->kbegin =
> +                       krealloc_array(buf->kbegin, desc.size,
> +                                      sizeof(struct uaccess_buffer_entry),
> +                                      GFP_KERNEL);
> +       if (!buf->kbegin) {
> +               buf->kend = NULL;
> +               return;
> +       }
> +
> +       set_syscall_work(UACCESS_BUFFER_EXIT);
> +       buf->kcur = buf->kbegin;
> +       buf->kend = buf->kbegin + desc.size;
> +       buf->ubegin =
> +               (struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       u64 num_entries = buf->kcur - buf->kbegin;
> +       struct uaccess_descriptor desc;
> +
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> +       desc.size = buf->kend - buf->kcur;
> +       buf->kcur = NULL;
> +       if (copy_to_user(buf->ubegin, buf->kbegin,
> +                        num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> +               (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +                                  unsigned long len)
> +{
> +       size_t retval;
> +
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       retval = copy_from_user(to, from, len);
> +       if (current->uaccess_buffer.kcur)
> +               set_syscall_work(UACCESS_BUFFER_EXIT);
> +       return retval;
> +}
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 2/7] uaccess-buffer: add core code
@ 2021-12-10  3:52     ` Dmitry Vyukov
  0 siblings, 0 replies; 46+ messages in thread
From: Dmitry Vyukov @ 2021-12-10  3:52 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, 9 Dec 2021 at 23:15, Peter Collingbourne <pcc@google.com> wrote:
>
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <pcc@google.com>

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

> ---
> v4:
> - add CONFIG_UACCESS_BUFFER
> - add kernel doc comments to uaccess-buffer.h
> - outline uaccess_buffer_set_descriptor_addr_addr
> - switch to using spin_lock_irqsave/spin_unlock_irqrestore during
>   pre/post-exit-loop code because preemption is disabled at that point
> - set kend to NULL if krealloc failed
> - size_t -> unsigned long in copy_from_user_nolog signature
>
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
>
> v2:
> - New interface that avoids multiple syscalls per real syscall and
>   is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
>
>  arch/Kconfig                         |  13 +++
>  fs/exec.c                            |   3 +
>  include/linux/instrumented-uaccess.h |   6 +-
>  include/linux/sched.h                |   5 +
>  include/linux/uaccess-buffer-info.h  |  46 ++++++++
>  include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
>  include/uapi/linux/prctl.h           |   3 +
>  include/uapi/linux/uaccess-buffer.h  |  27 +++++
>  kernel/Makefile                      |   1 +
>  kernel/bpf/helpers.c                 |   7 +-
>  kernel/fork.c                        |   4 +
>  kernel/signal.c                      |   9 +-
>  kernel/sys.c                         |   6 ++
>  kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
>  14 files changed, 422 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/uaccess-buffer-info.h
>  create mode 100644 include/linux/uaccess-buffer.h
>  create mode 100644 include/uapi/linux/uaccess-buffer.h
>  create mode 100644 kernel/uaccess-buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d3c4ab249e9c..17819f53ea80 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
>  config DYNAMIC_SIGFRAME
>         bool
>
> +config HAVE_ARCH_UACCESS_BUFFER
> +       bool
> +       help
> +         Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> +config UACCESS_BUFFER
> +       bool "Uaccess logging" if EXPERT
> +       default y
> +       depends on HAVE_ARCH_UACCESS_BUFFER
> +       help
> +         Select to enable support for uaccess logging
> +         (see Documentation/admin-guide/uaccess-logging.rst).
> +
>  source "kernel/gcov/Kconfig"
>
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>         me->personality &= ~bprm->per_clear;
>
>         clear_syscall_work_syscall_user_dispatch(me);
> +       uaccess_buffer_set_descriptor_addr_addr(0);
> +       uaccess_buffer_free(current);
>
>         /*
>          * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>
>  /*
>   * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
>   */
>  #ifndef _LINUX_INSTRUMENTED_UACCESS_H
>  #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
>  #include <linux/kasan-checks.h>
>  #include <linux/kcsan-checks.h>
>  #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>
>  /**
>   * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
>  {
>         kasan_check_read(from, n);
>         kcsan_check_read(from, n);
> +       uaccess_buffer_log_write(to, n);
>  }
>
>  /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
>  {
>         kasan_check_write(to, n);
>         kcsan_check_write(to, n);
> +       uaccess_buffer_log_read(from, n);
>  }
>
>  #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..96014dd2702e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
>  #include <linux/rseq.h>
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
>  #include <asm/kmap_size.h>
>
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
>         struct callback_head            l1d_flush_kill;
>  #endif
>
> +#ifdef CONFIG_UACCESS_BUFFER
> +       struct uaccess_buffer_info      uaccess_buffer;
> +#endif
> +
>         /*
>          * New fields for task_struct should be added above here, so that
>          * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..46e2b1a4a20f
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> +       /*
> +        * The pointer to pointer to struct uaccess_descriptor. This is the
> +        * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> +        */
> +       struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> +       /*
> +        * The pointer to struct uaccess_descriptor read at syscall entry time.
> +        */
> +       struct uaccess_descriptor __user *desc_ptr;
> +
> +       /*
> +        * A pointer to the kernel's temporary copy of the uaccess log for the
> +        * current syscall. We log to a kernel buffer in order to avoid leaking
> +        * timing information to userspace.
> +        */
> +       struct uaccess_buffer_entry *kbegin;
> +
> +       /*
> +        * The position of the next uaccess buffer entry for the current
> +        * syscall, or NULL if we are not logging the current syscall.
> +        */
> +       struct uaccess_buffer_entry *kcur;
> +
> +       /*
> +        * A pointer to the end of the kernel's uaccess log.
> +        */
> +       struct uaccess_buffer_entry *kend;
> +
> +       /*
> +        * The pointer to the userspace uaccess log, as read from the
> +        * struct uaccess_descriptor.
> +        */
> +       struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..2e9b4010fb59
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
> + * blocked due to uaccess logging
> + * @tsk: the task.
> + */
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +       return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +       __uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +       __uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +       if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> +               return false;
> +       return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +       if (pending)
> +               __uaccess_buffer_post_exit_loop();
> +}
> +
> +/*
> + * uaccess_buffer_set_descriptor_addr_addr - implements
> + * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> +
> +/*
> + * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
> + * logging. This is useful in special cases, such as when the kernel overreads a
> + * buffer.
> + * @to: the pointer to kernel memory.
> + * @from: the pointer to user memory.
> + * @len: the number of bytes to copy.
> + */
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +                                  unsigned long len);
> +
> +/*
> + * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
> + * for uaccess logging to be cancelled for the current syscall
> + * @tsk: the task.
> + */
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +/*
> + * uaccess_buffer_log_read - log a read access
> + * @from: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +               __uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +/*
> + * uaccess_buffer_log_write - log a write access
> + * @to: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +               __uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +       return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +       return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +       return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog copy_from_user
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> +                                          unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_SCOPE_THREAD_GROUP      1
>  # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP     2
>
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR    63
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> +       /* Address of the uaccess_buffer_entry array. */
> +       __u64 addr;
> +       /* Size of the uaccess_buffer_entry array in number of elements. */
> +       __u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> +       /* Address being accessed. */
> +       __u64 addr;
> +       /* Number of bytes that were accessed. */
> +       __u64 size;
> +       /* UACCESS_BUFFER_* flags. */
> +       __u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE      1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..e5f6c56696a2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
>  obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
>
>  obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/proc_ns.h>
>  #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include "../../lib/kstrtox.h"
>
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
>  BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
>            const void __user *, user_ptr)
>  {
> -       int ret = copy_from_user(dst, user_ptr, size);
> +       /*
> +        * Avoid logging uaccesses here as the BPF program may not be following
> +        * the uaccess log rules.
> +        */
> +       int ret = copy_from_user_nolog(dst, user_ptr, size);
>
>         if (unlikely(ret)) {
>                 memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
>         delayacct_tsk_free(tsk);
>         put_signal_struct(tsk->signal);
>         sched_core_free(tsk);
> +       uaccess_buffer_free(tsk);
>
>         if (!profile_handoff_task(tsk))
>                 free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>         if (memcg_charge_kernel_stack(tsk))
>                 goto free_stack;
>
> +       uaccess_buffer_free(orig);
> +
>         stack_vm_area = task_stack_vm_area(tsk);
>
>         err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..b85d7d4844f6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/cgroup.h>
>  #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>         if (sig_fatal(p, sig) &&
>             !(signal->flags & SIGNAL_GROUP_EXIT) &&
>             !sigismember(&t->real_blocked, sig) &&
> -           (sig == SIGKILL || !p->ptrace)) {
> +           (sig == SIGKILL ||
> +            !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>                 /*
>                  * This signal will be fatal to the whole group.
>                  */
> @@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
>  void __set_current_blocked(const sigset_t *newset)
>  {
>         struct task_struct *tsk = current;
> +       unsigned long flags;
>
>         /*
>          * In case the signal mask hasn't changed, there is nothing we need
> @@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
>         if (sigequalsets(&tsk->blocked, newset))
>                 return;
>
> -       spin_lock_irq(&tsk->sighand->siglock);
> +       spin_lock_irqsave(&tsk->sighand->siglock, flags);
>         __set_task_blocked(tsk, newset);
> -       spin_unlock_irq(&tsk->sighand->siglock);
> +       spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
>  }
>
>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/version.h>
>  #include <linux/ctype.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>                 error = sched_core_share_pid(arg2, arg3, arg4, arg5);
>                 break;
>  #endif
> +       case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> +               break;
>         default:
>                 error = -EINVAL;
>                 break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..d3129244b7d9
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +       current->uaccess_buffer.desc_ptr_ptr =
> +               (struct uaccess_descriptor __user * __user *)addr;
> +       if (addr)
> +               set_syscall_work(UACCESS_BUFFER_ENTRY);
> +       else
> +               clear_syscall_work(UACCESS_BUFFER_ENTRY);
> +       return 0;
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> +                             unsigned long flags)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_buffer_entry *entry = buf->kcur;
> +
> +       if (entry == buf->kend || unlikely(uaccess_kernel()))
> +               return;
> +       entry->addr = addr;
> +       entry->size = size;
> +       entry->flags = flags;
> +
> +       ++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +       uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +       uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_descriptor __user *desc_ptr;
> +       sigset_t tmp_mask;
> +
> +       if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +               return false;
> +
> +       current->real_blocked = current->blocked;
> +       sigfillset(&tmp_mask);
> +       set_current_blocked(&tmp_mask);
> +       return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&current->sighand->siglock, flags);
> +       current->blocked = current->real_blocked;
> +       recalc_sigpending();
> +       spin_unlock_irqrestore(&current->sighand->siglock, flags);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +       struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> +       kfree(buf->kbegin);
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       struct uaccess_descriptor desc;
> +
> +       if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> +           put_user(0, buf->desc_ptr_ptr) ||
> +           copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> +               return;
> +
> +       if (desc.size > 1024)
> +               desc.size = 1024;
> +
> +       if (buf->kend - buf->kbegin != desc.size)
> +               buf->kbegin =
> +                       krealloc_array(buf->kbegin, desc.size,
> +                                      sizeof(struct uaccess_buffer_entry),
> +                                      GFP_KERNEL);
> +       if (!buf->kbegin) {
> +               buf->kend = NULL;
> +               return;
> +       }
> +
> +       set_syscall_work(UACCESS_BUFFER_EXIT);
> +       buf->kcur = buf->kbegin;
> +       buf->kend = buf->kbegin + desc.size;
> +       buf->ubegin =
> +               (struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> +       struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +       u64 num_entries = buf->kcur - buf->kbegin;
> +       struct uaccess_descriptor desc;
> +
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> +       desc.size = buf->kend - buf->kcur;
> +       buf->kcur = NULL;
> +       if (copy_to_user(buf->ubegin, buf->kbegin,
> +                        num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> +               (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +                                  unsigned long len)
> +{
> +       size_t retval;
> +
> +       clear_syscall_work(UACCESS_BUFFER_EXIT);
> +       retval = copy_from_user(to, from, len);
> +       if (current->uaccess_buffer.kcur)
> +               set_syscall_work(UACCESS_BUFFER_EXIT);
> +       return retval;
> +}
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 2/7] uaccess-buffer: add core code
  2021-12-09 22:15   ` Peter Collingbourne
@ 2021-12-10 12:39     ` Marco Elver
  -1 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 12:39 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
> 
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <pcc@google.com>

Few minor issues that may help readability; apart from that only 2
bigger question below which are related to changes in v4 below.

> ---
> v4:
> - add CONFIG_UACCESS_BUFFER
> - add kernel doc comments to uaccess-buffer.h
> - outline uaccess_buffer_set_descriptor_addr_addr
> - switch to using spin_lock_irqsave/spin_unlock_irqrestore during
>   pre/post-exit-loop code because preemption is disabled at that point
> - set kend to NULL if krealloc failed
> - size_t -> unsigned long in copy_from_user_nolog signature
> 
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
> 
> v2:
> - New interface that avoids multiple syscalls per real syscall and
>   is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
> 
>  arch/Kconfig                         |  13 +++
>  fs/exec.c                            |   3 +
>  include/linux/instrumented-uaccess.h |   6 +-
>  include/linux/sched.h                |   5 +
>  include/linux/uaccess-buffer-info.h  |  46 ++++++++
>  include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
>  include/uapi/linux/prctl.h           |   3 +
>  include/uapi/linux/uaccess-buffer.h  |  27 +++++
>  kernel/Makefile                      |   1 +
>  kernel/bpf/helpers.c                 |   7 +-
>  kernel/fork.c                        |   4 +
>  kernel/signal.c                      |   9 +-
>  kernel/sys.c                         |   6 ++
>  kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
>  14 files changed, 422 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/uaccess-buffer-info.h
>  create mode 100644 include/linux/uaccess-buffer.h
>  create mode 100644 include/uapi/linux/uaccess-buffer.h
>  create mode 100644 kernel/uaccess-buffer.c
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d3c4ab249e9c..17819f53ea80 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
>  config DYNAMIC_SIGFRAME
>  	bool
>  
> +config HAVE_ARCH_UACCESS_BUFFER
> +	bool
> +	help
> +	  Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> +config UACCESS_BUFFER
> +	bool "Uaccess logging" if EXPERT

I think this is fine, based on perf data you provided. I think whether
or not to enable by default is probably something that needs to be
backed up with more data though, ideally with performance numbers added
to the commit description.

It had come up before, so I think clarifying this will probably address
one of the 2 major questions (the other being if it can be used to leak
sensitive data).

This is the first instance of spelling uaccess as "Uaccess" in the
kernel. I think this should just be spelled out as "User access". Here
and in documentation.

> +	default y
> +	depends on HAVE_ARCH_UACCESS_BUFFER
> +	help
> +	  Select to enable support for uaccess logging
> +	  (see Documentation/admin-guide/uaccess-logging.rst).
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	me->personality &= ~bprm->per_clear;
>  
>  	clear_syscall_work_syscall_user_dispatch(me);
> +	uaccess_buffer_set_descriptor_addr_addr(0);
> +	uaccess_buffer_free(current);
>  
>  	/*
>  	 * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>  
>  /*
>   * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
>   */
>  #ifndef _LINUX_INSTRUMENTED_UACCESS_H
>  #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
>  #include <linux/kasan-checks.h>
>  #include <linux/kcsan-checks.h>
>  #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>  
>  /**
>   * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
>  {
>  	kasan_check_read(from, n);
>  	kcsan_check_read(from, n);
> +	uaccess_buffer_log_write(to, n);
>  }
>  
>  /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
>  {
>  	kasan_check_write(to, n);
>  	kcsan_check_write(to, n);
> +	uaccess_buffer_log_read(from, n);
>  }
>  
>  #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..96014dd2702e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
>  #include <linux/rseq.h>
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
>  #include <asm/kmap_size.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
>  	struct callback_head		l1d_flush_kill;
>  #endif
>  
> +#ifdef CONFIG_UACCESS_BUFFER
> +	struct uaccess_buffer_info	uaccess_buffer;
> +#endif
> +
>  	/*
>  	 * New fields for task_struct should be added above here, so that
>  	 * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..46e2b1a4a20f
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> +	/*
> +	 * The pointer to pointer to struct uaccess_descriptor. This is the
> +	 * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> +	 */
> +	struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> +	/*
> +	 * The pointer to struct uaccess_descriptor read at syscall entry time.
> +	 */
> +	struct uaccess_descriptor __user *desc_ptr;
> +
> +	/*
> +	 * A pointer to the kernel's temporary copy of the uaccess log for the
> +	 * current syscall. We log to a kernel buffer in order to avoid leaking
> +	 * timing information to userspace.
> +	 */
> +	struct uaccess_buffer_entry *kbegin;
> +
> +	/*
> +	 * The position of the next uaccess buffer entry for the current
> +	 * syscall, or NULL if we are not logging the current syscall.
> +	 */
> +	struct uaccess_buffer_entry *kcur;
> +
> +	/*
> +	 * A pointer to the end of the kernel's uaccess log.
> +	 */
> +	struct uaccess_buffer_entry *kend;
> +
> +	/*
> +	 * The pointer to the userspace uaccess log, as read from the
> +	 * struct uaccess_descriptor.
> +	 */
> +	struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..2e9b4010fb59
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
> + * blocked due to uaccess logging
> + * @tsk: the task.
> + */

Kernel-doc comments need to start with "/**". See template:
https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation

> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +	return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +	__uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +	__uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +	if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> +		return false;
> +	return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +	if (pending)
> +		__uaccess_buffer_post_exit_loop();
> +}
> +
> +/*
> + * uaccess_buffer_set_descriptor_addr_addr - implements
> + * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> +
> +/*
> + * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
> + * logging. This is useful in special cases, such as when the kernel overreads a
> + * buffer.
> + * @to: the pointer to kernel memory.
> + * @from: the pointer to user memory.
> + * @len: the number of bytes to copy.
> + */
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +				   unsigned long len);
> +
> +/*
> + * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
> + * for uaccess logging to be cancelled for the current syscall
> + * @tsk: the task.
> + */
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +/*
> + * uaccess_buffer_log_read - log a read access
> + * @from: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +		__uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +/*
> + * uaccess_buffer_log_write - log a write access
> + * @to: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +		__uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +	return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +	return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +	return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog copy_from_user
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> +					   unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_SCOPE_THREAD_GROUP	1
>  # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP	2
>  
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR	63
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> +	/* Address of the uaccess_buffer_entry array. */
> +	__u64 addr;
> +	/* Size of the uaccess_buffer_entry array in number of elements. */
> +	__u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> +	/* Address being accessed. */
> +	__u64 addr;
> +	/* Number of bytes that were accessed. */
> +	__u64 size;
> +	/* UACCESS_BUFFER_* flags. */
> +	__u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE	1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..e5f6c56696a2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
>  obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/proc_ns.h>
>  #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include "../../lib/kstrtox.h"
>  
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
>  BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
>  	   const void __user *, user_ptr)
>  {
> -	int ret = copy_from_user(dst, user_ptr, size);
> +	/*
> +	 * Avoid logging uaccesses here as the BPF program may not be following
> +	 * the uaccess log rules.
> +	 */
> +	int ret = copy_from_user_nolog(dst, user_ptr, size);
>  

Like the fs change, shouldn't this also be in its own patch?

>  	if (unlikely(ret)) {
>  		memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
>  	delayacct_tsk_free(tsk);
>  	put_signal_struct(tsk->signal);
>  	sched_core_free(tsk);
> +	uaccess_buffer_free(tsk);
>  
>  	if (!profile_handoff_task(tsk))
>  		free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (memcg_charge_kernel_stack(tsk))
>  		goto free_stack;
>  
> +	uaccess_buffer_free(orig);
> +
>  	stack_vm_area = task_stack_vm_area(tsk);
>  
>  	err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..b85d7d4844f6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/cgroup.h>
>  #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>  	if (sig_fatal(p, sig) &&
>  	    !(signal->flags & SIGNAL_GROUP_EXIT) &&
>  	    !sigismember(&t->real_blocked, sig) &&
> -	    (sig == SIGKILL || !p->ptrace)) {
> +	    (sig == SIGKILL ||
> +	     !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>  		/*
>  		 * This signal will be fatal to the whole group.
>  		 */
> @@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
>  void __set_current_blocked(const sigset_t *newset)
>  {
>  	struct task_struct *tsk = current;
> +	unsigned long flags;
>  
>  	/*
>  	 * In case the signal mask hasn't changed, there is nothing we need
> @@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
>  	if (sigequalsets(&tsk->blocked, newset))
>  		return;
>  
> -	spin_lock_irq(&tsk->sighand->siglock);
> +	spin_lock_irqsave(&tsk->sighand->siglock, flags);
>  	__set_task_blocked(tsk, newset);
> -	spin_unlock_irq(&tsk->sighand->siglock);
> +	spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
>  }

You say that when you call this in the pre/post-exit-loop that
preemption is disabled.

Is only preemption (via preempt_disable()) disabled or are interrupts
disabled?

If the latter, is it even valid to call set_current_blocked()? As-is,
one of its expected pre-conditions is that interrupts are enabled. Is
that a redundant expectation, which you are suggesting with this change?

>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/version.h>
>  #include <linux/ctype.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
>  		break;
>  #endif
> +	case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..d3129244b7d9
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +	current->uaccess_buffer.desc_ptr_ptr =
> +		(struct uaccess_descriptor __user * __user *)addr;
> +	if (addr)
> +		set_syscall_work(UACCESS_BUFFER_ENTRY);
> +	else
> +		clear_syscall_work(UACCESS_BUFFER_ENTRY);
> +	return 0;
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> +			      unsigned long flags)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_buffer_entry *entry = buf->kcur;
> +
> +	if (entry == buf->kend || unlikely(uaccess_kernel()))
> +		return;
> +	entry->addr = addr;
> +	entry->size = size;
> +	entry->flags = flags;
> +
> +	++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +	uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +	uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor __user *desc_ptr;
> +	sigset_t tmp_mask;
> +
> +	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +		return false;
> +
> +	current->real_blocked = current->blocked;
> +	sigfillset(&tmp_mask);
> +	set_current_blocked(&tmp_mask);
> +	return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&current->sighand->siglock, flags);
> +	current->blocked = current->real_blocked;
> +	recalc_sigpending();
> +	spin_unlock_irqrestore(&current->sighand->siglock, flags);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +	struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> +	kfree(buf->kbegin);
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor desc;
> +
> +	if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> +	    put_user(0, buf->desc_ptr_ptr) ||
> +	    copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> +		return;
> +
> +	if (desc.size > 1024)
> +		desc.size = 1024;
> +
> +	if (buf->kend - buf->kbegin != desc.size)
> +		buf->kbegin =
> +			krealloc_array(buf->kbegin, desc.size,
> +				       sizeof(struct uaccess_buffer_entry),
> +				       GFP_KERNEL);

The kernel is fine with 100 cols per line, if it helps readability. And
I think all cases where a "val =" is broken after the = fall into that
category. Here and everywhere else.

> +	if (!buf->kbegin) {
> +		buf->kend = NULL;
> +		return;
> +	}
> +
> +	set_syscall_work(UACCESS_BUFFER_EXIT);
> +	buf->kcur = buf->kbegin;
> +	buf->kend = buf->kbegin + desc.size;
> +	buf->ubegin =
> +		(struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	u64 num_entries = buf->kcur - buf->kbegin;
> +	struct uaccess_descriptor desc;
> +
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> +	desc.size = buf->kend - buf->kcur;
> +	buf->kcur = NULL;
> +	if (copy_to_user(buf->ubegin, buf->kbegin,
> +			 num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> +		(void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +				   unsigned long len)
> +{
> +	size_t retval;
> +
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	retval = copy_from_user(to, from, len);
> +	if (current->uaccess_buffer.kcur)
> +		set_syscall_work(UACCESS_BUFFER_EXIT);
> +	return retval;
> +}
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 2/7] uaccess-buffer: add core code
@ 2021-12-10 12:39     ` Marco Elver
  0 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 12:39 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
> 
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <pcc@google.com>

Few minor issues that may help readability; apart from that only 2
bigger question below which are related to changes in v4 below.

> ---
> v4:
> - add CONFIG_UACCESS_BUFFER
> - add kernel doc comments to uaccess-buffer.h
> - outline uaccess_buffer_set_descriptor_addr_addr
> - switch to using spin_lock_irqsave/spin_unlock_irqrestore during
>   pre/post-exit-loop code because preemption is disabled at that point
> - set kend to NULL if krealloc failed
> - size_t -> unsigned long in copy_from_user_nolog signature
> 
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
> 
> v2:
> - New interface that avoids multiple syscalls per real syscall and
>   is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
> 
>  arch/Kconfig                         |  13 +++
>  fs/exec.c                            |   3 +
>  include/linux/instrumented-uaccess.h |   6 +-
>  include/linux/sched.h                |   5 +
>  include/linux/uaccess-buffer-info.h  |  46 ++++++++
>  include/linux/uaccess-buffer.h       | 152 +++++++++++++++++++++++++++
>  include/uapi/linux/prctl.h           |   3 +
>  include/uapi/linux/uaccess-buffer.h  |  27 +++++
>  kernel/Makefile                      |   1 +
>  kernel/bpf/helpers.c                 |   7 +-
>  kernel/fork.c                        |   4 +
>  kernel/signal.c                      |   9 +-
>  kernel/sys.c                         |   6 ++
>  kernel/uaccess-buffer.c              | 145 +++++++++++++++++++++++++
>  14 files changed, 422 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/uaccess-buffer-info.h
>  create mode 100644 include/linux/uaccess-buffer.h
>  create mode 100644 include/uapi/linux/uaccess-buffer.h
>  create mode 100644 kernel/uaccess-buffer.c
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d3c4ab249e9c..17819f53ea80 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1312,6 +1312,19 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
>  config DYNAMIC_SIGFRAME
>  	bool
>  
> +config HAVE_ARCH_UACCESS_BUFFER
> +	bool
> +	help
> +	  Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> +config UACCESS_BUFFER
> +	bool "Uaccess logging" if EXPERT

I think this is fine, based on perf data you provided. I think whether
or not to enable by default is probably something that needs to be
backed up with more data though, ideally with performance numbers added
to the commit description.

It had come up before, so I think clarifying this will probably address
one of the 2 major questions (the other being if it can be used to leak
sensitive data).

This is the first instance of spelling uaccess as "Uaccess" in the
kernel. I think this should just be spelled out as "User access". Here
and in documentation.

> +	default y
> +	depends on HAVE_ARCH_UACCESS_BUFFER
> +	help
> +	  Select to enable support for uaccess logging
> +	  (see Documentation/admin-guide/uaccess-logging.rst).
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	me->personality &= ~bprm->per_clear;
>  
>  	clear_syscall_work_syscall_user_dispatch(me);
> +	uaccess_buffer_set_descriptor_addr_addr(0);
> +	uaccess_buffer_free(current);
>  
>  	/*
>  	 * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>  
>  /*
>   * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
>   */
>  #ifndef _LINUX_INSTRUMENTED_UACCESS_H
>  #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
>  #include <linux/kasan-checks.h>
>  #include <linux/kcsan-checks.h>
>  #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>  
>  /**
>   * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
>  {
>  	kasan_check_read(from, n);
>  	kcsan_check_read(from, n);
> +	uaccess_buffer_log_write(to, n);
>  }
>  
>  /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
>  {
>  	kasan_check_write(to, n);
>  	kcsan_check_write(to, n);
> +	uaccess_buffer_log_read(from, n);
>  }
>  
>  #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..96014dd2702e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
>  #include <linux/rseq.h>
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
>  #include <asm/kmap_size.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
>  	struct callback_head		l1d_flush_kill;
>  #endif
>  
> +#ifdef CONFIG_UACCESS_BUFFER
> +	struct uaccess_buffer_info	uaccess_buffer;
> +#endif
> +
>  	/*
>  	 * New fields for task_struct should be added above here, so that
>  	 * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..46e2b1a4a20f
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> +	/*
> +	 * The pointer to pointer to struct uaccess_descriptor. This is the
> +	 * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> +	 */
> +	struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> +	/*
> +	 * The pointer to struct uaccess_descriptor read at syscall entry time.
> +	 */
> +	struct uaccess_descriptor __user *desc_ptr;
> +
> +	/*
> +	 * A pointer to the kernel's temporary copy of the uaccess log for the
> +	 * current syscall. We log to a kernel buffer in order to avoid leaking
> +	 * timing information to userspace.
> +	 */
> +	struct uaccess_buffer_entry *kbegin;
> +
> +	/*
> +	 * The position of the next uaccess buffer entry for the current
> +	 * syscall, or NULL if we are not logging the current syscall.
> +	 */
> +	struct uaccess_buffer_entry *kcur;
> +
> +	/*
> +	 * A pointer to the end of the kernel's uaccess log.
> +	 */
> +	struct uaccess_buffer_entry *kend;
> +
> +	/*
> +	 * The pointer to the userspace uaccess log, as read from the
> +	 * struct uaccess_descriptor.
> +	 */
> +	struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..2e9b4010fb59
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * uaccess_buffer_maybe_blocked - returns whether a task potentially has signals
> + * blocked due to uaccess logging
> + * @tsk: the task.
> + */

Kernel-doc comments need to start with "/**". See template:
https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation

> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +	return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +	__uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +	__uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +	if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> +		return false;
> +	return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +	if (pending)
> +		__uaccess_buffer_post_exit_loop();
> +}
> +
> +/*
> + * uaccess_buffer_set_descriptor_addr_addr - implements
> + * prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> +
> +/*
> + * copy_from_user_nolog - a variant of copy_from_user that avoids uaccess
> + * logging. This is useful in special cases, such as when the kernel overreads a
> + * buffer.
> + * @to: the pointer to kernel memory.
> + * @from: the pointer to user memory.
> + * @len: the number of bytes to copy.
> + */
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +				   unsigned long len);
> +
> +/*
> + * uaccess_buffer_free - free the task's kernel-side uaccess buffer and arrange
> + * for uaccess logging to be cancelled for the current syscall
> + * @tsk: the task.
> + */
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +/*
> + * uaccess_buffer_log_read - log a read access
> + * @from: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +		__uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +/*
> + * uaccess_buffer_log_write - log a write access
> + * @to: the address of the access.
> + * @n: the number of bytes.
> + */
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +	if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> +		__uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> +	return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> +	return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +	return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog copy_from_user
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> +					   unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif  /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_SCOPE_THREAD_GROUP	1
>  # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP	2
>  
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR	63
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> +	/* Address of the uaccess_buffer_entry array. */
> +	__u64 addr;
> +	/* Size of the uaccess_buffer_entry array in number of elements. */
> +	__u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> +	/* Address being accessed. */
> +	__u64 addr;
> +	/* Number of bytes that were accessed. */
> +	__u64 size;
> +	/* UACCESS_BUFFER_* flags. */
> +	__u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE	1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..e5f6c56696a2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
>  obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess-buffer.o
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/proc_ns.h>
>  #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include "../../lib/kstrtox.h"
>  
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
>  BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
>  	   const void __user *, user_ptr)
>  {
> -	int ret = copy_from_user(dst, user_ptr, size);
> +	/*
> +	 * Avoid logging uaccesses here as the BPF program may not be following
> +	 * the uaccess log rules.
> +	 */
> +	int ret = copy_from_user_nolog(dst, user_ptr, size);
>  

Like the fs change, shouldn't this also be in its own patch?

>  	if (unlikely(ret)) {
>  		memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
>  	delayacct_tsk_free(tsk);
>  	put_signal_struct(tsk->signal);
>  	sched_core_free(tsk);
> +	uaccess_buffer_free(tsk);
>  
>  	if (!profile_handoff_task(tsk))
>  		free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (memcg_charge_kernel_stack(tsk))
>  		goto free_stack;
>  
> +	uaccess_buffer_free(orig);
> +
>  	stack_vm_area = task_stack_vm_area(tsk);
>  
>  	err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..b85d7d4844f6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/cgroup.h>
>  #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>  	if (sig_fatal(p, sig) &&
>  	    !(signal->flags & SIGNAL_GROUP_EXIT) &&
>  	    !sigismember(&t->real_blocked, sig) &&
> -	    (sig == SIGKILL || !p->ptrace)) {
> +	    (sig == SIGKILL ||
> +	     !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>  		/*
>  		 * This signal will be fatal to the whole group.
>  		 */
> @@ -3027,6 +3029,7 @@ void set_current_blocked(sigset_t *newset)
>  void __set_current_blocked(const sigset_t *newset)
>  {
>  	struct task_struct *tsk = current;
> +	unsigned long flags;
>  
>  	/*
>  	 * In case the signal mask hasn't changed, there is nothing we need
> @@ -3035,9 +3038,9 @@ void __set_current_blocked(const sigset_t *newset)
>  	if (sigequalsets(&tsk->blocked, newset))
>  		return;
>  
> -	spin_lock_irq(&tsk->sighand->siglock);
> +	spin_lock_irqsave(&tsk->sighand->siglock, flags);
>  	__set_task_blocked(tsk, newset);
> -	spin_unlock_irq(&tsk->sighand->siglock);
> +	spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
>  }

You say that when you call this in the pre/post-exit-loop that
preemption is disabled.

Is only preemption (via preempt_disable()) disabled or are interrupts
disabled?

If the latter, is it even valid to call set_current_blocked()? As-is,
one of its expected pre-conditions is that interrupts are enabled. Is
that a redundant expectation, which you are suggesting with this change?

>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/version.h>
>  #include <linux/ctype.h>
>  #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>  
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
>  		break;
>  #endif
> +	case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..d3129244b7d9
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> +	current->uaccess_buffer.desc_ptr_ptr =
> +		(struct uaccess_descriptor __user * __user *)addr;
> +	if (addr)
> +		set_syscall_work(UACCESS_BUFFER_ENTRY);
> +	else
> +		clear_syscall_work(UACCESS_BUFFER_ENTRY);
> +	return 0;
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> +			      unsigned long flags)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_buffer_entry *entry = buf->kcur;
> +
> +	if (entry == buf->kend || unlikely(uaccess_kernel()))
> +		return;
> +	entry->addr = addr;
> +	entry->size = size;
> +	entry->flags = flags;
> +
> +	++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> +	uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +	uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor __user *desc_ptr;
> +	sigset_t tmp_mask;
> +
> +	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +		return false;
> +
> +	current->real_blocked = current->blocked;
> +	sigfillset(&tmp_mask);
> +	set_current_blocked(&tmp_mask);
> +	return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&current->sighand->siglock, flags);
> +	current->blocked = current->real_blocked;
> +	recalc_sigpending();
> +	spin_unlock_irqrestore(&current->sighand->siglock, flags);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +	struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> +	kfree(buf->kbegin);
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor desc;
> +
> +	if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> +	    put_user(0, buf->desc_ptr_ptr) ||
> +	    copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> +		return;
> +
> +	if (desc.size > 1024)
> +		desc.size = 1024;
> +
> +	if (buf->kend - buf->kbegin != desc.size)
> +		buf->kbegin =
> +			krealloc_array(buf->kbegin, desc.size,
> +				       sizeof(struct uaccess_buffer_entry),
> +				       GFP_KERNEL);

The kernel is fine with 100 cols per line, if it helps readability. And
I think all cases where a "val =" is broken after the = fall into that
category. Here and everywhere else.

> +	if (!buf->kbegin) {
> +		buf->kend = NULL;
> +		return;
> +	}
> +
> +	set_syscall_work(UACCESS_BUFFER_EXIT);
> +	buf->kcur = buf->kbegin;
> +	buf->kend = buf->kbegin + desc.size;
> +	buf->ubegin =
> +		(struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	u64 num_entries = buf->kcur - buf->kbegin;
> +	struct uaccess_descriptor desc;
> +
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> +	desc.size = buf->kend - buf->kcur;
> +	buf->kcur = NULL;
> +	if (copy_to_user(buf->ubegin, buf->kbegin,
> +			 num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> +		(void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +unsigned long copy_from_user_nolog(void *to, const void __user *from,
> +				   unsigned long len)
> +{
> +	size_t retval;
> +
> +	clear_syscall_work(UACCESS_BUFFER_EXIT);
> +	retval = copy_from_user(to, from, len);
> +	if (current->uaccess_buffer.kcur)
> +		set_syscall_work(UACCESS_BUFFER_EXIT);
> +	return retval;
> +}
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/7] include: split out uaccess instrumentation into a separate header
  2021-12-09 22:15   ` Peter Collingbourne
@ 2021-12-10 12:45     ` Marco Elver
  -1 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 12:45 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> In an upcoming change we are going to add uaccess instrumentation
> that uses inline access to struct task_struct from the
> instrumentation routines. Because instrumentation.h is included
> from many places including (recursively) from sched.h this would
> otherwise lead to a circular dependency. Break the dependency by
> moving uaccess instrumentation routines into a separate header,
> instrumentation-uaccess.h.
> 
> Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> Acked-by: Dmitry Vyukov <dvyukov@google.com>

Reviewed-by: Marco Elver <elver@google.com>

> ---
>  include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
>  include/linux/instrumented.h         | 34 -------------------
>  include/linux/uaccess.h              |  2 +-
>  lib/iov_iter.c                       |  2 +-
>  lib/usercopy.c                       |  2 +-
>  5 files changed, 52 insertions(+), 37 deletions(-)
>  create mode 100644 include/linux/instrumented-uaccess.h
> 
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> new file mode 100644
> index 000000000000..ece549088e50
> --- /dev/null
> +++ b/include/linux/instrumented-uaccess.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * This header provides generic wrappers for memory access instrumentation for
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + */
> +#ifndef _LINUX_INSTRUMENTED_UACCESS_H
> +#define _LINUX_INSTRUMENTED_UACCESS_H
> +
> +#include <linux/compiler.h>
> +#include <linux/kasan-checks.h>
> +#include <linux/kcsan-checks.h>
> +#include <linux/types.h>
> +
> +/**
> + * instrument_copy_to_user - instrument reads of copy_to_user
> + *
> + * Instrument reads from kernel memory, that are due to copy_to_user (and
> + * variants). The instrumentation must be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> +	kasan_check_read(from, n);
> +	kcsan_check_read(from, n);
> +}
> +
> +/**
> + * instrument_copy_from_user - instrument writes of copy_from_user
> + *
> + * Instrument writes to kernel memory, that are due to copy_from_user (and
> + * variants). The instrumentation should be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> +{
> +	kasan_check_write(to, n);
> +	kcsan_check_write(to, n);
> +}
> +
> +#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..b68f415510c7 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
>  	kcsan_check_atomic_read_write(v, size);
>  }
>  
> -/**
> - * instrument_copy_to_user - instrument reads of copy_to_user
> - *
> - * Instrument reads from kernel memory, that are due to copy_to_user (and
> - * variants). The instrumentation must be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> -{
> -	kasan_check_read(from, n);
> -	kcsan_check_read(from, n);
> -}
> -
> -/**
> - * instrument_copy_from_user - instrument writes of copy_from_user
> - *
> - * Instrument writes to kernel memory, that are due to copy_from_user (and
> - * variants). The instrumentation should be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> -{
> -	kasan_check_write(to, n);
> -	kcsan_check_write(to, n);
> -}
> -
>  #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index ac0394087f7d..c0c467e39657 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -3,7 +3,7 @@
>  #define __LINUX_UACCESS_H__
>  
>  #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  #include <linux/minmax.h>
>  #include <linux/sched.h>
>  #include <linux/thread_info.h>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 66a740e6e153..3f9dc6df7102 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -12,7 +12,7 @@
>  #include <linux/compat.h>
>  #include <net/checksum.h>
>  #include <linux/scatterlist.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  
>  #define PIPE_PARANOIA /* for now */
>  
> diff --git a/lib/usercopy.c b/lib/usercopy.c
> index 7413dd300516..1cd188e62d06 100644
> --- a/lib/usercopy.c
> +++ b/lib/usercopy.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/bitops.h>
>  #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  #include <linux/uaccess.h>
>  
>  /* out-of-line parts */
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/7] include: split out uaccess instrumentation into a separate header
@ 2021-12-10 12:45     ` Marco Elver
  0 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 12:45 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> In an upcoming change we are going to add uaccess instrumentation
> that uses inline access to struct task_struct from the
> instrumentation routines. Because instrumentation.h is included
> from many places including (recursively) from sched.h this would
> otherwise lead to a circular dependency. Break the dependency by
> moving uaccess instrumentation routines into a separate header,
> instrumentation-uaccess.h.
> 
> Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> Acked-by: Dmitry Vyukov <dvyukov@google.com>

Reviewed-by: Marco Elver <elver@google.com>

> ---
>  include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
>  include/linux/instrumented.h         | 34 -------------------
>  include/linux/uaccess.h              |  2 +-
>  lib/iov_iter.c                       |  2 +-
>  lib/usercopy.c                       |  2 +-
>  5 files changed, 52 insertions(+), 37 deletions(-)
>  create mode 100644 include/linux/instrumented-uaccess.h
> 
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> new file mode 100644
> index 000000000000..ece549088e50
> --- /dev/null
> +++ b/include/linux/instrumented-uaccess.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * This header provides generic wrappers for memory access instrumentation for
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + */
> +#ifndef _LINUX_INSTRUMENTED_UACCESS_H
> +#define _LINUX_INSTRUMENTED_UACCESS_H
> +
> +#include <linux/compiler.h>
> +#include <linux/kasan-checks.h>
> +#include <linux/kcsan-checks.h>
> +#include <linux/types.h>
> +
> +/**
> + * instrument_copy_to_user - instrument reads of copy_to_user
> + *
> + * Instrument reads from kernel memory, that are due to copy_to_user (and
> + * variants). The instrumentation must be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> +	kasan_check_read(from, n);
> +	kcsan_check_read(from, n);
> +}
> +
> +/**
> + * instrument_copy_from_user - instrument writes of copy_from_user
> + *
> + * Instrument writes to kernel memory, that are due to copy_from_user (and
> + * variants). The instrumentation should be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> +{
> +	kasan_check_write(to, n);
> +	kcsan_check_write(to, n);
> +}
> +
> +#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..b68f415510c7 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
>  	kcsan_check_atomic_read_write(v, size);
>  }
>  
> -/**
> - * instrument_copy_to_user - instrument reads of copy_to_user
> - *
> - * Instrument reads from kernel memory, that are due to copy_to_user (and
> - * variants). The instrumentation must be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> -{
> -	kasan_check_read(from, n);
> -	kcsan_check_read(from, n);
> -}
> -
> -/**
> - * instrument_copy_from_user - instrument writes of copy_from_user
> - *
> - * Instrument writes to kernel memory, that are due to copy_from_user (and
> - * variants). The instrumentation should be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> -{
> -	kasan_check_write(to, n);
> -	kcsan_check_write(to, n);
> -}
> -
>  #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index ac0394087f7d..c0c467e39657 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -3,7 +3,7 @@
>  #define __LINUX_UACCESS_H__
>  
>  #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  #include <linux/minmax.h>
>  #include <linux/sched.h>
>  #include <linux/thread_info.h>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 66a740e6e153..3f9dc6df7102 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -12,7 +12,7 @@
>  #include <linux/compat.h>
>  #include <net/checksum.h>
>  #include <linux/scatterlist.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  
>  #define PIPE_PARANOIA /* for now */
>  
> diff --git a/lib/usercopy.c b/lib/usercopy.c
> index 7413dd300516..1cd188e62d06 100644
> --- a/lib/usercopy.c
> +++ b/lib/usercopy.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/bitops.h>
>  #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>  #include <linux/uaccess.h>
>  
>  /* out-of-line parts */
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/7] selftests: test uaccess logging
  2021-12-09 22:15   ` Peter Collingbourne
@ 2021-12-10 13:30     ` Marco Elver
  -1 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 13:30 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> Add a kselftest for the uaccess logging feature.
> 
> Link: https://linux-review.googlesource.com/id/I39e1707fb8aef53747c42bd55b46ecaa67205199
> Signed-off-by: Peter Collingbourne <pcc@google.com>

It would be good to also test:

	- Logging of reads.

	- Exhausting the uaccess buffer, ideally somehow checking that
	  the kernel hasn't written out-of-bounds, e.g. by using some
	  canary.

	- Passing an invalid address to some syscall, for which the
	  access should not be logged?

	- Passing an invalid address to the
	  PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR prctl().

	- Passing a valid address to the prctl(), but that address
	  points to an invalid address.

> ---
>  tools/testing/selftests/Makefile              |   1 +
>  .../testing/selftests/uaccess_buffer/Makefile |   4 +
>  .../uaccess_buffer/uaccess_buffer_test.c      | 126 ++++++++++++++++++
>  3 files changed, 131 insertions(+)
>  create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
>  create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index c852eb40c4f7..291b62430557 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -71,6 +71,7 @@ TARGETS += timers
>  endif
>  TARGETS += tmpfs
>  TARGETS += tpm2
> +TARGETS += uaccess_buffer
>  TARGETS += user
>  TARGETS += vDSO
>  TARGETS += vm
> diff --git a/tools/testing/selftests/uaccess_buffer/Makefile b/tools/testing/selftests/uaccess_buffer/Makefile
> new file mode 100644
> index 000000000000..e6e5fb43ce29
> --- /dev/null
> +++ b/tools/testing/selftests/uaccess_buffer/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
> +TEST_GEN_PROGS := uaccess_buffer_test
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> new file mode 100644
> index 000000000000..051062e4fbf9
> --- /dev/null
> +++ b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "../kselftest_harness.h"
> +
> +#include <linux/uaccess-buffer.h>
> +#include <sys/prctl.h>
> +#include <sys/utsname.h>
> +
> +FIXTURE(uaccess_buffer)
> +{
> +	uint64_t addr;
> +};
> +
> +FIXTURE_SETUP(uaccess_buffer)
> +{
> +	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &self->addr, 0,
> +			   0, 0));
> +}
> +
> +FIXTURE_TEARDOWN(uaccess_buffer)
> +{
> +	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, 0, 0, 0, 0));
> +}
> +
> +TEST_F(uaccess_buffer, uname)
> +{
> +	struct uaccess_descriptor desc;
> +	struct uaccess_buffer_entry entries[64];
> +	struct utsname un;
> +
> +	desc.addr = (uint64_t)(unsigned long)entries;
> +	desc.size = 64;
> +	self->addr = (uint64_t)(unsigned long)&desc;
> +	ASSERT_EQ(0, uname(&un));
> +	ASSERT_EQ(0, self->addr);
> +
> +	if (desc.size == 63) {
> +		ASSERT_EQ((uint64_t)(unsigned long)(entries + 1), desc.addr);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
> +		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
> +	} else {
> +		/* See override_architecture in kernel/sys.c */
> +		ASSERT_EQ(62, desc.size);
> +		ASSERT_EQ((uint64_t)(unsigned long)(entries + 2), desc.addr);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
> +		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un.machine,
> +			  entries[1].addr);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[1].flags);
> +	}
> +}
> +
> +static bool handled;
> +
> +static void usr1_handler(int signo)
> +{
> +	handled = true;
> +}
> +
> +TEST_F(uaccess_buffer, blocked_signals)
> +{
> +	struct uaccess_descriptor desc;
> +	struct shared_buf {
> +		bool ready;
> +		bool killed;
> +	} volatile *shared = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
> +				  MAP_ANON | MAP_SHARED, -1, 0);

I know it's a synonym, but to be consistent with other code, MAP_ANONYMOUS?

> +	struct sigaction act = {}, oldact;
> +	int pid;
> +
> +	handled = false;
> +	act.sa_handler = usr1_handler;
> +	sigaction(SIGUSR1, &act, &oldact);
> +
> +	pid = fork();
> +	if (pid == 0) {
> +		/*
> +		 * Busy loop to synchronize instead of issuing syscalls because
> +		 * we need to test the behavior in the case where no syscall is
> +		 * issued by the parent process.
> +		 */
> +		while (!shared->ready)
> +			;
> +		kill(getppid(), SIGUSR1);
> +		shared->killed = true;
> +		_exit(0);
> +	} else {
> +		int i;
> +
> +		desc.addr = 0;
> +		desc.size = 0;
> +		self->addr = (uint64_t)(unsigned long)&desc;
> +
> +		shared->ready = true;
> +		while (!shared->killed)
> +			;
> +
> +		/*
> +		 * The kernel should have IPI'd us by now, but let's wait a bit
> +		 * longer just in case.

Is IPI = signalled? Because in the kernel, IPI = inter-processor
interrupt.

> +		 */
> +		for (i = 0; i != 1000000; ++i)
> +			;

This is probably optimized out.  usleep() should work, or add compiler
barrier if usleep doesn't work.

> +
> +		ASSERT_FALSE(handled);
> +
> +		/*
> +		 * Returning from the waitpid syscall should trigger the signal
> +		 * handler. The signal itself may also interrupt waitpid, so
> +		 * make sure to handle EINTR.
> +		 */
> +		while (waitpid(pid, NULL, 0) == -1)
> +			ASSERT_EQ(EINTR, errno);
> +		ASSERT_TRUE(handled);
> +	}
> +
> +	munmap((void *)shared, getpagesize());
> +	sigaction(SIGUSR1, &oldact, NULL);
> +}
> +
> +TEST_HARNESS_MAIN
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/7] selftests: test uaccess logging
@ 2021-12-10 13:30     ` Marco Elver
  0 siblings, 0 replies; 46+ messages in thread
From: Marco Elver @ 2021-12-10 13:30 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Thu, Dec 09, 2021 at 02:15PM -0800, Peter Collingbourne wrote:
> Add a kselftest for the uaccess logging feature.
> 
> Link: https://linux-review.googlesource.com/id/I39e1707fb8aef53747c42bd55b46ecaa67205199
> Signed-off-by: Peter Collingbourne <pcc@google.com>

It would be good to also test:

	- Logging of reads.

	- Exhausting the uaccess buffer, ideally somehow checking that
	  the kernel hasn't written out-of-bounds, e.g. by using some
	  canary.

	- Passing an invalid address to some syscall, for which the
	  access should not be logged?

	- Passing an invalid address to the
	  PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR prctl().

	- Passing a valid address to the prctl(), but that address
	  points to an invalid address.

> ---
>  tools/testing/selftests/Makefile              |   1 +
>  .../testing/selftests/uaccess_buffer/Makefile |   4 +
>  .../uaccess_buffer/uaccess_buffer_test.c      | 126 ++++++++++++++++++
>  3 files changed, 131 insertions(+)
>  create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
>  create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index c852eb40c4f7..291b62430557 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -71,6 +71,7 @@ TARGETS += timers
>  endif
>  TARGETS += tmpfs
>  TARGETS += tpm2
> +TARGETS += uaccess_buffer
>  TARGETS += user
>  TARGETS += vDSO
>  TARGETS += vm
> diff --git a/tools/testing/selftests/uaccess_buffer/Makefile b/tools/testing/selftests/uaccess_buffer/Makefile
> new file mode 100644
> index 000000000000..e6e5fb43ce29
> --- /dev/null
> +++ b/tools/testing/selftests/uaccess_buffer/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
> +TEST_GEN_PROGS := uaccess_buffer_test
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> new file mode 100644
> index 000000000000..051062e4fbf9
> --- /dev/null
> +++ b/tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "../kselftest_harness.h"
> +
> +#include <linux/uaccess-buffer.h>
> +#include <sys/prctl.h>
> +#include <sys/utsname.h>
> +
> +FIXTURE(uaccess_buffer)
> +{
> +	uint64_t addr;
> +};
> +
> +FIXTURE_SETUP(uaccess_buffer)
> +{
> +	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &self->addr, 0,
> +			   0, 0));
> +}
> +
> +FIXTURE_TEARDOWN(uaccess_buffer)
> +{
> +	ASSERT_EQ(0, prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, 0, 0, 0, 0));
> +}
> +
> +TEST_F(uaccess_buffer, uname)
> +{
> +	struct uaccess_descriptor desc;
> +	struct uaccess_buffer_entry entries[64];
> +	struct utsname un;
> +
> +	desc.addr = (uint64_t)(unsigned long)entries;
> +	desc.size = 64;
> +	self->addr = (uint64_t)(unsigned long)&desc;
> +	ASSERT_EQ(0, uname(&un));
> +	ASSERT_EQ(0, self->addr);
> +
> +	if (desc.size == 63) {
> +		ASSERT_EQ((uint64_t)(unsigned long)(entries + 1), desc.addr);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
> +		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
> +	} else {
> +		/* See override_architecture in kernel/sys.c */
> +		ASSERT_EQ(62, desc.size);
> +		ASSERT_EQ((uint64_t)(unsigned long)(entries + 2), desc.addr);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un, entries[0].addr);
> +		ASSERT_EQ(sizeof(struct utsname), entries[0].size);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[0].flags);
> +
> +		ASSERT_EQ((uint64_t)(unsigned long)&un.machine,
> +			  entries[1].addr);
> +		ASSERT_EQ(UACCESS_BUFFER_FLAG_WRITE, entries[1].flags);
> +	}
> +}
> +
> +static bool handled;
> +
> +static void usr1_handler(int signo)
> +{
> +	handled = true;
> +}
> +
> +TEST_F(uaccess_buffer, blocked_signals)
> +{
> +	struct uaccess_descriptor desc;
> +	struct shared_buf {
> +		bool ready;
> +		bool killed;
> +	} volatile *shared = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
> +				  MAP_ANON | MAP_SHARED, -1, 0);

I know it's a synonym, but to be consistent with other code, MAP_ANONYMOUS?

> +	struct sigaction act = {}, oldact;
> +	int pid;
> +
> +	handled = false;
> +	act.sa_handler = usr1_handler;
> +	sigaction(SIGUSR1, &act, &oldact);
> +
> +	pid = fork();
> +	if (pid == 0) {
> +		/*
> +		 * Busy loop to synchronize instead of issuing syscalls because
> +		 * we need to test the behavior in the case where no syscall is
> +		 * issued by the parent process.
> +		 */
> +		while (!shared->ready)
> +			;
> +		kill(getppid(), SIGUSR1);
> +		shared->killed = true;
> +		_exit(0);
> +	} else {
> +		int i;
> +
> +		desc.addr = 0;
> +		desc.size = 0;
> +		self->addr = (uint64_t)(unsigned long)&desc;
> +
> +		shared->ready = true;
> +		while (!shared->killed)
> +			;
> +
> +		/*
> +		 * The kernel should have IPI'd us by now, but let's wait a bit
> +		 * longer just in case.

Is IPI = signalled? Because in the kernel, IPI = inter-processor
interrupt.

> +		 */
> +		for (i = 0; i != 1000000; ++i)
> +			;

This is probably optimized out.  usleep() should work, or add compiler
barrier if usleep doesn't work.

> +
> +		ASSERT_FALSE(handled);
> +
> +		/*
> +		 * Returning from the waitpid syscall should trigger the signal
> +		 * handler. The signal itself may also interrupt waitpid, so
> +		 * make sure to handle EINTR.
> +		 */
> +		while (waitpid(pid, NULL, 0) == -1)
> +			ASSERT_EQ(EINTR, errno);
> +		ASSERT_TRUE(handled);
> +	}
> +
> +	munmap((void *)shared, getpagesize());
> +	sigaction(SIGUSR1, &oldact, NULL);
> +}
> +
> +TEST_HARNESS_MAIN
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-09 22:15   ` Peter Collingbourne
@ 2021-12-11 11:50     ` Thomas Gleixner
  -1 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-11 11:50 UTC (permalink / raw)
  To: Peter Collingbourne, Catalin Marinas, Will Deacon, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Andy Lutomirski, Kees Cook,
	Andrew Morton, Masahiro Yamada, Sami Tolvanen, YiFei Zhu,
	Mark Rutland, Frederic Weisbecker, Viresh Kumar,
	Andrey Konovalov, Peter Collingbourne, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Thu, Dec 09 2021 at 14:15, Peter Collingbourne wrote:
> @@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
>  	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	bool uaccess_buffer_pending;
>  
>  	lockdep_assert_irqs_disabled();
>  
>  	/* Flush pending rcuog wakeup before the last need_resched() check */
>  	tick_nohz_user_enter_prepare();
>  
> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
> +		bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> +
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
> +		uaccess_buffer_post_exit_loop(uaccess_buffer_pending);

What? Let me look at the these two functions, which are so full of useful
comments:

> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor __user *desc_ptr;
> +	sigset_t tmp_mask;
> +
> +	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +		return false;
> +
> +	current->real_blocked = current->blocked;
> +	sigfillset(&tmp_mask);
> +	set_current_blocked(&tmp_mask);

This prevents signal delivery in exit_to_user_mode_loop(), right?

> +	return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&current->sighand->siglock, flags);
> +	current->blocked = current->real_blocked;
> +	recalc_sigpending();

This restores the signal blocked mask _after_ exit_to_user_mode_loop()
has completed, recalculates pending signals and goes out to user space
with eventually pending signals.

How is this supposed to be even remotely correct?

But that aside, let me look at the whole picture as I understand it from
reverse engineering it. Yes, reverse engineering, because there are
neither comments in the code nor any useful information in the
changelogs of 2/7 and 4/7. Also the cover letter and the "documentation"
are not explaining any of this and just blurb about sanitizers and how
wonderful this all is.
 
> @@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
>  			return ret;
>  	}
>  
> +	if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
> +		uaccess_buffer_syscall_entry();

This conditionally sets SYSCALL_WORK_UACCESS_BUFFER_EXIT.

> @@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
>  
>  	audit_syscall_exit(regs);
>  
> +	if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
> +		uaccess_buffer_syscall_exit();

When returning from the syscall and SYSCALL_WORK_UACCESS_BUFFER_EXIT is
set, then uaccess_buffer_syscall_exit() clears
SYSCALL_WORK_UACCESS_BUFFER_EXIT, right?

This is called _before_ exit_to_user_mode_prepare(). So why is this
__uaccess_buffer_pre/post_exit_loop() required at all?

It's not required at all. Why?

Simply because there are only two ways how exit_to_user_mode_prepare()
can be reached:

  1) When returning from a syscall

  2) When returning from an interrupt which hit user mode execution

#1 SYSCALL_WORK_UACCESS_BUFFER_EXIT is cleared _before_
   exit_to_user_mode_prepare() is reached as documented above.

#2 SYSCALL_WORK_UACCESS_BUFFER_EXIT cannot be set because the entry
   to the kernel does not go through syscall_trace_enter().

So what is this pre/post exit loop code about? Handle something which
cannot happen in the first place?

If at all this would warrant a:

	if (WARN_ON_ONCE(test_syscall_work(UACCESS_BUFFER_ENTRY)))
        	do_something_sensible();

instead of adding undocumented voodoo w/o providing any rationale. Well,
I can see why that was not provided because there is no rationale to
begin with.

Seriously, I'm all for better instrumentation and analysis, but if the
code provided for that is incomprehensible, uncommented and
undocumented, then the result is worse than what we have now.

If you think that this qualifies as documentation:

> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */

> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */

> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */

> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */

then we have a very differrent understanding of what documentation
should provide.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-11 11:50     ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-11 11:50 UTC (permalink / raw)
  To: Peter Collingbourne, Catalin Marinas, Will Deacon, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Andy Lutomirski, Kees Cook,
	Andrew Morton, Masahiro Yamada, Sami Tolvanen, YiFei Zhu,
	Mark Rutland, Frederic Weisbecker, Viresh Kumar,
	Andrey Konovalov, Peter Collingbourne, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Thu, Dec 09 2021 at 14:15, Peter Collingbourne wrote:
> @@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
>  	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	bool uaccess_buffer_pending;
>  
>  	lockdep_assert_irqs_disabled();
>  
>  	/* Flush pending rcuog wakeup before the last need_resched() check */
>  	tick_nohz_user_enter_prepare();
>  
> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
> +		bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> +
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
> +		uaccess_buffer_post_exit_loop(uaccess_buffer_pending);

What? Let me look at the these two functions, which are so full of useful
comments:

> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> +	struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> +	struct uaccess_descriptor __user *desc_ptr;
> +	sigset_t tmp_mask;
> +
> +	if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> +		return false;
> +
> +	current->real_blocked = current->blocked;
> +	sigfillset(&tmp_mask);
> +	set_current_blocked(&tmp_mask);

This prevents signal delivery in exit_to_user_mode_loop(), right?

> +	return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&current->sighand->siglock, flags);
> +	current->blocked = current->real_blocked;
> +	recalc_sigpending();

This restores the signal blocked mask _after_ exit_to_user_mode_loop()
has completed, recalculates pending signals and goes out to user space
with eventually pending signals.

How is this supposed to be even remotely correct?

But that aside, let me look at the whole picture as I understand it from
reverse engineering it. Yes, reverse engineering, because there are
neither comments in the code nor any useful information in the
changelogs of 2/7 and 4/7. Also the cover letter and the "documentation"
are not explaining any of this and just blurb about sanitizers and how
wonderful this all is.
 
> @@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
>  			return ret;
>  	}
>  
> +	if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
> +		uaccess_buffer_syscall_entry();

This conditionally sets SYSCALL_WORK_UACCESS_BUFFER_EXIT.

> @@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
>  
>  	audit_syscall_exit(regs);
>  
> +	if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
> +		uaccess_buffer_syscall_exit();

When returning from the syscall and SYSCALL_WORK_UACCESS_BUFFER_EXIT is
set, then uaccess_buffer_syscall_exit() clears
SYSCALL_WORK_UACCESS_BUFFER_EXIT, right?

This is called _before_ exit_to_user_mode_prepare(). So why is this
__uaccess_buffer_pre/post_exit_loop() required at all?

It's not required at all. Why?

Simply because there are only two ways how exit_to_user_mode_prepare()
can be reached:

  1) When returning from a syscall

  2) When returning from an interrupt which hit user mode execution

#1 SYSCALL_WORK_UACCESS_BUFFER_EXIT is cleared _before_
   exit_to_user_mode_prepare() is reached as documented above.

#2 SYSCALL_WORK_UACCESS_BUFFER_EXIT cannot be set because the entry
   to the kernel does not go through syscall_trace_enter().

So what is this pre/post exit loop code about? Handle something which
cannot happen in the first place?

If at all this would warrant a:

	if (WARN_ON_ONCE(test_syscall_work(UACCESS_BUFFER_ENTRY)))
        	do_something_sensible();

instead of adding undocumented voodoo w/o providing any rationale. Well,
I can see why that was not provided because there is no rationale to
begin with.

Seriously, I'm all for better instrumentation and analysis, but if the
code provided for that is incomprehensible, uncommented and
undocumented, then the result is worse than what we have now.

If you think that this qualifies as documentation:

> +/*
> + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> + */

> +/*
> + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> + */

> +/*
> + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> + * be passed to uaccess_buffer_post_exit_loop.
> + */

> +/*
> + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> + * pre-kernel-exit loop that handles signals, tracing etc.
> + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> + */

then we have a very differrent understanding of what documentation
should provide.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 0/7] kernel: introduce uaccess logging
  2021-12-09 22:15 ` Peter Collingbourne
@ 2021-12-11 17:23   ` David Laight
  -1 siblings, 0 replies; 46+ messages in thread
From: David Laight @ 2021-12-11 17:23 UTC (permalink / raw)
  To: 'Peter Collingbourne',
	Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

From: Peter Collingbourne
> Sent: 09 December 2021 22:16
> 
> This patch series introduces a kernel feature known as uaccess
> logging, which allows userspace programs to be made aware of the
> address and size of uaccesses performed by the kernel during
> the servicing of a syscall. More details on the motivation
> for and interface to this feature are available in the file
> Documentation/admin-guide/uaccess-logging.rst added by the final
> patch in the series.

How does this work when get_user() and put_user() are used to
do optimised copies?

While adding checks to copy_to/from_user() is going to have
a measurable performance impact - even if nothing is done,
adding them to get/put_user() (and friends) is going to
make some hot paths really slow.

So maybe you could add it so KASAN test kernels, but you can't
sensibly enable it on a production kernel.

Now, it might be that you could semi-sensibly log 'data' transfers.
But have you actually looked at all the transfers that happen
for something like sendmsg().
The 'user copy hardening' code already has a significant impact
on that code (in many places).


	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-11 17:23   ` David Laight
  0 siblings, 0 replies; 46+ messages in thread
From: David Laight @ 2021-12-11 17:23 UTC (permalink / raw)
  To: 'Peter Collingbourne',
	Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko
  Cc: linux-kernel, linux-arm-kernel, Evgenii Stepanov

From: Peter Collingbourne
> Sent: 09 December 2021 22:16
> 
> This patch series introduces a kernel feature known as uaccess
> logging, which allows userspace programs to be made aware of the
> address and size of uaccesses performed by the kernel during
> the servicing of a syscall. More details on the motivation
> for and interface to this feature are available in the file
> Documentation/admin-guide/uaccess-logging.rst added by the final
> patch in the series.

How does this work when get_user() and put_user() are used to
do optimised copies?

While adding checks to copy_to/from_user() is going to have
a measurable performance impact - even if nothing is done,
adding them to get/put_user() (and friends) is going to
make some hot paths really slow.

So maybe you could add it so KASAN test kernels, but you can't
sensibly enable it on a production kernel.

Now, it might be that you could semi-sensibly log 'data' transfers.
But have you actually looked at all the transfers that happen
for something like sendmsg().
The 'user copy hardening' code already has a significant impact
on that code (in many places).


	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
  2021-12-11 17:23   ` David Laight
@ 2021-12-13 19:48     ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-13 19:48 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 09 December 2021 22:16
> >
> > This patch series introduces a kernel feature known as uaccess
> > logging, which allows userspace programs to be made aware of the
> > address and size of uaccesses performed by the kernel during
> > the servicing of a syscall. More details on the motivation
> > for and interface to this feature are available in the file
> > Documentation/admin-guide/uaccess-logging.rst added by the final
> > patch in the series.
>
> How does this work when get_user() and put_user() are used to
> do optimised copies?
>
> While adding checks to copy_to/from_user() is going to have
> a measurable performance impact - even if nothing is done,
> adding them to get/put_user() (and friends) is going to
> make some hot paths really slow.
>
> So maybe you could add it so KASAN test kernels, but you can't
> sensibly enable it on a production kernel.
>
> Now, it might be that you could semi-sensibly log 'data' transfers.
> But have you actually looked at all the transfers that happen
> for something like sendmsg().
> The 'user copy hardening' code already has a significant impact
> on that code (in many places).

Hi David,

Yes, I realised after I sent out my patch (and while writing test
cases for it) that it didn't cover get_user()/put_user(). I have a
patch under development that will add this coverage. I used it to run
my invalid syscall and uname benchmarks and the results were basically
the same as without the coverage.

Are you aware of any benchmarks that cover sendmsg()? I can try to
look at writing my own if not. I was also planning to write a
benchmark that uses getresuid() as this was the simplest syscall that
I could find that does multiple put_user() calls.

Peter

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-13 19:48     ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-13 19:48 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 09 December 2021 22:16
> >
> > This patch series introduces a kernel feature known as uaccess
> > logging, which allows userspace programs to be made aware of the
> > address and size of uaccesses performed by the kernel during
> > the servicing of a syscall. More details on the motivation
> > for and interface to this feature are available in the file
> > Documentation/admin-guide/uaccess-logging.rst added by the final
> > patch in the series.
>
> How does this work when get_user() and put_user() are used to
> do optimised copies?
>
> While adding checks to copy_to/from_user() is going to have
> a measurable performance impact - even if nothing is done,
> adding them to get/put_user() (and friends) is going to
> make some hot paths really slow.
>
> So maybe you could add it so KASAN test kernels, but you can't
> sensibly enable it on a production kernel.
>
> Now, it might be that you could semi-sensibly log 'data' transfers.
> But have you actually looked at all the transfers that happen
> for something like sendmsg().
> The 'user copy hardening' code already has a significant impact
> on that code (in many places).

Hi David,

Yes, I realised after I sent out my patch (and while writing test
cases for it) that it didn't cover get_user()/put_user(). I have a
patch under development that will add this coverage. I used it to run
my invalid syscall and uname benchmarks and the results were basically
the same as without the coverage.

Are you aware of any benchmarks that cover sendmsg()? I can try to
look at writing my own if not. I was also planning to write a
benchmark that uses getresuid() as this was the simplest syscall that
I could find that does multiple put_user() calls.

Peter

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 0/7] kernel: introduce uaccess logging
  2021-12-13 19:48     ` Peter Collingbourne
@ 2021-12-13 23:07       ` David Laight
  -1 siblings, 0 replies; 46+ messages in thread
From: David Laight @ 2021-12-13 23:07 UTC (permalink / raw)
  To: 'Peter Collingbourne'
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

From: Peter Collingbourne
> Sent: 13 December 2021 19:49
> 
> On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 09 December 2021 22:16
> > >
> > > This patch series introduces a kernel feature known as uaccess
> > > logging, which allows userspace programs to be made aware of the
> > > address and size of uaccesses performed by the kernel during
> > > the servicing of a syscall. More details on the motivation
> > > for and interface to this feature are available in the file
> > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > patch in the series.
> >
> > How does this work when get_user() and put_user() are used to
> > do optimised copies?
> >
> > While adding checks to copy_to/from_user() is going to have
> > a measurable performance impact - even if nothing is done,
> > adding them to get/put_user() (and friends) is going to
> > make some hot paths really slow.
> >
> > So maybe you could add it so KASAN test kernels, but you can't
> > sensibly enable it on a production kernel.
> >
> > Now, it might be that you could semi-sensibly log 'data' transfers.
> > But have you actually looked at all the transfers that happen
> > for something like sendmsg().
> > The 'user copy hardening' code already has a significant impact
> > on that code (in many places).
> 
> Hi David,
> 
> Yes, I realised after I sent out my patch (and while writing test
> cases for it) that it didn't cover get_user()/put_user(). I have a
> patch under development that will add this coverage. I used it to run
> my invalid syscall and uname benchmarks and the results were basically
> the same as without the coverage.
> 
> Are you aware of any benchmarks that cover sendmsg()? I can try to
> look at writing my own if not. I was also planning to write a
> benchmark that uses getresuid() as this was the simplest syscall that
> I could find that does multiple put_user() calls.

Also look at sys_poll() I think that uses __put/get_user().

I think you'll find some of the socket option code also uses get_user().

There is also the compat code for import_iovec().
IIRC that is actually faster than the non-compat version at the moment.

I did some benchmarking of writev("/dev/null", iov, 10);
The cost of reading in the iovec is significant in that case.
Maybe I ought to find time to sort out my patches.

For sendmsg() using __copy_from_user() to avoid the user-copy
hardening checks also makes a measurable difference when sending UDP
through raw sockets - which we do a lot of.

I think you'd need to instrument user_access_begin() and also be able
to merge trace entries (for multiple get_user() calls).

You really don't have to look far to find places where copy_to/from_user()
is optimised to multiple get/put_user() or __get/put_user() (or are they
the 'nofault' variants?)
Those are all hot paths - at least for some workloads.
So adding anything there isn't likely to be accepted for production kernels.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-13 23:07       ` David Laight
  0 siblings, 0 replies; 46+ messages in thread
From: David Laight @ 2021-12-13 23:07 UTC (permalink / raw)
  To: 'Peter Collingbourne'
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

From: Peter Collingbourne
> Sent: 13 December 2021 19:49
> 
> On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 09 December 2021 22:16
> > >
> > > This patch series introduces a kernel feature known as uaccess
> > > logging, which allows userspace programs to be made aware of the
> > > address and size of uaccesses performed by the kernel during
> > > the servicing of a syscall. More details on the motivation
> > > for and interface to this feature are available in the file
> > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > patch in the series.
> >
> > How does this work when get_user() and put_user() are used to
> > do optimised copies?
> >
> > While adding checks to copy_to/from_user() is going to have
> > a measurable performance impact - even if nothing is done,
> > adding them to get/put_user() (and friends) is going to
> > make some hot paths really slow.
> >
> > So maybe you could add it so KASAN test kernels, but you can't
> > sensibly enable it on a production kernel.
> >
> > Now, it might be that you could semi-sensibly log 'data' transfers.
> > But have you actually looked at all the transfers that happen
> > for something like sendmsg().
> > The 'user copy hardening' code already has a significant impact
> > on that code (in many places).
> 
> Hi David,
> 
> Yes, I realised after I sent out my patch (and while writing test
> cases for it) that it didn't cover get_user()/put_user(). I have a
> patch under development that will add this coverage. I used it to run
> my invalid syscall and uname benchmarks and the results were basically
> the same as without the coverage.
> 
> Are you aware of any benchmarks that cover sendmsg()? I can try to
> look at writing my own if not. I was also planning to write a
> benchmark that uses getresuid() as this was the simplest syscall that
> I could find that does multiple put_user() calls.

Also look at sys_poll() I think that uses __put/get_user().

I think you'll find some of the socket option code also uses get_user().

There is also the compat code for import_iovec().
IIRC that is actually faster than the non-compat version at the moment.

I did some benchmarking of writev("/dev/null", iov, 10);
The cost of reading in the iovec is significant in that case.
Maybe I ought to find time to sort out my patches.

For sendmsg() using __copy_from_user() to avoid the user-copy
hardening checks also makes a measurable difference when sending UDP
through raw sockets - which we do a lot of.

I think you'd need to instrument user_access_begin() and also be able
to merge trace entries (for multiple get_user() calls).

You really don't have to look far to find places where copy_to/from_user()
is optimised to multiple get/put_user() or __get/put_user() (or are they
the 'nofault' variants?)
Those are all hot paths - at least for some workloads.
So adding anything there isn't likely to be accepted for production kernels.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
  2021-12-13 23:07       ` David Laight
@ 2021-12-14  3:47         ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-14  3:47 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 13 December 2021 19:49
> >
> > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Peter Collingbourne
> > > > Sent: 09 December 2021 22:16
> > > >
> > > > This patch series introduces a kernel feature known as uaccess
> > > > logging, which allows userspace programs to be made aware of the
> > > > address and size of uaccesses performed by the kernel during
> > > > the servicing of a syscall. More details on the motivation
> > > > for and interface to this feature are available in the file
> > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > patch in the series.
> > >
> > > How does this work when get_user() and put_user() are used to
> > > do optimised copies?
> > >
> > > While adding checks to copy_to/from_user() is going to have
> > > a measurable performance impact - even if nothing is done,
> > > adding them to get/put_user() (and friends) is going to
> > > make some hot paths really slow.
> > >
> > > So maybe you could add it so KASAN test kernels, but you can't
> > > sensibly enable it on a production kernel.
> > >
> > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > But have you actually looked at all the transfers that happen
> > > for something like sendmsg().
> > > The 'user copy hardening' code already has a significant impact
> > > on that code (in many places).
> >
> > Hi David,
> >
> > Yes, I realised after I sent out my patch (and while writing test
> > cases for it) that it didn't cover get_user()/put_user(). I have a
> > patch under development that will add this coverage. I used it to run
> > my invalid syscall and uname benchmarks and the results were basically
> > the same as without the coverage.
> >
> > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > look at writing my own if not. I was also planning to write a
> > benchmark that uses getresuid() as this was the simplest syscall that
> > I could find that does multiple put_user() calls.
>
> Also look at sys_poll() I think that uses __put/get_user().
>
> I think you'll find some of the socket option code also uses get_user().
>
> There is also the compat code for import_iovec().
> IIRC that is actually faster than the non-compat version at the moment.
>
> I did some benchmarking of writev("/dev/null", iov, 10);
> The cost of reading in the iovec is significant in that case.
> Maybe I ought to find time to sort out my patches.
>
> For sendmsg() using __copy_from_user() to avoid the user-copy
> hardening checks also makes a measurable difference when sending UDP
> through raw sockets - which we do a lot of.
>
> I think you'd need to instrument user_access_begin() and also be able
> to merge trace entries (for multiple get_user() calls).
>
> You really don't have to look far to find places where copy_to/from_user()
> is optimised to multiple get/put_user() or __get/put_user() (or are they
> the 'nofault' variants?)
> Those are all hot paths - at least for some workloads.
> So adding anything there isn't likely to be accepted for production kernels.

Okay, but let's see what the benchmarks say first.

I added calls to uaccess_buffer_log_{read,write}() to
__{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
variant of my usual test program that does getresuid() in a loop and
measured an overhead of 5.9% on the small cores and 2.5% on the big
cores. The overhead appears to come from two sources:
1) The calling convention for the call to
__uaccess_buffer_log_read/write() transforms some leaf functions into
non-leaf functions, and as a result we end up spilling more registers.
2) We need to reload the flags from the task_struct every time to
determine the feature enabled state.
I think it should be possible to reduce the cost down to one
instruction (a conditional branch) per get/put_user() call, at least
on arm64, by using a different calling convention for the call and
maintaining the feature enabled state in one of the currently-ignored
bits 60-63 of a reserved register (x18).

I have the patch below which reduces the overhead with my test program
to 2.3% small 1.5% big (not fully functional yet because the rest of
the code never actually flips that bit in x18). For syscalls like
sendmsg() and poll() I would expect the overall overhead to be lower
because of other required work done by these syscalls (glancing
through their implementations this appears to be the case). But I'll
benchmark it of course. The preserve_most attribute is Clang-only but
I think it should be possible to use inline asm to get the same effect
with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
enabled but I was unable to measure an overhead for getresuid() when
reserving x18 unconditionally.

Peter

diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
index 122d9e1ccbbd..81b43a0a5505 100644
--- a/include/linux/uaccess-buffer.h
+++ b/include/linux/uaccess-buffer.h
@@ -99,7 +99,8 @@ unsigned long copy_from_user_with_log_flags(void
*to, const void __user *from,
  */
 void uaccess_buffer_free(struct task_struct *tsk);

-void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+__attribute__((preserve_most)) void
+__uaccess_buffer_log_read(const void __user *from, unsigned long n);
 /**
  * uaccess_buffer_log_read - log a read access
  * @from: the address of the access.
@@ -107,11 +108,14 @@ void __uaccess_buffer_log_read(const void __user
*from, unsigned long n);
  */
 static inline void uaccess_buffer_log_read(const void __user *from,
unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_read(from, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_read(from, n);
 }

-void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+__attribute__((preserve_most)) void __uaccess_buffer_log_write(void __user *to,
+                                                              unsigned long n);
 /**
  * uaccess_buffer_log_write - log a write access
  * @to: the address of the access.
@@ -119,8 +123,10 @@ void __uaccess_buffer_log_write(void __user *to,
unsigned long n);
  */
 static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_write(to, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_write(to, n);
 }

 #else

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-14  3:47         ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-14  3:47 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 13 December 2021 19:49
> >
> > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Peter Collingbourne
> > > > Sent: 09 December 2021 22:16
> > > >
> > > > This patch series introduces a kernel feature known as uaccess
> > > > logging, which allows userspace programs to be made aware of the
> > > > address and size of uaccesses performed by the kernel during
> > > > the servicing of a syscall. More details on the motivation
> > > > for and interface to this feature are available in the file
> > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > patch in the series.
> > >
> > > How does this work when get_user() and put_user() are used to
> > > do optimised copies?
> > >
> > > While adding checks to copy_to/from_user() is going to have
> > > a measurable performance impact - even if nothing is done,
> > > adding them to get/put_user() (and friends) is going to
> > > make some hot paths really slow.
> > >
> > > So maybe you could add it so KASAN test kernels, but you can't
> > > sensibly enable it on a production kernel.
> > >
> > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > But have you actually looked at all the transfers that happen
> > > for something like sendmsg().
> > > The 'user copy hardening' code already has a significant impact
> > > on that code (in many places).
> >
> > Hi David,
> >
> > Yes, I realised after I sent out my patch (and while writing test
> > cases for it) that it didn't cover get_user()/put_user(). I have a
> > patch under development that will add this coverage. I used it to run
> > my invalid syscall and uname benchmarks and the results were basically
> > the same as without the coverage.
> >
> > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > look at writing my own if not. I was also planning to write a
> > benchmark that uses getresuid() as this was the simplest syscall that
> > I could find that does multiple put_user() calls.
>
> Also look at sys_poll() I think that uses __put/get_user().
>
> I think you'll find some of the socket option code also uses get_user().
>
> There is also the compat code for import_iovec().
> IIRC that is actually faster than the non-compat version at the moment.
>
> I did some benchmarking of writev("/dev/null", iov, 10);
> The cost of reading in the iovec is significant in that case.
> Maybe I ought to find time to sort out my patches.
>
> For sendmsg() using __copy_from_user() to avoid the user-copy
> hardening checks also makes a measurable difference when sending UDP
> through raw sockets - which we do a lot of.
>
> I think you'd need to instrument user_access_begin() and also be able
> to merge trace entries (for multiple get_user() calls).
>
> You really don't have to look far to find places where copy_to/from_user()
> is optimised to multiple get/put_user() or __get/put_user() (or are they
> the 'nofault' variants?)
> Those are all hot paths - at least for some workloads.
> So adding anything there isn't likely to be accepted for production kernels.

Okay, but let's see what the benchmarks say first.

I added calls to uaccess_buffer_log_{read,write}() to
__{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
variant of my usual test program that does getresuid() in a loop and
measured an overhead of 5.9% on the small cores and 2.5% on the big
cores. The overhead appears to come from two sources:
1) The calling convention for the call to
__uaccess_buffer_log_read/write() transforms some leaf functions into
non-leaf functions, and as a result we end up spilling more registers.
2) We need to reload the flags from the task_struct every time to
determine the feature enabled state.
I think it should be possible to reduce the cost down to one
instruction (a conditional branch) per get/put_user() call, at least
on arm64, by using a different calling convention for the call and
maintaining the feature enabled state in one of the currently-ignored
bits 60-63 of a reserved register (x18).

I have the patch below which reduces the overhead with my test program
to 2.3% small 1.5% big (not fully functional yet because the rest of
the code never actually flips that bit in x18). For syscalls like
sendmsg() and poll() I would expect the overall overhead to be lower
because of other required work done by these syscalls (glancing
through their implementations this appears to be the case). But I'll
benchmark it of course. The preserve_most attribute is Clang-only but
I think it should be possible to use inline asm to get the same effect
with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
enabled but I was unable to measure an overhead for getresuid() when
reserving x18 unconditionally.

Peter

diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
index 122d9e1ccbbd..81b43a0a5505 100644
--- a/include/linux/uaccess-buffer.h
+++ b/include/linux/uaccess-buffer.h
@@ -99,7 +99,8 @@ unsigned long copy_from_user_with_log_flags(void
*to, const void __user *from,
  */
 void uaccess_buffer_free(struct task_struct *tsk);

-void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+__attribute__((preserve_most)) void
+__uaccess_buffer_log_read(const void __user *from, unsigned long n);
 /**
  * uaccess_buffer_log_read - log a read access
  * @from: the address of the access.
@@ -107,11 +108,14 @@ void __uaccess_buffer_log_read(const void __user
*from, unsigned long n);
  */
 static inline void uaccess_buffer_log_read(const void __user *from,
unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_read(from, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_read(from, n);
 }

-void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+__attribute__((preserve_most)) void __uaccess_buffer_log_write(void __user *to,
+                                                              unsigned long n);
 /**
  * uaccess_buffer_log_write - log a write access
  * @to: the address of the access.
@@ -119,8 +123,10 @@ void __uaccess_buffer_log_write(void __user *to,
unsigned long n);
  */
 static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_write(to, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_write(to, n);
 }

 #else

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
  2021-12-14  3:47         ` Peter Collingbourne
@ 2021-12-15  4:27           ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-15  4:27 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Mon, Dec 13, 2021 at 7:47 PM Peter Collingbourne <pcc@google.com> wrote:
>
> On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 13 December 2021 19:49
> > >
> > > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Peter Collingbourne
> > > > > Sent: 09 December 2021 22:16
> > > > >
> > > > > This patch series introduces a kernel feature known as uaccess
> > > > > logging, which allows userspace programs to be made aware of the
> > > > > address and size of uaccesses performed by the kernel during
> > > > > the servicing of a syscall. More details on the motivation
> > > > > for and interface to this feature are available in the file
> > > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > > patch in the series.
> > > >
> > > > How does this work when get_user() and put_user() are used to
> > > > do optimised copies?
> > > >
> > > > While adding checks to copy_to/from_user() is going to have
> > > > a measurable performance impact - even if nothing is done,
> > > > adding them to get/put_user() (and friends) is going to
> > > > make some hot paths really slow.
> > > >
> > > > So maybe you could add it so KASAN test kernels, but you can't
> > > > sensibly enable it on a production kernel.
> > > >
> > > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > > But have you actually looked at all the transfers that happen
> > > > for something like sendmsg().
> > > > The 'user copy hardening' code already has a significant impact
> > > > on that code (in many places).
> > >
> > > Hi David,
> > >
> > > Yes, I realised after I sent out my patch (and while writing test
> > > cases for it) that it didn't cover get_user()/put_user(). I have a
> > > patch under development that will add this coverage. I used it to run
> > > my invalid syscall and uname benchmarks and the results were basically
> > > the same as without the coverage.
> > >
> > > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > > look at writing my own if not. I was also planning to write a
> > > benchmark that uses getresuid() as this was the simplest syscall that
> > > I could find that does multiple put_user() calls.
> >
> > Also look at sys_poll() I think that uses __put/get_user().
> >
> > I think you'll find some of the socket option code also uses get_user().
> >
> > There is also the compat code for import_iovec().
> > IIRC that is actually faster than the non-compat version at the moment.
> >
> > I did some benchmarking of writev("/dev/null", iov, 10);
> > The cost of reading in the iovec is significant in that case.
> > Maybe I ought to find time to sort out my patches.
> >
> > For sendmsg() using __copy_from_user() to avoid the user-copy
> > hardening checks also makes a measurable difference when sending UDP
> > through raw sockets - which we do a lot of.
> >
> > I think you'd need to instrument user_access_begin() and also be able
> > to merge trace entries (for multiple get_user() calls).
> >
> > You really don't have to look far to find places where copy_to/from_user()
> > is optimised to multiple get/put_user() or __get/put_user() (or are they
> > the 'nofault' variants?)
> > Those are all hot paths - at least for some workloads.
> > So adding anything there isn't likely to be accepted for production kernels.
>
> Okay, but let's see what the benchmarks say first.
>
> I added calls to uaccess_buffer_log_{read,write}() to
> __{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
> variant of my usual test program that does getresuid() in a loop and
> measured an overhead of 5.9% on the small cores and 2.5% on the big
> cores. The overhead appears to come from two sources:
> 1) The calling convention for the call to
> __uaccess_buffer_log_read/write() transforms some leaf functions into
> non-leaf functions, and as a result we end up spilling more registers.
> 2) We need to reload the flags from the task_struct every time to
> determine the feature enabled state.
> I think it should be possible to reduce the cost down to one
> instruction (a conditional branch) per get/put_user() call, at least
> on arm64, by using a different calling convention for the call and
> maintaining the feature enabled state in one of the currently-ignored
> bits 60-63 of a reserved register (x18).
>
> I have the patch below which reduces the overhead with my test program
> to 2.3% small 1.5% big (not fully functional yet because the rest of
> the code never actually flips that bit in x18). For syscalls like
> sendmsg() and poll() I would expect the overall overhead to be lower
> because of other required work done by these syscalls (glancing
> through their implementations this appears to be the case). But I'll
> benchmark it of course. The preserve_most attribute is Clang-only but
> I think it should be possible to use inline asm to get the same effect
> with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
> enabled but I was unable to measure an overhead for getresuid() when
> reserving x18 unconditionally.

I developed benchmarks for sendmsg() and poll() and, as I expected,
the performance overhead of my patch was in the noise, except that I
measured 1.0% overhead for poll(), but only on the little cores. So it
looks like getresuid() still represents the worst case. I'm not sure
it's worth spending time developing additional benchmarks because I
would expect them to show similar results.

By the way, is there any interest in maintaining my syscall latency
benchmarks in tree? They have been invaluable for measuring the
performance impact of my kernel changes. The benchmarks that I
developed are written in arm64 assembly to avoid measurement error due
to overheads introduced by the libc syscall wrappers (i.e. they make
my benchmark results look as bad as possible, by spending as little
time in userspace as possible). I looked around in the tree and found
"perf bench syscall" but that uses the libc wrappers.

Peter

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 0/7] kernel: introduce uaccess logging
@ 2021-12-15  4:27           ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-15  4:27 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, Andy Lutomirski, Kees Cook, Andrew Morton,
	Masahiro Yamada, Sami Tolvanen, YiFei Zhu, Mark Rutland,
	Frederic Weisbecker, Viresh Kumar, Andrey Konovalov,
	Gabriel Krisman Bertazi, Chris Hyser, Daniel Vetter,
	Chris Wilson, Arnd Bergmann, Dmitry Vyukov, Christian Brauner,
	Eric W. Biederman, Alexey Gladkov, Ran Xiaokai,
	David Hildenbrand, Xiaofeng Cao, Cyrill Gorcunov, Thomas Cedeno,
	Marco Elver, Alexander Potapenko, linux-kernel, linux-arm-kernel,
	Evgenii Stepanov

On Mon, Dec 13, 2021 at 7:47 PM Peter Collingbourne <pcc@google.com> wrote:
>
> On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 13 December 2021 19:49
> > >
> > > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Peter Collingbourne
> > > > > Sent: 09 December 2021 22:16
> > > > >
> > > > > This patch series introduces a kernel feature known as uaccess
> > > > > logging, which allows userspace programs to be made aware of the
> > > > > address and size of uaccesses performed by the kernel during
> > > > > the servicing of a syscall. More details on the motivation
> > > > > for and interface to this feature are available in the file
> > > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > > patch in the series.
> > > >
> > > > How does this work when get_user() and put_user() are used to
> > > > do optimised copies?
> > > >
> > > > While adding checks to copy_to/from_user() is going to have
> > > > a measurable performance impact - even if nothing is done,
> > > > adding them to get/put_user() (and friends) is going to
> > > > make some hot paths really slow.
> > > >
> > > > So maybe you could add it so KASAN test kernels, but you can't
> > > > sensibly enable it on a production kernel.
> > > >
> > > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > > But have you actually looked at all the transfers that happen
> > > > for something like sendmsg().
> > > > The 'user copy hardening' code already has a significant impact
> > > > on that code (in many places).
> > >
> > > Hi David,
> > >
> > > Yes, I realised after I sent out my patch (and while writing test
> > > cases for it) that it didn't cover get_user()/put_user(). I have a
> > > patch under development that will add this coverage. I used it to run
> > > my invalid syscall and uname benchmarks and the results were basically
> > > the same as without the coverage.
> > >
> > > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > > look at writing my own if not. I was also planning to write a
> > > benchmark that uses getresuid() as this was the simplest syscall that
> > > I could find that does multiple put_user() calls.
> >
> > Also look at sys_poll() I think that uses __put/get_user().
> >
> > I think you'll find some of the socket option code also uses get_user().
> >
> > There is also the compat code for import_iovec().
> > IIRC that is actually faster than the non-compat version at the moment.
> >
> > I did some benchmarking of writev("/dev/null", iov, 10);
> > The cost of reading in the iovec is significant in that case.
> > Maybe I ought to find time to sort out my patches.
> >
> > For sendmsg() using __copy_from_user() to avoid the user-copy
> > hardening checks also makes a measurable difference when sending UDP
> > through raw sockets - which we do a lot of.
> >
> > I think you'd need to instrument user_access_begin() and also be able
> > to merge trace entries (for multiple get_user() calls).
> >
> > You really don't have to look far to find places where copy_to/from_user()
> > is optimised to multiple get/put_user() or __get/put_user() (or are they
> > the 'nofault' variants?)
> > Those are all hot paths - at least for some workloads.
> > So adding anything there isn't likely to be accepted for production kernels.
>
> Okay, but let's see what the benchmarks say first.
>
> I added calls to uaccess_buffer_log_{read,write}() to
> __{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
> variant of my usual test program that does getresuid() in a loop and
> measured an overhead of 5.9% on the small cores and 2.5% on the big
> cores. The overhead appears to come from two sources:
> 1) The calling convention for the call to
> __uaccess_buffer_log_read/write() transforms some leaf functions into
> non-leaf functions, and as a result we end up spilling more registers.
> 2) We need to reload the flags from the task_struct every time to
> determine the feature enabled state.
> I think it should be possible to reduce the cost down to one
> instruction (a conditional branch) per get/put_user() call, at least
> on arm64, by using a different calling convention for the call and
> maintaining the feature enabled state in one of the currently-ignored
> bits 60-63 of a reserved register (x18).
>
> I have the patch below which reduces the overhead with my test program
> to 2.3% small 1.5% big (not fully functional yet because the rest of
> the code never actually flips that bit in x18). For syscalls like
> sendmsg() and poll() I would expect the overall overhead to be lower
> because of other required work done by these syscalls (glancing
> through their implementations this appears to be the case). But I'll
> benchmark it of course. The preserve_most attribute is Clang-only but
> I think it should be possible to use inline asm to get the same effect
> with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
> enabled but I was unable to measure an overhead for getresuid() when
> reserving x18 unconditionally.

I developed benchmarks for sendmsg() and poll() and, as I expected,
the performance overhead of my patch was in the noise, except that I
measured 1.0% overhead for poll(), but only on the little cores. So it
looks like getresuid() still represents the worst case. I'm not sure
it's worth spending time developing additional benchmarks because I
would expect them to show similar results.

By the way, is there any interest in maintaining my syscall latency
benchmarks in tree? They have been invaluable for measuring the
performance impact of my kernel changes. The benchmarks that I
developed are written in arm64 assembly to avoid measurement error due
to overheads introduced by the libc syscall wrappers (i.e. they make
my benchmark results look as bad as possible, by spending as little
time in userspace as possible). I looked around in the tree and found
"perf bench syscall" but that uses the libc wrappers.

Peter

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-11 11:50     ` Thomas Gleixner
@ 2021-12-16  1:25       ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-16  1:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Thu, Dec 09 2021 at 14:15, Peter Collingbourne wrote:
> > @@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >  static void exit_to_user_mode_prepare(struct pt_regs *regs)
> >  {
> >       unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +     bool uaccess_buffer_pending;
> >
> >       lockdep_assert_irqs_disabled();
> >
> >       /* Flush pending rcuog wakeup before the last need_resched() check */
> >       tick_nohz_user_enter_prepare();
> >
> > -     if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> > +     if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
> > +             bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> > +
> >               ti_work = exit_to_user_mode_loop(regs, ti_work);
> > +             uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
>
> What? Let me look at the these two functions, which are so full of useful
> comments:
>
> > +bool __uaccess_buffer_pre_exit_loop(void)
> > +{
> > +     struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > +     struct uaccess_descriptor __user *desc_ptr;
> > +     sigset_t tmp_mask;
> > +
> > +     if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > +             return false;
> > +
> > +     current->real_blocked = current->blocked;
> > +     sigfillset(&tmp_mask);
> > +     set_current_blocked(&tmp_mask);
>
> This prevents signal delivery in exit_to_user_mode_loop(), right?

It prevents asynchronous signal delivery, same as with
sigprocmask(SIG_SETMASK, set, NULL) with a full set.

> > +     return true;
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void)
> > +{
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&current->sighand->siglock, flags);
> > +     current->blocked = current->real_blocked;
> > +     recalc_sigpending();
>
> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
> has completed, recalculates pending signals and goes out to user space
> with eventually pending signals.
>
> How is this supposed to be even remotely correct?

Please see this paragraph from the documentation:

When entering the kernel with a non-zero uaccess descriptor
address for a reason other than a syscall (for example, when
IPI'd due to an incoming asynchronous signal), any signals other
than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
initialized with ``sigfillset(set)``. This is to prevent incoming
signals from interfering with uaccess logging.

I believe that we will also go out to userspace with pending signals
when one of the signals that came in was a masked (via sigprocmask)
asynchronous signal, so this is an expected state.

> But that aside, let me look at the whole picture as I understand it from
> reverse engineering it. Yes, reverse engineering, because there are
> neither comments in the code nor any useful information in the
> changelogs of 2/7 and 4/7. Also the cover letter and the "documentation"
> are not explaining any of this and just blurb about sanitizers and how
> wonderful this all is.

The whole business with pre/post_exit_loop() is implementing the
paragraph mentioned above. I imagine that the kerneldoc comments could
be improved by referencing that paragraph.

> > @@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> >                       return ret;
> >       }
> >
> > +     if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
> > +             uaccess_buffer_syscall_entry();
>
> This conditionally sets SYSCALL_WORK_UACCESS_BUFFER_EXIT.

Right.

> > @@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
> >
> >       audit_syscall_exit(regs);
> >
> > +     if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
> > +             uaccess_buffer_syscall_exit();
>
> When returning from the syscall and SYSCALL_WORK_UACCESS_BUFFER_EXIT is
> set, then uaccess_buffer_syscall_exit() clears
> SYSCALL_WORK_UACCESS_BUFFER_EXIT, right?

Right.

> This is called _before_ exit_to_user_mode_prepare(). So why is this
> __uaccess_buffer_pre/post_exit_loop() required at all?
>
> It's not required at all. Why?
>
> Simply because there are only two ways how exit_to_user_mode_prepare()
> can be reached:
>
>   1) When returning from a syscall
>
>   2) When returning from an interrupt which hit user mode execution
>
> #1 SYSCALL_WORK_UACCESS_BUFFER_EXIT is cleared _before_
>    exit_to_user_mode_prepare() is reached as documented above.
>
> #2 SYSCALL_WORK_UACCESS_BUFFER_EXIT cannot be set because the entry
>    to the kernel does not go through syscall_trace_enter().
>
> So what is this pre/post exit loop code about? Handle something which
> cannot happen in the first place?

The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
used to set the uaccess descriptor address address to a non-zero
value. It is a different flag from UACCESS_BUFFER_EXIT. It is
certainly possible for the ENTRY flag to be set in your 2) above,
since that flag is not normally modified while inside the kernel.

> If at all this would warrant a:
>
>         if (WARN_ON_ONCE(test_syscall_work(UACCESS_BUFFER_ENTRY)))
>                 do_something_sensible();
>
> instead of adding undocumented voodoo w/o providing any rationale. Well,
> I can see why that was not provided because there is no rationale to
> begin with.
>
> Seriously, I'm all for better instrumentation and analysis, but if the
> code provided for that is incomprehensible, uncommented and
> undocumented, then the result is worse than what we have now.

Okay, as well as improving the kerneldoc I'll add some code comments
to make it clearer what's going on.

> If you think that this qualifies as documentation:
>
> > +/*
> > + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> > + */
>
> > +/*
> > + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> > + */
>
> > +/*
> > + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> > + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> > + * be passed to uaccess_buffer_post_exit_loop.
> > + */
>
> > +/*
> > + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> > + * pre-kernel-exit loop that handles signals, tracing etc.
> > + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> > + */
>
> then we have a very differrent understanding of what documentation
> should provide.

This was intended as interface documentation, so it doesn't go into
too many details. It could certainly be improved though by referencing
the user documentation, as I mentioned above.

Peter

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-16  1:25       ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-16  1:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Thu, Dec 09 2021 at 14:15, Peter Collingbourne wrote:
> > @@ -197,14 +201,19 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >  static void exit_to_user_mode_prepare(struct pt_regs *regs)
> >  {
> >       unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +     bool uaccess_buffer_pending;
> >
> >       lockdep_assert_irqs_disabled();
> >
> >       /* Flush pending rcuog wakeup before the last need_resched() check */
> >       tick_nohz_user_enter_prepare();
> >
> > -     if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> > +     if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) {
> > +             bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> > +
> >               ti_work = exit_to_user_mode_loop(regs, ti_work);
> > +             uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
>
> What? Let me look at the these two functions, which are so full of useful
> comments:
>
> > +bool __uaccess_buffer_pre_exit_loop(void)
> > +{
> > +     struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > +     struct uaccess_descriptor __user *desc_ptr;
> > +     sigset_t tmp_mask;
> > +
> > +     if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > +             return false;
> > +
> > +     current->real_blocked = current->blocked;
> > +     sigfillset(&tmp_mask);
> > +     set_current_blocked(&tmp_mask);
>
> This prevents signal delivery in exit_to_user_mode_loop(), right?

It prevents asynchronous signal delivery, same as with
sigprocmask(SIG_SETMASK, set, NULL) with a full set.

> > +     return true;
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void)
> > +{
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&current->sighand->siglock, flags);
> > +     current->blocked = current->real_blocked;
> > +     recalc_sigpending();
>
> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
> has completed, recalculates pending signals and goes out to user space
> with eventually pending signals.
>
> How is this supposed to be even remotely correct?

Please see this paragraph from the documentation:

When entering the kernel with a non-zero uaccess descriptor
address for a reason other than a syscall (for example, when
IPI'd due to an incoming asynchronous signal), any signals other
than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
initialized with ``sigfillset(set)``. This is to prevent incoming
signals from interfering with uaccess logging.

I believe that we will also go out to userspace with pending signals
when one of the signals that came in was a masked (via sigprocmask)
asynchronous signal, so this is an expected state.

> But that aside, let me look at the whole picture as I understand it from
> reverse engineering it. Yes, reverse engineering, because there are
> neither comments in the code nor any useful information in the
> changelogs of 2/7 and 4/7. Also the cover letter and the "documentation"
> are not explaining any of this and just blurb about sanitizers and how
> wonderful this all is.

The whole business with pre/post_exit_loop() is implementing the
paragraph mentioned above. I imagine that the kerneldoc comments could
be improved by referencing that paragraph.

> > @@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> >                       return ret;
> >       }
> >
> > +     if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
> > +             uaccess_buffer_syscall_entry();
>
> This conditionally sets SYSCALL_WORK_UACCESS_BUFFER_EXIT.

Right.

> > @@ -247,6 +256,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
> >
> >       audit_syscall_exit(regs);
> >
> > +     if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
> > +             uaccess_buffer_syscall_exit();
>
> When returning from the syscall and SYSCALL_WORK_UACCESS_BUFFER_EXIT is
> set, then uaccess_buffer_syscall_exit() clears
> SYSCALL_WORK_UACCESS_BUFFER_EXIT, right?

Right.

> This is called _before_ exit_to_user_mode_prepare(). So why is this
> __uaccess_buffer_pre/post_exit_loop() required at all?
>
> It's not required at all. Why?
>
> Simply because there are only two ways how exit_to_user_mode_prepare()
> can be reached:
>
>   1) When returning from a syscall
>
>   2) When returning from an interrupt which hit user mode execution
>
> #1 SYSCALL_WORK_UACCESS_BUFFER_EXIT is cleared _before_
>    exit_to_user_mode_prepare() is reached as documented above.
>
> #2 SYSCALL_WORK_UACCESS_BUFFER_EXIT cannot be set because the entry
>    to the kernel does not go through syscall_trace_enter().
>
> So what is this pre/post exit loop code about? Handle something which
> cannot happen in the first place?

The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
used to set the uaccess descriptor address address to a non-zero
value. It is a different flag from UACCESS_BUFFER_EXIT. It is
certainly possible for the ENTRY flag to be set in your 2) above,
since that flag is not normally modified while inside the kernel.

> If at all this would warrant a:
>
>         if (WARN_ON_ONCE(test_syscall_work(UACCESS_BUFFER_ENTRY)))
>                 do_something_sensible();
>
> instead of adding undocumented voodoo w/o providing any rationale. Well,
> I can see why that was not provided because there is no rationale to
> begin with.
>
> Seriously, I'm all for better instrumentation and analysis, but if the
> code provided for that is incomprehensible, uncommented and
> undocumented, then the result is worse than what we have now.

Okay, as well as improving the kerneldoc I'll add some code comments
to make it clearer what's going on.

> If you think that this qualifies as documentation:
>
> > +/*
> > + * uaccess_buffer_syscall_entry - hook to be run before syscall entry
> > + */
>
> > +/*
> > + * uaccess_buffer_syscall_exit - hook to be run after syscall exit
> > + */
>
> > +/*
> > + * uaccess_buffer_pre_exit_loop - hook to be run immediately before the
> > + * pre-kernel-exit loop that handles signals, tracing etc. Returns a bool to
> > + * be passed to uaccess_buffer_post_exit_loop.
> > + */
>
> > +/*
> > + * uaccess_buffer_post_exit_loop - hook to be run immediately after the
> > + * pre-kernel-exit loop that handles signals, tracing etc.
> > + * @pending: the bool returned from uaccess_buffer_pre_exit_loop.
> > + */
>
> then we have a very differrent understanding of what documentation
> should provide.

This was intended as interface documentation, so it doesn't go into
too many details. It could certainly be improved though by referencing
the user documentation, as I mentioned above.

Peter

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-16  1:25       ` Peter Collingbourne
@ 2021-12-16 13:05         ` Thomas Gleixner
  -1 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-16 13:05 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Wed, Dec 15 2021 at 17:25, Peter Collingbourne wrote:
> On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
>> has completed, recalculates pending signals and goes out to user space
>> with eventually pending signals.
>>
>> How is this supposed to be even remotely correct?
>
> Please see this paragraph from the documentation:
>
> When entering the kernel with a non-zero uaccess descriptor
> address for a reason other than a syscall (for example, when
> IPI'd due to an incoming asynchronous signal), any signals other
> than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> ``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> initialized with ``sigfillset(set)``. This is to prevent incoming
> signals from interfering with uaccess logging.
>
> I believe that we will also go out to userspace with pending signals
> when one of the signals that came in was a masked (via sigprocmask)
> asynchronous signal, so this is an expected state.

Believe is not part of a technical analysis, believe belongs into the
realm of religion.

It's a fundamental difference whether the program masks signals itself
or the kernel decides to do that just because.

Pending signals, which are not masked by the process, have to be
delivered _before_ returning to user space.

    That's the expected behaviour. Period.

Instrumentation which changes the semantics of the observed code is
broken by definition.

>> So what is this pre/post exit loop code about? Handle something which
>> cannot happen in the first place?
>
> The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
> which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
> used to set the uaccess descriptor address address to a non-zero
> value. It is a different flag from UACCESS_BUFFER_EXIT. It is
> certainly possible for the ENTRY flag to be set in your 2) above,
> since that flag is not normally modified while inside the kernel.

Let me try again. The logger is only active when:

    1) PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR has set an address, which
       sets UACCESS_BUFFER_ENTRY

    2) The task enters the kernel via syscall and the syscall entry
       observes UACCESS_BUFFER_ENTRY and sets UACCESS_BUFFER_EXIT

because the log functions only log when UACCESS_BUFFER_EXIT is set.

UACCESS_BUFFER_EXIT is cleared in the syscall exit path _before_ the
exit to usermode loop is reached, which means signal delivery is _NOT_
logged at all.

A non-syscall entry from user space - interrupt, exception, NMI - will
_NOT_ set UACCESS_BUFFER_EXIT because it takes a different entry
path. So when that non-syscall entry returns and delivers a signal then
there is no logging.

When the task has entered the kernel via a syscall and the kernel gets
interrupted and that interruption raises a signal, then there is no
signal delivery. The interrupt returns to kernel mode, which obviously
does not go through exit_to_user_mode(). The raised signal is delivered
when the task returns from the syscall to user mode, but that won't be
logged because UACCESS_BUFFER_EXIT is already cleared before the exit to
user mode loop is reached.

See?

>> then we have a very differrent understanding of what documentation
>> should provide.
>
> This was intended as interface documentation, so it doesn't go into
> too many details. It could certainly be improved though by referencing
> the user documentation, as I mentioned above.

Explanations which are required to make the code understandable have to
be in the code/kernel-doc comments and not in some disjunct place. This
disjunct documentation is guaranteed to be out of date within no time.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-16 13:05         ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-16 13:05 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Wed, Dec 15 2021 at 17:25, Peter Collingbourne wrote:
> On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
>> has completed, recalculates pending signals and goes out to user space
>> with eventually pending signals.
>>
>> How is this supposed to be even remotely correct?
>
> Please see this paragraph from the documentation:
>
> When entering the kernel with a non-zero uaccess descriptor
> address for a reason other than a syscall (for example, when
> IPI'd due to an incoming asynchronous signal), any signals other
> than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> ``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> initialized with ``sigfillset(set)``. This is to prevent incoming
> signals from interfering with uaccess logging.
>
> I believe that we will also go out to userspace with pending signals
> when one of the signals that came in was a masked (via sigprocmask)
> asynchronous signal, so this is an expected state.

Believe is not part of a technical analysis, believe belongs into the
realm of religion.

It's a fundamental difference whether the program masks signals itself
or the kernel decides to do that just because.

Pending signals, which are not masked by the process, have to be
delivered _before_ returning to user space.

    That's the expected behaviour. Period.

Instrumentation which changes the semantics of the observed code is
broken by definition.

>> So what is this pre/post exit loop code about? Handle something which
>> cannot happen in the first place?
>
> The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
> which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
> used to set the uaccess descriptor address address to a non-zero
> value. It is a different flag from UACCESS_BUFFER_EXIT. It is
> certainly possible for the ENTRY flag to be set in your 2) above,
> since that flag is not normally modified while inside the kernel.

Let me try again. The logger is only active when:

    1) PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR has set an address, which
       sets UACCESS_BUFFER_ENTRY

    2) The task enters the kernel via syscall and the syscall entry
       observes UACCESS_BUFFER_ENTRY and sets UACCESS_BUFFER_EXIT

because the log functions only log when UACCESS_BUFFER_EXIT is set.

UACCESS_BUFFER_EXIT is cleared in the syscall exit path _before_ the
exit to usermode loop is reached, which means signal delivery is _NOT_
logged at all.

A non-syscall entry from user space - interrupt, exception, NMI - will
_NOT_ set UACCESS_BUFFER_EXIT because it takes a different entry
path. So when that non-syscall entry returns and delivers a signal then
there is no logging.

When the task has entered the kernel via a syscall and the kernel gets
interrupted and that interruption raises a signal, then there is no
signal delivery. The interrupt returns to kernel mode, which obviously
does not go through exit_to_user_mode(). The raised signal is delivered
when the task returns from the syscall to user mode, but that won't be
logged because UACCESS_BUFFER_EXIT is already cleared before the exit to
user mode loop is reached.

See?

>> then we have a very differrent understanding of what documentation
>> should provide.
>
> This was intended as interface documentation, so it doesn't go into
> too many details. It could certainly be improved though by referencing
> the user documentation, as I mentioned above.

Explanations which are required to make the code understandable have to
be in the code/kernel-doc comments and not in some disjunct place. This
disjunct documentation is guaranteed to be out of date within no time.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-16 13:05         ` Thomas Gleixner
@ 2021-12-17  0:09           ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-17  0:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

On Thu, Dec 16, 2021 at 5:05 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Wed, Dec 15 2021 at 17:25, Peter Collingbourne wrote:
> > On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
> >> has completed, recalculates pending signals and goes out to user space
> >> with eventually pending signals.
> >>
> >> How is this supposed to be even remotely correct?
> >
> > Please see this paragraph from the documentation:
> >
> > When entering the kernel with a non-zero uaccess descriptor
> > address for a reason other than a syscall (for example, when
> > IPI'd due to an incoming asynchronous signal), any signals other
> > than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> > ``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> > initialized with ``sigfillset(set)``. This is to prevent incoming
> > signals from interfering with uaccess logging.
> >
> > I believe that we will also go out to userspace with pending signals
> > when one of the signals that came in was a masked (via sigprocmask)
> > asynchronous signal, so this is an expected state.
>
> Believe is not part of a technical analysis, believe belongs into the
> realm of religion.
>
> It's a fundamental difference whether the program masks signals itself
> or the kernel decides to do that just because.
>
> Pending signals, which are not masked by the process, have to be
> delivered _before_ returning to user space.
>
>     That's the expected behaviour. Period.
>
> Instrumentation which changes the semantics of the observed code is
> broken by definition.

The idea is that the uaccess descriptor address would be set to a
non-zero value inside the syscall wrapper, before performing the
syscall. Since the kernel will set the uaccess descriptor address to
zero before returning from a syscall, at no point should the caller of
the syscall wrapper be executing with a non-zero uaccess descriptor
address. At worst, a signal will be delivered to a task executing a
syscall wrapper a few instructions later than it would otherwise, but
that's not really important because the determination of the exact
delivery point of an asynchronous signal is fundamentally racy anyway.

Basically, it's as if the syscall wrapper did this:

// During task startup:
uint64_t addr;
prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);

// Wrapper for syscall "x"
int x(...) {
  sigset_t set, old_set;
  struct uaccess_descriptor desc = { ... };

  addr = &desc;
  // The following two statements implicitly occur atomically together
with setting addr:
  sigfillset(set);
  sigprocmask(SIG_SETMASK, set, old_set);

  syscall(__NR_x ...,);
  // The following two statements implicitly occur atomically together
with the syscall:
  sigprocmask(SIG_SETMASK, old_set, NULL);
  addr = 0;

  // Now the uaccesses for syscall "x" are logged to "desc".
}

Aside from the guarantees of atomicity, this really seems no different
from the kernel providing another API for setting the signal mask.

> >> So what is this pre/post exit loop code about? Handle something which
> >> cannot happen in the first place?
> >
> > The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
> > which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
> > used to set the uaccess descriptor address address to a non-zero
> > value. It is a different flag from UACCESS_BUFFER_EXIT. It is
> > certainly possible for the ENTRY flag to be set in your 2) above,
> > since that flag is not normally modified while inside the kernel.
>
> Let me try again. The logger is only active when:
>
>     1) PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR has set an address, which
>        sets UACCESS_BUFFER_ENTRY
>
>     2) The task enters the kernel via syscall and the syscall entry
>        observes UACCESS_BUFFER_ENTRY and sets UACCESS_BUFFER_EXIT
>
> because the log functions only log when UACCESS_BUFFER_EXIT is set.
>
> UACCESS_BUFFER_EXIT is cleared in the syscall exit path _before_ the
> exit to usermode loop is reached, which means signal delivery is _NOT_
> logged at all.

Right. It is not the intent to log uaccesses associated with signal
delivery. Only uaccesses that occur while handling the syscall itself
are logged.

> A non-syscall entry from user space - interrupt, exception, NMI - will
> _NOT_ set UACCESS_BUFFER_EXIT because it takes a different entry
> path. So when that non-syscall entry returns and delivers a signal then
> there is no logging.

Again, that's fine, there's no intent to log that.

> When the task has entered the kernel via a syscall and the kernel gets
> interrupted and that interruption raises a signal, then there is no
> signal delivery. The interrupt returns to kernel mode, which obviously
> does not go through exit_to_user_mode(). The raised signal is delivered
> when the task returns from the syscall to user mode, but that won't be
> logged because UACCESS_BUFFER_EXIT is already cleared before the exit to
> user mode loop is reached.
>
> See?

Perhaps there is a misunderstanding of the purpose of the signal
blocking with non-zero uaccess descriptor address. It isn't there
because we want to log anything about these signals. It's there
because we don't want a signal handler to be invoked between when we
arrange for the kernel to log the next syscall and when we issue the
syscall that we want to log, because that could lead to the signal
handler's syscalls being logged instead of the syscall that we intend
to log.

Consider the syscall wrapper for syscall "x" above, and imagine that
we didn't have the sigprocmask statements, and imagine that a signal
came in after storing &desc to addr but before the call to syscall.
Also imagine that the handler for that signal is unaware of uaccess
logging, so it just issues syscalls directly without touching addr.
Now the first syscall performed by the signal handler will be logged,
instead of the intended syscall "x", because the kernel will read the
uaccess descriptor intended for logging syscall "x" from addr when
entering the kernel for the signal handler's first syscall.

The kernel setting addr to 0 during the syscall is also necessary in
order for the kernel to continue processing signals normally once the
logged syscall has returned. Effectively any incoming signals are
queued until we have finished processing the logged syscall. Because
the kernel has set addr to 0, it refrains from blocking signals when
returning to userspace from the logged syscall, and therefore any
pending signals are delivered. Any syscalls that occur in any signal
handlers invoked via this signal delivery will not interfere with the
previously collected log for syscall "x", precisely because addr was
set to 0 by the kernel. If we left it up to userspace to set addr back
to 0, we would have the same problem as if we didn't have the
sigprocmask statements, but now the critical section extends until the
userspace program sets addr to 0. Furthermore, a userspace program
setting addr to 0 would not automatically cause the pending signals to
be delivered (because simply storing a value to memory from userspace
will not necessarily trigger a kernel entry), and signals could
therefore be left blocked for longer than expected (at least until the
next kernel entry).

> >> then we have a very differrent understanding of what documentation
> >> should provide.
> >
> > This was intended as interface documentation, so it doesn't go into
> > too many details. It could certainly be improved though by referencing
> > the user documentation, as I mentioned above.
>
> Explanations which are required to make the code understandable have to
> be in the code/kernel-doc comments and not in some disjunct place. This
> disjunct documentation is guaranteed to be out of date within no time.

Got it. From our discussion it's clear that the justification for the
design of the uaccess logging interface (especially the signal
handling parts) needs to be documented in the code in order to avoid
confusion.

Peter

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-17  0:09           ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2021-12-17  0:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

On Thu, Dec 16, 2021 at 5:05 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Wed, Dec 15 2021 at 17:25, Peter Collingbourne wrote:
> > On Sat, Dec 11, 2021 at 3:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> This restores the signal blocked mask _after_ exit_to_user_mode_loop()
> >> has completed, recalculates pending signals and goes out to user space
> >> with eventually pending signals.
> >>
> >> How is this supposed to be even remotely correct?
> >
> > Please see this paragraph from the documentation:
> >
> > When entering the kernel with a non-zero uaccess descriptor
> > address for a reason other than a syscall (for example, when
> > IPI'd due to an incoming asynchronous signal), any signals other
> > than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> > ``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> > initialized with ``sigfillset(set)``. This is to prevent incoming
> > signals from interfering with uaccess logging.
> >
> > I believe that we will also go out to userspace with pending signals
> > when one of the signals that came in was a masked (via sigprocmask)
> > asynchronous signal, so this is an expected state.
>
> Believe is not part of a technical analysis, believe belongs into the
> realm of religion.
>
> It's a fundamental difference whether the program masks signals itself
> or the kernel decides to do that just because.
>
> Pending signals, which are not masked by the process, have to be
> delivered _before_ returning to user space.
>
>     That's the expected behaviour. Period.
>
> Instrumentation which changes the semantics of the observed code is
> broken by definition.

The idea is that the uaccess descriptor address would be set to a
non-zero value inside the syscall wrapper, before performing the
syscall. Since the kernel will set the uaccess descriptor address to
zero before returning from a syscall, at no point should the caller of
the syscall wrapper be executing with a non-zero uaccess descriptor
address. At worst, a signal will be delivered to a task executing a
syscall wrapper a few instructions later than it would otherwise, but
that's not really important because the determination of the exact
delivery point of an asynchronous signal is fundamentally racy anyway.

Basically, it's as if the syscall wrapper did this:

// During task startup:
uint64_t addr;
prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);

// Wrapper for syscall "x"
int x(...) {
  sigset_t set, old_set;
  struct uaccess_descriptor desc = { ... };

  addr = &desc;
  // The following two statements implicitly occur atomically together
with setting addr:
  sigfillset(set);
  sigprocmask(SIG_SETMASK, set, old_set);

  syscall(__NR_x ...,);
  // The following two statements implicitly occur atomically together
with the syscall:
  sigprocmask(SIG_SETMASK, old_set, NULL);
  addr = 0;

  // Now the uaccesses for syscall "x" are logged to "desc".
}

Aside from the guarantees of atomicity, this really seems no different
from the kernel providing another API for setting the signal mask.

> >> So what is this pre/post exit loop code about? Handle something which
> >> cannot happen in the first place?
> >
> > The pre/post_exit_loop() functions are checking UACCESS_BUFFER_ENTRY,
> > which is set when prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR) has been
> > used to set the uaccess descriptor address address to a non-zero
> > value. It is a different flag from UACCESS_BUFFER_EXIT. It is
> > certainly possible for the ENTRY flag to be set in your 2) above,
> > since that flag is not normally modified while inside the kernel.
>
> Let me try again. The logger is only active when:
>
>     1) PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR has set an address, which
>        sets UACCESS_BUFFER_ENTRY
>
>     2) The task enters the kernel via syscall and the syscall entry
>        observes UACCESS_BUFFER_ENTRY and sets UACCESS_BUFFER_EXIT
>
> because the log functions only log when UACCESS_BUFFER_EXIT is set.
>
> UACCESS_BUFFER_EXIT is cleared in the syscall exit path _before_ the
> exit to usermode loop is reached, which means signal delivery is _NOT_
> logged at all.

Right. It is not the intent to log uaccesses associated with signal
delivery. Only uaccesses that occur while handling the syscall itself
are logged.

> A non-syscall entry from user space - interrupt, exception, NMI - will
> _NOT_ set UACCESS_BUFFER_EXIT because it takes a different entry
> path. So when that non-syscall entry returns and delivers a signal then
> there is no logging.

Again, that's fine, there's no intent to log that.

> When the task has entered the kernel via a syscall and the kernel gets
> interrupted and that interruption raises a signal, then there is no
> signal delivery. The interrupt returns to kernel mode, which obviously
> does not go through exit_to_user_mode(). The raised signal is delivered
> when the task returns from the syscall to user mode, but that won't be
> logged because UACCESS_BUFFER_EXIT is already cleared before the exit to
> user mode loop is reached.
>
> See?

Perhaps there is a misunderstanding of the purpose of the signal
blocking with non-zero uaccess descriptor address. It isn't there
because we want to log anything about these signals. It's there
because we don't want a signal handler to be invoked between when we
arrange for the kernel to log the next syscall and when we issue the
syscall that we want to log, because that could lead to the signal
handler's syscalls being logged instead of the syscall that we intend
to log.

Consider the syscall wrapper for syscall "x" above, and imagine that
we didn't have the sigprocmask statements, and imagine that a signal
came in after storing &desc to addr but before the call to syscall.
Also imagine that the handler for that signal is unaware of uaccess
logging, so it just issues syscalls directly without touching addr.
Now the first syscall performed by the signal handler will be logged,
instead of the intended syscall "x", because the kernel will read the
uaccess descriptor intended for logging syscall "x" from addr when
entering the kernel for the signal handler's first syscall.

The kernel setting addr to 0 during the syscall is also necessary in
order for the kernel to continue processing signals normally once the
logged syscall has returned. Effectively any incoming signals are
queued until we have finished processing the logged syscall. Because
the kernel has set addr to 0, it refrains from blocking signals when
returning to userspace from the logged syscall, and therefore any
pending signals are delivered. Any syscalls that occur in any signal
handlers invoked via this signal delivery will not interfere with the
previously collected log for syscall "x", precisely because addr was
set to 0 by the kernel. If we left it up to userspace to set addr back
to 0, we would have the same problem as if we didn't have the
sigprocmask statements, but now the critical section extends until the
userspace program sets addr to 0. Furthermore, a userspace program
setting addr to 0 would not automatically cause the pending signals to
be delivered (because simply storing a value to memory from userspace
will not necessarily trigger a kernel entry), and signals could
therefore be left blocked for longer than expected (at least until the
next kernel entry).

> >> then we have a very differrent understanding of what documentation
> >> should provide.
> >
> > This was intended as interface documentation, so it doesn't go into
> > too many details. It could certainly be improved though by referencing
> > the user documentation, as I mentioned above.
>
> Explanations which are required to make the code understandable have to
> be in the code/kernel-doc comments and not in some disjunct place. This
> disjunct documentation is guaranteed to be out of date within no time.

Got it. From our discussion it's clear that the justification for the
design of the uaccess logging interface (especially the signal
handling parts) needs to be documented in the code in order to avoid
confusion.

Peter

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-17  0:09           ` Peter Collingbourne
@ 2021-12-17 18:42             ` Thomas Gleixner
  -1 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-17 18:42 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Thu, Dec 16 2021 at 16:09, Peter Collingbourne wrote:
> userspace program sets addr to 0. Furthermore, a userspace program
> setting addr to 0 would not automatically cause the pending signals to
> be delivered (because simply storing a value to memory from userspace
> will not necessarily trigger a kernel entry), and signals could
> therefore be left blocked for longer than expected (at least until the
> next kernel entry).

Groan, so what you are trying to prevent is:

   *ptr = addr;

--> interrupt
       signal raised

    signal delivery

    signal handler
      syscall()   <- Logs this syscall
      sigreturn;

   syscall() <- Is not logged

I must have missed that detail in these novel sized comments all over
the place.

Yes, I can see how that pre/post muck solves this, but TBH while it is
admittedly a smart hack it's also a horrible hack.

There are a two aspects which I really dislike:

  - It's yet another ad hoc 'solution' to scratch 'my particular itch'

  - It's adding a horrorshow in the syscall hotpath. We have already
    enough gunk there. No need to add more.

The problem you are trying to solve is to instrument user accesses of
the kernel, which is special purpose tracing, right?

May I ask why this is not solvable via tracepoints?

  DECLARE_EVENT_CLASS(uaccess_class,....);
  DECLARE_EVENT(uaccess_class, uaccess_read,...);
  DECLARE_EVENT(uaccess_class, uaccess_write,...);

    trace_uaccess_read(from, n);

    trace_uaccess_write(to, n);

Tracepoints have filters, tooling, libraries etc. Zero code except for
the tracepoints themself. They are disabled by default with a static
key, which means very close to 0 overhead.

Aside of that such tracepoints can be used for other purposes as well
and are therefore not bound to 'my particular itch'.

There are obviously some questions to solve versus filtering, but even
with a stupid tid based filter, it's easy enough to filter out the stuff
you're interested in. E.g. to filter out the signal scenario above all
you need is to enable two more tracepoints:

   signal:signal_deliver
   syscalls:sys_enter_rt_sigreturn

and when analyzing the event stream you can just skip the noise between
signal:signal_deliver and syscalls:sys_enter_rt_sigreturn trace entries.

There are other fancies like BPF which can be used for filtering and
filling a map with entries.

The only downside is that system configuration restricts access and
requires certain priviledges. But is that a real problem for the purpose
at hand, sanitizers and validation tools?

I don't think it is, but you surely can provide more information here.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2021-12-17 18:42             ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2021-12-17 18:42 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	linux-kernel, linux-arm-kernel, Evgenii Stepanov

Peter,

On Thu, Dec 16 2021 at 16:09, Peter Collingbourne wrote:
> userspace program sets addr to 0. Furthermore, a userspace program
> setting addr to 0 would not automatically cause the pending signals to
> be delivered (because simply storing a value to memory from userspace
> will not necessarily trigger a kernel entry), and signals could
> therefore be left blocked for longer than expected (at least until the
> next kernel entry).

Groan, so what you are trying to prevent is:

   *ptr = addr;

--> interrupt
       signal raised

    signal delivery

    signal handler
      syscall()   <- Logs this syscall
      sigreturn;

   syscall() <- Is not logged

I must have missed that detail in these novel sized comments all over
the place.

Yes, I can see how that pre/post muck solves this, but TBH while it is
admittedly a smart hack it's also a horrible hack.

There are a two aspects which I really dislike:

  - It's yet another ad hoc 'solution' to scratch 'my particular itch'

  - It's adding a horrorshow in the syscall hotpath. We have already
    enough gunk there. No need to add more.

The problem you are trying to solve is to instrument user accesses of
the kernel, which is special purpose tracing, right?

May I ask why this is not solvable via tracepoints?

  DECLARE_EVENT_CLASS(uaccess_class,....);
  DECLARE_EVENT(uaccess_class, uaccess_read,...);
  DECLARE_EVENT(uaccess_class, uaccess_write,...);

    trace_uaccess_read(from, n);

    trace_uaccess_write(to, n);

Tracepoints have filters, tooling, libraries etc. Zero code except for
the tracepoints themself. They are disabled by default with a static
key, which means very close to 0 overhead.

Aside of that such tracepoints can be used for other purposes as well
and are therefore not bound to 'my particular itch'.

There are obviously some questions to solve versus filtering, but even
with a stupid tid based filter, it's easy enough to filter out the stuff
you're interested in. E.g. to filter out the signal scenario above all
you need is to enable two more tracepoints:

   signal:signal_deliver
   syscalls:sys_enter_rt_sigreturn

and when analyzing the event stream you can just skip the noise between
signal:signal_deliver and syscalls:sys_enter_rt_sigreturn trace entries.

There are other fancies like BPF which can be used for filtering and
filling a map with entries.

The only downside is that system configuration restricts access and
requires certain priviledges. But is that a real problem for the purpose
at hand, sanitizers and validation tools?

I don't think it is, but you surely can provide more information here.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  2021-12-17 18:42             ` Thomas Gleixner
@ 2022-01-10 21:43               ` Peter Collingbourne
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2022-01-10 21:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	Linux Kernel Mailing List, Linux ARM, Evgenii Stepanov

On Fri, Dec 17, 2021 at 10:42 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Thu, Dec 16 2021 at 16:09, Peter Collingbourne wrote:
> > userspace program sets addr to 0. Furthermore, a userspace program
> > setting addr to 0 would not automatically cause the pending signals to
> > be delivered (because simply storing a value to memory from userspace
> > will not necessarily trigger a kernel entry), and signals could
> > therefore be left blocked for longer than expected (at least until the
> > next kernel entry).
>
> Groan, so what you are trying to prevent is:
>
>    *ptr = addr;
>
> --> interrupt
>        signal raised
>
>     signal delivery
>
>     signal handler
>       syscall()   <- Logs this syscall
>       sigreturn;
>
>    syscall() <- Is not logged
>
> I must have missed that detail in these novel sized comments all over
> the place.
>
> Yes, I can see how that pre/post muck solves this, but TBH while it is
> admittedly a smart hack it's also a horrible hack.
>
> There are a two aspects which I really dislike:
>
>   - It's yet another ad hoc 'solution' to scratch 'my particular itch'
>
>   - It's adding a horrorshow in the syscall hotpath. We have already
>     enough gunk there. No need to add more.
>
> The problem you are trying to solve is to instrument user accesses of
> the kernel, which is special purpose tracing, right?
>
> May I ask why this is not solvable via tracepoints?
>
>   DECLARE_EVENT_CLASS(uaccess_class,....);
>   DECLARE_EVENT(uaccess_class, uaccess_read,...);
>   DECLARE_EVENT(uaccess_class, uaccess_write,...);
>
>     trace_uaccess_read(from, n);
>
>     trace_uaccess_write(to, n);
>
> Tracepoints have filters, tooling, libraries etc. Zero code except for
> the tracepoints themself. They are disabled by default with a static
> key, which means very close to 0 overhead.
>
> Aside of that such tracepoints can be used for other purposes as well
> and are therefore not bound to 'my particular itch'.
>
> There are obviously some questions to solve versus filtering, but even
> with a stupid tid based filter, it's easy enough to filter out the stuff
> you're interested in. E.g. to filter out the signal scenario above all
> you need is to enable two more tracepoints:
>
>    signal:signal_deliver
>    syscalls:sys_enter_rt_sigreturn
>
> and when analyzing the event stream you can just skip the noise between
> signal:signal_deliver and syscalls:sys_enter_rt_sigreturn trace entries.

I'm afraid that it won't always work. This is because the control flow
can leave a signal handler without going through sigreturn (e.g. when
throwing an exception through the signal handler in C++, or via
longjmp).

But aside from that, there seem to be more fundamental problems with
using tracepoints for this. Let me start by saying that the uaccess
monitoring needs to be synchronous. This is because we need to read
the sanitizer metadata (e.g. memory tags for HWASan) for the memory
accessed by the syscall at the time that the syscall takes place,
before the syscall wrapper function returns. If we read it later, the
memory accessed by the syscall may have been deallocated or
reallocated, so we may end up producing a false positive
use-after-free (if deallocated) or buffer-overflow (if reallocated)
report.

As far as I can tell, there are four main ways that a userspace
program can access tracepoint events:

1) /sys/kernel/debug/tracing/trace_pipe
2) perf_event_open()
3) BPF
4) SIGTRAP on perf events

Let me go over each of them in turn. I don't think any of them will
work, but let me know if you think I made a mistake.

1) The event monitoring via trace_pipe appears to be a mechanism that
is global to the system, so it isn't suitable for an environment where
multiple processes may have sanitizers enabled.

2) This seems to be closest to being feasible, but it looks like there
are a few issues. From reading the perf_event_open man page, the way I
imagined it could work is that we sample the events uaccess_read,
uaccess_write, signal_deliver and sys_enter_rt_sigreturn to a ring
buffer. Then in the syscall wrapper we read the buffer and filter out
the events as you proposed. We do, however, need to be careful to
avoid memory corruption with this approach -- suppose that a signal
comes in while the syscall wrapper is part way through reading the
ring buffer. The syscalls performed by the signal handler may cause
the kernel to overwrite the part of the ring buffer that the syscall
wrapper is currently reading. I think something like this may work to
avoid the overwriting:

// thread local variables:
perf_event_mmap_page *page; // initialized at thread startup
bool in_syscall = false;

long syscall(...) {
  if (in_syscall) {
    // just do the syscall without logging to avoid reentrancy issues
    raw_syscall(...);
  }
  in_syscall = true;
  old_data_head = page->data_head;
  page->data_tail = old_data_head;
  raw_syscall(...);
  new_data_head = page->data_head;
  // read events between old_data_head and new_data_head
  in_syscall = false;
}

With this scheme we end up dropping events from signal handlers in
rare cases. This may be acceptable because the whole scheme is
best-effort anyway. However, if we abnormally return from the syscall
wrapper (exception or longjmp from signal handler), we will be stuck
in the state where logging is disabled, which seems more problematic.

For this to work we will need access to the arguments to the
uaccess_read and uaccess_write events. I couldn't find anything
suitable in the man page, but it does offer this tantalizing snippet:

              PERF_SAMPLE_RAW
                     Records additional data, if applicable.  Usually
returned by tracepoint events.

Maybe the arguments can be accessed via this record? Reading through
the kernel code it looks like *something* is copied in response to
this flag being set, but it is not very clear what. A comment in the
include/uapi/linux/perf_event.h header seems to indicate, however,
that even if the arguments are accessible this way, that isn't part of
the ABI:

         *      #
         *      # The RAW record below is opaque data wrt the ABI
         *      #
         *      # That is, the ABI doesn't make any promises wrt to
         *      # the stability of its content, it may vary depending
         *      # on event, hardware, kernel version and phase of
         *      # the moon.
         *      #
         *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
         *      #
         *
         *      { u32                   size;
         *        char                  data[size];}&& PERF_SAMPLE_RAW

This would make PERF_SAMPLE_RAW unsuitable for use in sanitizers
because they are generally expected not to break with new kernel
versions.

Another issue is that we need to be able to read the id file from
tracefs in order to issue the perf_event_open() syscall, but there
isn't necessarily a guarantee that tracefs will always be available
given that we will want to use this from almost every process on the
system including init and sandboxed processes. It also looks like the
whole "events" tree of tracefs is restricted to root on at least some
Android devices. There may be other privileges issues that I haven't
run into yet.

3) Installing BPF programs requires additional privileges. As
explained below, this makes it unsuitable for use with sanitizers.

4) I explained why I don't think it will work in the cover letter of
my most recent patch series:

- Tracing would need to be synchronous in order to produce useful
  stack traces. For example this could be achieved using the new SIGTRAP
  on perf events mechanism. However, this would require logging each
  access to the stack (in the form of a sigcontext) and this is more
  likely to overflow the stack due to being much larger than a uaccess
  buffer entry as well as being unbounded, in contrast to the bounded
  buffer size passed to prctl(). An approach based on signal handlers is
  also likely to fall foul of the asynchronous signal issues mentioned
  previously, together with needing sigreturn to be handled specially
  (because it copies a sigcontext from userspace) otherwise we could
  never return from the signal handler. Furthermore, arguments to the
  trace events are not available to SIGTRAP. (This on its own wouldn't
  be insurmountable though -- we could add the arguments as fields
  to siginfo.)

> There are other fancies like BPF which can be used for filtering and
> filling a map with entries.
>
> The only downside is that system configuration restricts access and
> requires certain priviledges. But is that a real problem for the purpose
> at hand, sanitizers and validation tools?
>
> I don't think it is, but you surely can provide more information here.

I'm afraid that it is a problem. We frequently deploy sanitizers in
production-like environments, which means that processes may not only
be running as non-root but may also be sandboxed. For example, a
typical HWASan deployment on an Android device will have HWASan
enabled in almost every process, including untrusted application
processes and sandboxed processes used for parsing untrusted data (and
which are particularly security sensitive).

Furthermore, we plan to use uaccess logging with the ARM Memory
Tagging Extension, which is intended to be deployed to end-user
devices. While opening holes in the sandbox for HWASan may be possible
in some cases (though inadvisable because sanitizers are meant to be
transparent to the program as much as possible, and we don't control
every sandbox used by every Android app), we shouldn't open holes in
the sandbox on end-user devices even if we were able to get every app
with a sandbox to do so as this would increase the risk to end users
(whereas the entire goal of deploying this is to *reduce* the risk).

Peter

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support
@ 2022-01-10 21:43               ` Peter Collingbourne
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Collingbourne @ 2022-01-10 21:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, Will Deacon, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Andy Lutomirski, Kees Cook, Andrew Morton, Masahiro Yamada,
	Sami Tolvanen, YiFei Zhu, Mark Rutland, Frederic Weisbecker,
	Viresh Kumar, Andrey Konovalov, Gabriel Krisman Bertazi,
	Chris Hyser, Daniel Vetter, Chris Wilson, Arnd Bergmann,
	Dmitry Vyukov, Christian Brauner, Eric W. Biederman,
	Alexey Gladkov, Ran Xiaokai, David Hildenbrand, Xiaofeng Cao,
	Cyrill Gorcunov, Thomas Cedeno, Marco Elver, Alexander Potapenko,
	Linux Kernel Mailing List, Linux ARM, Evgenii Stepanov

On Fri, Dec 17, 2021 at 10:42 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Peter,
>
> On Thu, Dec 16 2021 at 16:09, Peter Collingbourne wrote:
> > userspace program sets addr to 0. Furthermore, a userspace program
> > setting addr to 0 would not automatically cause the pending signals to
> > be delivered (because simply storing a value to memory from userspace
> > will not necessarily trigger a kernel entry), and signals could
> > therefore be left blocked for longer than expected (at least until the
> > next kernel entry).
>
> Groan, so what you are trying to prevent is:
>
>    *ptr = addr;
>
> --> interrupt
>        signal raised
>
>     signal delivery
>
>     signal handler
>       syscall()   <- Logs this syscall
>       sigreturn;
>
>    syscall() <- Is not logged
>
> I must have missed that detail in these novel sized comments all over
> the place.
>
> Yes, I can see how that pre/post muck solves this, but TBH while it is
> admittedly a smart hack it's also a horrible hack.
>
> There are a two aspects which I really dislike:
>
>   - It's yet another ad hoc 'solution' to scratch 'my particular itch'
>
>   - It's adding a horrorshow in the syscall hotpath. We have already
>     enough gunk there. No need to add more.
>
> The problem you are trying to solve is to instrument user accesses of
> the kernel, which is special purpose tracing, right?
>
> May I ask why this is not solvable via tracepoints?
>
>   DECLARE_EVENT_CLASS(uaccess_class,....);
>   DECLARE_EVENT(uaccess_class, uaccess_read,...);
>   DECLARE_EVENT(uaccess_class, uaccess_write,...);
>
>     trace_uaccess_read(from, n);
>
>     trace_uaccess_write(to, n);
>
> Tracepoints have filters, tooling, libraries etc. Zero code except for
> the tracepoints themself. They are disabled by default with a static
> key, which means very close to 0 overhead.
>
> Aside of that such tracepoints can be used for other purposes as well
> and are therefore not bound to 'my particular itch'.
>
> There are obviously some questions to solve versus filtering, but even
> with a stupid tid based filter, it's easy enough to filter out the stuff
> you're interested in. E.g. to filter out the signal scenario above all
> you need is to enable two more tracepoints:
>
>    signal:signal_deliver
>    syscalls:sys_enter_rt_sigreturn
>
> and when analyzing the event stream you can just skip the noise between
> signal:signal_deliver and syscalls:sys_enter_rt_sigreturn trace entries.

I'm afraid that it won't always work. This is because the control flow
can leave a signal handler without going through sigreturn (e.g. when
throwing an exception through the signal handler in C++, or via
longjmp).

But aside from that, there seem to be more fundamental problems with
using tracepoints for this. Let me start by saying that the uaccess
monitoring needs to be synchronous. This is because we need to read
the sanitizer metadata (e.g. memory tags for HWASan) for the memory
accessed by the syscall at the time that the syscall takes place,
before the syscall wrapper function returns. If we read it later, the
memory accessed by the syscall may have been deallocated or
reallocated, so we may end up producing a false positive
use-after-free (if deallocated) or buffer-overflow (if reallocated)
report.

As far as I can tell, there are four main ways that a userspace
program can access tracepoint events:

1) /sys/kernel/debug/tracing/trace_pipe
2) perf_event_open()
3) BPF
4) SIGTRAP on perf events

Let me go over each of them in turn. I don't think any of them will
work, but let me know if you think I made a mistake.

1) The event monitoring via trace_pipe appears to be a mechanism that
is global to the system, so it isn't suitable for an environment where
multiple processes may have sanitizers enabled.

2) This seems to be closest to being feasible, but it looks like there
are a few issues. From reading the perf_event_open man page, the way I
imagined it could work is that we sample the events uaccess_read,
uaccess_write, signal_deliver and sys_enter_rt_sigreturn to a ring
buffer. Then in the syscall wrapper we read the buffer and filter out
the events as you proposed. We do, however, need to be careful to
avoid memory corruption with this approach -- suppose that a signal
comes in while the syscall wrapper is part way through reading the
ring buffer. The syscalls performed by the signal handler may cause
the kernel to overwrite the part of the ring buffer that the syscall
wrapper is currently reading. I think something like this may work to
avoid the overwriting:

// thread local variables:
perf_event_mmap_page *page; // initialized at thread startup
bool in_syscall = false;

long syscall(...) {
  if (in_syscall) {
    // just do the syscall without logging to avoid reentrancy issues
    raw_syscall(...);
  }
  in_syscall = true;
  old_data_head = page->data_head;
  page->data_tail = old_data_head;
  raw_syscall(...);
  new_data_head = page->data_head;
  // read events between old_data_head and new_data_head
  in_syscall = false;
}

With this scheme we end up dropping events from signal handlers in
rare cases. This may be acceptable because the whole scheme is
best-effort anyway. However, if we abnormally return from the syscall
wrapper (exception or longjmp from signal handler), we will be stuck
in the state where logging is disabled, which seems more problematic.

For this to work we will need access to the arguments to the
uaccess_read and uaccess_write events. I couldn't find anything
suitable in the man page, but it does offer this tantalizing snippet:

              PERF_SAMPLE_RAW
                     Records additional data, if applicable.  Usually
returned by tracepoint events.

Maybe the arguments can be accessed via this record? Reading through
the kernel code it looks like *something* is copied in response to
this flag being set, but it is not very clear what. A comment in the
include/uapi/linux/perf_event.h header seems to indicate, however,
that even if the arguments are accessible this way, that isn't part of
the ABI:

         *      #
         *      # The RAW record below is opaque data wrt the ABI
         *      #
         *      # That is, the ABI doesn't make any promises wrt to
         *      # the stability of its content, it may vary depending
         *      # on event, hardware, kernel version and phase of
         *      # the moon.
         *      #
         *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
         *      #
         *
         *      { u32                   size;
         *        char                  data[size];}&& PERF_SAMPLE_RAW

This would make PERF_SAMPLE_RAW unsuitable for use in sanitizers
because they are generally expected not to break with new kernel
versions.

Another issue is that we need to be able to read the id file from
tracefs in order to issue the perf_event_open() syscall, but there
isn't necessarily a guarantee that tracefs will always be available
given that we will want to use this from almost every process on the
system including init and sandboxed processes. It also looks like the
whole "events" tree of tracefs is restricted to root on at least some
Android devices. There may be other privileges issues that I haven't
run into yet.

3) Installing BPF programs requires additional privileges. As
explained below, this makes it unsuitable for use with sanitizers.

4) I explained why I don't think it will work in the cover letter of
my most recent patch series:

- Tracing would need to be synchronous in order to produce useful
  stack traces. For example this could be achieved using the new SIGTRAP
  on perf events mechanism. However, this would require logging each
  access to the stack (in the form of a sigcontext) and this is more
  likely to overflow the stack due to being much larger than a uaccess
  buffer entry as well as being unbounded, in contrast to the bounded
  buffer size passed to prctl(). An approach based on signal handlers is
  also likely to fall foul of the asynchronous signal issues mentioned
  previously, together with needing sigreturn to be handled specially
  (because it copies a sigcontext from userspace) otherwise we could
  never return from the signal handler. Furthermore, arguments to the
  trace events are not available to SIGTRAP. (This on its own wouldn't
  be insurmountable though -- we could add the arguments as fields
  to siginfo.)

> There are other fancies like BPF which can be used for filtering and
> filling a map with entries.
>
> The only downside is that system configuration restricts access and
> requires certain priviledges. But is that a real problem for the purpose
> at hand, sanitizers and validation tools?
>
> I don't think it is, but you surely can provide more information here.

I'm afraid that it is a problem. We frequently deploy sanitizers in
production-like environments, which means that processes may not only
be running as non-root but may also be sandboxed. For example, a
typical HWASan deployment on an Android device will have HWASan
enabled in almost every process, including untrusted application
processes and sandboxed processes used for parsing untrusted data (and
which are particularly security sensitive).

Furthermore, we plan to use uaccess logging with the ARM Memory
Tagging Extension, which is intended to be deployed to end-user
devices. While opening holes in the sandbox for HWASan may be possible
in some cases (though inadvisable because sanitizers are meant to be
transparent to the program as much as possible, and we don't control
every sandbox used by every Android app), we shouldn't open holes in
the sandbox on end-user devices even if we were able to get every app
with a sandbox to do so as this would increase the risk to end users
(whereas the entire goal of deploying this is to *reduce* the risk).

Peter

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-01-10 21:45 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-09 22:15 [PATCH v4 0/7] kernel: introduce uaccess logging Peter Collingbourne
2021-12-09 22:15 ` Peter Collingbourne
2021-12-09 22:15 ` [PATCH v4 1/7] include: split out uaccess instrumentation into a separate header Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-10 12:45   ` Marco Elver
2021-12-10 12:45     ` Marco Elver
2021-12-09 22:15 ` [PATCH v4 2/7] uaccess-buffer: add core code Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-10  3:52   ` Dmitry Vyukov
2021-12-10  3:52     ` Dmitry Vyukov
2021-12-10 12:39   ` Marco Elver
2021-12-10 12:39     ` Marco Elver
2021-12-09 22:15 ` [PATCH v4 3/7] fs: use copy_from_user_nolog() to copy mount() data Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-09 22:15 ` [PATCH v4 4/7] uaccess-buffer: add CONFIG_GENERIC_ENTRY support Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-11 11:50   ` Thomas Gleixner
2021-12-11 11:50     ` Thomas Gleixner
2021-12-16  1:25     ` Peter Collingbourne
2021-12-16  1:25       ` Peter Collingbourne
2021-12-16 13:05       ` Thomas Gleixner
2021-12-16 13:05         ` Thomas Gleixner
2021-12-17  0:09         ` Peter Collingbourne
2021-12-17  0:09           ` Peter Collingbourne
2021-12-17 18:42           ` Thomas Gleixner
2021-12-17 18:42             ` Thomas Gleixner
2022-01-10 21:43             ` Peter Collingbourne
2022-01-10 21:43               ` Peter Collingbourne
2021-12-09 22:15 ` [PATCH v4 5/7] arm64: add support for uaccess logging Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-09 22:15 ` [PATCH v4 6/7] Documentation: document " Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-09 22:15 ` [PATCH v4 7/7] selftests: test " Peter Collingbourne
2021-12-09 22:15   ` Peter Collingbourne
2021-12-10 13:30   ` Marco Elver
2021-12-10 13:30     ` Marco Elver
2021-12-11 17:23 ` [PATCH v4 0/7] kernel: introduce " David Laight
2021-12-11 17:23   ` David Laight
2021-12-13 19:48   ` Peter Collingbourne
2021-12-13 19:48     ` Peter Collingbourne
2021-12-13 23:07     ` David Laight
2021-12-13 23:07       ` David Laight
2021-12-14  3:47       ` Peter Collingbourne
2021-12-14  3:47         ` Peter Collingbourne
2021-12-15  4:27         ` Peter Collingbourne
2021-12-15  4:27           ` Peter Collingbourne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.