linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] Fix the read of vsyscall page through bpf
@ 2024-01-19  7:30 Hou Tao
  2024-01-19  7:30 ` [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-19  7:30 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

As reported by syzboot [1] and [2], when trying to read vsyscall page
by using bpf_probe_read_kernel() or bpf_probe_read(), oops will happen.

Thomas Gleixner had proposed a test patch [3], but it seems that no
formal patch is posted after about one month [4], so I post it instead
and add an Originally-from tag in patch #2.

Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes
the problem by disallowing vsyscall page read for
copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the
read of vsyscall page through bpf is rejected. Although vsyscall page
can be disabled by vsyscall=none, but it doesn't affect the reproduce of
the problem and the added test.

Comments are always welcome.

[1]: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/
[4]: https://lore.kernel.org/bpf/e24b125c-8ff4-9031-6c53-67ff2e01f316@huaweicloud.com/

Hou Tao (3):
  x86/mm: Move is_vsyscall_vaddr() into mm_internal.h
  x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  selftest/bpf: Test the read of vsyscall page under x86-64

 arch/x86/mm/fault.c                           | 11 +---
 arch/x86/mm/maccess.c                         |  6 ++
 arch/x86/mm/mm_internal.h                     | 13 ++++
 .../selftests/bpf/prog_tests/read_vsyscall.c  | 61 +++++++++++++++++++
 .../selftests/bpf/progs/read_vsyscall.c       | 45 ++++++++++++++
 5 files changed, 127 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h
  2024-01-19  7:30 [PATCH bpf 0/3] Fix the read of vsyscall page through bpf Hou Tao
@ 2024-01-19  7:30 ` Hou Tao
  2024-01-20  0:35   ` Sohil Mehta
  2024-01-19  7:30 ` [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
  2024-01-19  7:30 ` [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
  2 siblings, 1 reply; 12+ messages in thread
From: Hou Tao @ 2024-01-19  7:30 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

From: Hou Tao <houtao1@huawei.com>

Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 arch/x86/mm/fault.c       | 11 ++---------
 arch/x86/mm/mm_internal.h | 13 +++++++++++++
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 679b09cfe241c..69e007761d9a9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -38,6 +38,8 @@
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
 
+#include "mm_internal.h"
+
 /*
  * Returns 0 if mmiotrace is disabled, or if the fault is not
  * handled by mmiotrace:
@@ -798,15 +800,6 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	show_opcodes(regs, loglvl);
 }
 
-/*
- * The (legacy) vsyscall page is the long page in the kernel portion
- * of the address space that has user-accessible permissions.
- */
-static bool is_vsyscall_vaddr(unsigned long vaddr)
-{
-	return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
-}
-
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		       unsigned long address, u32 pkey, int si_code)
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 3f37b5c80bb32..4ebf6051e1ed7 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -2,6 +2,10 @@
 #ifndef __X86_MM_INTERNAL_H
 #define __X86_MM_INTERNAL_H
 
+#include <uapi/asm/vsyscall.h>
+
+#include <asm/page_types.h>
+
 void *alloc_low_pages(unsigned int num);
 static inline void *alloc_low_page(void)
 {
@@ -25,4 +29,13 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache);
 
 extern unsigned long tlb_single_page_flush_ceiling;
 
+/*
+ * The (legacy) vsyscall page is the long page in the kernel portion
+ * of the address space that has user-accessible permissions.
+ */
+static inline bool is_vsyscall_vaddr(unsigned long vaddr)
+{
+	return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
+}
+
 #endif	/* __X86_MM_INTERNAL_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-19  7:30 [PATCH bpf 0/3] Fix the read of vsyscall page through bpf Hou Tao
  2024-01-19  7:30 ` [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h Hou Tao
@ 2024-01-19  7:30 ` Hou Tao
  2024-01-23  0:18   ` Sohil Mehta
  2024-01-19  7:30 ` [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
  2 siblings, 1 reply; 12+ messages in thread
From: Hou Tao @ 2024-01-19  7:30 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

From: Hou Tao <houtao1@huawei.com>

When trying to use copy_from_kernel_nofault() to read vsyscall page
through a bpf program, the following oops was reported:

  BUG: unable to handle page fault for address: ffffffffff600000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
  ......
  Call Trace:
   <TASK>
   ? copy_from_kernel_nofault+0x6f/0x110
   bpf_probe_read_kernel+0x1d/0x50
   bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
   trace_call_bpf+0xc5/0x1c0
   perf_call_bpf_enter.isra.0+0x69/0xb0
   perf_syscall_enter+0x13e/0x200
   syscall_trace_enter+0x188/0x1c0
   do_syscall_64+0xb5/0xe0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>
  ......
  ---[ end trace 0000000000000000 ]---

The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
to read from vsyscall page, bpf_probe_read_kernel() invokes
copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
page fault exception is triggered accordingly, but handle_page_fault()
considers the vsyscall page address as a userspace address instead of
a kernel space address, so the fix-up set-up by bpf isn't applied.
Because the exception happens in kernel space and page fault handling is
disabled, page_fault_oops() is invoked and an oops happens.

Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().

Originally-from: Thomas Gleixner <tglx@linutronix.de>
Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 arch/x86/mm/maccess.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
index 6993f026adec9..bb454e0abbfcf 100644
--- a/arch/x86/mm/maccess.c
+++ b/arch/x86/mm/maccess.c
@@ -3,6 +3,8 @@
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
 
+#include "mm_internal.h"
+
 #ifdef CONFIG_X86_64
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
@@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
 		return false;
 
+	/* vsyscall page is also considered as userspace address. */
+	if (is_vsyscall_vaddr(vaddr))
+		return false;
+
 	/*
 	 * Allow everything during early boot before 'x86_virt_bits'
 	 * is initialized.  Needed for instruction decoding in early
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-19  7:30 [PATCH bpf 0/3] Fix the read of vsyscall page through bpf Hou Tao
  2024-01-19  7:30 ` [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h Hou Tao
  2024-01-19  7:30 ` [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
@ 2024-01-19  7:30 ` Hou Tao
  2024-01-22  6:30   ` Yonghong Song
  2024-01-23  0:25   ` Sohil Mehta
  2 siblings, 2 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-19  7:30 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

From: Hou Tao <houtao1@huawei.com>

Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
from vsyscall page under x86-64 will trigger oops, so add one test case
to ensure that the problem is fixed.

Beside those four bpf helpers mentioned above, testing the read of
vsyscall page by using bpf_probe_read_user{_str} and
bpf_copy_from_user{_task}() as well.

vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
problem and the returned error codes.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/read_vsyscall.c  | 61 +++++++++++++++++++
 .../selftests/bpf/progs/read_vsyscall.c       | 45 ++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
new file mode 100644
index 0000000000000..d9247cc89cf3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include "test_progs.h"
+#include "read_vsyscall.skel.h"
+
+#if defined(__x86_64__)
+/* For VSYSCALL_ADDR */
+#include <asm/vsyscall.h>
+#else
+/* To prevent build failure on non-x86 arch */
+#define VSYSCALL_ADDR 0UL
+#endif
+
+struct read_ret_desc {
+	const char *name;
+	int ret;
+} all_read[] = {
+	{ .name = "probe_read_kernel", .ret = -ERANGE },
+	{ .name = "probe_read_kernel_str", .ret = -ERANGE },
+	{ .name = "probe_read", .ret = -ERANGE },
+	{ .name = "probe_read_str", .ret = -ERANGE },
+	/* __access_ok() will fail */
+	{ .name = "probe_read_user", .ret = -EFAULT },
+	/* __access_ok() will fail */
+	{ .name = "probe_read_user_str", .ret = -EFAULT },
+	/* access_ok() will fail */
+	{ .name = "copy_from_user", .ret = -EFAULT },
+	/* both vma_lookup() and expand_stack() will fail */
+	{ .name = "copy_from_user_task", .ret = -EFAULT },
+};
+
+void test_read_vsyscall(void)
+{
+	struct read_vsyscall *skel;
+	unsigned int i;
+	int err;
+
+#if !defined(__x86_64__)
+	test__skip();
+	return;
+#endif
+	skel = read_vsyscall__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
+		return;
+
+	skel->bss->target_pid = getpid();
+	err = read_vsyscall__attach(skel);
+	if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
+		goto out;
+
+	/* userspace may don't have vsyscall page due to LEGACY_VSYSCALL_NONE,
+	 * but it doesn't affect the returned error codes.
+	 */
+	skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
+	usleep(1);
+
+	for (i = 0; i < ARRAY_SIZE(all_read); i++)
+		ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret, all_read[i].name);
+out:
+	read_vsyscall__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/read_vsyscall.c b/tools/testing/selftests/bpf/progs/read_vsyscall.c
new file mode 100644
index 0000000000000..986f96687ae15
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/read_vsyscall.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+int target_pid = 0;
+void *user_ptr = 0;
+int read_ret[8];
+
+char _license[] SEC("license") = "GPL";
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+int do_probe_read(void *ctx)
+{
+	char buf[8];
+
+	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+		return 0;
+
+	read_ret[0] = bpf_probe_read_kernel(buf, sizeof(buf), user_ptr);
+	read_ret[1] = bpf_probe_read_kernel_str(buf, sizeof(buf), user_ptr);
+	read_ret[2] = bpf_probe_read(buf, sizeof(buf), user_ptr);
+	read_ret[3] = bpf_probe_read_str(buf, sizeof(buf), user_ptr);
+	read_ret[4] = bpf_probe_read_user(buf, sizeof(buf), user_ptr);
+	read_ret[5] = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+	return 0;
+}
+
+SEC("fentry.s/" SYS_PREFIX "sys_nanosleep")
+int do_copy_from_user(void *ctx)
+{
+	char buf[8];
+
+	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+		return 0;
+
+	read_ret[6] = bpf_copy_from_user(buf, sizeof(buf), user_ptr);
+	read_ret[7] = bpf_copy_from_user_task(buf, sizeof(buf), user_ptr,
+					      bpf_get_current_task_btf(), 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h
  2024-01-19  7:30 ` [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h Hou Tao
@ 2024-01-20  0:35   ` Sohil Mehta
  2024-01-20  1:29     ` Hou Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Sohil Mehta @ 2024-01-20  0:35 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

On 1/18/2024 11:30 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
> 

Instead of mm_internal.h would a better place for is_vsyscall_vaddr() be
arch/x86/include/asm/vsyscall.h?

Sohil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h
  2024-01-20  0:35   ` Sohil Mehta
@ 2024-01-20  1:29     ` Hou Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-20  1:29 UTC (permalink / raw)
  To: Sohil Mehta, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

Hi,

On 1/20/2024 8:35 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
>> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
>>
> Instead of mm_internal.h would a better place for is_vsyscall_vaddr() be
> arch/x86/include/asm/vsyscall.h?

Yes, asm/vsyscall.h is better indeed. Will update in v2. Thanks for the
suggestion.
>
> Sohil
>
> .


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-19  7:30 ` [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
@ 2024-01-22  6:30   ` Yonghong Song
  2024-01-25  6:37     ` Hou Tao
  2024-01-23  0:25   ` Sohil Mehta
  1 sibling, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2024-01-22  6:30 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1


On 1/18/24 11:30 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
> from vsyscall page under x86-64 will trigger oops, so add one test case
> to ensure that the problem is fixed.
>
> Beside those four bpf helpers mentioned above, testing the read of
> vsyscall page by using bpf_probe_read_user{_str} and
> bpf_copy_from_user{_task}() as well.
>
> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
> problem and the returned error codes.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   .../selftests/bpf/prog_tests/read_vsyscall.c  | 61 +++++++++++++++++++
>   .../selftests/bpf/progs/read_vsyscall.c       | 45 ++++++++++++++
>   2 files changed, 106 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>   create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
> new file mode 100644
> index 0000000000000..d9247cc89cf3e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
> +#include "test_progs.h"
> +#include "read_vsyscall.skel.h"
> +
> +#if defined(__x86_64__)
> +/* For VSYSCALL_ADDR */
> +#include <asm/vsyscall.h>
> +#else
> +/* To prevent build failure on non-x86 arch */
> +#define VSYSCALL_ADDR 0UL
> +#endif
> +
> +struct read_ret_desc {
> +	const char *name;
> +	int ret;
> +} all_read[] = {
> +	{ .name = "probe_read_kernel", .ret = -ERANGE },
> +	{ .name = "probe_read_kernel_str", .ret = -ERANGE },
> +	{ .name = "probe_read", .ret = -ERANGE },
> +	{ .name = "probe_read_str", .ret = -ERANGE },
> +	/* __access_ok() will fail */
> +	{ .name = "probe_read_user", .ret = -EFAULT },
> +	/* __access_ok() will fail */
> +	{ .name = "probe_read_user_str", .ret = -EFAULT },
> +	/* access_ok() will fail */
> +	{ .name = "copy_from_user", .ret = -EFAULT },
> +	/* both vma_lookup() and expand_stack() will fail */
> +	{ .name = "copy_from_user_task", .ret = -EFAULT },

The above comments are not clear enough. For example,
'__access_ok() will fail', user will need to
check the source code where __access_ok() is and
this could be hard e.g., for probe_read_user_str().
Another example, 'both vma_lookup() and expand_stack() will fail',
where is vma_lookup()/expand_stack()? User needs to further
check to make sense.

I suggest remove the above comments and add more
detailed explanation in commit messages with callstack
indicating where the fail/error return happens.

> +};
> +
> +void test_read_vsyscall(void)
> +{
> +	struct read_vsyscall *skel;
> +	unsigned int i;
> +	int err;
> +
> +#if !defined(__x86_64__)
> +	test__skip();
> +	return;
> +#endif
> +	skel = read_vsyscall__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
> +		return;
> +
> +	skel->bss->target_pid = getpid();
> +	err = read_vsyscall__attach(skel);
> +	if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
> +		goto out;
> +
> +	/* userspace may don't have vsyscall page due to LEGACY_VSYSCALL_NONE,
> +	 * but it doesn't affect the returned error codes.
> +	 */
> +	skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
> +	usleep(1);
> +
> +	for (i = 0; i < ARRAY_SIZE(all_read); i++)
> +		ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret, all_read[i].name);
> +out:
> +	read_vsyscall__destroy(skel);
> +}
[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-19  7:30 ` [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
@ 2024-01-23  0:18   ` Sohil Mehta
  2024-01-25  7:18     ` Hou Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Sohil Mehta @ 2024-01-23  0:18 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

On 1/18/2024 11:30 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When trying to use copy_from_kernel_nofault() to read vsyscall page
> through a bpf program, the following oops was reported:
> 
>   BUG: unable to handle page fault for address: ffffffffff600000
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
>   Oops: 0000 [#1] PREEMPT SMP PTI
>   CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>   RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
>   ......
>   Call Trace:
>    <TASK>
>    ? copy_from_kernel_nofault+0x6f/0x110
>    bpf_probe_read_kernel+0x1d/0x50
>    bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
>    trace_call_bpf+0xc5/0x1c0
>    perf_call_bpf_enter.isra.0+0x69/0xb0
>    perf_syscall_enter+0x13e/0x200
>    syscall_trace_enter+0x188/0x1c0
>    do_syscall_64+0xb5/0xe0
>    entry_SYSCALL_64_after_hwframe+0x6e/0x76
>    </TASK>
>   ......
>   ---[ end trace 0000000000000000 ]---
> 
> The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
> to read from vsyscall page, bpf_probe_read_kernel() invokes
> copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
> page fault exception is triggered accordingly, but handle_page_fault()
> considers the vsyscall page address as a userspace address instead of
> a kernel space address, so the fix-up set-up by bpf isn't applied.

This comment and the one in the code below seem contradictory and
confusing. Do we want the vsyscall page address to be considered as a
userspace address or not?

IIUC, the issue here is that the vsyscall page (in xonly mode) is not
really mapped and therefore running copy_from_kernel_nofault() on this
address is incorrect. This patch fixes this by making
copy_from_kernel_nofault() return an error for a vsyscall address.


> Because the exception happens in kernel space and page fault handling is
> disabled, page_fault_oops() is invoked and an oops happens.
> 
> Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().
> 

[Maybe I have misunderstood the issue here and following questions are
not even relevant.]

But, what about vsyscall=emulate? In that mode the page is actually
mapped. Would we want the page read to go through then?

> Originally-from: Thomas Gleixner <tglx@linutronix.de>

Documentation/process/maintainer-tip.rst says to use "Originally-by:"


> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
> index 6993f026adec9..bb454e0abbfcf 100644
> --- a/arch/x86/mm/maccess.c
> +++ b/arch/x86/mm/maccess.c
> @@ -3,6 +3,8 @@
>  #include <linux/uaccess.h>
>  #include <linux/kernel.h>
>  
> +#include "mm_internal.h"
> +
>  #ifdef CONFIG_X86_64
>  bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>  {
> @@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>  	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
>  		return false;
>  
> +	/* vsyscall page is also considered as userspace address. */

A bit more explanation about why this should happen might be useful.

> +	if (is_vsyscall_vaddr(vaddr))
> +		return false;
> +
>  	/*
>  	 * Allow everything during early boot before 'x86_virt_bits'
>  	 * is initialized.  Needed for instruction decoding in early


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-19  7:30 ` [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
  2024-01-22  6:30   ` Yonghong Song
@ 2024-01-23  0:25   ` Sohil Mehta
  2024-01-25  7:54     ` Hou Tao
  1 sibling, 1 reply; 12+ messages in thread
From: Sohil Mehta @ 2024-01-23  0:25 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

On 1/18/2024 11:30 PM, Hou Tao wrote:

> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
> problem and the returned error codes.
> 

With vsyscall=emulate a direct read of the vsyscall address from
userspace is expected to go through. This is mode deprecated so maybe it
wouldn't matter much. Without the fix in patch 2/3, do you see the same
behavior with vsyscall=emulate set in the cmdline?

Sohil


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-22  6:30   ` Yonghong Song
@ 2024-01-25  6:37     ` Hou Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-25  6:37 UTC (permalink / raw)
  To: Yonghong Song, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

Hi,

On 1/22/2024 2:30 PM, Yonghong Song wrote:
>
> On 1/18/24 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
>> from vsyscall page under x86-64 will trigger oops, so add one test case
>> to ensure that the problem is fixed.
>>
>> Beside those four bpf helpers mentioned above, testing the read of
>> vsyscall page by using bpf_probe_read_user{_str} and
>> bpf_copy_from_user{_task}() as well.
>>
>> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
>> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
>> problem and the returned error codes.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>   .../selftests/bpf/prog_tests/read_vsyscall.c  | 61 +++++++++++++++++++
>>   .../selftests/bpf/progs/read_vsyscall.c       | 45 ++++++++++++++
>>   2 files changed, 106 insertions(+)
>>   create mode 100644
>> tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> new file mode 100644
>> index 0000000000000..d9247cc89cf3e
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> @@ -0,0 +1,61 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
>> +#include "test_progs.h"
>> +#include "read_vsyscall.skel.h"
>> +
>> +#if defined(__x86_64__)
>> +/* For VSYSCALL_ADDR */
>> +#include <asm/vsyscall.h>
>> +#else
>> +/* To prevent build failure on non-x86 arch */
>> +#define VSYSCALL_ADDR 0UL
>> +#endif
>> +
>> +struct read_ret_desc {
>> +    const char *name;
>> +    int ret;
>> +} all_read[] = {
>> +    { .name = "probe_read_kernel", .ret = -ERANGE },
>> +    { .name = "probe_read_kernel_str", .ret = -ERANGE },
>> +    { .name = "probe_read", .ret = -ERANGE },
>> +    { .name = "probe_read_str", .ret = -ERANGE },
>> +    /* __access_ok() will fail */
>> +    { .name = "probe_read_user", .ret = -EFAULT },
>> +    /* __access_ok() will fail */
>> +    { .name = "probe_read_user_str", .ret = -EFAULT },
>> +    /* access_ok() will fail */
>> +    { .name = "copy_from_user", .ret = -EFAULT },
>> +    /* both vma_lookup() and expand_stack() will fail */
>> +    { .name = "copy_from_user_task", .ret = -EFAULT },
>
> The above comments are not clear enough. For example,
> '__access_ok() will fail', user will need to
> check the source code where __access_ok() is and
> this could be hard e.g., for probe_read_user_str().
> Another example, 'both vma_lookup() and expand_stack() will fail',
> where is vma_lookup()/expand_stack()? User needs to further
> check to make sense.

Yes. These comment are highly coupled with the implementation.
>
> I suggest remove the above comments and add more
> detailed explanation in commit messages with callstack
> indicating where the fail/error return happens.

Will do in v2. Thanks for the suggestions.
>
>> +};
>> +
>> +void test_read_vsyscall(void)
>> +{
>> +    struct read_vsyscall *skel;
>> +    unsigned int i;
>> +    int err;
>> +
>> +#if !defined(__x86_64__)
>> +    test__skip();
>> +    return;
>> +#endif
>> +    skel = read_vsyscall__open_and_load();
>> +    if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
>> +        return;
>> +
>> +    skel->bss->target_pid = getpid();
>> +    err = read_vsyscall__attach(skel);
>> +    if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
>> +        goto out;
>> +
>> +    /* userspace may don't have vsyscall page due to
>> LEGACY_VSYSCALL_NONE,
>> +     * but it doesn't affect the returned error codes.
>> +     */
>> +    skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
>> +    usleep(1);
>> +
>> +    for (i = 0; i < ARRAY_SIZE(all_read); i++)
>> +        ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret,
>> all_read[i].name);
>> +out:
>> +    read_vsyscall__destroy(skel);
>> +}
> [...]


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-23  0:18   ` Sohil Mehta
@ 2024-01-25  7:18     ` Hou Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-25  7:18 UTC (permalink / raw)
  To: Sohil Mehta, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1

Hi,

On 1/23/2024 8:18 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When trying to use copy_from_kernel_nofault() to read vsyscall page
>> through a bpf program, the following oops was reported:
>>
>>   BUG: unable to handle page fault for address: ffffffffff600000
>>   #PF: supervisor read access in kernel mode
>>   #PF: error_code(0x0000) - not-present page
>>   PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
>>   Oops: 0000 [#1] PREEMPT SMP PTI
>>   CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>>   RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
>>   ......
>>   Call Trace:
>>    <TASK>
>>    ? copy_from_kernel_nofault+0x6f/0x110
>>    bpf_probe_read_kernel+0x1d/0x50
>>    bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
>>    trace_call_bpf+0xc5/0x1c0
>>    perf_call_bpf_enter.isra.0+0x69/0xb0
>>    perf_syscall_enter+0x13e/0x200
>>    syscall_trace_enter+0x188/0x1c0
>>    do_syscall_64+0xb5/0xe0
>>    entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>    </TASK>
>>   ......
>>   ---[ end trace 0000000000000000 ]---
>>
>> The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
>> to read from vsyscall page, bpf_probe_read_kernel() invokes
>> copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
>> page fault exception is triggered accordingly, but handle_page_fault()
>> considers the vsyscall page address as a userspace address instead of
>> a kernel space address, so the fix-up set-up by bpf isn't applied.
> This comment and the one in the code below seem contradictory and
> confusing. Do we want the vsyscall page address to be considered as a
> userspace address or not?

Now handle_page_fault() has already considered the vsyscall page as a
userspace address, and in the patch we update copy_from_kernel_nofault()
to consider vsyscall page as a userspapce address as well.
>
> IIUC, the issue here is that the vsyscall page (in xonly mode) is not
> really mapped and therefore running copy_from_kernel_nofault() on this
> address is incorrect. This patch fixes this by making
> copy_from_kernel_nofault() return an error for a vsyscall address.
>

Yes, but the issue may occur for vsyscall=none case as well. Because
fault_in_kernel_space() invoked by handle_page_fault() will return
false, so in do_user_addr_fault(), when smap feature is enabled, the
invocation of copy_from_kernel_nofault() will trigger oops due to the
following code snippet:

        if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
                     !(error_code & X86_PF_USER) &&
                     !(regs->flags & X86_EFLAGS_AC))) {
                /*
                 * No extable entry here.  This was a kernel access to an
                 * invalid pointer.  get_kernel_nofault() will not get here.
                 */
                page_fault_oops(regs, error_code, address);
                return;
        }

>> Because the exception happens in kernel space and page fault handling is
>> disabled, page_fault_oops() is invoked and an oops happens.
>>
>> Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().
>>
> [Maybe I have misunderstood the issue here and following questions are
> not even relevant.]
>
> But, what about vsyscall=emulate? In that mode the page is actually
> mapped. Would we want the page read to go through then?

Er, Now the vsyscall page is considered as a userspace address, I think
we should reject its read through copy_from_kernel_nofault() even it is
mapped.

>
>> Originally-from: Thomas Gleixner <tglx@linutronix.de>
> Documentation/process/maintainer-tip.rst says to use "Originally-by:"

Thanks for the tip. Will update.
>
>
>> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
>> index 6993f026adec9..bb454e0abbfcf 100644
>> --- a/arch/x86/mm/maccess.c
>> +++ b/arch/x86/mm/maccess.c
>> @@ -3,6 +3,8 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/kernel.h>
>>  
>> +#include "mm_internal.h"
>> +
>>  #ifdef CONFIG_X86_64
>>  bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>>  {
>> @@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>>  	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
>>  		return false;
>>  
>> +	/* vsyscall page is also considered as userspace address. */
> A bit more explanation about why this should happen might be useful.
>
>> +	if (is_vsyscall_vaddr(vaddr))
>> +		return false;
>> +
>>  	/*
>>  	 * Allow everything during early boot before 'x86_virt_bits'
>>  	 * is initialized.  Needed for instruction decoding in early


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-23  0:25   ` Sohil Mehta
@ 2024-01-25  7:54     ` Hou Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Hou Tao @ 2024-01-25  7:54 UTC (permalink / raw)
  To: Sohil Mehta, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, houtao1



On 1/23/2024 8:25 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>
>> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
>> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
>> problem and the returned error codes.
>>
> With vsyscall=emulate a direct read of the vsyscall address from
> userspace is expected to go through. This is mode deprecated so maybe it
> wouldn't matter much. Without the fix in patch 2/3, do you see the same
> behavior with vsyscall=emulate set in the cmdline?

Er, I think it depends on whether or not SMAP [1] feature is available.
When SMAP feature is enabled, even the vsyscall page is populated,
reading the vsyscall page through bpf_read_kernel() will trigger a page
fault and then oops. But when there is not SMAP, bpf_read_kernel() will
succeed. So I think the test may need to be skipped if vsyscall_mode is
emulate.

[1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
>
> Sohil


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-25  7:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-19  7:30 [PATCH bpf 0/3] Fix the read of vsyscall page through bpf Hou Tao
2024-01-19  7:30 ` [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h Hou Tao
2024-01-20  0:35   ` Sohil Mehta
2024-01-20  1:29     ` Hou Tao
2024-01-19  7:30 ` [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
2024-01-23  0:18   ` Sohil Mehta
2024-01-25  7:18     ` Hou Tao
2024-01-19  7:30 ` [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
2024-01-22  6:30   ` Yonghong Song
2024-01-25  6:37     ` Hou Tao
2024-01-23  0:25   ` Sohil Mehta
2024-01-25  7:54     ` Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).