All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v2 0/3] Fix the read of vsyscall page through bpf
@ 2024-01-26 11:54 Hou Tao
  2024-01-26 11:54 ` [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hou Tao @ 2024-01-26 11:54 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Sohil Mehta, Yonghong Song, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

As reported by syzboot [1] and [2], when trying to read vsyscall page
by using bpf_probe_read_kernel() or bpf_probe_read(), oops may happen.

Thomas Gleixner had proposed a test patch [3], but it seems that no
formal patch is posted after about one month [4], so I post it instead
and add an Originally-by tag in patch #2.

Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes
the problem by disallowing vsyscall page read for
copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the
read of vsyscall page through bpf is rejected. Please see individual
patches for more details.

Comments are always welcome.

[1]: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/
[4]: https://lore.kernel.org/bpf/e24b125c-8ff4-9031-6c53-67ff2e01f316@huaweicloud.com/

Change Log:
v2:
  * move is_vsyscall_vaddr to asm/vsyscall.h instead (Sohil)
  * elaborate on the reason for disallowing of vsyscall page read in
    copy_from_kernel_nofault_allowed() (Sohil)
  * update the commit message of patch #2 to more clearly explain how
    the oops occurs. (Sohil)
  * update the commit message of patch #3 to explain the expected return
    values of various bpf helpers (Yonghong)

v1: https://lore.kernel.org/bpf/20240119073019.1528573-1-houtao@huaweicloud.com/

Hou Tao (3):
  x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
  x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  selftest/bpf: Test the read of vsyscall page under x86-64

 arch/x86/include/asm/vsyscall.h               | 10 ++++
 arch/x86/mm/fault.c                           |  9 ---
 arch/x86/mm/maccess.c                         |  9 +++
 .../selftests/bpf/prog_tests/read_vsyscall.c  | 57 +++++++++++++++++++
 .../selftests/bpf/progs/read_vsyscall.c       | 45 +++++++++++++++
 5 files changed, 121 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
  2024-01-26 11:54 [PATCH bpf v2 0/3] Fix the read of vsyscall page through bpf Hou Tao
@ 2024-01-26 11:54 ` Hou Tao
  2024-01-29 23:56   ` Sohil Mehta
  2024-01-26 11:54 ` [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
  2024-01-26 11:54 ` [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-01-26 11:54 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Sohil Mehta, Yonghong Song, houtao1

From: Hou Tao <houtao1@huawei.com>

Moving is_vsyscall_vaddr() into asm/vsyscall.h to make it available for
copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 arch/x86/include/asm/vsyscall.h | 10 ++++++++++
 arch/x86/mm/fault.c             |  9 ---------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index ab60a71a8dcb9..472f0263dbc61 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -4,6 +4,7 @@
 
 #include <linux/seqlock.h>
 #include <uapi/asm/vsyscall.h>
+#include <asm/page_types.h>
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 extern void map_vsyscall(void);
@@ -24,4 +25,13 @@ static inline bool emulate_vsyscall(unsigned long error_code,
 }
 #endif
 
+/*
+ * The (legacy) vsyscall page is the long page in the kernel portion
+ * of the address space that has user-accessible permissions.
+ */
+static inline bool is_vsyscall_vaddr(unsigned long vaddr)
+{
+	return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
+}
+
 #endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 679b09cfe241c..d6375b3c633bc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -798,15 +798,6 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	show_opcodes(regs, loglvl);
 }
 
-/*
- * The (legacy) vsyscall page is the long page in the kernel portion
- * of the address space that has user-accessible permissions.
- */
-static bool is_vsyscall_vaddr(unsigned long vaddr)
-{
-	return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
-}
-
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		       unsigned long address, u32 pkey, int si_code)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-26 11:54 [PATCH bpf v2 0/3] Fix the read of vsyscall page through bpf Hou Tao
  2024-01-26 11:54 ` [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Hou Tao
@ 2024-01-26 11:54 ` Hou Tao
  2024-01-29 23:50   ` Sohil Mehta
  2024-01-26 11:54 ` [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-01-26 11:54 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Sohil Mehta, Yonghong Song, houtao1

From: Hou Tao <houtao1@huawei.com>

When trying to use copy_from_kernel_nofault() to read vsyscall page
through a bpf program, the following oops was reported:

  BUG: unable to handle page fault for address: ffffffffff600000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
  ......
  Call Trace:
   <TASK>
   ? copy_from_kernel_nofault+0x6f/0x110
   bpf_probe_read_kernel+0x1d/0x50
   bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
   trace_call_bpf+0xc5/0x1c0
   perf_call_bpf_enter.isra.0+0x69/0xb0
   perf_syscall_enter+0x13e/0x200
   syscall_trace_enter+0x188/0x1c0
   do_syscall_64+0xb5/0xe0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>
  ......
  ---[ end trace 0000000000000000 ]---

It seems the occurrence of oops depends on SMAP feature of CPU. It
happens as follow: a bpf program uses bpf_probe_read_kernel() to read
from vsyscall page, bpf_probe_read_kernel() invokes
copy_from_kernel_nofault() in turn and then invokes __get_user_asm().
Because the vsyscall page address is not readable for kernel space,
a page fault exception is triggered accordingly, handle_page_fault()
considers the vsyscall page address as a userspace address instead of a
kernel space address, so the fix-up set-up by bpf isn't applied. Because
the CPU has SMAP feature and the access happens in kernel mode, so
page_fault_oops() is invoked and an oops happens. If these is no SMAP
feature, the fix-up set-up by bpf will be applied and
copy_from_kernel_nofault() will return -EFAULT instead.

Considering handle_page_fault() has already considered the vsyscall page
address as a userspace address, fix the problem by disallowing vsyscall
page read for copy_from_kernel_nofault().

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 arch/x86/mm/maccess.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
index 6993f026adec9..d9272e1db5224 100644
--- a/arch/x86/mm/maccess.c
+++ b/arch/x86/mm/maccess.c
@@ -3,6 +3,8 @@
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
 
+#include <asm/vsyscall.h>
+
 #ifdef CONFIG_X86_64
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
@@ -15,6 +17,13 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
 		return false;
 
+	/* Also consider the vsyscall page as userspace address. Otherwise,
+	 * reading the vsyscall page in copy_from_kernel_nofault() may
+	 * trigger an oops due to an unhandled page fault.
+	 */
+	if (is_vsyscall_vaddr(vaddr))
+		return false;
+
 	/*
 	 * Allow everything during early boot before 'x86_virt_bits'
 	 * is initialized.  Needed for instruction decoding in early
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-26 11:54 [PATCH bpf v2 0/3] Fix the read of vsyscall page through bpf Hou Tao
  2024-01-26 11:54 ` [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Hou Tao
  2024-01-26 11:54 ` [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
@ 2024-01-26 11:54 ` Hou Tao
  2024-01-26 19:36   ` Yonghong Song
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-01-26 11:54 UTC (permalink / raw)
  To: x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Sohil Mehta, Yonghong Song, houtao1

From: Hou Tao <houtao1@huawei.com>

Under x86-64, when using bpf_probe_read_kernel{_str}() or
bpf_probe_read{_str}() to read vsyscall page, the read may trigger oops,
so add one test case to ensure that the problem is fixed. Beside those
four bpf helpers mentioned above, testing the read of vsyscall page by
using bpf_probe_read_user{_str} and bpf_copy_from_user{_task}() as well.

The test case passes the address of vsyscall page to these six helpers
and checks whether the returned values are expected:

1) For bpf_probe_read_kernel{_str}()/bpf_probe_read{_str}(), the
   expected return value is -ERANGE as shown below:

bpf_probe_read_kernel_common
  copy_from_kernel_nofault
    // false, return -ERANGE
    copy_from_kernel_nofault_allowed

2) For bpf_probe_read_user{_str}(), the expected return value is -EFAULT
   as show below:

bpf_probe_read_user_common
  copy_from_user_nofault
    // false, return -EFAULT
    __access_ok

3) For bpf_copy_from_user(), the expected return value is -EFAULT:

// return -EFAULT
bpf_copy_from_user
  copy_from_user
    _copy_from_user
      // return false
      access_ok

4) For bpf_copy_from_user_task(), the expected return value is -EFAULT:

// return -EFAULT
bpf_copy_from_user_task
  access_process_vm
    // return 0
    vma_lookup()
    // return 0
    expand_stack()

The occurrence of oops depends on the availability of CPU SMAP [1]
feature and there are three possible configurations of vsyscall page in
boot cmd-line: vsyscall={xonly|none|emulate}, so there are totally six
possible combinations. Under all these combinations, the running of the
test case succeeds.

[1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/read_vsyscall.c  | 57 +++++++++++++++++++
 .../selftests/bpf/progs/read_vsyscall.c       | 45 +++++++++++++++
 2 files changed, 102 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
new file mode 100644
index 0000000000000..3405923fe4e65
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include "test_progs.h"
+#include "read_vsyscall.skel.h"
+
+#if defined(__x86_64__)
+/* For VSYSCALL_ADDR */
+#include <asm/vsyscall.h>
+#else
+/* To prevent build failure on non-x86 arch */
+#define VSYSCALL_ADDR 0UL
+#endif
+
+struct read_ret_desc {
+	const char *name;
+	int ret;
+} all_read[] = {
+	{ .name = "probe_read_kernel", .ret = -ERANGE },
+	{ .name = "probe_read_kernel_str", .ret = -ERANGE },
+	{ .name = "probe_read", .ret = -ERANGE },
+	{ .name = "probe_read_str", .ret = -ERANGE },
+	{ .name = "probe_read_user", .ret = -EFAULT },
+	{ .name = "probe_read_user_str", .ret = -EFAULT },
+	{ .name = "copy_from_user", .ret = -EFAULT },
+	{ .name = "copy_from_user_task", .ret = -EFAULT },
+};
+
+void test_read_vsyscall(void)
+{
+	struct read_vsyscall *skel;
+	unsigned int i;
+	int err;
+
+#if !defined(__x86_64__)
+	test__skip();
+	return;
+#endif
+	skel = read_vsyscall__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
+		return;
+
+	skel->bss->target_pid = getpid();
+	err = read_vsyscall__attach(skel);
+	if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
+		goto out;
+
+	/* userspace may don't have vsyscall page due to LEGACY_VSYSCALL_NONE,
+	 * but it doesn't affect the returned error codes.
+	 */
+	skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
+	usleep(1);
+
+	for (i = 0; i < ARRAY_SIZE(all_read); i++)
+		ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret, all_read[i].name);
+out:
+	read_vsyscall__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/read_vsyscall.c b/tools/testing/selftests/bpf/progs/read_vsyscall.c
new file mode 100644
index 0000000000000..986f96687ae15
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/read_vsyscall.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+int target_pid = 0;
+void *user_ptr = 0;
+int read_ret[8];
+
+char _license[] SEC("license") = "GPL";
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+int do_probe_read(void *ctx)
+{
+	char buf[8];
+
+	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+		return 0;
+
+	read_ret[0] = bpf_probe_read_kernel(buf, sizeof(buf), user_ptr);
+	read_ret[1] = bpf_probe_read_kernel_str(buf, sizeof(buf), user_ptr);
+	read_ret[2] = bpf_probe_read(buf, sizeof(buf), user_ptr);
+	read_ret[3] = bpf_probe_read_str(buf, sizeof(buf), user_ptr);
+	read_ret[4] = bpf_probe_read_user(buf, sizeof(buf), user_ptr);
+	read_ret[5] = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+	return 0;
+}
+
+SEC("fentry.s/" SYS_PREFIX "sys_nanosleep")
+int do_copy_from_user(void *ctx)
+{
+	char buf[8];
+
+	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+		return 0;
+
+	read_ret[6] = bpf_copy_from_user(buf, sizeof(buf), user_ptr);
+	read_ret[7] = bpf_copy_from_user_task(buf, sizeof(buf), user_ptr,
+					      bpf_get_current_task_btf(), 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64
  2024-01-26 11:54 ` [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
@ 2024-01-26 19:36   ` Yonghong Song
  0 siblings, 0 replies; 9+ messages in thread
From: Yonghong Song @ 2024-01-26 19:36 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Sohil Mehta, houtao1


On 1/26/24 3:54 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Under x86-64, when using bpf_probe_read_kernel{_str}() or
> bpf_probe_read{_str}() to read vsyscall page, the read may trigger oops,
> so add one test case to ensure that the problem is fixed. Beside those
> four bpf helpers mentioned above, testing the read of vsyscall page by
> using bpf_probe_read_user{_str} and bpf_copy_from_user{_task}() as well.
>
> The test case passes the address of vsyscall page to these six helpers
> and checks whether the returned values are expected:
>
> 1) For bpf_probe_read_kernel{_str}()/bpf_probe_read{_str}(), the
>     expected return value is -ERANGE as shown below:
>
> bpf_probe_read_kernel_common
>    copy_from_kernel_nofault
>      // false, return -ERANGE
>      copy_from_kernel_nofault_allowed
>
> 2) For bpf_probe_read_user{_str}(), the expected return value is -EFAULT
>     as show below:
>
> bpf_probe_read_user_common
>    copy_from_user_nofault
>      // false, return -EFAULT
>      __access_ok
>
> 3) For bpf_copy_from_user(), the expected return value is -EFAULT:
>
> // return -EFAULT
> bpf_copy_from_user
>    copy_from_user
>      _copy_from_user
>        // return false
>        access_ok
>
> 4) For bpf_copy_from_user_task(), the expected return value is -EFAULT:
>
> // return -EFAULT
> bpf_copy_from_user_task
>    access_process_vm
>      // return 0
>      vma_lookup()
>      // return 0
>      expand_stack()
>
> The occurrence of oops depends on the availability of CPU SMAP [1]
> feature and there are three possible configurations of vsyscall page in
> boot cmd-line: vsyscall={xonly|none|emulate}, so there are totally six
> possible combinations. Under all these combinations, the running of the
> test case succeeds.
>
> [1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>

The first two patches look good to me but I think it would be better
if x86 folks can ack on them. The selftest patch LGTM.

Acked-by: Yonghong Song <yonghong.song@linux.dev>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-26 11:54 ` [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
@ 2024-01-29 23:50   ` Sohil Mehta
  2024-01-30  4:18     ` Hou Tao
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2024-01-29 23:50 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Yonghong Song, houtao1

Hi Hou Tao,

I agree to your approach in this patch. Please see some comments below.

On 1/26/2024 3:54 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When trying to use copy_from_kernel_nofault() to read vsyscall page
> through a bpf program, the following oops was reported:
> 
>   BUG: unable to handle page fault for address: ffffffffff600000
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
>   Oops: 0000 [#1] PREEMPT SMP PTI
>   CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>   RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
>   ......
>   Call Trace:
>    <TASK>
>    ? copy_from_kernel_nofault+0x6f/0x110
>    bpf_probe_read_kernel+0x1d/0x50
>    bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
>    trace_call_bpf+0xc5/0x1c0
>    perf_call_bpf_enter.isra.0+0x69/0xb0
>    perf_syscall_enter+0x13e/0x200
>    syscall_trace_enter+0x188/0x1c0
>    do_syscall_64+0xb5/0xe0
>    entry_SYSCALL_64_after_hwframe+0x6e/0x76
>    </TASK>
>   ......
>   ---[ end trace 0000000000000000 ]---
> 


> It seems the occurrence of oops depends on SMAP feature of CPU. It
> happens as follow: a bpf program uses bpf_probe_read_kernel() to read
> from vsyscall page, bpf_probe_read_kernel() invokes
> copy_from_kernel_nofault() in turn and then invokes __get_user_asm().
> Because the vsyscall page address is not readable for kernel space,
> a page fault exception is triggered accordingly, handle_page_fault()
> considers the vsyscall page address as a userspace address instead of a
> kernel space address, so the fix-up set-up by bpf isn't applied. Because
> the CPU has SMAP feature and the access happens in kernel mode, so
> page_fault_oops() is invoked and an oops happens. If these is no SMAP
> feature, the fix-up set-up by bpf will be applied and
> copy_from_kernel_nofault() will return -EFAULT instead.
> 

I find this paragraph to be a bit hard to follow. I think we can
minimize the reference to SMAP here since it is only helping detect
cross address space accesses.  How about something like the following:

The oops is triggered when:

1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall
page and invokes copy_from_kernel_nofault() which in turn calls
__get_user_asm().

2) Because the vsyscall page address is not readable from kernel space,
a page fault exception is triggered accordingly.

3) handle_page_fault() considers the vsyscall page address as a user
space address instead of a kernel space address. This results in the
fix-up setup by bpf not being applied and a page_fault_oops() is invoked
due to SMAP.

> Considering handle_page_fault() has already considered the vsyscall page
> address as a userspace address, fix the problem by disallowing vsyscall
> page read for copy_from_kernel_nofault().
> 

I agree, following the same approach as handle_page_fault() seems
reasonable.

> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
> Reported-by: xingwei lee <xrivendell7@gmail.com>
> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  arch/x86/mm/maccess.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
> index 6993f026adec9..d9272e1db5224 100644
> --- a/arch/x86/mm/maccess.c
> +++ b/arch/x86/mm/maccess.c
> @@ -3,6 +3,8 @@
>  #include <linux/uaccess.h>
>  #include <linux/kernel.h>
>  
> +#include <asm/vsyscall.h>
> +
>  #ifdef CONFIG_X86_64
>  bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>  {
> @@ -15,6 +17,13 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>  	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
>  		return false;
>  
> +	/* Also consider the vsyscall page as userspace address. Otherwise,
> +	 * reading the vsyscall page in copy_from_kernel_nofault() may
> +	 * trigger an oops due to an unhandled page fault.
> +	 */

x86 prefers a slightly different style for multi-line comments. Please
refer to https://docs.kernel.org/process/maintainer-tip.html#comment-style.

How about rewording the above as:

/*
 * Reading from the vsyscall page may cause an unhandled fault in
 * certain cases.  Though it is at an address above TASK_SIZE_MAX, it is
 * usually considered as a user space address.
 */


> +	if (is_vsyscall_vaddr(vaddr))
> +		return false;
> +

It would have been convenient if we had a common check for whether a
particular address is a kernel address or not. fault_in_kernel_space()
serves that purpose to an extent in other places.

I thought we could rename fault_in_kernel_space() to
vaddr_in_kernel_space() and use it here. But the check in
copy_from_kernel_nofault_allowed() includes the user guard page as well.
So the checks wouldn't exactly be the same.

I am unsure of the implications if we get rid of that difference. Maybe
we can leave it as-is for now unless someone else chimes in.

Sohil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
  2024-01-26 11:54 ` [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Hou Tao
@ 2024-01-29 23:56   ` Sohil Mehta
  2024-01-30  4:20     ` Hou Tao
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2024-01-29 23:56 UTC (permalink / raw)
  To: Hou Tao, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Yonghong Song, houtao1

On 1/26/2024 3:54 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Moving is_vsyscall_vaddr() into asm/vsyscall.h to make it available for

s/Moving/Move

> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  arch/x86/include/asm/vsyscall.h | 10 ++++++++++
>  arch/x86/mm/fault.c             |  9 ---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 


Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
  2024-01-29 23:50   ` Sohil Mehta
@ 2024-01-30  4:18     ` Hou Tao
  0 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2024-01-30  4:18 UTC (permalink / raw)
  To: Sohil Mehta, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Yonghong Song, houtao1

Hi,

On 1/30/2024 7:50 AM, Sohil Mehta wrote:
> Hi Hou Tao,
>
> I agree to your approach in this patch. Please see some comments below.
>
> On 1/26/2024 3:54 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When trying to use copy_from_kernel_nofault() to read vsyscall page
>> through a bpf program, the following oops was reported:

[SNIP]
>> It seems the occurrence of oops depends on SMAP feature of CPU. It
>> happens as follow: a bpf program uses bpf_probe_read_kernel() to read
>> from vsyscall page, bpf_probe_read_kernel() invokes
>> copy_from_kernel_nofault() in turn and then invokes __get_user_asm().
>> Because the vsyscall page address is not readable for kernel space,
>> a page fault exception is triggered accordingly, handle_page_fault()
>> considers the vsyscall page address as a userspace address instead of a
>> kernel space address, so the fix-up set-up by bpf isn't applied. Because
>> the CPU has SMAP feature and the access happens in kernel mode, so
>> page_fault_oops() is invoked and an oops happens. If these is no SMAP
>> feature, the fix-up set-up by bpf will be applied and
>> copy_from_kernel_nofault() will return -EFAULT instead.
>>
> I find this paragraph to be a bit hard to follow. I think we can
> minimize the reference to SMAP here since it is only helping detect
> cross address space accesses.  How about something like the following:
>
> The oops is triggered when:
>
> 1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall
> page and invokes copy_from_kernel_nofault() which in turn calls
> __get_user_asm().
>
> 2) Because the vsyscall page address is not readable from kernel space,
> a page fault exception is triggered accordingly.
>
> 3) handle_page_fault() considers the vsyscall page address as a user
> space address instead of a kernel space address. This results in the
> fix-up setup by bpf not being applied and a page_fault_oops() is invoked
> due to SMAP.

Thanks for the rephrasing. It is much better now.
>> Considering handle_page_fault() has already considered the vsyscall page
>> address as a userspace address, fix the problem by disallowing vsyscall
>> page read for copy_from_kernel_nofault().
>>
> I agree, following the same approach as handle_page_fault() seems
> reasonable.
>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
>> Reported-by: xingwei lee <xrivendell7@gmail.com>
>> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  arch/x86/mm/maccess.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
>> index 6993f026adec9..d9272e1db5224 100644
>> --- a/arch/x86/mm/maccess.c
>> +++ b/arch/x86/mm/maccess.c
>> @@ -3,6 +3,8 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/kernel.h>
>>  
>> +#include <asm/vsyscall.h>
>> +
>>  #ifdef CONFIG_X86_64
>>  bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>>  {
>> @@ -15,6 +17,13 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>>  	if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
>>  		return false;
>>  
>> +	/* Also consider the vsyscall page as userspace address. Otherwise,
>> +	 * reading the vsyscall page in copy_from_kernel_nofault() may
>> +	 * trigger an oops due to an unhandled page fault.
>> +	 */
> x86 prefers a slightly different style for multi-line comments. Please
> refer to https://docs.kernel.org/process/maintainer-tip.html#comment-style.

I see. Will update.
>
> How about rewording the above as:
>
> /*
>  * Reading from the vsyscall page may cause an unhandled fault in
>  * certain cases.  Though it is at an address above TASK_SIZE_MAX, it is
>  * usually considered as a user space address.
>  */

Thanks for the rewording. Will do in v3.
>
>> +	if (is_vsyscall_vaddr(vaddr))
>> +		return false;
>> +
> It would have been convenient if we had a common check for whether a
> particular address is a kernel address or not. fault_in_kernel_space()
> serves that purpose to an extent in other places.
>
> I thought we could rename fault_in_kernel_space() to
> vaddr_in_kernel_space() and use it here. But the check in
> copy_from_kernel_nofault_allowed() includes the user guard page as well.
> So the checks wouldn't exactly be the same.
>
> I am unsure of the implications if we get rid of that difference. Maybe
> we can leave it as-is for now unless someone else chimes in.

There is other difference between fault_in_kernel_space() and
copy_from_kernel_nofault_allowed(). fault_in_kernel_space() uses address
>= TASK_SIZE_MAX to check the kernel space address, but
copy_from_kernel_nofault_allowed() uses vaddr >= TASK_SIZE_MAX +
PAGE_SIZE to check the kernel space address, so I prefer to keep it as-is.
>
> Sohil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
  2024-01-29 23:56   ` Sohil Mehta
@ 2024-01-30  4:20     ` Hou Tao
  0 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2024-01-30  4:20 UTC (permalink / raw)
  To: Sohil Mehta, x86, bpf
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, linux-kernel,
	xingwei lee, Jann Horn, Yonghong Song, houtao1



On 1/30/2024 7:56 AM, Sohil Mehta wrote:
> On 1/26/2024 3:54 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Moving is_vsyscall_vaddr() into asm/vsyscall.h to make it available for
> s/Moving/Move

Will update in v3.
>
>> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  arch/x86/include/asm/vsyscall.h | 10 ++++++++++
>>  arch/x86/mm/fault.c             |  9 ---------
>>  2 files changed, 10 insertions(+), 9 deletions(-)
>>
>
> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>

Thank you for the review and all of your suggestions.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-30  4:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26 11:54 [PATCH bpf v2 0/3] Fix the read of vsyscall page through bpf Hou Tao
2024-01-26 11:54 ` [PATCH bpf v2 1/3] x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h Hou Tao
2024-01-29 23:56   ` Sohil Mehta
2024-01-30  4:20     ` Hou Tao
2024-01-26 11:54 ` [PATCH bpf v2 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() Hou Tao
2024-01-29 23:50   ` Sohil Mehta
2024-01-30  4:18     ` Hou Tao
2024-01-26 11:54 ` [PATCH bpf v2 3/3] selftest/bpf: Test the read of vsyscall page under x86-64 Hou Tao
2024-01-26 19:36   ` Yonghong Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.