All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v2] [GIT PULL (take two)] tracing: Two more fixes
@ 2019-02-15 17:47 Steven Rostedt
  2019-02-15 17:47 ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Steven Rostedt
  2019-02-15 17:47 ` [PATCH 2/2 v2] tracing: Fix number of entries in trace header Steven Rostedt
  0 siblings, 2 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-15 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton


Linus,

Two more tracing fixes

 - Have kprobes not use copy_from_user() to access kernel addresses,
   because kprobes can legitimately poke at bad kernel memory, which
   will fault. Copy from user code should never fault in kernel space.
   Using probe_mem_read() can handle kernel address space faulting.

 - Put back the entries counter in the tracing output that was accidentally
   removed.

Please pull the latest trace-v5.0-rc4-3 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-v5.0-rc4-3

Tag SHA1: 633d24cef06bed019a34e84ff82c20782449ee5c
Head SHA1: 9e7382153f80ba45a0bbcd540fb77d4b15f6e966


Changbin Du (1):
      kprobe: Do not use uaccess functions to access kernel memory that can fault

Quentin Perret (1):
      tracing: Fix number of entries in trace header

----
 kernel/trace/trace.c        |  2 ++
 kernel/trace/trace_kprobe.c | 10 +---------
 2 files changed, 3 insertions(+), 9 deletions(-)

Changes from v1:

  - Explain that the kprobe change to probe_mem_read() is because
    copy_from_user() can not handle bad memory in kernel space,
    which kprobes can cause. The code itself is unchanged.
    I tweaked the change log of that patch to clarify that as well.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 17:47 [PATCH 0/2 v2] [GIT PULL (take two)] tracing: Two more fixes Steven Rostedt
@ 2019-02-15 17:47 ` Steven Rostedt
  2019-02-15 17:55   ` Linus Torvalds
  2019-02-21  7:52   ` Masami Hiramatsu
  2019-02-15 17:47 ` [PATCH 2/2 v2] tracing: Fix number of entries in trace header Steven Rostedt
  1 sibling, 2 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-15 17:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, stable, Changbin Du

From: Changbin Du <changbin.du@gmail.com>

The userspace can ask kprobe to intercept strings at any memory address,
including invalid kernel address. In this case, fetch_store_strlen()
would crash since it uses general usercopy function, and user access
functions are no longer allowed to access kernel memory.

For example, we can crash the kernel by doing something as below:

$ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'

[  103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?)
[  103.622104] general protection fault: 0000 [#1] SMP PTI
[  103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96
[  103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014
[  103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0
[  103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63
 ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48
[  103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246
[  103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000
[  103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000
[  103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  103.644040] FS:  0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000
[  103.646019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0
[  103.649147] Call Trace:
[  103.649781]  ? sched_clock_cpu+0xc/0xa0
[  103.650747]  ? do_sys_open+0x5/0x220
[  103.651635]  kprobe_trace_func+0x303/0x380
[  103.652645]  ? do_sys_open+0x5/0x220
[  103.653528]  kprobe_dispatcher+0x45/0x50
[  103.654682]  ? do_sys_open+0x1/0x220
[  103.655875]  kprobe_ftrace_handler+0x90/0xf0
[  103.657282]  ftrace_ops_assist_func+0x54/0xf0
[  103.658564]  ? __call_rcu+0x1dc/0x280
[  103.659482]  0xffffffffc00000bf
[  103.660384]  ? __ia32_sys_open+0x20/0x20
[  103.661682]  ? do_sys_open+0x1/0x220
[  103.662863]  do_sys_open+0x5/0x220
[  103.663988]  do_syscall_64+0x60/0x210
[  103.665201]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  103.666862] RIP: 0033:0x7fc22fadccdd
[  103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff
 ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44
[  103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[  103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd
[  103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c
[  103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[  103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8
[  103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  103.688056] Modules linked in:
[  103.689131] ---[ end trace 43792035c28984a1 ]---

This can be fixed by using probe_mem_read() instead, as it can handle faulting
kernel memory addresses, which kprobes can legitimately do.

Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com

Cc: stable@vger.kernel.org
Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace_kprobe.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d5fb09ebba8b..9eaf07f99212 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = {
 static nokprobe_inline int
 fetch_store_strlen(unsigned long addr)
 {
-	mm_segment_t old_fs;
 	int ret, len = 0;
 	u8 c;
 
-	old_fs = get_fs();
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-
 	do {
-		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+		ret = probe_mem_read(&c, (u8 *)addr + len, 1);
 		len++;
 	} while (c && ret == 0 && len < MAX_STRING_SIZE);
 
-	pagefault_enable();
-	set_fs(old_fs);
-
 	return (ret < 0) ? ret : len;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 2/2 v2] tracing: Fix number of entries in trace header
  2019-02-15 17:47 [PATCH 0/2 v2] [GIT PULL (take two)] tracing: Two more fixes Steven Rostedt
  2019-02-15 17:47 ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Steven Rostedt
@ 2019-02-15 17:47 ` Steven Rostedt
  1 sibling, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-15 17:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Joel Fernandes,
	stable, Quentin Perret

From: Quentin Perret <quentin.perret@arm.com>

The following commit

  441dae8f2f29 ("tracing: Add support for display of tgid in trace output")

removed the call to print_event_info() from print_func_help_header_irq()
which results in the ftrace header not reporting the number of entries
written in the buffer. As this wasn't the original intent of the patch,
re-introduce the call to print_event_info() to restore the orginal
behaviour.

Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com

Acked-by: Joel Fernandes <joelaf@google.com>
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c521b7347482..c4238b441624 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3384,6 +3384,8 @@ static void print_func_help_header_irq(struct trace_buffer *buf, struct seq_file
 	const char tgid_space[] = "          ";
 	const char space[] = "  ";
 
+	print_event_info(buf, m);
+
 	seq_printf(m, "#                          %s  _-----=> irqs-off\n",
 		   tgid ? tgid_space : space);
 	seq_printf(m, "#                          %s / _----=> need-resched\n",
-- 
2.20.1



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 17:47 ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Steven Rostedt
@ 2019-02-15 17:55   ` Linus Torvalds
  2019-02-15 22:15     ` Steven Rostedt
  2019-02-21  7:52   ` Masami Hiramatsu
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-15 17:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du

On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: Changbin Du <changbin.du@gmail.com>
>
> The userspace can ask kprobe to intercept strings at any memory address,
> including invalid kernel address.

Again, this is not about a "kernel address" at all.

It's neither a kernel address _nor_ a user address.  It's an invalid
address entirely, and there is nothing that makes it "kernel".

Please don't talk about "invalid kernel addresses" when it is no such thing.

                      Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 17:55   ` Linus Torvalds
@ 2019-02-15 22:15     ` Steven Rostedt
  2019-02-15 23:49       ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-15 22:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski

On Fri, 15 Feb 2019 09:55:32 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > From: Changbin Du <changbin.du@gmail.com>
> >
> > The userspace can ask kprobe to intercept strings at any memory address,
> > including invalid kernel address.  
> 
> Again, this is not about a "kernel address" at all.
> 
> It's neither a kernel address _nor_ a user address.  It's an invalid
> address entirely, and there is nothing that makes it "kernel".
> 
> Please don't talk about "invalid kernel addresses" when it is no such thing.
> 

Ah, I see what you are saying. The example of the bug that the patch
fixed used a non-canonical address, which is "garbage", and not kernel
or user space. Point taken.

But the issue is deeper than that. This code never bugged until the
following commit was added:

9da3f2b7405 "x86/fault: BUG() when uaccess helpers fault on kernel addresses"

Before that commit, this worked fine.

In fact, we can trigger the same bug (with a slightly different
message), if we use a kernel space address.

To test this, I added the following variable somewhere in the code:

 void *sdr_ptr = 0xffffffff70112200;

And then did the following:

# echo 'p:fault do_sys_open arg=+0(@sdr_ptr):string' > /debug/tracing/kprobe_events

Which will read the sdr_ptr variable just like the example did with +0(+0(%si)):string.
But this time I control the what address it is triggered on.

And it crashed with:

[ 1447.876799] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess

The above message in the bug in the patch was:
"BUG: GPF in non-whitelisted uaccess (non-canonical address?)"

[ 1447.885053] BUG: unable to handle kernel paging request at ffffffff70112200
[ 1447.891997] #PF error: [normal kernel read fault]
[ 1447.896695] PGD 8a212067 P4D 8a212067 PUD 0 
[ 1447.900679] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess
[ 1447.900967] Oops: 0000 [#1] PREEMPT SMP PTI
[ 1447.913394] CPU: 7 PID: 1688 Comm: cat Not tainted 5.0.0-rc5-test+ #28
[ 1447.919905] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
[ 1447.928843] RIP: 0010:process_fetch_insn+0x1a0/0x470
[ 1447.933804] Code: ff f0 80 48 03 80 83 80 18 1a 00 00 01 31 c9 eb 10 48 83 c2 01 85 c0 75 1f 81 f9 ff 0f 00 00 7f 17 0f 1f 00 0f ae e8 44 89 e0 <40> 8a 32 0f 1f 00 83 c1 01 40 84 f6 75 d9 65 48 8b 14 25 00 5c 01
[ 1447.952520] RSP: 0018:ffffb77b80673d08 EFLAGS: 00010246
[ 1447.957736] RAX: 0000000000000000 RBX: ffff91bfb7ab2c30 RCX: 0000000000000000
[ 1447.964857] RDX: ffffffff70112200 RSI: ffffffff97267a80 RDI: 00007ffffffff000
[ 1447.971981] RBP: 0000000000000000 R08: ffffffff70112200 R09: ffff91c014518000
[ 1447.979105] R10: ffff91c01451801c R11: ffff91c0185e5800 R12: 0000000000000000
[ 1447.986226] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1447.993342] FS:  0000000000000000(0000) GS:ffff91c01a9c0000(0000) knlGS:0000000000000000
[ 1448.001417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1448.007157] CR2: ffffffff70112200 CR3: 00000000b3f32004 CR4: 00000000001606e0
[ 1448.014280] Call Trace:
[ 1448.016737]  kprobe_trace_func+0x278/0x360
[ 1448.020834]  ? preempt_count_sub+0x98/0xe0
[ 1448.024931]  ? do_sys_open+0x5/0x220
[ 1448.028503]  kprobe_dispatcher+0x36/0x50
[ 1448.032426]  ? do_sys_open+0x1/0x220
[ 1448.036002]  kprobe_ftrace_handler+0xb5/0x120
[ 1448.040356]  ftrace_ops_assist_func+0x87/0x110
[ 1448.044797]  0xffffffffc06a30bf
[ 1448.047939]  ? __ia32_sys_open+0x20/0x20
[ 1448.051860]  ? do_syscall_64+0x1c/0x160
[ 1448.055694]  ? do_sys_open+0x1/0x220
[ 1448.059268]  do_sys_open+0x5/0x220
[ 1448.062672]  do_syscall_64+0x60/0x160
[ 1448.066335]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Which triggered on the address: 0xffffffff70112200

Which is perfectly canonical.

The reason I use "kernel address" is because this became an issue after
"x86/fault: BUG() when uaccess helpers fault on kernel addresses" was
added and that explicitly states "kernel address".

That patch added:

+       /*
+        * This is a faulting memory access in kernel space, on a kernel
+        * address, in a usercopy function. This can e.g. be caused by improper
+        * use of helpers like __put_user and by improper attempts to access
+        * userspace addresses in KERNEL_DS regions.
+        * The one (semi-)legitimate exception are probe_kernel_{read,write}(),
+        * which can be invoked from places like kgdb, /dev/mem (for reading)
+        * and privileged BPF code (for reading).
+        * The probe_kernel_*() functions set the kernel_uaccess_faults_ok flag
+        * to tell us that faulting on kernel addresses, and even noncanonical
+        * addresses, in a userspace accessor does not necessarily imply a
+        * kernel bug, root might just be doing weird stuff.
+        */
+       if (current->kernel_uaccess_faults_ok)
+               return false;
+
+       /* This is bad. Refuse the fixup so that we go into die(). */
+       if (trapnr == X86_TRAP_PF) {
+               pr_emerg("BUG: pagefault on kernel address 0x%lx in non-whitelisted uaccess\n",
+                        fault_addr);
+       } else {
+               pr_emerg("BUG: GPF in non-whitelisted uaccess (non-canonical address?)\n");
+       }

Where this path triggered by the kprobe using copy_from_user(). So
yeah, it can happen if it triggered on a non-canonical address (which is
just garbage), but it can also trigger if it's a kernel address that
isn't mapped either.

So the comment in the code is not 100% accurate, because it isn't just
a kernel address, but also a non-canonical one. Something tells me that
the change log of the commit that checks this isn't written well
either. What exactly can't be done now with uaccess code? I'm guessing
that it should only be allowed to fault on user space addresses? So
should I change this commit log to:

 kprobe: Do not use uaccess function to access non-user address that can fault

And change all the "kernel address" mentions to "non-user address" as
non-user covers both kernel address and non-canonical?


-- Steve

This patch allows me to modify the sdr_ptr as well from the tracing directory:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2cf3c747a357..292fe2ef0a45 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7928,6 +7928,45 @@ static const struct file_operations buffer_percent_fops = {
 	.llseek		= default_llseek,
 };
 
+void *sdr_ptr = 0xffffffff70112200;
+
+static ssize_t
+sdr_ptr_read(struct file *filp, char __user *ubuf,
+	       size_t cnt, loff_t *ppos)
+{
+	char buf[64];
+	int r;
+
+	r = sprintf(buf, "%px\n", sdr_ptr);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+sdr_ptr_write(struct file *filp, const char __user *ubuf,
+		size_t cnt, loff_t *ppos)
+{
+	unsigned long val;
+	int ret;
+
+	ret = kstrtoul_from_user(ubuf, cnt, 0, &val);
+	if (ret)
+		return ret;
+
+	sdr_ptr = val;
+
+	(*ppos)++;
+
+	return cnt;
+}
+
+static const struct file_operations sdr_ptr_fops = {
+	.open		= tracing_open_generic,
+	.read		= sdr_ptr_read,
+	.write		= sdr_ptr_write,
+	.llseek		= default_llseek,
+};
+
 struct dentry *trace_instance_dir;
 
 static void
@@ -8188,6 +8227,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	struct trace_event_file *file;
 	int cpu;
 
+	trace_create_file("sdr_ptr", 0644, d_tracer,
+			NULL, &sdr_ptr_fops);
+
 	trace_create_file("available_tracers", 0444, d_tracer,
 			tr, &show_traces_fops);
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 22:15     ` Steven Rostedt
@ 2019-02-15 23:49       ` Andy Lutomirski
  2019-02-16  0:19         ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-15 23:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski



> On Feb 15, 2019, at 2:15 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Fri, 15 Feb 2019 09:55:32 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>>> On Fri, Feb 15, 2019 at 9:49 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>> 
>>> From: Changbin Du <changbin.du@gmail.com>
>>> 
>>> The userspace can ask kprobe to intercept strings at any memory address,
>>> including invalid kernel address.  
>> 
>> Again, this is not about a "kernel address" at all.
>> 
>> It's neither a kernel address _nor_ a user address.  It's an invalid
>> address entirely, and there is nothing that makes it "kernel".
>> 
>> Please don't talk about "invalid kernel addresses" when it is no such thing.
>> 
> 
> Ah, I see what you are saying. The example of the bug that the patch
> fixed used a non-canonical address, which is "garbage", and not kernel
> or user space. Point taken.
> 
> But the issue is deeper than that. This code never bugged until the
> following commit was added:
> 
> 9da3f2b7405 "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
> 
> Before that commit, this worked fine.
> 
> In fact, we can trigger the same bug (with a slightly different
> message), if we use a kernel space address.
> 
> To test this, I added the following variable somewhere in the code:
> 
> void *sdr_ptr = 0xffffffff70112200;
> 
> And then did the following:
> 
> # echo 'p:fault do_sys_open arg=+0(@sdr_ptr):string' > /debug/tracing/kprobe_events
> 
> Which will read the sdr_ptr variable just like the example did with +0(+0(%si)):string.
> But this time I control the what address it is triggered on.
> 
> And it crashed with:
> 
> [ 1447.876799] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess
> 
> The above message in the bug in the patch was:
> "BUG: GPF in non-whitelisted uaccess (non-canonical address?)"
> 
> [ 1447.885053] BUG: unable to handle kernel paging request at ffffffff70112200
> [ 1447.891997] #PF error: [normal kernel read fault]
> [ 1447.896695] PGD 8a212067 P4D 8a212067 PUD 0 
> [ 1447.900679] BUG: pagefault on kernel address 0xffffffff70112200 in non-whitelisted uaccess
> [ 1447.900967] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 1447.913394] CPU: 7 PID: 1688 Comm: cat Not tainted 5.0.0-rc5-test+ #28
> [ 1447.919905] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> [ 1447.928843] RIP: 0010:process_fetch_insn+0x1a0/0x470
> [ 1447.933804] Code: ff f0 80 48 03 80 83 80 18 1a 00 00 01 31 c9 eb 10 48 83 c2 01 85 c0 75 1f 81 f9 ff 0f 00 00 7f 17 0f 1f 00 0f ae e8 44 89 e0 <40> 8a 32 0f 1f 00 83 c1 01 40 84 f6 75 d9 65 48 8b 14 25 00 5c 01
> [ 1447.952520] RSP: 0018:ffffb77b80673d08 EFLAGS: 00010246
> [ 1447.957736] RAX: 0000000000000000 RBX: ffff91bfb7ab2c30 RCX: 0000000000000000
> [ 1447.964857] RDX: ffffffff70112200 RSI: ffffffff97267a80 RDI: 00007ffffffff000
> [ 1447.971981] RBP: 0000000000000000 R08: ffffffff70112200 R09: ffff91c014518000
> [ 1447.979105] R10: ffff91c01451801c R11: ffff91c0185e5800 R12: 0000000000000000
> [ 1447.986226] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1447.993342] FS:  0000000000000000(0000) GS:ffff91c01a9c0000(0000) knlGS:0000000000000000
> [ 1448.001417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1448.007157] CR2: ffffffff70112200 CR3: 00000000b3f32004 CR4: 00000000001606e0
> [ 1448.014280] Call Trace:
> [ 1448.016737]  kprobe_trace_func+0x278/0x360
> [ 1448.020834]  ? preempt_count_sub+0x98/0xe0
> [ 1448.024931]  ? do_sys_open+0x5/0x220
> [ 1448.028503]  kprobe_dispatcher+0x36/0x50
> [ 1448.032426]  ? do_sys_open+0x1/0x220
> [ 1448.036002]  kprobe_ftrace_handler+0xb5/0x120
> [ 1448.040356]  ftrace_ops_assist_func+0x87/0x110
> [ 1448.044797]  0xffffffffc06a30bf
> [ 1448.047939]  ? __ia32_sys_open+0x20/0x20
> [ 1448.051860]  ? do_syscall_64+0x1c/0x160
> [ 1448.055694]  ? do_sys_open+0x1/0x220
> [ 1448.059268]  do_sys_open+0x5/0x220
> [ 1448.062672]  do_syscall_64+0x60/0x160
> [ 1448.066335]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Which triggered on the address: 0xffffffff70112200
> 
> Which is perfectly canonical.
> 
> The reason I use "kernel address" is because this became an issue after
> "x86/fault: BUG() when uaccess helpers fault on kernel addresses" was
> added and that explicitly states "kernel address".
> 
> That patch added:
> 
> +       /*
> +        * This is a faulting memory access in kernel space, on a kernel
> +        * address, in a usercopy function. This can e.g. be caused by improper
> +        * use of helpers like __put_user and by improper attempts to access
> +        * userspace addresses in KERNEL_DS regions.
> +        * The one (semi-)legitimate exception are probe_kernel_{read,write}(),
> +        * which can be invoked from places like kgdb, /dev/mem (for reading)
> +        * and privileged BPF code (for reading).
> +        * The probe_kernel_*() functions set the kernel_uaccess_faults_ok flag
> +        * to tell us that faulting on kernel addresses, and even noncanonical
> +        * addresses, in a userspace accessor does not necessarily imply a
> +        * kernel bug, root might just be doing weird stuff.
> +        */
> +       if (current->kernel_uaccess_faults_ok)
> +               return false;
> +
> +       /* This is bad. Refuse the fixup so that we go into die(). */
> +       if (trapnr == X86_TRAP_PF) {
> +               pr_emerg("BUG: pagefault on kernel address 0x%lx in non-whitelisted uaccess\n",
> +                        fault_addr);
> +       } else {
> +               pr_emerg("BUG: GPF in non-whitelisted uaccess (non-canonical address?)\n");
> +       }
> 
> Where this path triggered by the kprobe using copy_from_user(). So
> yeah, it can happen if it triggered on a non-canonical address (which is
> just garbage), but it can also trigger if it's a kernel address that
> isn't mapped either.
> 
> So the comment in the code is not 100% accurate, because it isn't just
> a kernel address, but also a non-canonical one. Something tells me that
> the change log of the commit that checks this isn't written well
> either. What exactly can't be done now with uaccess code? I'm guessing
> that it should only be allowed to fault on user space addresses? So
> should I change this commit log to:
> 
> kprobe: Do not use uaccess function to access non-user address that can fault
> 
> And change all the "kernel address" mentions to "non-user address" as
> non-user covers both kernel address and non-canonical?
> 
> 
> -- Steve
> 
> This patch allows me to modify the sdr_ptr as well from the tracing directory:
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 2cf3c747a357..292fe2ef0a45 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -7928,6 +7928,45 @@ static const struct file_operations buffer_percent_fops = {
>    .llseek        = default_llseek,
> };
> 
> +void *sdr_ptr = 0xffffffff70112200;
> +
> +static ssize_t
> +sdr_ptr_read(struct file *filp, char __user *ubuf,
> +           size_t cnt, loff_t *ppos)
> +{
> +    char buf[64];
> +    int r;
> +
> +    r = sprintf(buf, "%px\n", sdr_ptr);
> +
> +    return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> +}
> +
> +static ssize_t
> +sdr_ptr_write(struct file *filp, const char __user *ubuf,
> +        size_t cnt, loff_t *ppos)
> +{
> +    unsigned long val;
> +    int ret;
> +
> +    ret = kstrtoul_from_user(ubuf, cnt, 0, &val);
> +    if (ret)
> +        return ret;
> +
> +    sdr_ptr = val;
> +
> +    (*ppos)++;
> +
> +    return cnt;
> +}
> +
> +static const struct file_operations sdr_ptr_fops = {
> +    .open        = tracing_open_generic,
> +    .read        = sdr_ptr_read,
> +    .write        = sdr_ptr_write,
> +    .llseek        = default_llseek,
> +};
> +
> struct dentry *trace_instance_dir;
> 
> static void
> @@ -8188,6 +8227,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
>    struct trace_event_file *file;
>    int cpu;
> 
> +    trace_create_file("sdr_ptr", 0644, d_tracer,
> +            NULL, &sdr_ptr_fops);
> +
>    trace_create_file("available_tracers", 0444, d_tracer,
>            tr, &show_traces_fops);
> 


I’m missing most of the context here, but even probe_kernel_...() is unwise for a totally untrustworthy address. It could be MMIO, for example.

If needed, we could come up with a safe-ish helper for tracing.  For direct-map addresses, probe_kernel_...() is probably okay.  Same for the current stack. Otherwise we could walk the page tables and check that the address is cacheable, I suppose, although this is slightly dubious if we don’t also check MTRRs. We could also check that the PA is in main memory, I suppose, although this may have unfortunate interactions with the MCE code.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 23:49       ` Andy Lutomirski
@ 2019-02-16  0:19         ` Steven Rostedt
  2019-02-16  1:32           ` Andy Lutomirski
  2019-02-18 17:58           ` Linus Torvalds
  0 siblings, 2 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-16  0:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski

On Fri, 15 Feb 2019 15:49:35 -0800
Andy Lutomirski <luto@amacapital.net> wrote:

> I’m missing most of the context here, but even probe_kernel_...() is
> unwise for a totally untrustworthy address. It could be MMIO, for
> example.

True, but kprobes are used like modules, and only allowed by root. They
are used to poke literally anywhere one wants. That's the entire
purpose of kprobes.

> 
> If needed, we could come up with a safe-ish helper for tracing.  For
> direct-map addresses, probe_kernel_...() is probably okay.  Same for
> the current stack. Otherwise we could walk the page tables and check
> that the address is cacheable, I suppose, although this is slightly
> dubious if we don’t also check MTRRs. We could also check that the PA
> is in main memory, I suppose, although this may have unfortunate
> interactions with the MCE code.

I added you just because I wanted help getting the change log correct,
as that's what Linus was complaining about. I kept using "kernel
address" when the sample bug used for the patch was really a
non-canonical address (as Linus said, it's just garbage. Neither kernel
or user space). But I pointed out that this can also bug if the
address is canonical and in the kernel address space. The old code
didn't complain about non-canonical or kernel address faulting before
commit 9da3f2b7405, which only talks about kernel address space
faulting (which is why I only mentioned that in my messages).

Would changing all the mention of "kernel address" to "non user space"
be accurate?

 For reference:

  http://lkml.kernel.org/r/20190215174945.557218316@goodmis.org
  http://lkml.kernel.org/r/20190215142015.860423791@goodmis.org

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-16  0:19         ` Steven Rostedt
@ 2019-02-16  1:32           ` Andy Lutomirski
  2019-02-16  2:08             ` Steven Rostedt
  2019-02-18 17:58           ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-16  1:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski




> On Feb 15, 2019, at 4:19 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Fri, 15 Feb 2019 15:49:35 -0800
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>> I’m missing most of the context here, but even probe_kernel_...() is
>> unwise for a totally untrustworthy address. It could be MMIO, for
>> example.
> 
> True, but kprobes are used like modules, and only allowed by root. They
> are used to poke literally anywhere one wants. That's the entire
> purpose of kprobes.
> 
>> 
>> If needed, we could come up with a safe-ish helper for tracing.  For
>> direct-map addresses, probe_kernel_...() is probably okay.  Same for
>> the current stack. Otherwise we could walk the page tables and check
>> that the address is cacheable, I suppose, although this is slightly
>> dubious if we don’t also check MTRRs. We could also check that the PA
>> is in main memory, I suppose, although this may have unfortunate
>> interactions with the MCE code.
> 
> I added you just because I wanted help getting the change log correct,
> as that's what Linus was complaining about. I kept using "kernel
> address" when the sample bug used for the patch was really a
> non-canonical address (as Linus said, it's just garbage. Neither kernel
> or user space). But I pointed out that this can also bug if the
> address is canonical and in the kernel address space. The old code
> didn't complain about non-canonical or kernel address faulting before
> commit 9da3f2b7405, which only talks about kernel address space
> faulting (which is why I only mentioned that in my messages).
> 
> Would changing all the mention of "kernel address" to "non user space"
> be accurate?
> 

I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.

The old __copy seems likely to have always been a bit bogus.

BTW, what is this probe_mem_read() thing?  Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read().  Can you delete it and just use probe_kernel_read() directly?

> For reference:
> 
>  http://lkml.kernel.org/r/20190215174945.557218316@goodmis.org
>  http://lkml.kernel.org/r/20190215142015.860423791@goodmis.org
> 
> -- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-16  1:32           ` Andy Lutomirski
@ 2019-02-16  2:08             ` Steven Rostedt
  2019-02-16  2:14               ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-16  2:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski

On Fri, 15 Feb 2019 17:32:55 -0800
Andy Lutomirski <luto@amacapital.net> wrote:

> > I added you just because I wanted help getting the change log correct,
> > as that's what Linus was complaining about. I kept using "kernel
> > address" when the sample bug used for the patch was really a
> > non-canonical address (as Linus said, it's just garbage. Neither kernel
> > or user space). But I pointed out that this can also bug if the
> > address is canonical and in the kernel address space. The old code
> > didn't complain about non-canonical or kernel address faulting before
> > commit 9da3f2b7405, which only talks about kernel address space
> > faulting (which is why I only mentioned that in my messages).
> > 
> > Would changing all the mention of "kernel address" to "non user space"
> > be accurate?
> >   
> 
> I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.

But an non-canonical address is not a "kernel address", and that will
cause a bug too. This patch came about because it was changed that if
we do a uaccess on something other than a user space address and take a
fault (either because it was a non-canonical address, or a kernel
address), we BUG! Where before that one patch, it would just return a
fault.

> 
> The old __copy seems likely to have always been a bit bogus.
> 
> BTW, what is this probe_mem_read() thing?  Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read().  Can you delete it and just use probe_kernel_read() directly?

Well, the issue is that we have trace_probe_tmpl.h in that same
directory, which does the work for kprobes and uprobes. The
trace_kprobes.c defines all the functions for handling kprobes, and
trace_uprobes.c does all the handling of uprobes, then they include
trace_probe_tmpl.h which does the bulk of the work.

In the uprobes case, we have:

static nokprobe_inline int
probe_mem_read(void *dest, void *src, size_t size)
{
	void __user *vaddr = (void __force __user *)src;

	return copy_from_user(dest, vaddr, size) ? -EFAULT : 0;
}

Because that is adding probes on userspace code.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-16  2:08             ` Steven Rostedt
@ 2019-02-16  2:14               ` Andy Lutomirski
  2019-02-16  2:21                 ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-16  2:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski


> On Feb 15, 2019, at 6:08 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Fri, 15 Feb 2019 17:32:55 -0800
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>> I added you just because I wanted help getting the change log correct,
>>> as that's what Linus was complaining about. I kept using "kernel
>>> address" when the sample bug used for the patch was really a
>>> non-canonical address (as Linus said, it's just garbage. Neither kernel
>>> or user space). But I pointed out that this can also bug if the
>>> address is canonical and in the kernel address space. The old code
>>> didn't complain about non-canonical or kernel address faulting before
>>> commit 9da3f2b7405, which only talks about kernel address space
>>> faulting (which is why I only mentioned that in my messages).
>>> 
>>> Would changing all the mention of "kernel address" to "non user space"
>>> be accurate?
>>> 
>> 
>> I think “kernel address” is right. It’s illegal to access anything that isn’t known to be a valid kernel address while in KERNEL_DS.
> 
> But an non-canonical address is not a "kernel address", and that will
> cause a bug too.

Indeed. A non-canonical address is not known to be a valid kernel address.  I stand by my slightly pedantic statement :)

> This patch came about because it was changed that if
> we do a uaccess on something other than a user space address and take a
> fault (either because it was a non-canonical address, or a kernel
> address), we BUG! Where before that one patch, it would just return a
> fault.
> 
>> 
>> The old __copy seems likely to have always been a bit bogus.
>> 
>> BTW, what is this probe_mem_read() thing?  Some minimal inspection suggests it’s a buggy reimplementation of probe_kernel_read().  Can you delete it and just use probe_kernel_read() directly?
> 
> Well, the issue is that we have trace_probe_tmpl.h in that same
> directory, which does the work for kprobes and uprobes. The
> trace_kprobes.c defines all the functions for handling kprobes, and
> trace_uprobes.c does all the handling of uprobes, then they include
> trace_probe_tmpl.h which does the bulk of the work.
> 
> In the uprobes case, we have:
> 
> static nokprobe_inline int
> probe_mem_read(void *dest, void *src, size_t size)
> {
>    void __user *vaddr = (void __force __user *)src;
> 
>    return copy_from_user(dest, vaddr, size) ? -EFAULT : 0;
> }
> 
> Because that is adding probes on userspace code.
> 
> 

Can the kprobe case call probe_kernel_read?  Maybe it does already?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-16  2:14               ` Andy Lutomirski
@ 2019-02-16  2:21                 ` Steven Rostedt
  0 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-16  2:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski

On Fri, 15 Feb 2019 18:14:21 -0800
Andy Lutomirski <luto@amacapital.net> wrote:

> > In the uprobes case, we have:
> > 
> > static nokprobe_inline int
> > probe_mem_read(void *dest, void *src, size_t size)
> > {
> >    void __user *vaddr = (void __force __user *)src;
> > 
> >    return copy_from_user(dest, vaddr, size) ? -EFAULT : 0;
> > }
> > 
> > Because that is adding probes on userspace code.
> > 
> >   
> 
> Can the kprobe case call probe_kernel_read?  Maybe it does already?

Yes, the probe_mem_read() is only used in the trace_probe_tmpl.h which
for uprobes is the above "copy_from_user()" and for the kprobes case it
is probe_kernel_read().

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-16  0:19         ` Steven Rostedt
  2019-02-16  1:32           ` Andy Lutomirski
@ 2019-02-18 17:58           ` Linus Torvalds
  2019-02-18 18:23             ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-18 17:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski

[ Sorry for not looking at this earlier, I have family in town so I
was mostly busy over the weekend... ]

On Fri, Feb 15, 2019 at 4:19 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Would changing all the mention of "kernel address" to "non user space"
> be accurate?

That would have worked, but in the meantime I just decided to pull the
existing tag, because it's not _horribly_ misleading any more and now
we're just talking details of what is a "kernel address" etc. And the
patch itself is better than what we used to have.

That said, I do agree with Andy that both the old
_copy_from_user_inatomic() thing and the new probe_mem_read() are just
fundamentally broken, nasty and slow. It just so happens that
probe_mem_read() works _reasonably_ well in practice on x86.

Basically, probe_mem_read() -> probe_kernel_read() really only works
on true kernel addresses. And some of that is very fundamental: some
architectures can have two entirely different address spaces for user
and kernel memory, so if you give _any_ function an "try to read this
address", it fundamentally has to be one or the other.

The fact that on x86, there is a unified address space for
kernel/user, and it can work for one or the other, is just
happenstance (and admittedly the common case).

So I've pulled the existing pull request, because it papers over one
particular issue, but the real fix would require:

 - knowing whether it's kernel or user space you access

 - actually using that knowledge to then limit the addresses we are
willing to probe, and _how_ we probe them.

The user-space case is fairly easy: just check the address with
"access_ok()", and then use _copy_user_atomic() without any set_fs()
games. That should "JustWork(tm)".

And if it's truly supposed to be a kernel address, then we probably
need to add a "is this possibly a valid kernel data pointer"
interface. Before we then do what "probe_kernel_read()" currently
does.

               Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-18 17:58           ` Linus Torvalds
@ 2019-02-18 18:23             ` Linus Torvalds
  2019-02-19 16:18               ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-18 18:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski

On Mon, Feb 18, 2019 at 9:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I've pulled the existing pull request, because it papers over one
> particular issue, but the real fix would require:
>
>  - knowing whether it's kernel or user space you access

Actually, it would be good to split the "kernel space" access into
finer granularities, because it would be good to limit those addresses
by what you expect to happen.

For example, on x86-64, we have a separate virtual mapping for kernel
text. If you're following a data pointer, you probably shouldn't even
look at that area, and just disallow it up-front.

And the vmalloc space, we should probably look up the vmalloc
descriptor for the address, to make sure we don't go into ioremap'ed
areas (but we'd generally *do* want to follow pages for module data,
or for vmap'ed stacks).

And fixmap and/or percpu pointers may or may not be something that
we'd want to follow.

Etc.

So it would be good to not just say "user or kernel", but actually say
what *kind* of kernel access it expects.

               Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-18 18:23             ` Linus Torvalds
@ 2019-02-19 16:18               ` Steven Rostedt
  2019-02-19 18:43                 ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-19 16:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski, Masami Hiramatsu


[ Added Masami too. Start of thread is here:
  http://lkml.kernel.org/r/20190215174712.372898450@goodmis.org ]

On Mon, 18 Feb 2019 10:23:44 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So it would be good to not just say "user or kernel", but actually say
> what *kind* of kernel access it expects.

Note, kprobes are a different kind of beast. I've used kprobes to probe
userspace information as well as kernel. Heck, I could see someone
even using kprobes to probe IO memory to check if a device is doing
what they expect it's doing.

Basically, a kprobe is mostly used for debugging what's happening in a
live kernel, to read any address. But for those that are more security
minded, perhaps we could add "layers" via CONFIG or /sys files that
prevent kprobes from doing certain kinds of probes?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-19 16:18               ` Steven Rostedt
@ 2019-02-19 18:43                 ` Linus Torvalds
  2019-02-19 19:03                   ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-19 18:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski, Masami Hiramatsu

On Tue, Feb 19, 2019 at 8:18 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > So it would be good to not just say "user or kernel", but actually say
> > what *kind* of kernel access it expects.
>
> Note, kprobes are a different kind of beast. I've used kprobes to probe
> userspace information as well as kernel. Heck, I could see someone
> even using kprobes to probe IO memory to check if a device is doing
> what they expect it's doing.

Note that even if that is the case, you _need_ to special "user vs
kernel" information.

Because the exact same address might exist in both.

Right now I think that only happens on sparc32, but vendors used to
have that issue on x86-32 too (if they had the 4G:4G patches).

> Basically, a kprobe is mostly used for debugging what's happening in a
> live kernel, to read any address.

My point is that "any address" is not sufficient to begin with. You
need "kernel or user".

Having a flag for what _kind_ of kernel address is ok might then be
required for other cases if they might not be ok with following page
tables to IO space..

          Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-19 18:43                 ` Linus Torvalds
@ 2019-02-19 19:03                   ` Steven Rostedt
  2019-02-20  8:10                     ` Masami Hiramatsu
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-19 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andy Lutomirski, Masami Hiramatsu

On Tue, 19 Feb 2019 10:43:36 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 19, 2019 at 8:18 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > > So it would be good to not just say "user or kernel", but actually say
> > > what *kind* of kernel access it expects.  
> >
> > Note, kprobes are a different kind of beast. I've used kprobes to probe
> > userspace information as well as kernel. Heck, I could see someone
> > even using kprobes to probe IO memory to check if a device is doing
> > what they expect it's doing.  
> 
> Note that even if that is the case, you _need_ to special "user vs
> kernel" information.
> 
> Because the exact same address might exist in both.
> 
> Right now I think that only happens on sparc32, but vendors used to
> have that issue on x86-32 too (if they had the 4G:4G patches).

Hmm, I didn't realize that Linux allowed the same address to be
different depending on being in kernel or user space (learn something
new everyday). I guess it makes sense, and even easier with the switch
of cr3 from user to kernel. And I even knew of 4G:4G, but never used
it, nor put too much thought about its implementation.

> 
> > Basically, a kprobe is mostly used for debugging what's happening in a
> > live kernel, to read any address.  
> 
> My point is that "any address" is not sufficient to begin with. You
> need "kernel or user".
> 
> Having a flag for what _kind_ of kernel address is ok might then be
> required for other cases if they might not be ok with following page
> tables to IO space..
> 

Good point. Looks like we should add a new flag for kprobe
trace parameters, that tell kprobes if the address is expected to be
user or kernel. That would be good regardless of the duplicate
meanings, as we could use copy_from_user without touching KERNEL_DS, if
the probe argument specifically states "this is user space". For
example, when probing do_sys_open, and you want to read what path string
was passed into the kernel.

Masami, thoughts?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-19 19:03                   ` Steven Rostedt
@ 2019-02-20  8:10                     ` Masami Hiramatsu
  2019-02-20 13:57                       ` Jann Horn
  2019-02-20 14:49                       ` Steven Rostedt
  0 siblings, 2 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-20  8:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski, Masami Hiramatsu

On Tue, 19 Feb 2019 14:03:30 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > > Basically, a kprobe is mostly used for debugging what's happening in a
> > > live kernel, to read any address.  
> > 
> > My point is that "any address" is not sufficient to begin with. You
> > need "kernel or user".
> > 
> > Having a flag for what _kind_ of kernel address is ok might then be
> > required for other cases if they might not be ok with following page
> > tables to IO space..
> > 
> 
> Good point. Looks like we should add a new flag for kprobe
> trace parameters, that tell kprobes if the address is expected to be
> user or kernel. That would be good regardless of the duplicate
> meanings, as we could use copy_from_user without touching KERNEL_DS, if
> the probe argument specifically states "this is user space". For
> example, when probing do_sys_open, and you want to read what path string
> was passed into the kernel.
> 
> Masami, thoughts?

Let me ensure what you want. So you want to access a "string" in user-space,
not a data structure? In that case, it is very easy to me. It is enough to
add a "ustring" type to kprobe events. For example, do_sys_opsn's path 
variable is one example. That will be +0(+0(%si)):ustring, and fetcher
finally copy the string using strncpy_from_user() instead of 
strncpy_from_unsafe(). (*)

But if you consider to access a field in a data-structure in user space,
it might need some more work (E.g. ioctl's parameter), becase if the __user
pointer to the data structure is on the memory, we have to dereference 
the address inside kernel using probe_kernel_read(), but after getting 
the data strucutre address, we have to dereference the address with copy_from_user().
At this moment, we have no such strong syntax... 

To solve that, maybe we need to introduce something like "back reference"
of arguments in the event, e.g.

p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user

or

p somewhere +0(%si) field_val=+8(\1):u32:user

This ":user" additional suffix tells kprobe events to change fetching method 
to fetch the data by copy_from_user().

(*) BTW, there is another concern to use _from_user APIs in kprobe. Are those
APIs might sleep??


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20  8:10                     ` Masami Hiramatsu
@ 2019-02-20 13:57                       ` Jann Horn
  2019-02-20 14:47                         ` Steven Rostedt
  2019-02-20 15:08                         ` Masami Hiramatsu
  2019-02-20 14:49                       ` Steven Rostedt
  1 sibling, 2 replies; 85+ messages in thread
From: Jann Horn @ 2019-02-20 13:57 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Kees Cook, Andy Lutomirski

On Wed, Feb 20, 2019 at 9:10 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Tue, 19 Feb 2019 14:03:30 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > > Basically, a kprobe is mostly used for debugging what's happening in a
> > > > live kernel, to read any address.
> > >
> > > My point is that "any address" is not sufficient to begin with. You
> > > need "kernel or user".
> > >
> > > Having a flag for what _kind_ of kernel address is ok might then be
> > > required for other cases if they might not be ok with following page
> > > tables to IO space..
> > >
> >
> > Good point. Looks like we should add a new flag for kprobe
> > trace parameters, that tell kprobes if the address is expected to be
> > user or kernel. That would be good regardless of the duplicate
> > meanings, as we could use copy_from_user without touching KERNEL_DS, if
> > the probe argument specifically states "this is user space". For
> > example, when probing do_sys_open, and you want to read what path string
> > was passed into the kernel.
> >
> > Masami, thoughts?
>
> Let me ensure what you want. So you want to access a "string" in user-space,
> not a data structure? In that case, it is very easy to me. It is enough to
> add a "ustring" type to kprobe events. For example, do_sys_opsn's path
> variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> finally copy the string using strncpy_from_user() instead of
> strncpy_from_unsafe(). (*)
[...]
> (*) BTW, there is another concern to use _from_user APIs in kprobe. Are those
> APIs might sleep??

If you want to access userspace without sleeping, and ignore data in
non-present pages, you can do `pagefault_disable(); err =
__copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe
the kernel should have a helper for that...)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 13:57                       ` Jann Horn
@ 2019-02-20 14:47                         ` Steven Rostedt
  2019-02-20 15:08                         ` Masami Hiramatsu
  1 sibling, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-20 14:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: Masami Hiramatsu, Linus Torvalds, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Kees Cook, Andy Lutomirski

On Wed, 20 Feb 2019 14:57:31 +0100
Jann Horn <jannh@google.com> wrote:

> > (*) BTW, there is another concern to use _from_user APIs in kprobe. Are those
> > APIs might sleep??  
> 
> If you want to access userspace without sleeping, and ignore data in
> non-present pages, you can do `pagefault_disable(); err =
> __copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe
> the kernel should have a helper for that...)

I was about to say pretty much the same thing (about using
_inatomic()), but yeah, we should also have a helper function if we
don't already to include the page_fault_disable/enable() too.

-- Steve


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20  8:10                     ` Masami Hiramatsu
  2019-02-20 13:57                       ` Jann Horn
@ 2019-02-20 14:49                       ` Steven Rostedt
  2019-02-20 16:04                         ` Masami Hiramatsu
  2019-02-22  8:27                         ` Masami Hiramatsu
  1 sibling, 2 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-20 14:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Wed, 20 Feb 2019 17:10:19 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> Let me ensure what you want. So you want to access a "string" in user-space,
> not a data structure? In that case, it is very easy to me. It is enough to
> add a "ustring" type to kprobe events. For example, do_sys_opsn's path 
> variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> finally copy the string using strncpy_from_user() instead of 
> strncpy_from_unsafe(). (*)

ustring would be good.

> 
> But if you consider to access a field in a data-structure in user space,
> it might need some more work (E.g. ioctl's parameter), becase if the __user
> pointer to the data structure is on the memory, we have to dereference 
> the address inside kernel using probe_kernel_read(), but after getting 
> the data strucutre address, we have to dereference the address with copy_from_user().
> At this moment, we have no such strong syntax... 
> 
> To solve that, maybe we need to introduce something like "back reference"
> of arguments in the event, e.g.
> 
> p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user
> 
> or
> 
> p somewhere +0(%si) field_val=+8(\1):u32:user
> 
> This ":user" additional suffix tells kprobe events to change fetching method 
> to fetch the data by copy_from_user().

What about just adding 'u' to the end of the offset? Say you have a
data structure in kernel space that has a field in user space you want
to reference?


	field_val=+8u(+0(%si))

Although, I would say having ways to access current parameters may also
be a nice touch ;-)

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 13:57                       ` Jann Horn
  2019-02-20 14:47                         ` Steven Rostedt
@ 2019-02-20 15:08                         ` Masami Hiramatsu
  1 sibling, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-20 15:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Kees Cook, Andy Lutomirski

Hi Jann,

On Wed, 20 Feb 2019 14:57:31 +0100
Jann Horn <jannh@google.com> wrote:

> On Wed, Feb 20, 2019 at 9:10 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > On Tue, 19 Feb 2019 14:03:30 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > > > > Basically, a kprobe is mostly used for debugging what's happening in a
> > > > > live kernel, to read any address.
> > > >
> > > > My point is that "any address" is not sufficient to begin with. You
> > > > need "kernel or user".
> > > >
> > > > Having a flag for what _kind_ of kernel address is ok might then be
> > > > required for other cases if they might not be ok with following page
> > > > tables to IO space..
> > > >
> > >
> > > Good point. Looks like we should add a new flag for kprobe
> > > trace parameters, that tell kprobes if the address is expected to be
> > > user or kernel. That would be good regardless of the duplicate
> > > meanings, as we could use copy_from_user without touching KERNEL_DS, if
> > > the probe argument specifically states "this is user space". For
> > > example, when probing do_sys_open, and you want to read what path string
> > > was passed into the kernel.
> > >
> > > Masami, thoughts?
> >
> > Let me ensure what you want. So you want to access a "string" in user-space,
> > not a data structure? In that case, it is very easy to me. It is enough to
> > add a "ustring" type to kprobe events. For example, do_sys_opsn's path
> > variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> > finally copy the string using strncpy_from_user() instead of
> > strncpy_from_unsafe(). (*)
> [...]
> > (*) BTW, there is another concern to use _from_user APIs in kprobe. Are those
> > APIs might sleep??
> 
> If you want to access userspace without sleeping, and ignore data in
> non-present pages, you can do `pagefault_disable(); err =
> __copy_from_user_inatomic(...); pagefault_enable();`. (Actually, maybe
> the kernel should have a helper for that...)

Ok, we are going back to the start point of this thread :)

http://lkml.kernel.org/r/20190215174712.372898450@goodmis.org

So, if user tells kprobe it is user-pointer, we check it with access_ok(),
and will do something similar to the strnlen_user() and strncpy_from_user(),
but using __copy_from_user_inatomic() and pagefault_disable() for kprobes.

Thank you!

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 14:49                       ` Steven Rostedt
@ 2019-02-20 16:04                         ` Masami Hiramatsu
  2019-02-20 16:42                           ` Steven Rostedt
  2019-02-22  8:27                         ` Masami Hiramatsu
  1 sibling, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-20 16:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Wed, 20 Feb 2019 09:49:26 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 20 Feb 2019 17:10:19 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > Let me ensure what you want. So you want to access a "string" in user-space,
> > not a data structure? In that case, it is very easy to me. It is enough to
> > add a "ustring" type to kprobe events. For example, do_sys_opsn's path 
> > variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> > finally copy the string using strncpy_from_user() instead of 
> > strncpy_from_unsafe(). (*)
> 
> ustring would be good.

OK.

> 
> > 
> > But if you consider to access a field in a data-structure in user space,
> > it might need some more work (E.g. ioctl's parameter), becase if the __user
> > pointer to the data structure is on the memory, we have to dereference 
> > the address inside kernel using probe_kernel_read(), but after getting 
> > the data strucutre address, we have to dereference the address with copy_from_user().
> > At this moment, we have no such strong syntax... 
> > 
> > To solve that, maybe we need to introduce something like "back reference"
> > of arguments in the event, e.g.
> > 
> > p somewhere user_data=+0(%si) field_val=+8(\user_data):u32:user
> > 
> > or
> > 
> > p somewhere +0(%si) field_val=+8(\1):u32:user
> > 
> > This ":user" additional suffix tells kprobe events to change fetching method 
> > to fetch the data by copy_from_user().
> 
> What about just adding 'u' to the end of the offset? Say you have a
> data structure in kernel space that has a field in user space you want
> to reference?
> 
> 
> 	field_val=+8u(+0(%si))

Ah, that looks good :~) thank you for this idea!

> 
> Although, I would say having ways to access current parameters may also
> be a nice touch ;-)

Hehe, OK, let's implement both.

Thanks!

> 
> -- Steve


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 16:04                         ` Masami Hiramatsu
@ 2019-02-20 16:42                           ` Steven Rostedt
  2019-02-21  7:37                             ` Masami Hiramatsu
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-20 16:42 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Thu, 21 Feb 2019 01:04:53 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> > What about just adding 'u' to the end of the offset? Say you have a
> > data structure in kernel space that has a field in user space you want
> > to reference?
> > 
> > 
> > 	field_val=+8u(+0(%si))  
> 
> Ah, that looks good :~) thank you for this idea!

<bike-shed>

Hmm, I wonder if we should make it +u8 or u+8? as +8u may be confused
as unsigned? Like 8ULL. I don't know. Kernel developers suck at
naming :-p

</bike-shed>

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 16:42                           ` Steven Rostedt
@ 2019-02-21  7:37                             ` Masami Hiramatsu
  0 siblings, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-21  7:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Wed, 20 Feb 2019 11:42:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 21 Feb 2019 01:04:53 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > > What about just adding 'u' to the end of the offset? Say you have a
> > > data structure in kernel space that has a field in user space you want
> > > to reference?
> > > 
> > > 
> > > 	field_val=+8u(+0(%si))  
> > 
> > Ah, that looks good :~) thank you for this idea!
> 
> <bike-shed>
> 
> Hmm, I wonder if we should make it +u8 or u+8? as +8u may be confused
> as unsigned? Like 8ULL. I don't know. Kernel developers suck at
> naming :-p

I like +u8 since it is easier to implement :-p.

Thank you,

> 
> </bike-shed>
> 
> -- Steve


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-15 17:47 ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Steven Rostedt
  2019-02-15 17:55   ` Linus Torvalds
@ 2019-02-21  7:52   ` Masami Hiramatsu
  2019-02-21 14:36     ` Steven Rostedt
  2019-02-23 14:48     ` Masami Hiramatsu
  1 sibling, 2 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-21  7:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton, stable,
	Changbin Du

On Fri, 15 Feb 2019 12:47:13 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Changbin Du <changbin.du@gmail.com>
> 
> The userspace can ask kprobe to intercept strings at any memory address,
> including invalid kernel address. In this case, fetch_store_strlen()
> would crash since it uses general usercopy function, and user access
> functions are no longer allowed to access kernel memory.
> 
> For example, we can crash the kernel by doing something as below:
> 
> $ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'
> 
> [  103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?)
> [  103.622104] general protection fault: 0000 [#1] SMP PTI
> [  103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96
> [  103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014
> [  103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0
> [  103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63
>  ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48
> [  103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246
> [  103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000
> [  103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000
> [  103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [  103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  103.644040] FS:  0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000
> [  103.646019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0
> [  103.649147] Call Trace:
> [  103.649781]  ? sched_clock_cpu+0xc/0xa0
> [  103.650747]  ? do_sys_open+0x5/0x220
> [  103.651635]  kprobe_trace_func+0x303/0x380
> [  103.652645]  ? do_sys_open+0x5/0x220
> [  103.653528]  kprobe_dispatcher+0x45/0x50
> [  103.654682]  ? do_sys_open+0x1/0x220
> [  103.655875]  kprobe_ftrace_handler+0x90/0xf0
> [  103.657282]  ftrace_ops_assist_func+0x54/0xf0
> [  103.658564]  ? __call_rcu+0x1dc/0x280
> [  103.659482]  0xffffffffc00000bf
> [  103.660384]  ? __ia32_sys_open+0x20/0x20
> [  103.661682]  ? do_sys_open+0x1/0x220
> [  103.662863]  do_sys_open+0x5/0x220
> [  103.663988]  do_syscall_64+0x60/0x210
> [  103.665201]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  103.666862] RIP: 0033:0x7fc22fadccdd
> [  103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff
>  ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44
> [  103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
> [  103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd
> [  103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c
> [  103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
> [  103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8
> [  103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
> [  103.688056] Modules linked in:
> [  103.689131] ---[ end trace 43792035c28984a1 ]---
> 
> This can be fixed by using probe_mem_read() instead, as it can handle faulting
> kernel memory addresses, which kprobes can legitimately do.

Basically OK to me.
Could you use probe_kernel_read() in this context, since probe_mem_read() is a
wrapper function for template code.

With that change,

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

And for the long term, I need to find more efficient (or smarter) way to do it,
like strnlen_user() does.

Thank you,

> 
> Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com
> 
> Cc: stable@vger.kernel.org
> Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses")
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace_kprobe.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index d5fb09ebba8b..9eaf07f99212 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = {
>  static nokprobe_inline int
>  fetch_store_strlen(unsigned long addr)
>  {
> -	mm_segment_t old_fs;
>  	int ret, len = 0;
>  	u8 c;
>  
> -	old_fs = get_fs();
> -	set_fs(KERNEL_DS);
> -	pagefault_disable();
> -
>  	do {
> -		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
> +		ret = probe_mem_read(&c, (u8 *)addr + len, 1);
>  		len++;
>  	} while (c && ret == 0 && len < MAX_STRING_SIZE);
>  
> -	pagefault_enable();
> -	set_fs(old_fs);
> -
>  	return (ret < 0) ? ret : len;
>  }
>  
> -- 
> 2.20.1
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-21  7:52   ` Masami Hiramatsu
@ 2019-02-21 14:36     ` Steven Rostedt
  2019-02-21 15:58       ` Masami Hiramatsu
  2019-02-23 14:48     ` Masami Hiramatsu
  1 sibling, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-21 14:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton, stable,
	Changbin Du

On Thu, 21 Feb 2019 16:52:52 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> Basically OK to me.
> Could you use probe_kernel_read() in this context, since probe_mem_read() is a
> wrapper function for template code.
> 
> With that change,
> 
> Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

This already hit Linus's tree. I was able to reproduce the crash, so I
streamlined it. I should have still pushed more for your ack first.
Sorry about that.

For some reason, I thought the change was in the generic probe code,
and accepted the probe_mem_read(). Anyway, did you want to send a patch
to change it to probe_kernel_read(), for the merge window?

> 
> And for the long term, I need to find more efficient (or smarter) way to do it,
> like strnlen_user() does.

Agreed.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-21 14:36     ` Steven Rostedt
@ 2019-02-21 15:58       ` Masami Hiramatsu
  2019-02-21 16:16         ` Masami Hiramatsu
  0 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-21 15:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton, stable,
	Changbin Du

On Thu, 21 Feb 2019 09:36:25 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 21 Feb 2019 16:52:52 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > Basically OK to me.
> > Could you use probe_kernel_read() in this context, since probe_mem_read() is a
> > wrapper function for template code.
> > 
> > With that change,
> > 
> > Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
> 
> This already hit Linus's tree. I was able to reproduce the crash, so I
> streamlined it. I should have still pushed more for your ack first.
> Sorry about that.

Oh, never mind. That seems urgent issue for kprobe event.
Thank you very much for fixing it!

> For some reason, I thought the change was in the generic probe code,
> and accepted the probe_mem_read(). Anyway, did you want to send a patch
> to change it to probe_kernel_read(), for the merge window?

No problem.

> > And for the long term, I need to find more efficient (or smarter) way to do it,
> > like strnlen_user() does.
> 
> Agreed.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-21 15:58       ` Masami Hiramatsu
@ 2019-02-21 16:16         ` Masami Hiramatsu
  2019-02-21 16:32           ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-21 16:16 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, stable, Changbin Du

Hi Steve,

On Fri, 22 Feb 2019 00:58:12 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Thu, 21 Feb 2019 09:36:25 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Thu, 21 Feb 2019 16:52:52 +0900
> > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> > > Basically OK to me.
> > > Could you use probe_kernel_read() in this context, since probe_mem_read() is a
> > > wrapper function for template code.
> > > 
> > > With that change,
> > > 
> > > Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
> > 
> > This already hit Linus's tree. I was able to reproduce the crash, so I
> > streamlined it. I should have still pushed more for your ack first.
> > Sorry about that.
> 
> Oh, never mind. That seems urgent issue for kprobe event.
> Thank you very much for fixing it!
> 
> > For some reason, I thought the change was in the generic probe code,
> > and accepted the probe_mem_read(). Anyway, did you want to send a patch
> > to change it to probe_kernel_read(), for the merge window?
> 
> No problem.

Oops, I mean No, not yet. but it is a simple and cosmetic patch like below.
Feel free to merge it to ftrace/core.

---------
tracing/kprobes: Use probe_kernel_read instead of probe_mem_read

From: Masami Hiramatsu <mhiramat@kernel.org>

Use probe_kernel_read() instead of probe_mem_read() because
probe_mem_read() is a kind of wrapper for switching memory
read function between uprobes and kprobes.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 kernel/trace/trace_kprobe.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 9eaf07f99212..99592c27465e 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -865,7 +865,7 @@ fetch_store_strlen(unsigned long addr)
 	u8 c;
 
 	do {
-		ret = probe_mem_read(&c, (u8 *)addr + len, 1);
+		ret = probe_kernel_read(&c, (u8 *)addr + len, 1);
 		len++;
 	} while (c && ret == 0 && len < MAX_STRING_SIZE);
 

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-21 16:16         ` Masami Hiramatsu
@ 2019-02-21 16:32           ` Steven Rostedt
  0 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-21 16:32 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton, stable,
	Changbin Du


[ Linus, you may want to read this ]

On Fri, 22 Feb 2019 01:16:43 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:


Yep, I'll take this patch. Hmm, my for-next isn't based on my urgent
branch, and this would go on top of urgent. I can do one of three
things to push this to Linus for the merge window:

 1) Create a separate branch to add updates for on my urgent branch.
    This would have me do two pull requests to Linus.

 2) Cherry pick the urgent commit that this updates.
    But that has a stable tag, which could confuse things as it would
    create the same commit twice, both with stable tags (I could take
    the stable tag off of the cherry pick though).

 3) Merge the urgent branch into my for-next branch at that commit,
    with a message to why I'm doing this. The urgent branch is based
    off of a older tag in Linus's tree, so that would only pull in
    tracing changes (my stuff), and wont pull in anyone else's work.

I may go with option 3, because I believe this may be one of the cases
that is allowed to have merges in pull requests.

 a) Only my stuff got merged
 b) Have a dependency on something that already went into Linus's tree
    (prevent me from having to rebase already tested work).
 c) Have it documented in the merge commit to why its being merged

Unless Linus feels otherwise (use one of the other options?), I may go
ahead and do that.

-- Steve

> ---------
> tracing/kprobes: Use probe_kernel_read instead of probe_mem_read
> 
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Use probe_kernel_read() instead of probe_mem_read() because
> probe_mem_read() is a kind of wrapper for switching memory
> read function between uprobes and kprobes.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> ---
>  kernel/trace/trace_kprobe.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 9eaf07f99212..99592c27465e 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -865,7 +865,7 @@ fetch_store_strlen(unsigned long addr)
>  	u8 c;
>  
>  	do {
> -		ret = probe_mem_read(&c, (u8 *)addr + len, 1);
> +		ret = probe_kernel_read(&c, (u8 *)addr + len, 1);
>  		len++;
>  	} while (c && ret == 0 && len < MAX_STRING_SIZE);
>  
> 


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-20 14:49                       ` Steven Rostedt
  2019-02-20 16:04                         ` Masami Hiramatsu
@ 2019-02-22  8:27                         ` Masami Hiramatsu
  2019-02-22  8:35                           ` Masami Hiramatsu
  1 sibling, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-22  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

Hi Steve,

On Wed, 20 Feb 2019 09:49:26 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 20 Feb 2019 17:10:19 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > Let me ensure what you want. So you want to access a "string" in user-space,
> > not a data structure? In that case, it is very easy to me. It is enough to
> > add a "ustring" type to kprobe events. For example, do_sys_opsn's path 
> > variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> > finally copy the string using strncpy_from_user() instead of 
> > strncpy_from_unsafe(). (*)
> 
> ustring would be good.

I've tried to implement ustring and u-offsets, but I got some issues.

- access_ok() warns if it is called in IRQ context (kprobes is.)
- copy_from_user uses access_ok(), so it is not designed for irq handler.

Moreover, if we have different kernel/user address spaces, we have to
assign target user-pages to kernel vma. Can we do that (doesn't it involve
mutex locks)?

If not, I think what we can do "in kprobes" is only probe_kernel_read()
and strncpy_from_unsafe(). This means, on the architechture whoes kernel
address space doesn't map user space, we can not support user-space
data fetching in kprobe evevnts.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22  8:27                         ` Masami Hiramatsu
@ 2019-02-22  8:35                           ` Masami Hiramatsu
  2019-02-22 17:43                             ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-22  8:35 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski

On Fri, 22 Feb 2019 17:27:45 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, 20 Feb 2019 09:49:26 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 20 Feb 2019 17:10:19 +0900
> > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> > > Let me ensure what you want. So you want to access a "string" in user-space,
> > > not a data structure? In that case, it is very easy to me. It is enough to
> > > add a "ustring" type to kprobe events. For example, do_sys_opsn's path 
> > > variable is one example. That will be +0(+0(%si)):ustring, and fetcher
> > > finally copy the string using strncpy_from_user() instead of 
> > > strncpy_from_unsafe(). (*)
> > 
> > ustring would be good.
> 
> I've tried to implement ustring and u-offsets, but I got some issues.
> 
> - access_ok() warns if it is called in IRQ context (kprobes is.)
> - copy_from_user uses access_ok(), so it is not designed for irq handler.
> 
> Moreover, if we have different kernel/user address spaces, we have to
> assign target user-pages to kernel vma. Can we do that (doesn't it involve
> mutex locks)?

Or, can we do this?

long __probe_user_read(void *dst, const void *src, size_t size)
{
        long ret;
        mm_segment_t old_fs = get_fs();

        set_fs(USER_DS);	/* Only this is changed */
        pagefault_disable();
        current->kernel_uaccess_faults_ok++;
        ret = __copy_from_user_inatomic(dst,
                        (__force const void __user *)src, size);
        current->kernel_uaccess_faults_ok--;
        pagefault_enable();
        set_fs(old_fs);

        return ret ? -EFAULT : 0;
}


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22  8:35                           ` Masami Hiramatsu
@ 2019-02-22 17:43                             ` Linus Torvalds
  2019-02-22 17:48                               ` Andy Lutomirski
                                                 ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2019-02-22 17:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> Or, can we do this?
>
> long __probe_user_read(void *dst, const void *src, size_t size)
> {

Add a

        if (!access_ok(src, size))
                ret = -EFAULT;
        else {
.. do the pagefault_disable() etc ..
        }

to after the "set_fs()", and it looks good to me. Make it clear that
yes, this works _only_ for user reads.

Adn that makes all the games with "kernel_uaccess_faults_ok"
pointless, so you can just remove them.

(note that the "access_ok()" has to come after we've done "set_fs()",
because it takes the address limit from that).

Also, since normally we'd expect that we already have USER_DS, it
might be worthwhile to do this with a wrapper, something along the
lines of

        mm_segment_t old_fs = get_fs();

        if (segment_eq(old_fs, USER_DS))
              return __normal_probe_user_read();
        set_fs(USER_DS);
        ret = __normal_probe_user_read();
        set_fs(old_fs);
        return ret;

and have that __normal_probe_user_read() just do

        if (!access_ok(src, size))
                return -EFAULT;
        pagefault_disable();
        ret = __copy_from_user_inatomic(dst, ...);
        pagefault_enable();
        return ret ? -EFAULT : 0;

which looks more obvious.

Also, I would suggest that you just make the argument type be "const
void __user *", since the whole point is that this takes a user
pointer, and nothing else.

Then we should still probably fix up "__probe_kernel_read()" to not
allow user accesses. The easiest way to do that is actually likely to
use the "unsafe_get_user()" functions *without* doing a
uaccess_begin(), which will mean that modern CPU's will simply fault
on a kernel access to user space.

The nice thing about that is that usually developers will have access
to exactly those modern boxes, so the people who notice that it
doesn't work are the right people.

Alternatively, we should just make it be architecture-specific, so
that architectures can decide "this address cannot be a kernel
address" and refuse to do it.

                Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 17:43                             ` Linus Torvalds
@ 2019-02-22 17:48                               ` Andy Lutomirski
  2019-02-22 18:28                                 ` Linus Torvalds
  2019-02-22 19:27                               ` Alexei Starovoitov
  2019-02-23  3:47                               ` Masami Hiramatsu
  2 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-22 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski



> On Feb 22, 2019, at 9:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>> 
>> Or, can we do this?
>> 
>> long __probe_user_read(void *dst, const void *src, size_t size)
>> {
> 
> Add a
> 
>        if (!access_ok(src, size))
>                ret = -EFAULT;
>        else {
> .. do the pagefault_disable() etc ..
>        }
> 
> to after the "set_fs()", and it looks good to me. Make it clear that
> yes, this works _only_ for user reads.
> 
> Adn that makes all the games with "kernel_uaccess_faults_ok"
> pointless, so you can just remove them.
> 
> (note that the "access_ok()" has to come after we've done "set_fs()",
> because it takes the address limit from that).
> 
> Also, since normally we'd expect that we already have USER_DS, it
> might be worthwhile to do this with a wrapper, something along the
> lines of
> 
>        mm_segment_t old_fs = get_fs();
> 
>        if (segment_eq(old_fs, USER_DS))
>              return __normal_probe_user_read();
>        set_fs(USER_DS);
>        ret = __normal_probe_user_read();
>        set_fs(old_fs);
>        return ret;
> 
> and have that __normal_probe_user_read() just do
> 
>        if (!access_ok(src, size))
>                return -EFAULT;
>        pagefault_disable();
>        ret = __copy_from_user_inatomic(dst, ...);
>        pagefault_enable();
>        return ret ? -EFAULT : 0;
> 
> which looks more obvious.
> 
> Also, I would suggest that you just make the argument type be "const
> void __user *", since the whole point is that this takes a user
> pointer, and nothing else.
> 
> Then we should still probably fix up "__probe_kernel_read()" to not
> allow user accesses. The easiest way to do that is actually likely to
> use the "unsafe_get_user()" functions *without* doing a
> uaccess_begin(), which will mean that modern CPU's will simply fault
> on a kernel access to user space.
> 
> The nice thing about that is that usually developers will have access
> to exactly those modern boxes, so the people who notice that it
> doesn't work are the right people.

We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 17:48                               ` Andy Lutomirski
@ 2019-02-22 18:28                                 ` Linus Torvalds
  2019-02-22 19:52                                   ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-22 18:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Masami Hiramatsu, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Fri, Feb 22, 2019 at 9:48 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> > On Feb 22, 2019, at 9:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > Then we should still probably fix up "__probe_kernel_read()" to not
> > allow user accesses. The easiest way to do that is actually likely to
> > use the "unsafe_get_user()" functions *without* doing a
> > uaccess_begin(), which will mean that modern CPU's will simply fault
> > on a kernel access to user space.
> >
> > The nice thing about that is that usually developers will have access
> > to exactly those modern boxes, so the people who notice that it
> > doesn't work are the right people.
>
> We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.

It would still do that.

Using the unsafe_get_user() macros doesn't remove the exception
handling, and we wouldn't remove the whole "pagefault_disable()"
either. So it would work exactly the same way it does now, except on a
modern CPU it would return -EFAULT for a user space access due to AC
not being set.

So no "oops harder", only "safer accesses".

                Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 17:43                             ` Linus Torvalds
  2019-02-22 17:48                               ` Andy Lutomirski
@ 2019-02-22 19:27                               ` Alexei Starovoitov
  2019-02-22 19:30                                 ` Steven Rostedt
                                                   ` (2 more replies)
  2019-02-23  3:47                               ` Masami Hiramatsu
  2 siblings, 3 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> 
> Then we should still probably fix up "__probe_kernel_read()" to not
> allow user accesses. The easiest way to do that is actually likely to
> use the "unsafe_get_user()" functions *without* doing a
> uaccess_begin(), which will mean that modern CPU's will simply fault
> on a kernel access to user space.

On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
and users pass both user and kernel addresses into it and expect
that the helper will actually try to read from that address.

If __probe_kernel_read will suddenly start failing on all user addresses
it will break the expectations.
How do we solve it in bpf_probe_read?
Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
in the loop?
That's doable, but people already complain that bpf_probe_read() is slow
and shows up in their perf report.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                               ` Alexei Starovoitov
@ 2019-02-22 19:30                                 ` Steven Rostedt
  2019-02-22 19:34                                   ` Alexei Starovoitov
  2019-02-22 21:20                                 ` Linus Torvalds
  2019-02-26 15:24                                 ` Joel Fernandes
  2 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-22 19:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, 22 Feb 2019 11:27:05 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > 
> > Then we should still probably fix up "__probe_kernel_read()" to not
> > allow user accesses. The easiest way to do that is actually likely to
> > use the "unsafe_get_user()" functions *without* doing a
> > uaccess_begin(), which will mean that modern CPU's will simply fault
> > on a kernel access to user space.  
> 
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.
> 
> If __probe_kernel_read will suddenly start failing on all user addresses
> it will break the expectations.
> How do we solve it in bpf_probe_read?
> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> in the loop?
> That's doable, but people already complain that bpf_probe_read() is slow
> and shows up in their perf report.

We're changing kprobes to add a specific flag to say that we want to
differentiate between kernel or user reads. Can this be done with
bpf_probe_read()? If it's showing up in perf report, I doubt a single
check is going to cause an issue. In fact, it may actually help speed
things up as the read will be optimized for either user or kernel
address reading.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:30                                 ` Steven Rostedt
@ 2019-02-22 19:34                                   ` Alexei Starovoitov
  2019-02-22 19:39                                     ` Steven Rostedt
  2019-02-22 19:55                                     ` Andy Lutomirski
  0 siblings, 2 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 19:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> On Fri, 22 Feb 2019 11:27:05 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > 
> > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > allow user accesses. The easiest way to do that is actually likely to
> > > use the "unsafe_get_user()" functions *without* doing a
> > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > on a kernel access to user space.  
> > 
> > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > and users pass both user and kernel addresses into it and expect
> > that the helper will actually try to read from that address.
> > 
> > If __probe_kernel_read will suddenly start failing on all user addresses
> > it will break the expectations.
> > How do we solve it in bpf_probe_read?
> > Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > in the loop?
> > That's doable, but people already complain that bpf_probe_read() is slow
> > and shows up in their perf report.
> 
> We're changing kprobes to add a specific flag to say that we want to
> differentiate between kernel or user reads. Can this be done with
> bpf_probe_read()? If it's showing up in perf report, I doubt a single

so you're saying you will break existing kprobe scripts?
I don't think it's a good idea.
It's not acceptable to break bpf_probe_read uapi.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:34                                   ` Alexei Starovoitov
@ 2019-02-22 19:39                                     ` Steven Rostedt
  2019-02-22 19:55                                     ` Andy Lutomirski
  1 sibling, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-22 19:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, 22 Feb 2019 11:34:58 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> so you're saying you will break existing kprobe scripts?

Yes we may.

> I don't think it's a good idea.
> It's not acceptable to break bpf_probe_read uapi.

Then you may need to add more code to determine if the address is user
space or not in the kernel, and then go the appropriate route, before
calling probe_kernel_read().

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 18:28                                 ` Linus Torvalds
@ 2019-02-22 19:52                                   ` Andy Lutomirski
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-22 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski



> On Feb 22, 2019, at 10:28 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, Feb 22, 2019 at 9:48 AM Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>> On Feb 22, 2019, at 9:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> 
>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>> allow user accesses. The easiest way to do that is actually likely to
>>> use the "unsafe_get_user()" functions *without* doing a
>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>> on a kernel access to user space.
>>> 
>>> The nice thing about that is that usually developers will have access
>>> to exactly those modern boxes, so the people who notice that it
>>> doesn't work are the right people.
>> 
>> We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than oops harder and kill the first oops.
> 
> It would still do that.
> 
> Using the unsafe_get_user() macros doesn't remove the exception
> handling, and we wouldn't remove the whole "pagefault_disable()"
> either. So it would work exactly the same way it does now, except on a
> modern CPU it would return -EFAULT for a user space access due to AC
> not being set.
> 
> 

Hmm. I misunderstood you. I thought you wanted the oops.

We’d have to check that we don’t trip the “SMAP violation, egads!” check. 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:34                                   ` Alexei Starovoitov
  2019-02-22 19:39                                     ` Steven Rostedt
@ 2019-02-22 19:55                                     ` Andy Lutomirski
  2019-02-22 21:43                                       ` Jann Horn
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-22 19:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Linus Torvalds, Masami Hiramatsu,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf



> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>> On Fri, 22 Feb 2019 11:27:05 -0800
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> 
>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>> 
>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>> allow user accesses. The easiest way to do that is actually likely to
>>>> use the "unsafe_get_user()" functions *without* doing a
>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>> on a kernel access to user space.  
>>> 
>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>> and users pass both user and kernel addresses into it and expect
>>> that the helper will actually try to read from that address.
>>> 
>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>> it will break the expectations.
>>> How do we solve it in bpf_probe_read?
>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>> in the loop?
>>> That's doable, but people already complain that bpf_probe_read() is slow
>>> and shows up in their perf report.
>> 
>> We're changing kprobes to add a specific flag to say that we want to
>> differentiate between kernel or user reads. Can this be done with
>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> 
> so you're saying you will break existing kprobe scripts?
> I don't think it's a good idea.
> It's not acceptable to break bpf_probe_read uapi.
> 

If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.

What to do about existing scripts is a different question.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                               ` Alexei Starovoitov
  2019-02-22 19:30                                 ` Steven Rostedt
@ 2019-02-22 21:20                                 ` Linus Torvalds
  2019-02-22 21:38                                   ` David Miller
  2019-02-26 15:24                                 ` Joel Fernandes
  2 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-22 21:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.

As mentioned earlier in the thread, that's actually fundamentally broken.

There are architectures that have physically separate address spaces,
with the same pointer value in both kernel and user space.

They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.

So a pointer really should always unambiguously always be explicitly
_either_ a kernel pointer, or a user pointer. You can't have "this is
a pointer", and then try to figure it out by looking at the value.
That may happen to work on x86-64, but it's literally a "happen to
work on the most common architectures", not a design thing.

                Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:20                                 ` Linus Torvalds
@ 2019-02-22 21:38                                   ` David Miller
  2019-02-22 21:59                                     ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: David Miller @ 2019-02-22 21:38 UTC (permalink / raw)
  To: torvalds
  Cc: alexei.starovoitov, mhiramat, rostedt, luto, linux-kernel, mingo,
	akpm, stable, changbin.du, jannh, keescook, luto, daniel, netdev,
	bpf

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 22 Feb 2019 13:20:58 -0800

> On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>> and users pass both user and kernel addresses into it and expect
>> that the helper will actually try to read from that address.
> 
> As mentioned earlier in the thread, that's actually fundamentally broken.
> 
> There are architectures that have physically separate address spaces,
> with the same pointer value in both kernel and user space.
> 
> They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.

And sparc64.

> So a pointer really should always unambiguously always be explicitly
> _either_ a kernel pointer, or a user pointer. You can't have "this is
> a pointer", and then try to figure it out by looking at the value.
> That may happen to work on x86-64, but it's literally a "happen to
> work on the most common architectures", not a design thing.

Don't be surprised if we see more separation like this in the future too.

So it's not a smart thing to code against even if you can discount all of
the examples Linus gives above.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:55                                     ` Andy Lutomirski
@ 2019-02-22 21:43                                       ` Jann Horn
  2019-02-22 22:08                                         ` Nadav Amit
  0 siblings, 1 reply; 85+ messages in thread
From: Jann Horn @ 2019-02-22 21:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Network Development, bpf, Nadav Amit,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

(adding some people from the text_poke series to the thread, removing stable@)

On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >> On Fri, 22 Feb 2019 11:27:05 -0800
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>
> >>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>> allow user accesses. The easiest way to do that is actually likely to
> >>>> use the "unsafe_get_user()" functions *without* doing a
> >>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>> on a kernel access to user space.
> >>>
> >>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>> and users pass both user and kernel addresses into it and expect
> >>> that the helper will actually try to read from that address.
> >>>
> >>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>> it will break the expectations.
> >>> How do we solve it in bpf_probe_read?
> >>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>> in the loop?
> >>> That's doable, but people already complain that bpf_probe_read() is slow
> >>> and shows up in their perf report.
> >>
> >> We're changing kprobes to add a specific flag to say that we want to
> >> differentiate between kernel or user reads. Can this be done with
> >> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >
> > so you're saying you will break existing kprobe scripts?
> > I don't think it's a good idea.
> > It's not acceptable to break bpf_probe_read uapi.
> >
>
> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>
> What to do about existing scripts is a different question.

This lack of logical separation between user and kernel addresses
might interact interestingly with the text_poke series, specifically
"[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
patching" (https://lore.kernel.org/lkml/20190221234451.17632-6-rick.p.edgecombe@intel.com/)
and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
poking" (https://lore.kernel.org/lkml/20190221234451.17632-7-rick.p.edgecombe@intel.com/),
right? If someone manages to get a tracing BPF program to trigger in a
task that has switched to the patching mm, could they use
bpf_probe_write_user() - which uses probe_kernel_write() after
checking that KERNEL_DS isn't active and that access_ok() passes - to
overwrite kernel text that is mapped writable in the patching mm?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:38                                   ` David Miller
@ 2019-02-22 21:59                                     ` Linus Torvalds
  2019-02-22 22:51                                       ` Alexei Starovoitov
  2019-02-26  3:57                                       ` Christoph Hellwig
  0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2019-02-22 21:59 UTC (permalink / raw)
  To: David Miller
  Cc: Alexei Starovoitov, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
>
> Don't be surprised if we see more separation like this in the future too.

Yes, with the whole meltdown fiasco, there's actually more pressure to
add more support for separation of kernel/user address spaces. As Andy
pointed out, it's been discussed as a future wish-list for x86-64 too.

But yeah, right now the *common* architectures all distinguish kernel
and user space by pointers (ie x86-64, arm64 and powerpc).

                 Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:43                                       ` Jann Horn
@ 2019-02-22 22:08                                         ` Nadav Amit
  2019-02-22 22:17                                           ` Jann Horn
  0 siblings, 1 reply; 85+ messages in thread
From: Nadav Amit @ 2019-02-22 22:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> 
> (adding some people from the text_poke series to the thread, removing stable@)
> 
> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 
>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>> 
>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>> on a kernel access to user space.
>>>>> 
>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>> and users pass both user and kernel addresses into it and expect
>>>>> that the helper will actually try to read from that address.
>>>>> 
>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>> it will break the expectations.
>>>>> How do we solve it in bpf_probe_read?
>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>> in the loop?
>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>> and shows up in their perf report.
>>>> 
>>>> We're changing kprobes to add a specific flag to say that we want to
>>>> differentiate between kernel or user reads. Can this be done with
>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>> 
>>> so you're saying you will break existing kprobe scripts?
>>> I don't think it's a good idea.
>>> It's not acceptable to break bpf_probe_read uapi.
>> 
>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>> 
>> What to do about existing scripts is a different question.
> 
> This lack of logical separation between user and kernel addresses
> might interact interestingly with the text_poke series, specifically
> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=gVALdkEULEhj4iJNEWAGxyYWe2lxnHRdamW5ZA2A5RQ%3D&amp;reserved=0)
> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=nu2J1FtJsZJmt53SKJz8C8ktWE9eycwdAA%2BiCi1TfCc%3D&amp;reserved=0),
> right? If someone manages to get a tracing BPF program to trigger in a
> task that has switched to the patching mm, could they use
> bpf_probe_write_user() - which uses probe_kernel_write() after
> checking that KERNEL_DS isn't active and that access_ok() passes - to
> overwrite kernel text that is mapped writable in the patching mm?

Yes, this is a good point. I guess text_poke() should be defined with
“__kprobes” and open-code memcpy.

Does it sound reasonable?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:08                                         ` Nadav Amit
@ 2019-02-22 22:17                                           ` Jann Horn
  2019-02-22 22:21                                             ` Nadav Amit
  0 siblings, 1 reply; 85+ messages in thread
From: Jann Horn @ 2019-02-22 22:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >
> > (adding some people from the text_poke series to the thread, removing stable@)
> >
> > On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>
> >>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>> on a kernel access to user space.
> >>>>>
> >>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>> and users pass both user and kernel addresses into it and expect
> >>>>> that the helper will actually try to read from that address.
> >>>>>
> >>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>> it will break the expectations.
> >>>>> How do we solve it in bpf_probe_read?
> >>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>> in the loop?
> >>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>> and shows up in their perf report.
> >>>>
> >>>> We're changing kprobes to add a specific flag to say that we want to
> >>>> differentiate between kernel or user reads. Can this be done with
> >>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>
> >>> so you're saying you will break existing kprobe scripts?
> >>> I don't think it's a good idea.
> >>> It's not acceptable to break bpf_probe_read uapi.
> >>
> >> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>
> >> What to do about existing scripts is a different question.
> >
> > This lack of logical separation between user and kernel addresses
> > might interact interestingly with the text_poke series, specifically
> > "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=gVALdkEULEhj4iJNEWAGxyYWe2lxnHRdamW5ZA2A5RQ%3D&amp;reserved=0)
> > and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=nu2J1FtJsZJmt53SKJz8C8ktWE9eycwdAA%2BiCi1TfCc%3D&amp;reserved=0),
> > right? If someone manages to get a tracing BPF program to trigger in a
> > task that has switched to the patching mm, could they use
> > bpf_probe_write_user() - which uses probe_kernel_write() after
> > checking that KERNEL_DS isn't active and that access_ok() passes - to
> > overwrite kernel text that is mapped writable in the patching mm?
>
> Yes, this is a good point. I guess text_poke() should be defined with
> “__kprobes” and open-code memcpy.
>
> Does it sound reasonable?

Doesn't __text_poke() as implemented in the proposed patch use a
couple other kernel functions, too? Like switch_mm_irqs_off() and
pte_clear() (which can be a call into a separate function on paravirt
kernels)?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:17                                           ` Jann Horn
@ 2019-02-22 22:21                                             ` Nadav Amit
  2019-02-22 22:39                                               ` Nadav Amit
  0 siblings, 1 reply; 85+ messages in thread
From: Nadav Amit @ 2019-02-22 22:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>> 
>>> (adding some people from the text_poke series to the thread, removing stable@)
>>> 
>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>> 
>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>> 
>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>> on a kernel access to user space.
>>>>>>> 
>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>> that the helper will actually try to read from that address.
>>>>>>> 
>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>> it will break the expectations.
>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>> in the loop?
>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>> and shows up in their perf report.
>>>>>> 
>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>> 
>>>>> so you're saying you will break existing kprobe scripts?
>>>>> I don't think it's a good idea.
>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>> 
>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>> 
>>>> What to do about existing scripts is a different question.
>>> 
>>> This lack of logical separation between user and kernel addresses
>>> might interact interestingly with the text_poke series, specifically
>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
>>> right? If someone manages to get a tracing BPF program to trigger in a
>>> task that has switched to the patching mm, could they use
>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>> overwrite kernel text that is mapped writable in the patching mm?
>> 
>> Yes, this is a good point. I guess text_poke() should be defined with
>> “__kprobes” and open-code memcpy.
>> 
>> Does it sound reasonable?
> 
> Doesn't __text_poke() as implemented in the proposed patch use a
> couple other kernel functions, too? Like switch_mm_irqs_off() and
> pte_clear() (which can be a call into a separate function on paravirt
> kernels)?

I will move the pte_clear() to be done after the poking mm was unloaded.
Give me a few minutes to send a sketch of what I think should be done.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:21                                             ` Nadav Amit
@ 2019-02-22 22:39                                               ` Nadav Amit
  2019-02-22 23:02                                                 ` Jann Horn
  0 siblings, 1 reply; 85+ messages in thread
From: Nadav Amit @ 2019-02-22 22:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>> 
>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>> 
>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>> 
>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>> 
>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>> 
>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>> on a kernel access to user space.
>>>>>>>> 
>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>> 
>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>> it will break the expectations.
>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>> in the loop?
>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>> and shows up in their perf report.
>>>>>>> 
>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>> 
>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>> I don't think it's a good idea.
>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>> 
>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>> 
>>>>> What to do about existing scripts is a different question.
>>>> 
>>>> This lack of logical separation between user and kernel addresses
>>>> might interact interestingly with the text_poke series, specifically
>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>> task that has switched to the patching mm, could they use
>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>> overwrite kernel text that is mapped writable in the patching mm?
>>> 
>>> Yes, this is a good point. I guess text_poke() should be defined with
>>> “__kprobes” and open-code memcpy.
>>> 
>>> Does it sound reasonable?
>> 
>> Doesn't __text_poke() as implemented in the proposed patch use a
>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>> pte_clear() (which can be a call into a separate function on paravirt
>> kernels)?
> 
> I will move the pte_clear() to be done after the poking mm was unloaded.
> Give me a few minutes to send a sketch of what I think should be done.

Err.. You are right, I don’t see an easy way of preventing a kprobe from
being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.

The reasonable solution seems to me as taking all the relevant pieces of
code (and data) that might be used during text-poking and encapsulating them, so they
will be set in a memory area which cannot be kprobe'd. This can also be
useful to write-protect data structures of code that calls text_poke(),
e.g., static-keys. It can also protect data on that stack that is used
during text_poke() from being overwritten from another core.

This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
when doing write-rarely operations.

Right now, I think that text_poke() will keep being susceptible to such
an attack, unless you have a better suggestion.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:59                                     ` Linus Torvalds
@ 2019-02-22 22:51                                       ` Alexei Starovoitov
  2019-02-22 23:11                                         ` Jann Horn
  2019-02-22 23:16                                         ` Linus Torvalds
  2019-02-26  3:57                                       ` Christoph Hellwig
  1 sibling, 2 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 22:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> >
> > Don't be surprised if we see more separation like this in the future too.
> 
> Yes, with the whole meltdown fiasco, there's actually more pressure to
> add more support for separation of kernel/user address spaces. As Andy
> pointed out, it's been discussed as a future wish-list for x86-64 too.
> 
> But yeah, right now the *common* architectures all distinguish kernel
> and user space by pointers (ie x86-64, arm64 and powerpc).

That's all fine. I'm missing rationale for making probe_kernel_read()
fail on user addresses.
What is fundamentally wrong with a function probe_any_address_read() ?

For context, typical bpf kprobe program looks like this:
#define probe_read(P) \
    ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
SEC("kprobe/__set_task_comm")
int bpf_prog(struct pt_regs *ctx)
{
    struct signal_struct *signal;
    struct task_struct *tsk;
    char oldcomm[16] = {};
    char newcomm[16] = {};
    u16 oom_score_adj;
    u32 pid;

    tsk = (void *)PT_REGS_PARM1(ctx);
    pid = probe_read(tsk->pid);
    bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
    bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
    signal = probe_read(tsk->signal);
    oom_score_adj = probe_read(signal->oom_score_adj);
    ...
}

where PT_REGS_PARMx macros are defined per architecture.
On x86 it's #define PT_REGS_PARM1(x) ((x)->di)

The program writer has to know the meaning of function arguments.
In this example they need to know that __set_task_comm is defined as
void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.

Right now these programs just call bpf_probe_read() on whatever data
they need to access and not differentiating whether it's user or kernel.

One idea we discussed is to split bpf_probe_read() into kernel_read and user_read
helpers, but in the BPF verifier we cannot determine which address space
the program wants to access. The prog writer needs to manually analyze the program
to use correct one. But mistakes are possible and cannot be fatal.
On the kernel side we have to be safe.
Both probe_kernel_read and probe_user_read must not panic if a pointer
from wrong address space was passed.

Hence my preference is to keep probe_kernel_read as "try read any address".
The function can be renamed to indicate so.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:39                                               ` Nadav Amit
@ 2019-02-22 23:02                                                 ` Jann Horn
  2019-02-22 23:22                                                   ` Nadav Amit
  2019-02-22 23:59                                                   ` Andy Lutomirski
  0 siblings, 2 replies; 85+ messages in thread
From: Jann Horn @ 2019-02-22 23:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> >
> >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> >>
> >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >>>>
> >>>> (adding some people from the text_poke series to the thread, removing stable@)
> >>>>
> >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>
> >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>>>>
> >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>>>>> on a kernel access to user space.
> >>>>>>>>
> >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>>>>> and users pass both user and kernel addresses into it and expect
> >>>>>>>> that the helper will actually try to read from that address.
> >>>>>>>>
> >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>>>>> it will break the expectations.
> >>>>>>>> How do we solve it in bpf_probe_read?
> >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>>>>> in the loop?
> >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>>>>> and shows up in their perf report.
> >>>>>>>
> >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> >>>>>>> differentiate between kernel or user reads. Can this be done with
> >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>>>>
> >>>>>> so you're saying you will break existing kprobe scripts?
> >>>>>> I don't think it's a good idea.
> >>>>>> It's not acceptable to break bpf_probe_read uapi.
> >>>>>
> >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>>>>
> >>>>> What to do about existing scripts is a different question.
> >>>>
> >>>> This lack of logical separation between user and kernel addresses
> >>>> might interact interestingly with the text_poke series, specifically
> >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> >>>> right? If someone manages to get a tracing BPF program to trigger in a
> >>>> task that has switched to the patching mm, could they use
> >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> >>>> overwrite kernel text that is mapped writable in the patching mm?
> >>>
> >>> Yes, this is a good point. I guess text_poke() should be defined with
> >>> “__kprobes” and open-code memcpy.
> >>>
> >>> Does it sound reasonable?
> >>
> >> Doesn't __text_poke() as implemented in the proposed patch use a
> >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> >> pte_clear() (which can be a call into a separate function on paravirt
> >> kernels)?
> >
> > I will move the pte_clear() to be done after the poking mm was unloaded.
> > Give me a few minutes to send a sketch of what I think should be done.
>
> Err.. You are right, I don’t see an easy way of preventing a kprobe from
> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>
> The reasonable solution seems to me as taking all the relevant pieces of
> code (and data) that might be used during text-poking and encapsulating them, so they
> will be set in a memory area which cannot be kprobe'd. This can also be
> useful to write-protect data structures of code that calls text_poke(),
> e.g., static-keys. It can also protect data on that stack that is used
> during text_poke() from being overwritten from another core.
>
> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> when doing write-rarely operations.
>
> Right now, I think that text_poke() will keep being susceptible to such
> an attack, unless you have a better suggestion.

A relatively simple approach might be to teach BPF not to run kprobe
programs and such in contexts where current->mm isn't the active mm?
Maybe using nmi_uaccess_okay(), or something like that? It looks like
perf_callchain_user() also already uses that. Except that a lot of
this code is x86-specific...

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:51                                       ` Alexei Starovoitov
@ 2019-02-22 23:11                                         ` Jann Horn
  2019-02-22 23:16                                           ` David Miller
  2019-02-22 23:16                                         ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Jann Horn @ 2019-02-22 23:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 11:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> > >
> > > Don't be surprised if we see more separation like this in the future too.
> >
> > Yes, with the whole meltdown fiasco, there's actually more pressure to
> > add more support for separation of kernel/user address spaces. As Andy
> > pointed out, it's been discussed as a future wish-list for x86-64 too.
> >
> > But yeah, right now the *common* architectures all distinguish kernel
> > and user space by pointers (ie x86-64, arm64 and powerpc).
>
> That's all fine. I'm missing rationale for making probe_kernel_read()
> fail on user addresses.
> What is fundamentally wrong with a function probe_any_address_read() ?

I think what Linus is saying is: There are some scenarios (like a
system with the old 4G/4G X86 patch) where *the same* address can
refer to two different pieces of memory, depending on whether you
interpret it as a kernel pointer or a user pointer. So for example, if
your BPF program tries to read tsk->comm, it works, but if the BPF
program then tries to read from PT_REGS_PARM2(ctx), it's going to
actually interpret the userspace address as a kernel address and read
completely different memory.

On top of that, from the security angle, this means that if a user
passes a kernel pointer into a syscall, they can trick a tracing BPF
program into looking at random kernel memory instead of the user's
memory. That may or may not be problematic, depending on what you do
afterwards with the data you've read. (For example, if this is a
service that collects performance data and then saves it to some
world-readable location on disk because the data it is collecting
(including comm strings) isn't supposed to be sensitive, you might
have a problem.)

> For context, typical bpf kprobe program looks like this:
> #define probe_read(P) \
>     ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
> SEC("kprobe/__set_task_comm")
> int bpf_prog(struct pt_regs *ctx)
> {
>     struct signal_struct *signal;
>     struct task_struct *tsk;
>     char oldcomm[16] = {};
>     char newcomm[16] = {};
>     u16 oom_score_adj;
>     u32 pid;
>
>     tsk = (void *)PT_REGS_PARM1(ctx);
>     pid = probe_read(tsk->pid);
>     bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
>     bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
>     signal = probe_read(tsk->signal);
>     oom_score_adj = probe_read(signal->oom_score_adj);
>     ...
> }
>
> where PT_REGS_PARMx macros are defined per architecture.
> On x86 it's #define PT_REGS_PARM1(x) ((x)->di)
>
> The program writer has to know the meaning of function arguments.
> In this example they need to know that __set_task_comm is defined as
> void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.
>
> Right now these programs just call bpf_probe_read() on whatever data
> they need to access and not differentiating whether it's user or kernel.
>
> One idea we discussed is to split bpf_probe_read() into kernel_read and user_read
> helpers, but in the BPF verifier we cannot determine which address space
> the program wants to access. The prog writer needs to manually analyze the program
> to use correct one. But mistakes are possible and cannot be fatal.
> On the kernel side we have to be safe.
> Both probe_kernel_read and probe_user_read must not panic if a pointer
> from wrong address space was passed.
>
> Hence my preference is to keep probe_kernel_read as "try read any address".
> The function can be renamed to indicate so.
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:11                                         ` Jann Horn
@ 2019-02-22 23:16                                           ` David Miller
  0 siblings, 0 replies; 85+ messages in thread
From: David Miller @ 2019-02-22 23:16 UTC (permalink / raw)
  To: jannh
  Cc: alexei.starovoitov, torvalds, mhiramat, rostedt, luto,
	linux-kernel, mingo, akpm, stable, changbin.du, keescook, luto,
	daniel, netdev, bpf

From: Jann Horn <jannh@google.com>
Date: Sat, 23 Feb 2019 00:11:58 +0100

> I think what Linus is saying is: There are some scenarios (like a
> system with the old 4G/4G X86 patch) where *the same* address can
> refer to two different pieces of memory, depending on whether you
> interpret it as a kernel pointer or a user pointer.

Exactly.

On sparc64 the kernel is mapped exactly at the same virtual addresses
as userspace processes usually are mapped, even 32-bit ones.  The
difference is the MMU context only.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:51                                       ` Alexei Starovoitov
  2019-02-22 23:11                                         ` Jann Horn
@ 2019-02-22 23:16                                         ` Linus Torvalds
  2019-02-22 23:56                                           ` Alexei Starovoitov
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-22 23:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 2:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> That's all fine. I'm missing rationale for making probe_kernel_read()
> fail on user addresses.

Because it already WON'T WORK in general!

> What is fundamentally wrong with a function probe_any_address_read() ?

What part of "the same pointer value can be a user address and a
kernel address" is not getting through?

The user address space and the kernel address space have separate page
tables on some architectures. We used to avoid it on x86, because
switching address spaces was expensive, but even on x86 some vendors
did it on 32-bit simply to get 4GB of user (and kernel) address space.
And now we end up doing it anyway just because of meltdown.

So a kernel pointer value of 0x12345678 could be a value kernel
pointer pointing to some random kmalloc'ed kernel memory, and a user
pointer value of 0x12345678 could be a valid _user_ pointer pointing
to some user mapping.

See?

If you access a user pointer, you need to use a user accessor function
(eg "get_user()"), while if you access a kernel pointer you need to
just dereference it directly (unless you can't trust it, in which case
you need to use a _different_ accessor function).

The fact that user and kernel pointers happen to be distinct on x86-64
(right now) is just a random implementation detail.

Really.

I didn't realize how many people seem to have been confused about
this. But it's always been true. It's just that the common
architectures have had that "one single address space for both kernel
and user pointers" in practice.

In fact, the *very* first kernel version had separate address spaces
for kernel and user mode even on x86 (using segments, not paging). So
it has literally been true since day one in Linux that a kernel
address can be indistinguishable from a user address from a pure value
standpoint.

                 Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:02                                                 ` Jann Horn
@ 2019-02-22 23:22                                                   ` Nadav Amit
  2019-02-22 23:59                                                   ` Andy Lutomirski
  1 sibling, 0 replies; 85+ messages in thread
From: Nadav Amit @ 2019-02-22 23:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 3:02 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>>>> 
>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>>>> 
>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>>>> 
>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>>>> on a kernel access to user space.
>>>>>>>>>> 
>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>>>> 
>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>>>> it will break the expectations.
>>>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>>>> in the loop?
>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>>>> and shows up in their perf report.
>>>>>>>>> 
>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>>>> 
>>>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>>>> I don't think it's a good idea.
>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>>>> 
>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>>>> 
>>>>>>> What to do about existing scripts is a different question.
>>>>>> 
>>>>>> This lack of logical separation between user and kernel addresses
>>>>>> might interact interestingly with the text_poke series, specifically
>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=ky5iTrsCceoPwVW5N9FB4sDspwGEQ8MTlRE4b1Bqn54%3D&amp;reserved=0)
>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=EJs8doLrfFfMTKiVHmWjmpnpWolmuuZ5pxOmEMcI0ew%3D&amp;reserved=0),
>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>>>> task that has switched to the patching mm, could they use
>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>>>> overwrite kernel text that is mapped writable in the patching mm?
>>>>> 
>>>>> Yes, this is a good point. I guess text_poke() should be defined with
>>>>> “__kprobes” and open-code memcpy.
>>>>> 
>>>>> Does it sound reasonable?
>>>> 
>>>> Doesn't __text_poke() as implemented in the proposed patch use a
>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>>>> pte_clear() (which can be a call into a separate function on paravirt
>>>> kernels)?
>>> 
>>> I will move the pte_clear() to be done after the poking mm was unloaded.
>>> Give me a few minutes to send a sketch of what I think should be done.
>> 
>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>> 
>> The reasonable solution seems to me as taking all the relevant pieces of
>> code (and data) that might be used during text-poking and encapsulating them, so they
>> will be set in a memory area which cannot be kprobe'd. This can also be
>> useful to write-protect data structures of code that calls text_poke(),
>> e.g., static-keys. It can also protect data on that stack that is used
>> during text_poke() from being overwritten from another core.
>> 
>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
>> when doing write-rarely operations.
>> 
>> Right now, I think that text_poke() will keep being susceptible to such
>> an attack, unless you have a better suggestion.
> 
> A relatively simple approach might be to teach BPF not to run kprobe
> programs and such in contexts where current->mm isn't the active mm?
> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> perf_callchain_user() also already uses that. Except that a lot of
> this code is x86-specific…

I keep having in mind how to reduce the TCB that is used while text_poke()
is running, but for the specific issue here, I think your approach would
be fine, and trace_call_bpf() can be modified to do what you ask for.

But, I am not sure that relying on current->mm gets us any more security,
relatively to having another unrelated explicit kprobe-disable indication,
which is cleaner from design point-of-view.

I can see how we get “some more security” if our decision whether kprobes
should be enabled was purely based on some hardware register (e.g., CR3) and
we could unequivocally realize whether kprobes eBPF should be on/off without
memory accesses (e.g., PCID bit set). Yet, I am not sure it worth it.

What do you say?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:16                                         ` Linus Torvalds
@ 2019-02-22 23:56                                           ` Alexei Starovoitov
  2019-02-23  0:08                                             ` Linus Torvalds
  2019-02-23  4:51                                             ` Masami Hiramatsu
  0 siblings, 2 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
> 
> So a kernel pointer value of 0x12345678 could be a value kernel
> pointer pointing to some random kmalloc'ed kernel memory, and a user
> pointer value of 0x12345678 could be a valid _user_ pointer pointing
> to some user mapping.
> 
> See?
> 
> If you access a user pointer, you need to use a user accessor function
> (eg "get_user()"), while if you access a kernel pointer you need to
> just dereference it directly (unless you can't trust it, in which case
> you need to use a _different_ accessor function).

that was clear already.
Reading 0x12345678 via probe_kernel_read can return valid value
and via get_user() can return another valid value on _some_ architectures.

> The fact that user and kernel pointers happen to be distinct on x86-64
> (right now) is just a random implementation detail.

yes and my point that people already rely on this implementation detail.
Say we implement 
int bpf_probe_read(void *val, void *unsafe_ptr)
{
  if (probe_kernel_read(val, unsafe_ptr) == OK) {
     return 0;
  } else (get_user(val, unsafe_ptr) == OK) {
     return 0;
  } else {
     *val = 0;
     return -EFAULT;
  }
}
It will preserve existing bpf_probe_read() behavior on x86.
If x86 implementation changes tomorrow then progs that read user
addresses may start failing randomly because first probe_kernel_read()
will be returning random values from kernel memory and that's no good,
but at least we won't be breaking them today, so we have time to
introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.

Imo that's much better than making current bpf_probe_read() fail
on user addresses today and not providing a non disruptive path forward.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:02                                                 ` Jann Horn
  2019-02-22 23:22                                                   ` Nadav Amit
@ 2019-02-22 23:59                                                   ` Andy Lutomirski
  2019-02-23  0:03                                                     ` Alexei Starovoitov
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-22 23:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Nadav Amit, Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> > >
> > >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> > >>
> > >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> > >>>>
> > >>>> (adding some people from the text_poke series to the thread, removing stable@)
> > >>>>
> > >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> > >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> > >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > >>>>>>>>>
> > >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> > >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> > >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> > >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> > >>>>>>>>> on a kernel access to user space.
> > >>>>>>>>
> > >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > >>>>>>>> and users pass both user and kernel addresses into it and expect
> > >>>>>>>> that the helper will actually try to read from that address.
> > >>>>>>>>
> > >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> > >>>>>>>> it will break the expectations.
> > >>>>>>>> How do we solve it in bpf_probe_read?
> > >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > >>>>>>>> in the loop?
> > >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> > >>>>>>>> and shows up in their perf report.
> > >>>>>>>
> > >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> > >>>>>>> differentiate between kernel or user reads. Can this be done with
> > >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> > >>>>>>
> > >>>>>> so you're saying you will break existing kprobe scripts?
> > >>>>>> I don't think it's a good idea.
> > >>>>>> It's not acceptable to break bpf_probe_read uapi.
> > >>>>>
> > >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> > >>>>>
> > >>>>> What to do about existing scripts is a different question.
> > >>>>
> > >>>> This lack of logical separation between user and kernel addresses
> > >>>> might interact interestingly with the text_poke series, specifically
> > >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> > >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> > >>>> right? If someone manages to get a tracing BPF program to trigger in a
> > >>>> task that has switched to the patching mm, could they use
> > >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> > >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> > >>>> overwrite kernel text that is mapped writable in the patching mm?
> > >>>
> > >>> Yes, this is a good point. I guess text_poke() should be defined with
> > >>> “__kprobes” and open-code memcpy.
> > >>>
> > >>> Does it sound reasonable?
> > >>
> > >> Doesn't __text_poke() as implemented in the proposed patch use a
> > >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> > >> pte_clear() (which can be a call into a separate function on paravirt
> > >> kernels)?
> > >
> > > I will move the pte_clear() to be done after the poking mm was unloaded.
> > > Give me a few minutes to send a sketch of what I think should be done.
> >
> > Err.. You are right, I don’t see an easy way of preventing a kprobe from
> > being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> >
> > The reasonable solution seems to me as taking all the relevant pieces of
> > code (and data) that might be used during text-poking and encapsulating them, so they
> > will be set in a memory area which cannot be kprobe'd. This can also be
> > useful to write-protect data structures of code that calls text_poke(),
> > e.g., static-keys. It can also protect data on that stack that is used
> > during text_poke() from being overwritten from another core.
> >
> > This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> > when doing write-rarely operations.
> >
> > Right now, I think that text_poke() will keep being susceptible to such
> > an attack, unless you have a better suggestion.
>
> A relatively simple approach might be to teach BPF not to run kprobe
> programs and such in contexts where current->mm isn't the active mm?
> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> perf_callchain_user() also already uses that. Except that a lot of
> this code is x86-specific...

This sounds like exactly the right solution.  If you're running from
some unknown context (like NMI or tracing), then you should check
nmi_uaccess_okay().  I think we should just promote that to be a
non-arch-specific function (that returns true by default) and check it
the relevant bpf_probe_xyz() functions.

Alexei, does that seem reasonable?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                                   ` Andy Lutomirski
@ 2019-02-23  0:03                                                     ` Alexei Starovoitov
  2019-02-23  0:15                                                     ` Nadav Amit
  2019-02-25 13:36                                                     ` Masami Hiramatsu
  2 siblings, 0 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-23  0:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Nadav Amit, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Daniel Borkmann,
	Network Development, bpf, Rick Edgecombe, Dave Hansen,
	Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 03:59:30PM -0800, Andy Lutomirski wrote:
> >
> > A relatively simple approach might be to teach BPF not to run kprobe
> > programs and such in contexts where current->mm isn't the active mm?
> > Maybe using nmi_uaccess_okay(), or something like that? It looks like
> > perf_callchain_user() also already uses that. Except that a lot of
> > this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.
> 
> Alexei, does that seem reasonable?

yep. I think that should work.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:56                                           ` Alexei Starovoitov
@ 2019-02-23  0:08                                             ` Linus Torvalds
  2019-02-23  2:28                                               ` Alexei Starovoitov
  2019-02-23  4:51                                             ` Masami Hiramatsu
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2019-02-23  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> It will preserve existing bpf_probe_read() behavior on x86.

... but that's the worst possible situation.

It appears that people haven't understood that kernel and user
addresses are distinct, and may have written programs that are
fundamentally buggy.

And we _want_ to make it clear that they are buggy on x86-64, exactly
because x86-64 is the one that gets the most testing - by far.

So if x86-64 continues working for buggy programs, then that only
means that those bugs never get fixed.

It would be much better to try to get those things fixed, and make the
x86-64 implementation stricter, exactly so that people end up
_realizing_ that they can't just think "a pointer is a pointer, and
the context doesn't matter".

From a pure functional safety standpoint, I thought bpf already knew
what kind of a pointer it had?

                   Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                                   ` Andy Lutomirski
  2019-02-23  0:03                                                     ` Alexei Starovoitov
@ 2019-02-23  0:15                                                     ` Nadav Amit
  2019-02-24 19:35                                                       ` Andy Lutomirski
  2019-02-25 13:36                                                     ` Masami Hiramatsu
  2 siblings, 1 reply; 85+ messages in thread
From: Nadav Amit @ 2019-02-23  0:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Daniel Borkmann,
	Network Development, bpf, Rick Edgecombe, Dave Hansen,
	Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 3:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
>> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
>>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>>>>> 
>>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>>>>> 
>>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>>>>> 
>>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>>>>> on a kernel access to user space.
>>>>>>>>>>> 
>>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>>>>> 
>>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>>>>> it will break the expectations.
>>>>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>>>>> in the loop?
>>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>>>>> and shows up in their perf report.
>>>>>>>>>> 
>>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>>>>> 
>>>>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>>>>> I don't think it's a good idea.
>>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>>>>> 
>>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>>>>> 
>>>>>>>> What to do about existing scripts is a different question.
>>>>>>> 
>>>>>>> This lack of logical separation between user and kernel addresses
>>>>>>> might interact interestingly with the text_poke series, specifically
>>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=2tqD7udTCfNbcNLcj5SFpZt8WwK5NwtgaWMKm1Ye1EE%3D&amp;reserved=0)
>>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=7%2BLShgLKnra6xzSkxdJrCclCacfdE5IdczwScW83nuE%3D&amp;reserved=0),
>>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>>>>> task that has switched to the patching mm, could they use
>>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>>>>> overwrite kernel text that is mapped writable in the patching mm?
>>>>>> 
>>>>>> Yes, this is a good point. I guess text_poke() should be defined with
>>>>>> “__kprobes” and open-code memcpy.
>>>>>> 
>>>>>> Does it sound reasonable?
>>>>> 
>>>>> Doesn't __text_poke() as implemented in the proposed patch use a
>>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>>>>> pte_clear() (which can be a call into a separate function on paravirt
>>>>> kernels)?
>>>> 
>>>> I will move the pte_clear() to be done after the poking mm was unloaded.
>>>> Give me a few minutes to send a sketch of what I think should be done.
>>> 
>>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
>>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>>> 
>>> The reasonable solution seems to me as taking all the relevant pieces of
>>> code (and data) that might be used during text-poking and encapsulating them, so they
>>> will be set in a memory area which cannot be kprobe'd. This can also be
>>> useful to write-protect data structures of code that calls text_poke(),
>>> e.g., static-keys. It can also protect data on that stack that is used
>>> during text_poke() from being overwritten from another core.
>>> 
>>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
>>> when doing write-rarely operations.
>>> 
>>> Right now, I think that text_poke() will keep being susceptible to such
>>> an attack, unless you have a better suggestion.
>> 
>> A relatively simple approach might be to teach BPF not to run kprobe
>> programs and such in contexts where current->mm isn't the active mm?
>> Maybe using nmi_uaccess_okay(), or something like that? It looks like
>> perf_callchain_user() also already uses that. Except that a lot of
>> this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.

I can do that, but notice that switch_mm_irqs_off() writes to
cpu_tlbstate.loaded_mm before it actually writes to CR3. So there are still
a couple of instructions (and the load_new_mm_cr3()) in between that a
kprobe can be set on, no?

I can mark them as non-kprobable.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  0:08                                             ` Linus Torvalds
@ 2019-02-23  2:28                                               ` Alexei Starovoitov
  2019-02-23  2:32                                                 ` Linus Torvalds
  2019-02-23  3:02                                                 ` Steven Rostedt
  0 siblings, 2 replies; 85+ messages in thread
From: Alexei Starovoitov @ 2019-02-23  2:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 04:08:59PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > It will preserve existing bpf_probe_read() behavior on x86.
> 
> ... but that's the worst possible situation.
> 
> It appears that people haven't understood that kernel and user
> addresses are distinct, and may have written programs that are
> fundamentally buggy.
> 
> And we _want_ to make it clear that they are buggy on x86-64, exactly
> because x86-64 is the one that gets the most testing - by far.
> 
> So if x86-64 continues working for buggy programs, then that only
> means that those bugs never get fixed.
> 
> It would be much better to try to get those things fixed, and make the
> x86-64 implementation stricter, exactly so that people end up
> _realizing_ that they can't just think "a pointer is a pointer, and
> the context doesn't matter".
>
> From a pure functional safety standpoint, I thought bpf already knew
> what kind of a pointer it had?

when bpf verifier knows the type of pointer it allows direct access to it.
That's the case for skb, socket, packet data, hash maps, arrays, stack, etc.
Networking progs cannot call bpf_probe_read().
It's available to tracing progs only and their goal is to walk the kernel and
user memory with addresses that cannot be statically verified
at program load time.
We are working on adding type information (BTF) to vmlinux.
Soon we'll be able to tell the type of every kernel function argument.
Right now arg1 and arg2 in a kprobed function are just u64 pt_regs->di, si.
Soon we'll be able to precisely identify their C type.

I completely agree on the direction that x86 is the architecture that
sets an example and users need to learn the difference in pointers.
I only disagree on timing.
Right now users don't have an alternative.
In our repo I counted ~400 calls to bpf_probe_read and about 10 times more
'indirect' calls. What's happening with 'indirect' calls is BCC toolchain
using clang to automatically replace struct_a->field_foo access with
bpf_probe_read(struct_a + offsetof(typeof(struct_a), field_foo));

If we had __user vs __kernel annotation available to clang we could have
taught BCC to replace this '->' dereference with appropriate kernel vs user
helper. Also we need to teach GCC to recognize __user and store into dwarf,
so we can take it further into BTF and verify later.

Also I think disallowing bpf_probe_read() to read user pointer will not
make a desired teaching effect. It will only cause painful debugging to folks
when their progs will stop working. It's better to remove bpf_probe_read()
completely.
imo the process of teaching the users of kernel vs user pointer difference
needs to be gradual.
First we introduce bpf_probe_kernel_read and bpf_probe_user_read and
introduce clang/gcc tooling to catch the mistakes.
Going over this 400+ places and manually grepping kernel sources
for __user keyword is not a great proposal if we want to keep those users.
Once we have this working we can remove bpf_probe_read() altogether.
Rejecting bpf prog at load time is a clear signal that user has to fix it
(instead of changing run-time behavior).
When the verifier gets even smarter it could potentially replace prob_read
with probe_kernel_read and probe_user_read when it has that type info.

imo this kernel release should finish as-is and in the next cycle we can
change probe_kernel_read() to reject user address, have temporary
workaround in bpf_probe_read() with probe_kernel_read+get_user hack,
introduce new bpf helpers, new tooling and eventually remove
buggy bpf_probe_read.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  2:28                                               ` Alexei Starovoitov
@ 2019-02-23  2:32                                                 ` Linus Torvalds
  2019-02-23  3:02                                                 ` Steven Rostedt
  1 sibling, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2019-02-23  2:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 6:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> imo this kernel release should finish as-is and in the next cycle we can
> change probe_kernel_read() to reject user address [..]

Absolutely. Nothing is going to change right now for 5.0, which is imminent.

It's really a "long-term we really need to fix this", where the only
question is how soon "long-term" is.

                   Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  2:28                                               ` Alexei Starovoitov
  2019-02-23  2:32                                                 ` Linus Torvalds
@ 2019-02-23  3:02                                                 ` Steven Rostedt
  1 sibling, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2019-02-23  3:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, 22 Feb 2019 18:28:53 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> First we introduce bpf_probe_kernel_read and bpf_probe_user_read and
> introduce clang/gcc tooling to catch the mistakes.
> Going over this 400+ places and manually grepping kernel sources
> for __user keyword is not a great proposal if we want to keep those users.
> Once we have this working we can remove bpf_probe_read() altogether.
> Rejecting bpf prog at load time is a clear signal that user has to fix it
> (instead of changing run-time behavior).
> When the verifier gets even smarter it could potentially replace prob_read
> with probe_kernel_read and probe_user_read when it has that type info.

I was about to suggest this approach. Document that bpf_probe_read() is
known to be buggy and will be deprecated in the future, and that all
new bpf scripts should start using bpf_probe_kernel/user_read()
instead (after they have been implemented of course). And give time for
people to fix their current scripts.

Perhaps in the near future, trigger some kind of warning for users that
use bpf_probe_read().

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 17:43                             ` Linus Torvalds
  2019-02-22 17:48                               ` Andy Lutomirski
  2019-02-22 19:27                               ` Alexei Starovoitov
@ 2019-02-23  3:47                               ` Masami Hiramatsu
  2019-02-24  0:44                                 ` Steven Rostedt
  2 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-23  3:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Fri, 22 Feb 2019 09:43:14 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Feb 22, 2019 at 12:35 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > Or, can we do this?
> >
> > long __probe_user_read(void *dst, const void *src, size_t size)
> > {
> 
> Add a
> 
>         if (!access_ok(src, size))
>                 ret = -EFAULT;
>         else {
> .. do the pagefault_disable() etc ..
>         }

Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
(only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)

In arch/x86/include/asm/uaccess.h:

#define access_ok(addr, size)                                   \
({                                                                      \
        WARN_ON_IN_IRQ();                                               \
        likely(!__range_not_ok(addr, size, user_addr_max()));           \
})

Do we need acccess_ok_inatomic()?

BTW, it seems a bit strange that this WARN_ON_IN_IRQ() is only in x86
access_ok() implementation, since CONFIG_DEBUG_ATOMIC_SLEEP(which defines
WARN_ON_IN_IRQ) doesn't depend on x86, and access_ok() is widely used
in kernel. I think it would be better that each arch provides __access_ok()
and include/linux/uaccess.h provides access_ok() with WARN_ON_IN_IRQ().


> to after the "set_fs()", and it looks good to me. Make it clear that
> yes, this works _only_ for user reads.
> 
> Adn that makes all the games with "kernel_uaccess_faults_ok"
> pointless, so you can just remove them.

OK.

> 
> (note that the "access_ok()" has to come after we've done "set_fs()",
> because it takes the address limit from that).
> 
> Also, since normally we'd expect that we already have USER_DS, it
> might be worthwhile to do this with a wrapper, something along the
> lines of
> 
>         mm_segment_t old_fs = get_fs();
> 
>         if (segment_eq(old_fs, USER_DS))
>               return __normal_probe_user_read();
>         set_fs(USER_DS);
>         ret = __normal_probe_user_read();
>         set_fs(old_fs);
>         return ret;
> 
> and have that __normal_probe_user_read() just do
> 
>         if (!access_ok(src, size))
>                 return -EFAULT;
>         pagefault_disable();
>         ret = __copy_from_user_inatomic(dst, ...);
>         pagefault_enable();
>         return ret ? -EFAULT : 0;
> 
> which looks more obvious.

OK.

> 
> Also, I would suggest that you just make the argument type be "const
> void __user *", since the whole point is that this takes a user
> pointer, and nothing else.

Ah, right.

> Then we should still probably fix up "__probe_kernel_read()" to not
> allow user accesses. The easiest way to do that is actually likely to
> use the "unsafe_get_user()" functions *without* doing a
> uaccess_begin(), which will mean that modern CPU's will simply fault
> on a kernel access to user space.

Or, use __chk_user_ptr(ptr) to check it?

Thank you,

> 
> The nice thing about that is that usually developers will have access
> to exactly those modern boxes, so the people who notice that it
> doesn't work are the right people.
> 
> Alternatively, we should just make it be architecture-specific, so
> that architectures can decide "this address cannot be a kernel
> address" and refuse to do it.
> 
>                 Linus


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:56                                           ` Alexei Starovoitov
  2019-02-23  0:08                                             ` Linus Torvalds
@ 2019-02-23  4:51                                             ` Masami Hiramatsu
  1 sibling, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-23  4:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, 22 Feb 2019 15:56:20 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
> > 
> > So a kernel pointer value of 0x12345678 could be a value kernel
> > pointer pointing to some random kmalloc'ed kernel memory, and a user
> > pointer value of 0x12345678 could be a valid _user_ pointer pointing
> > to some user mapping.
> > 
> > See?
> > 
> > If you access a user pointer, you need to use a user accessor function
> > (eg "get_user()"), while if you access a kernel pointer you need to
> > just dereference it directly (unless you can't trust it, in which case
> > you need to use a _different_ accessor function).
> 
> that was clear already.
> Reading 0x12345678 via probe_kernel_read can return valid value
> and via get_user() can return another valid value on _some_ architectures.
> 
> > The fact that user and kernel pointers happen to be distinct on x86-64
> > (right now) is just a random implementation detail.
> 
> yes and my point that people already rely on this implementation detail.
> Say we implement 
> int bpf_probe_read(void *val, void *unsafe_ptr)
> {
>   if (probe_kernel_read(val, unsafe_ptr) == OK) {
>      return 0;
>   } else (get_user(val, unsafe_ptr) == OK) {
>      return 0;
>   } else {
>      *val = 0;
>      return -EFAULT;
>   }
> }

Note that we can not use get_user() form kprobe handler. If you use it,
you have to prepare fault_handler() and make bpf itself can be aborted.
So, maybe you can use probe_user_read().

Hmm, however, it still doesn't work correctly on "some" architecture,
since whether a pointer (address) points user-space or kernel-space
depends on the context. In kprobe/bpf, the context means where you
put the probe and which pointer you record.

I think only "__user" tag tells us which one is user-space. But 
unfortunately, that "__user" tag is only for compiler or checker, not
for runtime binary. Such useful attribute goes away when we execute it.

So, even if we introduce "ustring", ftrace/perf users has to decide to use
it by themselves. As far as I know, DWARF(debuginfo) also doesn't have
that attribute. So perf-probe can not help it from debuginfo.
(Maybe if we introduce C parser, it might be detected...)

> It will preserve existing bpf_probe_read() behavior on x86.
> If x86 implementation changes tomorrow then progs that read user
> addresses may start failing randomly because first probe_kernel_read()
> will be returning random values from kernel memory and that's no good,
> but at least we won't be breaking them today, so we have time to
> introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.

I see. I think bpf also has to introduce new bpf_probe_read_user() and
keep bpf_probe_read() for kernel dataa only.

> Imo that's much better than making current bpf_probe_read() fail
> on user addresses today and not providing a non disruptive path forward.

Agreed.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-21  7:52   ` Masami Hiramatsu
  2019-02-21 14:36     ` Steven Rostedt
@ 2019-02-23 14:48     ` Masami Hiramatsu
  1 sibling, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-23 14:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, stable, Changbin Du

Hi,

On Thu, 21 Feb 2019 16:52:52 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> > ---
> >  kernel/trace/trace_kprobe.c | 10 +---------
> >  1 file changed, 1 insertion(+), 9 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index d5fb09ebba8b..9eaf07f99212 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -861,22 +861,14 @@ static const struct file_operations kprobe_profile_ops = {
> >  static nokprobe_inline int
> >  fetch_store_strlen(unsigned long addr)
> >  {
> > -	mm_segment_t old_fs;
> >  	int ret, len = 0;
> >  	u8 c;
> >  
> > -	old_fs = get_fs();
> > -	set_fs(KERNEL_DS);
> > -	pagefault_disable();
> > -

BTW, compared with probe_kernel_read() implementation, this function
lacks current->kernel_uaccess_faults_ok modification here.

I would like to know whether we can avoid this issue if we tweak this flag.

Thank you,

> >  	do {
> > -		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
> > +		ret = probe_mem_read(&c, (u8 *)addr + len, 1);
> >  		len++;
> >  	} while (c && ret == 0 && len < MAX_STRING_SIZE);
> >  
> > -	pagefault_enable();
> > -	set_fs(old_fs);
> > -
> >  	return (ret < 0) ? ret : len;
> >  }
> >  
> > -- 
> > 2.20.1
> > 
> > 
> 
> 
> -- 
> Masami Hiramatsu <mhiramat@kernel.org>


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  3:47                               ` Masami Hiramatsu
@ 2019-02-24  0:44                                 ` Steven Rostedt
  2019-02-24  4:38                                   ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-24  0:44 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Sat, 23 Feb 2019 12:47:46 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
> (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)

Is it really IRQ context or exception context? That is, one
(interrupts) happen for any task, but exceptions happen because of the
software that is executed (like a breakpoint). Although you can have a
kprobe trigger in an interrupt handler (where user access wouldn't make
sense anyway). But there should be no problem with user access from an
exception handler.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24  0:44                                 ` Steven Rostedt
@ 2019-02-24  4:38                                   ` Andy Lutomirski
  2019-02-24 15:17                                     ` Masami Hiramatsu
  2019-02-25 16:48                                     ` Kees Cook
  0 siblings, 2 replies; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-24  4:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Linus Torvalds, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski

On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sat, 23 Feb 2019 12:47:46 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> > Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
> > (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
>
> Is it really IRQ context or exception context? That is, one
> (interrupts) happen for any task, but exceptions happen because of the
> software that is executed (like a breakpoint). Although you can have a
> kprobe trigger in an interrupt handler (where user access wouldn't make
> sense anyway). But there should be no problem with user access from an
> exception handler.
>

Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
as far as I know.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24  4:38                                   ` Andy Lutomirski
@ 2019-02-24 15:17                                     ` Masami Hiramatsu
  2019-02-24 17:26                                       ` Linus Torvalds
  2019-02-25 16:48                                     ` Kees Cook
  1 sibling, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-24 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Masami Hiramatsu, Linus Torvalds,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Peter Zijlstra

On Sat, 23 Feb 2019 20:38:03 -0800
Andy Lutomirski <luto@kernel.org> wrote:

> On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Sat, 23 Feb 2019 12:47:46 +0900
> > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > > Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
> > > (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
> >
> > Is it really IRQ context or exception context? That is, one
> > (interrupts) happen for any task, but exceptions happen because of the
> > software that is executed (like a breakpoint). Although you can have a
> > kprobe trigger in an interrupt handler (where user access wouldn't make
> > sense anyway). But there should be no problem with user access from an
> > exception handler.
> >
> 
> Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> as far as I know.

Hmm, which might_sleep() would you pointed? What I talked was a 
WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just
checks preempt_count.

I guess PeterZ assumed that access_ok() is used only with user space access
APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might
sleep :)), but now we are trying to use access_ok() with new functions which
disables page-fault and just return -EFAULT.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24 15:17                                     ` Masami Hiramatsu
@ 2019-02-24 17:26                                       ` Linus Torvalds
  2019-02-25  2:40                                         ` Masami Hiramatsu
  2019-02-25  8:33                                         ` Peter Zijlstra
  0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2019-02-24 17:26 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andy Lutomirski, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Peter Zijlstra

On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Sat, 23 Feb 2019 20:38:03 -0800
> Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> > as far as I know.
>
> Hmm, which might_sleep() would you pointed? What I talked was a
> WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just
> checks preempt_count.

So the in_task() check does kind of make sense. Using "access_ok()"
outside of task context is certainly an odd thing, for several
reasons. The main one being simply that outside of task context, the
whole "which task" question is open, and you don't know if the task is
the active one, and so it's not clear if whatever task you interrupt
might have done "set_fs()" or not.

So PeterZ isn't wrong:

> I guess PeterZ assumed that access_ok() is used only with user space access
> APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might
> sleep :)), but now we are trying to use access_ok() with new functions which
> disables page-fault and just return -EFAULT.

.. but in this case, if we do it all *within* code that saves and
restores the user access flag with get_fs/set_fs, access_ok() would be
ok and it doesn't have the above issue.

So access_ok() in _general_ is absolutely not safe to do from
interrupts, but within the context of probing user memory from a
tracing event it just happens to be ok.

It would be lovely to have a special macro for this, and keep the
warning for the general case, but because this is a "every
architecture needs to build their own" it's probably too painful.

PeterZ, do you remember the particular use case that triggered that
commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok()
context")?

                    Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  0:15                                                     ` Nadav Amit
@ 2019-02-24 19:35                                                       ` Andy Lutomirski
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-24 19:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Jann Horn, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Sat, Feb 23, 2019 at 12:30 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Feb 22, 2019, at 3:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
> >> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> >>>>
> >>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> >>>>>
> >>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> >>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >>>>>>>
> >>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
> >>>>>>>
> >>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>>>>>>>> on a kernel access to user space.
> >>>>>>>>>>>
> >>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>>>>>>>> and users pass both user and kernel addresses into it and expect
> >>>>>>>>>>> that the helper will actually try to read from that address.
> >>>>>>>>>>>
> >>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>>>>>>>> it will break the expectations.
> >>>>>>>>>>> How do we solve it in bpf_probe_read?
> >>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>>>>>>>> in the loop?
> >>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>>>>>>>> and shows up in their perf report.
> >>>>>>>>>>
> >>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
> >>>>>>>>>> differentiate between kernel or user reads. Can this be done with
> >>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>>>>>>>
> >>>>>>>>> so you're saying you will break existing kprobe scripts?
> >>>>>>>>> I don't think it's a good idea.
> >>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
> >>>>>>>>
> >>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>>>>>>>
> >>>>>>>> What to do about existing scripts is a different question.
> >>>>>>>
> >>>>>>> This lack of logical separation between user and kernel addresses
> >>>>>>> might interact interestingly with the text_poke series, specifically
> >>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> >>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=2tqD7udTCfNbcNLcj5SFpZt8WwK5NwtgaWMKm1Ye1EE%3D&amp;reserved=0)
> >>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> >>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=7%2BLShgLKnra6xzSkxdJrCclCacfdE5IdczwScW83nuE%3D&amp;reserved=0),
> >>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
> >>>>>>> task that has switched to the patching mm, could they use
> >>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> >>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> >>>>>>> overwrite kernel text that is mapped writable in the patching mm?
> >>>>>>
> >>>>>> Yes, this is a good point. I guess text_poke() should be defined with
> >>>>>> “__kprobes” and open-code memcpy.
> >>>>>>
> >>>>>> Does it sound reasonable?
> >>>>>
> >>>>> Doesn't __text_poke() as implemented in the proposed patch use a
> >>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
> >>>>> pte_clear() (which can be a call into a separate function on paravirt
> >>>>> kernels)?
> >>>>
> >>>> I will move the pte_clear() to be done after the poking mm was unloaded.
> >>>> Give me a few minutes to send a sketch of what I think should be done.
> >>>
> >>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
> >>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> >>>
> >>> The reasonable solution seems to me as taking all the relevant pieces of
> >>> code (and data) that might be used during text-poking and encapsulating them, so they
> >>> will be set in a memory area which cannot be kprobe'd. This can also be
> >>> useful to write-protect data structures of code that calls text_poke(),
> >>> e.g., static-keys. It can also protect data on that stack that is used
> >>> during text_poke() from being overwritten from another core.
> >>>
> >>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> >>> when doing write-rarely operations.
> >>>
> >>> Right now, I think that text_poke() will keep being susceptible to such
> >>> an attack, unless you have a better suggestion.
> >>
> >> A relatively simple approach might be to teach BPF not to run kprobe
> >> programs and such in contexts where current->mm isn't the active mm?
> >> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> >> perf_callchain_user() also already uses that. Except that a lot of
> >> this code is x86-specific...
> >
> > This sounds like exactly the right solution.  If you're running from
> > some unknown context (like NMI or tracing), then you should check
> > nmi_uaccess_okay().  I think we should just promote that to be a
> > non-arch-specific function (that returns true by default) and check it
> > the relevant bpf_probe_xyz() functions.
>
> I can do that, but notice that switch_mm_irqs_off() writes to
> cpu_tlbstate.loaded_mm before it actually writes to CR3. So there are still
> a couple of instructions (and the load_new_mm_cr3()) in between that a
> kprobe can be set on, no?

But you can't mark then as no-nmi :)  See the comment in
nmi_uaccess_ok() -- the code is intended to work correctly during this
window.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24 17:26                                       ` Linus Torvalds
@ 2019-02-25  2:40                                         ` Masami Hiramatsu
  2019-02-25  4:49                                           ` Andy Lutomirski
  2019-02-25  8:33                                         ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-25  2:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Peter Zijlstra

On Sun, 24 Feb 2019 09:26:45 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Sat, 23 Feb 2019 20:38:03 -0800
> > Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> > > as far as I know.
> >
> > Hmm, which might_sleep() would you pointed? What I talked was a
> > WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just
> > checks preempt_count.
> 
> So the in_task() check does kind of make sense. Using "access_ok()"
> outside of task context is certainly an odd thing, for several
> reasons. The main one being simply that outside of task context, the
> whole "which task" question is open, and you don't know if the task is
> the active one, and so it's not clear if whatever task you interrupt
> might have done "set_fs()" or not.

Ah I got it. Usual case access_ok() in IRQ handler is strange.

> 
> So PeterZ isn't wrong:
> 
> > I guess PeterZ assumed that access_ok() is used only with user space access
> > APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might
> > sleep :)), but now we are trying to use access_ok() with new functions which
> > disables page-fault and just return -EFAULT.
> 
> .. but in this case, if we do it all *within* code that saves and
> restores the user access flag with get_fs/set_fs, access_ok() would be
> ok and it doesn't have the above issue.
> 
> So access_ok() in _general_ is absolutely not safe to do from
> interrupts, but within the context of probing user memory from a
> tracing event it just happens to be ok.

Hmm, but user can specify user-memory access from the tracing event
which is located in interrupt handler. So I understand that it is safe
only if we correctly setup access flag with get_fs/set_fs, is that
correct?

> It would be lovely to have a special macro for this, and keep the
> warning for the general case, but because this is a "every
> architecture needs to build their own" it's probably too painful.

Agreed.

> 
> PeterZ, do you remember the particular use case that triggered that
> commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok()
> context")?
> 
>                     Linus


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25  2:40                                         ` Masami Hiramatsu
@ 2019-02-25  4:49                                           ` Andy Lutomirski
  2019-02-25  8:09                                             ` Masami Hiramatsu
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-25  4:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Peter Zijlstra

On Sun, Feb 24, 2019 at 6:40 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Sun, 24 Feb 2019 09:26:45 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Sat, 23 Feb 2019 20:38:03 -0800
> > > Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> > > > as far as I know.
> > >
> > > Hmm, which might_sleep() would you pointed? What I talked was a
> > > WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just
> > > checks preempt_count.
> >
> > So the in_task() check does kind of make sense. Using "access_ok()"
> > outside of task context is certainly an odd thing, for several
> > reasons. The main one being simply that outside of task context, the
> > whole "which task" question is open, and you don't know if the task is
> > the active one, and so it's not clear if whatever task you interrupt
> > might have done "set_fs()" or not.
>
> Ah I got it. Usual case access_ok() in IRQ handler is strange.
>
> >
> > So PeterZ isn't wrong:
> >
> > > I guess PeterZ assumed that access_ok() is used only with user space access
> > > APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might
> > > sleep :)), but now we are trying to use access_ok() with new functions which
> > > disables page-fault and just return -EFAULT.
> >
> > .. but in this case, if we do it all *within* code that saves and
> > restores the user access flag with get_fs/set_fs, access_ok() would be
> > ok and it doesn't have the above issue.
> >
> > So access_ok() in _general_ is absolutely not safe to do from
> > interrupts, but within the context of probing user memory from a
> > tracing event it just happens to be ok.
>
> Hmm, but user can specify user-memory access from the tracing event
> which is located in interrupt handler. So I understand that it is safe
> only if we correctly setup access flag with get_fs/set_fs, is that
> correct?
>
> > It would be lovely to have a special macro for this, and keep the
> > warning for the general case, but because this is a "every
> > architecture needs to build their own" it's probably too painful.
>
> Agreed.

This should probably go with whatever effort makes nmi_uaccess_ok()
available on all architectures.  That being said, how about just
making copy_from_user_nmi() work on all architectures, even if it just
fails unconditionally on some of them?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25  4:49                                           ` Andy Lutomirski
@ 2019-02-25  8:09                                             ` Masami Hiramatsu
  2019-02-25 16:40                                               ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-25  8:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Peter Zijlstra

On Sun, 24 Feb 2019 20:49:45 -0800
Andy Lutomirski <luto@kernel.org> wrote:

> On Sun, Feb 24, 2019 at 6:40 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Sun, 24 Feb 2019 09:26:45 -0800
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > On Sun, Feb 24, 2019 at 7:18 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > >
> > > > On Sat, 23 Feb 2019 20:38:03 -0800
> > > > Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> > > > > as far as I know.
> > > >
> > > > Hmm, which might_sleep() would you pointed? What I talked was a
> > > > WARN_ON_ONCE(!in_task()) in access_ok() on x86 (only!), and in_task() just
> > > > checks preempt_count.
> > >
> > > So the in_task() check does kind of make sense. Using "access_ok()"
> > > outside of task context is certainly an odd thing, for several
> > > reasons. The main one being simply that outside of task context, the
> > > whole "which task" question is open, and you don't know if the task is
> > > the active one, and so it's not clear if whatever task you interrupt
> > > might have done "set_fs()" or not.
> >
> > Ah I got it. Usual case access_ok() in IRQ handler is strange.
> >
> > >
> > > So PeterZ isn't wrong:
> > >
> > > > I guess PeterZ assumed that access_ok() is used only with user space access
> > > > APIs (e.g. copy_from_user) which can cause page-fault and locks mm (and might
> > > > sleep :)), but now we are trying to use access_ok() with new functions which
> > > > disables page-fault and just return -EFAULT.
> > >
> > > .. but in this case, if we do it all *within* code that saves and
> > > restores the user access flag with get_fs/set_fs, access_ok() would be
> > > ok and it doesn't have the above issue.
> > >
> > > So access_ok() in _general_ is absolutely not safe to do from
> > > interrupts, but within the context of probing user memory from a
> > > tracing event it just happens to be ok.
> >
> > Hmm, but user can specify user-memory access from the tracing event
> > which is located in interrupt handler. So I understand that it is safe
> > only if we correctly setup access flag with get_fs/set_fs, is that
> > correct?
> >
> > > It would be lovely to have a special macro for this, and keep the
> > > warning for the general case, but because this is a "every
> > > architecture needs to build their own" it's probably too painful.
> >
> > Agreed.
> 
> This should probably go with whatever effort makes nmi_uaccess_ok()
> available on all architectures.  That being said, how about just
> making copy_from_user_nmi() work on all architectures, even if it just
> fails unconditionally on some of them?

I think even if we have copy_from_user_nmi(), we need something like
nmi_uaccess_ok() because without it we can not correctly use
__copy_from_user_inatomic()...


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24 17:26                                       ` Linus Torvalds
  2019-02-25  2:40                                         ` Masami Hiramatsu
@ 2019-02-25  8:33                                         ` Peter Zijlstra
  2019-02-25 14:52                                           ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2019-02-25  8:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Andy Lutomirski, Steven Rostedt,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook

On Sun, Feb 24, 2019 at 09:26:45AM -0800, Linus Torvalds wrote:
> PeterZ, do you remember the particular use case that triggered that
> commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok()
> context")?

This one, if I'm not mistaken.

---

commit ae31fe51a3cceaa0cabdb3058f69669ecb47f12e
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue Nov 22 10:57:42 2016 +0100

    perf/x86: Restore TASK_SIZE check on frame pointer
    
    The following commit:
    
      75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses")
    
    ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual
    access_ok() check.
    
    Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE,
    whereas the access_ok() uses whatever the current address limit of the task is.
    
    We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and
    then see vmalloc faults when we access what looks like pointers into vmalloc
    space:
    
      [] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290
      [] CPU: 3 PID: 3685731 Comm: sh Tainted: G        W       4.6.0-5_fbk1_223_gdbf0f40 #1
      [] Call Trace:
      []  <NMI>  [<ffffffff814717d1>] dump_stack+0x4d/0x6c
      []  [<ffffffff81076e43>] __warn+0xd3/0xf0
      []  [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20
      []  [<ffffffff8104a899>] vmalloc_fault+0x289/0x290
      []  [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490
      []  [<ffffffff8104b70c>] do_page_fault+0xc/0x10
      []  [<ffffffff81794e82>] page_fault+0x22/0x30
      []  [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0
      []  [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190
      []  [<ffffffff811512c7>] perf_callchain+0x67/0x80
      []  [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370
      []  [<ffffffff8114e840>] perf_event_output+0x20/0x60
      []  [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130
      []  [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0
      []  [<ffffffff8114f484>] perf_event_overflow+0x14/0x20
      []  [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490
      []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
      []  [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0
      []  [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20
      []  [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0
      []  [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20
      []  [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50
      []  [<ffffffff8101ea31>] nmi_handle+0x61/0x110
      []  [<ffffffff8101ef94>] default_do_nmi+0x44/0x110
      []  [<ffffffff8101f13b>] do_nmi+0xdb/0x150
      []  [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e
      []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
      []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
      []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
      []  <<EOE>>  <IRQ>  [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0
    
    Fix this by moving the valid_user_frame() check to before the uaccess
    that loads the return address and the pointer to the next frame.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: linux-kernel@vger.kernel.org
    Fixes: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses")
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d31735f37ed7..9d4bf3ab049e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2352,7 +2352,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 		frame.next_frame     = 0;
 		frame.return_address = 0;
 
-		if (!access_ok(VERIFY_READ, fp, 8))
+		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
 
 		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4);
@@ -2362,9 +2362,6 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 		if (bytes != 0)
 			break;
 
-		if (!valid_user_frame(fp, sizeof(frame)))
-			break;
-
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
@@ -2413,7 +2410,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		frame.next_frame	     = NULL;
 		frame.return_address = 0;
 
-		if (!access_ok(VERIFY_READ, fp, sizeof(*fp) * 2))
+		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
 
 		bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
@@ -2423,9 +2420,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		if (bytes != 0)
 			break;
 
-		if (!valid_user_frame(fp, sizeof(frame)))
-			break;
-
 		perf_callchain_store(entry, frame.return_address);
 		fp = (void __user *)frame.next_frame;
 	}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                                   ` Andy Lutomirski
  2019-02-23  0:03                                                     ` Alexei Starovoitov
  2019-02-23  0:15                                                     ` Nadav Amit
@ 2019-02-25 13:36                                                     ` Masami Hiramatsu
  2 siblings, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-25 13:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Nadav Amit, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, 22 Feb 2019 15:59:30 -0800
Andy Lutomirski <luto@kernel.org> wrote:

> On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > > > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> > > >
> > > >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> > > >>
> > > >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > > >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> > > >>>>
> > > >>>> (adding some people from the text_poke series to the thread, removing stable@)
> > > >>>>
> > > >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > > >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> > > >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> > > >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> > > >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> > > >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> > > >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> > > >>>>>>>>> on a kernel access to user space.
> > > >>>>>>>>
> > > >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > > >>>>>>>> and users pass both user and kernel addresses into it and expect
> > > >>>>>>>> that the helper will actually try to read from that address.
> > > >>>>>>>>
> > > >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> > > >>>>>>>> it will break the expectations.
> > > >>>>>>>> How do we solve it in bpf_probe_read?
> > > >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > > >>>>>>>> in the loop?
> > > >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> > > >>>>>>>> and shows up in their perf report.
> > > >>>>>>>
> > > >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> > > >>>>>>> differentiate between kernel or user reads. Can this be done with
> > > >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> > > >>>>>>
> > > >>>>>> so you're saying you will break existing kprobe scripts?
> > > >>>>>> I don't think it's a good idea.
> > > >>>>>> It's not acceptable to break bpf_probe_read uapi.
> > > >>>>>
> > > >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> > > >>>>>
> > > >>>>> What to do about existing scripts is a different question.
> > > >>>>
> > > >>>> This lack of logical separation between user and kernel addresses
> > > >>>> might interact interestingly with the text_poke series, specifically
> > > >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > > >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> > > >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > > >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> > > >>>> right? If someone manages to get a tracing BPF program to trigger in a
> > > >>>> task that has switched to the patching mm, could they use
> > > >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> > > >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> > > >>>> overwrite kernel text that is mapped writable in the patching mm?
> > > >>>
> > > >>> Yes, this is a good point. I guess text_poke() should be defined with
> > > >>> “__kprobes” and open-code memcpy.
> > > >>>
> > > >>> Does it sound reasonable?
> > > >>
> > > >> Doesn't __text_poke() as implemented in the proposed patch use a
> > > >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> > > >> pte_clear() (which can be a call into a separate function on paravirt
> > > >> kernels)?
> > > >
> > > > I will move the pte_clear() to be done after the poking mm was unloaded.
> > > > Give me a few minutes to send a sketch of what I think should be done.
> > >
> > > Err.. You are right, I don’t see an easy way of preventing a kprobe from
> > > being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> > >
> > > The reasonable solution seems to me as taking all the relevant pieces of
> > > code (and data) that might be used during text-poking and encapsulating them, so they
> > > will be set in a memory area which cannot be kprobe'd. This can also be
> > > useful to write-protect data structures of code that calls text_poke(),
> > > e.g., static-keys. It can also protect data on that stack that is used
> > > during text_poke() from being overwritten from another core.
> > >
> > > This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> > > when doing write-rarely operations.
> > >
> > > Right now, I think that text_poke() will keep being susceptible to such
> > > an attack, unless you have a better suggestion.
> >
> > A relatively simple approach might be to teach BPF not to run kprobe
> > programs and such in contexts where current->mm isn't the active mm?
> > Maybe using nmi_uaccess_okay(), or something like that? It looks like
> > perf_callchain_user() also already uses that. Except that a lot of
> > this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.

This treat may also need for my work, like probe_user_read() we should
fail if nmi_uaccess_okay().

Thank you,

> 
> Alexei, does that seem reasonable?


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25  8:33                                         ` Peter Zijlstra
@ 2019-02-25 14:52                                           ` Peter Zijlstra
  0 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2019-02-25 14:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Andy Lutomirski, Steven Rostedt,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook

On Mon, Feb 25, 2019 at 09:33:09AM +0100, Peter Zijlstra wrote:
> On Sun, Feb 24, 2019 at 09:26:45AM -0800, Linus Torvalds wrote:
> > PeterZ, do you remember the particular use case that triggered that
> > commit 7c4788950ba5 ("x86/uaccess, sched/preempt: Verify access_ok()
> > context")?
> 
> This one, if I'm not mistaken.
> 
> ---
> 
> commit ae31fe51a3cceaa0cabdb3058f69669ecb47f12e
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Tue Nov 22 10:57:42 2016 +0100
> 
>     perf/x86: Restore TASK_SIZE check on frame pointer
>     
>     The following commit:
>     
>       75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses")
>     
>     ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual
>     access_ok() check.
>     
>     Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE,
>     whereas the access_ok() uses whatever the current address limit of the task is.
>     
>     We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and
>     then see vmalloc faults when we access what looks like pointers into vmalloc
>     space:

Also note that this was before we did:

commit 88b0193d9418c00340e45e0a913a0813bc6c8c96
Author: Will Deacon <will.deacon@arm.com>
Date:   Tue May 9 18:00:04 2017 +0100

    perf/callchain: Force USER_DS when invoking perf_callchain_user()
    
    Perf can generate and record a user callchain in response to a synchronous
    request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS),
    then we can end up walking the user stack (and dereferencing/saving whatever we
    find there) without the protections usually afforded by checks such as
    access_ok.
    
    Rather than play whack-a-mole with each architecture's stack unwinding
    implementation, fix the root of the problem by ensuring that we force USER_DS
    when invoking perf_callchain_user from the perf core.
    
    Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index c04917cad1bf..1b2be63c8528 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -229,12 +229,18 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 		}
 
 		if (regs) {
+			mm_segment_t fs;
+
 			if (crosstask)
 				goto exit_put;
 
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+
+			fs = get_fs();
+			set_fs(USER_DS);
 			perf_callchain_user(&ctx, regs);
+			set_fs(fs);
 		}
 	}
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25  8:09                                             ` Masami Hiramatsu
@ 2019-02-25 16:40                                               ` Steven Rostedt
  2019-02-26  1:35                                                 ` Masami Hiramatsu
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2019-02-25 16:40 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andy Lutomirski, Linus Torvalds, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Peter Zijlstra

On Mon, 25 Feb 2019 17:09:45 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> > This should probably go with whatever effort makes nmi_uaccess_ok()
> > available on all architectures.  That being said, how about just
> > making copy_from_user_nmi() work on all architectures, even if it just
> > fails unconditionally on some of them?  
> 
> I think even if we have copy_from_user_nmi(), we need something like
> nmi_uaccess_ok() because without it we can not correctly use
> __copy_from_user_inatomic()...

But wouldn't that just be part of the implementation of
"copy_from_user_nmi()" as being in an NMI just assumes being inatomic?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-24  4:38                                   ` Andy Lutomirski
  2019-02-24 15:17                                     ` Masami Hiramatsu
@ 2019-02-25 16:48                                     ` Kees Cook
  2019-02-25 16:58                                       ` Andy Lutomirski
  1 sibling, 1 reply; 85+ messages in thread
From: Kees Cook @ 2019-02-25 16:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Masami Hiramatsu, Linus Torvalds,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn

On Sat, Feb 23, 2019 at 8:38 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Sat, 23 Feb 2019 12:47:46 +0900
> > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > > Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
> > > (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
> >
> > Is it really IRQ context or exception context? That is, one
> > (interrupts) happen for any task, but exceptions happen because of the
> > software that is executed (like a breakpoint). Although you can have a
> > kprobe trigger in an interrupt handler (where user access wouldn't make
> > sense anyway). But there should be no problem with user access from an
> > exception handler.
> >
>
> Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> as far as I know.

We do need to be aware of the userfaultfd case of getting held by
userspace in the middle of a copy_*_user()... that's a whole other
problem.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25 16:48                                     ` Kees Cook
@ 2019-02-25 16:58                                       ` Andy Lutomirski
  2019-02-25 17:07                                         ` Kees Cook
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2019-02-25 16:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Steven Rostedt, Masami Hiramatsu,
	Linus Torvalds, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn

On Mon, Feb 25, 2019 at 8:48 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Sat, Feb 23, 2019 at 8:38 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Sat, Feb 23, 2019 at 4:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Sat, 23 Feb 2019 12:47:46 +0900
> > > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > > Since kprobes handler runs in IRQ context, we can not use access_ok() in it.
> > > > (only on x86 + CONFIG_DEBUG_ATOMIC_SLEEP=y)
> > >
> > > Is it really IRQ context or exception context? That is, one
> > > (interrupts) happen for any task, but exceptions happen because of the
> > > software that is executed (like a breakpoint). Although you can have a
> > > kprobe trigger in an interrupt handler (where user access wouldn't make
> > > sense anyway). But there should be no problem with user access from an
> > > exception handler.
> > >
> >
> > Can we just get rid of this might_sleep()?  access_ok() doesn't sleep
> > as far as I know.
>
> We do need to be aware of the userfaultfd case of getting held by
> userspace in the middle of a copy_*_user()... that's a whole other
> problem.
>

I sure hope that pagefault_disable() already takes care of this.
Otherwise we have major problems already.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25 16:58                                       ` Andy Lutomirski
@ 2019-02-25 17:07                                         ` Kees Cook
  0 siblings, 0 replies; 85+ messages in thread
From: Kees Cook @ 2019-02-25 17:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Masami Hiramatsu, Linus Torvalds,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn

On Mon, Feb 25, 2019 at 8:58 AM Andy Lutomirski <luto@kernel.org> wrote:
> I sure hope that pagefault_disable() already takes care of this.
> Otherwise we have major problems already.

Okay, cool. I missed that bit. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-25 16:40                                               ` Steven Rostedt
@ 2019-02-26  1:35                                                 ` Masami Hiramatsu
  0 siblings, 0 replies; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-26  1:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Linus Torvalds, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Peter Zijlstra

On Mon, 25 Feb 2019 11:40:18 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 25 Feb 2019 17:09:45 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > > This should probably go with whatever effort makes nmi_uaccess_ok()
> > > available on all architectures.  That being said, how about just
> > > making copy_from_user_nmi() work on all architectures, even if it just
> > > fails unconditionally on some of them?  
> > 
> > I think even if we have copy_from_user_nmi(), we need something like
> > nmi_uaccess_ok() because without it we can not correctly use
> > __copy_from_user_inatomic()...
> 
> But wouldn't that just be part of the implementation of
> "copy_from_user_nmi()" as being in an NMI just assumes being inatomic?

Yes for copy_from_user_nmi(). But there are some other fundamental
functions, like __get_user(). And when we optimize the loop in
strncpy/strnlen from user in atomic, I think one nmi_access_ok() at
entry is enough.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:59                                     ` Linus Torvalds
  2019-02-22 22:51                                       ` Alexei Starovoitov
@ 2019-02-26  3:57                                       ` Christoph Hellwig
  1 sibling, 0 replies; 85+ messages in thread
From: Christoph Hellwig @ 2019-02-26  3:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Alexei Starovoitov, Masami Hiramatsu,
	Steven Rostedt, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> >
> > Don't be surprised if we see more separation like this in the future too.
> 
> Yes, with the whole meltdown fiasco, there's actually more pressure to
> add more support for separation of kernel/user address spaces. As Andy
> pointed out, it's been discussed as a future wish-list for x86-64 too.

S/390 is another example.

I've also proposed a RISC-V extension for it, including prototypes
for Rocketchip and Qemu, and a Linux kernel support, but it didn't go
any way.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                               ` Alexei Starovoitov
  2019-02-22 19:30                                 ` Steven Rostedt
  2019-02-22 21:20                                 ` Linus Torvalds
@ 2019-02-26 15:24                                 ` Joel Fernandes
  2019-02-28 12:29                                   ` Masami Hiramatsu
  2 siblings, 1 reply; 85+ messages in thread
From: Joel Fernandes @ 2019-02-26 15:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, will.deacon, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski, daniel, netdev, bpf

On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > 
> > Then we should still probably fix up "__probe_kernel_read()" to not
> > allow user accesses. The easiest way to do that is actually likely to
> > use the "unsafe_get_user()" functions *without* doing a
> > uaccess_begin(), which will mean that modern CPU's will simply fault
> > on a kernel access to user space.
> 
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.

Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
kprobe on do_sys_open to monitor calls to the open syscall globally.

do_sys_open() has prototype:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);

This causes a "blank" filename to be displayed by opensnoop when I run it on
my Pixel 3 (arm64), possibly because this is a user pointer. However, it
works fine on x86-64.

So it seems to me that on arm64, reading user pointers directly still doesn't
work even if there is a distinction between user/kernel addresses. In that
case reading the user pointer using user accessors (possibly using
bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
also privately discussed with me).

[1] https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py#L140

thanks!

 - Joel


> 
> If __probe_kernel_read will suddenly start failing on all user addresses
> it will break the expectations.
> How do we solve it in bpf_probe_read?
> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> in the loop?
> That's doable, but people already complain that bpf_probe_read() is slow
> and shows up in their perf report.
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-26 15:24                                 ` Joel Fernandes
@ 2019-02-28 12:29                                   ` Masami Hiramatsu
  2019-02-28 15:18                                     ` Joel Fernandes
  0 siblings, 1 reply; 85+ messages in thread
From: Masami Hiramatsu @ 2019-02-28 12:29 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Alexei Starovoitov, Linus Torvalds, Masami Hiramatsu,
	Steven Rostedt, Andy Lutomirski, will.deacon,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Tue, 26 Feb 2019 10:24:47 -0500
Joel Fernandes <joel@joelfernandes.org> wrote:

> On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > 
> > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > allow user accesses. The easiest way to do that is actually likely to
> > > use the "unsafe_get_user()" functions *without* doing a
> > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > on a kernel access to user space.
> > 
> > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > and users pass both user and kernel addresses into it and expect
> > that the helper will actually try to read from that address.
> 
> Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
> kprobe on do_sys_open to monitor calls to the open syscall globally.
> 
> do_sys_open() has prototype:
> 
> long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
> 
> This causes a "blank" filename to be displayed by opensnoop when I run it on
> my Pixel 3 (arm64), possibly because this is a user pointer. However, it
> works fine on x86-64.
> 
> So it seems to me that on arm64, reading user pointers directly still doesn't
> work even if there is a distinction between user/kernel addresses. In that
> case reading the user pointer using user accessors (possibly using
> bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
> also privately discussed with me).

OK, it sounds like the same issue. Please add a bpf_user_read() and use it
for __user pointer.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-28 12:29                                   ` Masami Hiramatsu
@ 2019-02-28 15:18                                     ` Joel Fernandes
  0 siblings, 0 replies; 85+ messages in thread
From: Joel Fernandes @ 2019-02-28 15:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Linus Torvalds, Steven Rostedt,
	Andy Lutomirski, will.deacon, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski, daniel, netdev, bpf, yhs

On Thu, Feb 28, 2019 at 09:29:13PM +0900, Masami Hiramatsu wrote:
> On Tue, 26 Feb 2019 10:24:47 -0500
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> > > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > > 
> > > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > > allow user accesses. The easiest way to do that is actually likely to
> > > > use the "unsafe_get_user()" functions *without* doing a
> > > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > > on a kernel access to user space.
> > > 
> > > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > > and users pass both user and kernel addresses into it and expect
> > > that the helper will actually try to read from that address.
> > 
> > Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
> > kprobe on do_sys_open to monitor calls to the open syscall globally.
> > 
> > do_sys_open() has prototype:
> > 
> > long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
> > 
> > This causes a "blank" filename to be displayed by opensnoop when I run it on
> > my Pixel 3 (arm64), possibly because this is a user pointer. However, it
> > works fine on x86-64.
> > 
> > So it seems to me that on arm64, reading user pointers directly still doesn't
> > work even if there is a distinction between user/kernel addresses. In that
> > case reading the user pointer using user accessors (possibly using
> > bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
> > also privately discussed with me).
> 
> OK, it sounds like the same issue. Please add a bpf_user_read() and use it
> for __user pointer.

CC'd Yonghong who said eariler to me he would add it, but I could add it too
if he wants me to.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2019-02-28 15:18 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15 17:47 [PATCH 0/2 v2] [GIT PULL (take two)] tracing: Two more fixes Steven Rostedt
2019-02-15 17:47 ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Steven Rostedt
2019-02-15 17:55   ` Linus Torvalds
2019-02-15 22:15     ` Steven Rostedt
2019-02-15 23:49       ` Andy Lutomirski
2019-02-16  0:19         ` Steven Rostedt
2019-02-16  1:32           ` Andy Lutomirski
2019-02-16  2:08             ` Steven Rostedt
2019-02-16  2:14               ` Andy Lutomirski
2019-02-16  2:21                 ` Steven Rostedt
2019-02-18 17:58           ` Linus Torvalds
2019-02-18 18:23             ` Linus Torvalds
2019-02-19 16:18               ` Steven Rostedt
2019-02-19 18:43                 ` Linus Torvalds
2019-02-19 19:03                   ` Steven Rostedt
2019-02-20  8:10                     ` Masami Hiramatsu
2019-02-20 13:57                       ` Jann Horn
2019-02-20 14:47                         ` Steven Rostedt
2019-02-20 15:08                         ` Masami Hiramatsu
2019-02-20 14:49                       ` Steven Rostedt
2019-02-20 16:04                         ` Masami Hiramatsu
2019-02-20 16:42                           ` Steven Rostedt
2019-02-21  7:37                             ` Masami Hiramatsu
2019-02-22  8:27                         ` Masami Hiramatsu
2019-02-22  8:35                           ` Masami Hiramatsu
2019-02-22 17:43                             ` Linus Torvalds
2019-02-22 17:48                               ` Andy Lutomirski
2019-02-22 18:28                                 ` Linus Torvalds
2019-02-22 19:52                                   ` Andy Lutomirski
2019-02-22 19:27                               ` Alexei Starovoitov
2019-02-22 19:30                                 ` Steven Rostedt
2019-02-22 19:34                                   ` Alexei Starovoitov
2019-02-22 19:39                                     ` Steven Rostedt
2019-02-22 19:55                                     ` Andy Lutomirski
2019-02-22 21:43                                       ` Jann Horn
2019-02-22 22:08                                         ` Nadav Amit
2019-02-22 22:17                                           ` Jann Horn
2019-02-22 22:21                                             ` Nadav Amit
2019-02-22 22:39                                               ` Nadav Amit
2019-02-22 23:02                                                 ` Jann Horn
2019-02-22 23:22                                                   ` Nadav Amit
2019-02-22 23:59                                                   ` Andy Lutomirski
2019-02-23  0:03                                                     ` Alexei Starovoitov
2019-02-23  0:15                                                     ` Nadav Amit
2019-02-24 19:35                                                       ` Andy Lutomirski
2019-02-25 13:36                                                     ` Masami Hiramatsu
2019-02-22 21:20                                 ` Linus Torvalds
2019-02-22 21:38                                   ` David Miller
2019-02-22 21:59                                     ` Linus Torvalds
2019-02-22 22:51                                       ` Alexei Starovoitov
2019-02-22 23:11                                         ` Jann Horn
2019-02-22 23:16                                           ` David Miller
2019-02-22 23:16                                         ` Linus Torvalds
2019-02-22 23:56                                           ` Alexei Starovoitov
2019-02-23  0:08                                             ` Linus Torvalds
2019-02-23  2:28                                               ` Alexei Starovoitov
2019-02-23  2:32                                                 ` Linus Torvalds
2019-02-23  3:02                                                 ` Steven Rostedt
2019-02-23  4:51                                             ` Masami Hiramatsu
2019-02-26  3:57                                       ` Christoph Hellwig
2019-02-26 15:24                                 ` Joel Fernandes
2019-02-28 12:29                                   ` Masami Hiramatsu
2019-02-28 15:18                                     ` Joel Fernandes
2019-02-23  3:47                               ` Masami Hiramatsu
2019-02-24  0:44                                 ` Steven Rostedt
2019-02-24  4:38                                   ` Andy Lutomirski
2019-02-24 15:17                                     ` Masami Hiramatsu
2019-02-24 17:26                                       ` Linus Torvalds
2019-02-25  2:40                                         ` Masami Hiramatsu
2019-02-25  4:49                                           ` Andy Lutomirski
2019-02-25  8:09                                             ` Masami Hiramatsu
2019-02-25 16:40                                               ` Steven Rostedt
2019-02-26  1:35                                                 ` Masami Hiramatsu
2019-02-25  8:33                                         ` Peter Zijlstra
2019-02-25 14:52                                           ` Peter Zijlstra
2019-02-25 16:48                                     ` Kees Cook
2019-02-25 16:58                                       ` Andy Lutomirski
2019-02-25 17:07                                         ` Kees Cook
2019-02-21  7:52   ` Masami Hiramatsu
2019-02-21 14:36     ` Steven Rostedt
2019-02-21 15:58       ` Masami Hiramatsu
2019-02-21 16:16         ` Masami Hiramatsu
2019-02-21 16:32           ` Steven Rostedt
2019-02-23 14:48     ` Masami Hiramatsu
2019-02-15 17:47 ` [PATCH 2/2 v2] tracing: Fix number of entries in trace header Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.