Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
       [not found]                 ` <CAHk-=whNf_n1WXWW+ugAVeL5ZK0GcEP3cTYocju1nS85VtMjjQ@mail.gmail.com>
@ 2019-02-22 19:27                   ` Alexei Starovoitov
  2019-02-22 19:30                     ` Steven Rostedt
                                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> 
> Then we should still probably fix up "__probe_kernel_read()" to not
> allow user accesses. The easiest way to do that is actually likely to
> use the "unsafe_get_user()" functions *without* doing a
> uaccess_begin(), which will mean that modern CPU's will simply fault
> on a kernel access to user space.

On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
and users pass both user and kernel addresses into it and expect
that the helper will actually try to read from that address.

If __probe_kernel_read will suddenly start failing on all user addresses
it will break the expectations.
How do we solve it in bpf_probe_read?
Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
in the loop?
That's doable, but people already complain that bpf_probe_read() is slow
and shows up in their perf report.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                   ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Alexei Starovoitov
@ 2019-02-22 19:30                     ` Steven Rostedt
  2019-02-22 19:34                       ` Alexei Starovoitov
  2019-02-22 21:20                     ` Linus Torvalds
  2019-02-26 15:24                     ` Joel Fernandes
  2 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2019-02-22 19:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, 22 Feb 2019 11:27:05 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > 
> > Then we should still probably fix up "__probe_kernel_read()" to not
> > allow user accesses. The easiest way to do that is actually likely to
> > use the "unsafe_get_user()" functions *without* doing a
> > uaccess_begin(), which will mean that modern CPU's will simply fault
> > on a kernel access to user space.  
> 
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.
> 
> If __probe_kernel_read will suddenly start failing on all user addresses
> it will break the expectations.
> How do we solve it in bpf_probe_read?
> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> in the loop?
> That's doable, but people already complain that bpf_probe_read() is slow
> and shows up in their perf report.

We're changing kprobes to add a specific flag to say that we want to
differentiate between kernel or user reads. Can this be done with
bpf_probe_read()? If it's showing up in perf report, I doubt a single
check is going to cause an issue. In fact, it may actually help speed
things up as the read will be optimized for either user or kernel
address reading.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:30                     ` Steven Rostedt
@ 2019-02-22 19:34                       ` Alexei Starovoitov
  2019-02-22 19:39                         ` Steven Rostedt
  2019-02-22 19:55                         ` Andy Lutomirski
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 19:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> On Fri, 22 Feb 2019 11:27:05 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > 
> > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > allow user accesses. The easiest way to do that is actually likely to
> > > use the "unsafe_get_user()" functions *without* doing a
> > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > on a kernel access to user space.  
> > 
> > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > and users pass both user and kernel addresses into it and expect
> > that the helper will actually try to read from that address.
> > 
> > If __probe_kernel_read will suddenly start failing on all user addresses
> > it will break the expectations.
> > How do we solve it in bpf_probe_read?
> > Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > in the loop?
> > That's doable, but people already complain that bpf_probe_read() is slow
> > and shows up in their perf report.
> 
> We're changing kprobes to add a specific flag to say that we want to
> differentiate between kernel or user reads. Can this be done with
> bpf_probe_read()? If it's showing up in perf report, I doubt a single

so you're saying you will break existing kprobe scripts?
I don't think it's a good idea.
It's not acceptable to break bpf_probe_read uapi.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:34                       ` Alexei Starovoitov
@ 2019-02-22 19:39                         ` Steven Rostedt
  2019-02-22 19:55                         ` Andy Lutomirski
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2019-02-22 19:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Fri, 22 Feb 2019 11:34:58 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> so you're saying you will break existing kprobe scripts?

Yes we may.

> I don't think it's a good idea.
> It's not acceptable to break bpf_probe_read uapi.

Then you may need to add more code to determine if the address is user
space or not in the kernel, and then go the appropriate route, before
calling probe_kernel_read().

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:34                       ` Alexei Starovoitov
  2019-02-22 19:39                         ` Steven Rostedt
@ 2019-02-22 19:55                         ` Andy Lutomirski
  2019-02-22 21:43                           ` Jann Horn
  1 sibling, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2019-02-22 19:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Linus Torvalds, Masami Hiramatsu,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf



> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>> On Fri, 22 Feb 2019 11:27:05 -0800
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> 
>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>> 
>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>> allow user accesses. The easiest way to do that is actually likely to
>>>> use the "unsafe_get_user()" functions *without* doing a
>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>> on a kernel access to user space.  
>>> 
>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>> and users pass both user and kernel addresses into it and expect
>>> that the helper will actually try to read from that address.
>>> 
>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>> it will break the expectations.
>>> How do we solve it in bpf_probe_read?
>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>> in the loop?
>>> That's doable, but people already complain that bpf_probe_read() is slow
>>> and shows up in their perf report.
>> 
>> We're changing kprobes to add a specific flag to say that we want to
>> differentiate between kernel or user reads. Can this be done with
>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> 
> so you're saying you will break existing kprobe scripts?
> I don't think it's a good idea.
> It's not acceptable to break bpf_probe_read uapi.
> 

If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.

What to do about existing scripts is a different question.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                   ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Alexei Starovoitov
  2019-02-22 19:30                     ` Steven Rostedt
@ 2019-02-22 21:20                     ` Linus Torvalds
  2019-02-22 21:38                       ` David Miller
  2019-02-26 15:24                     ` Joel Fernandes
  2 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2019-02-22 21:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.

As mentioned earlier in the thread, that's actually fundamentally broken.

There are architectures that have physically separate address spaces,
with the same pointer value in both kernel and user space.

They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.

So a pointer really should always unambiguously always be explicitly
_either_ a kernel pointer, or a user pointer. You can't have "this is
a pointer", and then try to figure it out by looking at the value.
That may happen to work on x86-64, but it's literally a "happen to
work on the most common architectures", not a design thing.

                Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:20                     ` Linus Torvalds
@ 2019-02-22 21:38                       ` David Miller
  2019-02-22 21:59                         ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2019-02-22 21:38 UTC (permalink / raw)
  To: torvalds
  Cc: alexei.starovoitov, mhiramat, rostedt, luto, linux-kernel, mingo,
	akpm, stable, changbin.du, jannh, keescook, luto, daniel, netdev,
	bpf

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 22 Feb 2019 13:20:58 -0800

> On Fri, Feb 22, 2019 at 11:27 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>> and users pass both user and kernel addresses into it and expect
>> that the helper will actually try to read from that address.
> 
> As mentioned earlier in the thread, that's actually fundamentally broken.
> 
> There are architectures that have physically separate address spaces,
> with the same pointer value in both kernel and user space.
> 
> They are rare, but they exist. At least sparc32 and the old 4G:4G split x86.

And sparc64.

> So a pointer really should always unambiguously always be explicitly
> _either_ a kernel pointer, or a user pointer. You can't have "this is
> a pointer", and then try to figure it out by looking at the value.
> That may happen to work on x86-64, but it's literally a "happen to
> work on the most common architectures", not a design thing.

Don't be surprised if we see more separation like this in the future too.

So it's not a smart thing to code against even if you can discount all of
the examples Linus gives above.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:55                         ` Andy Lutomirski
@ 2019-02-22 21:43                           ` Jann Horn
  2019-02-22 22:08                             ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Jann Horn @ 2019-02-22 21:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Network Development, bpf, Nadav Amit,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

(adding some people from the text_poke series to the thread, removing stable@)

On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >> On Fri, 22 Feb 2019 11:27:05 -0800
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>
> >>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>> allow user accesses. The easiest way to do that is actually likely to
> >>>> use the "unsafe_get_user()" functions *without* doing a
> >>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>> on a kernel access to user space.
> >>>
> >>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>> and users pass both user and kernel addresses into it and expect
> >>> that the helper will actually try to read from that address.
> >>>
> >>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>> it will break the expectations.
> >>> How do we solve it in bpf_probe_read?
> >>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>> in the loop?
> >>> That's doable, but people already complain that bpf_probe_read() is slow
> >>> and shows up in their perf report.
> >>
> >> We're changing kprobes to add a specific flag to say that we want to
> >> differentiate between kernel or user reads. Can this be done with
> >> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >
> > so you're saying you will break existing kprobe scripts?
> > I don't think it's a good idea.
> > It's not acceptable to break bpf_probe_read uapi.
> >
>
> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>
> What to do about existing scripts is a different question.

This lack of logical separation between user and kernel addresses
might interact interestingly with the text_poke series, specifically
"[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
patching" (https://lore.kernel.org/lkml/20190221234451.17632-6-rick.p.edgecombe@intel.com/)
and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
poking" (https://lore.kernel.org/lkml/20190221234451.17632-7-rick.p.edgecombe@intel.com/),
right? If someone manages to get a tracing BPF program to trigger in a
task that has switched to the patching mm, could they use
bpf_probe_write_user() - which uses probe_kernel_write() after
checking that KERNEL_DS isn't active and that access_ok() passes - to
overwrite kernel text that is mapped writable in the patching mm?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:38                       ` David Miller
@ 2019-02-22 21:59                         ` Linus Torvalds
  2019-02-22 22:51                           ` Alexei Starovoitov
  2019-02-26  3:57                           ` Christoph Hellwig
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2019-02-22 21:59 UTC (permalink / raw)
  To: David Miller
  Cc: Alexei Starovoitov, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
>
> Don't be surprised if we see more separation like this in the future too.

Yes, with the whole meltdown fiasco, there's actually more pressure to
add more support for separation of kernel/user address spaces. As Andy
pointed out, it's been discussed as a future wish-list for x86-64 too.

But yeah, right now the *common* architectures all distinguish kernel
and user space by pointers (ie x86-64, arm64 and powerpc).

                 Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:43                           ` Jann Horn
@ 2019-02-22 22:08                             ` Nadav Amit
  2019-02-22 22:17                               ` Jann Horn
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2019-02-22 22:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> 
> (adding some people from the text_poke series to the thread, removing stable@)
> 
> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 
>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>> 
>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>> on a kernel access to user space.
>>>>> 
>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>> and users pass both user and kernel addresses into it and expect
>>>>> that the helper will actually try to read from that address.
>>>>> 
>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>> it will break the expectations.
>>>>> How do we solve it in bpf_probe_read?
>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>> in the loop?
>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>> and shows up in their perf report.
>>>> 
>>>> We're changing kprobes to add a specific flag to say that we want to
>>>> differentiate between kernel or user reads. Can this be done with
>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>> 
>>> so you're saying you will break existing kprobe scripts?
>>> I don't think it's a good idea.
>>> It's not acceptable to break bpf_probe_read uapi.
>> 
>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>> 
>> What to do about existing scripts is a different question.
> 
> This lack of logical separation between user and kernel addresses
> might interact interestingly with the text_poke series, specifically
> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=gVALdkEULEhj4iJNEWAGxyYWe2lxnHRdamW5ZA2A5RQ%3D&amp;reserved=0)
> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=nu2J1FtJsZJmt53SKJz8C8ktWE9eycwdAA%2BiCi1TfCc%3D&amp;reserved=0),
> right? If someone manages to get a tracing BPF program to trigger in a
> task that has switched to the patching mm, could they use
> bpf_probe_write_user() - which uses probe_kernel_write() after
> checking that KERNEL_DS isn't active and that access_ok() passes - to
> overwrite kernel text that is mapped writable in the patching mm?

Yes, this is a good point. I guess text_poke() should be defined with
“__kprobes” and open-code memcpy.

Does it sound reasonable?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:08                             ` Nadav Amit
@ 2019-02-22 22:17                               ` Jann Horn
  2019-02-22 22:21                                 ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Jann Horn @ 2019-02-22 22:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >
> > (adding some people from the text_poke series to the thread, removing stable@)
> >
> > On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>
> >>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>> on a kernel access to user space.
> >>>>>
> >>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>> and users pass both user and kernel addresses into it and expect
> >>>>> that the helper will actually try to read from that address.
> >>>>>
> >>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>> it will break the expectations.
> >>>>> How do we solve it in bpf_probe_read?
> >>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>> in the loop?
> >>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>> and shows up in their perf report.
> >>>>
> >>>> We're changing kprobes to add a specific flag to say that we want to
> >>>> differentiate between kernel or user reads. Can this be done with
> >>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>
> >>> so you're saying you will break existing kprobe scripts?
> >>> I don't think it's a good idea.
> >>> It's not acceptable to break bpf_probe_read uapi.
> >>
> >> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>
> >> What to do about existing scripts is a different question.
> >
> > This lack of logical separation between user and kernel addresses
> > might interact interestingly with the text_poke series, specifically
> > "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=gVALdkEULEhj4iJNEWAGxyYWe2lxnHRdamW5ZA2A5RQ%3D&amp;reserved=0)
> > and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd44d6f0765dd49b20db708d6990ee7e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864686717142892&amp;sdata=nu2J1FtJsZJmt53SKJz8C8ktWE9eycwdAA%2BiCi1TfCc%3D&amp;reserved=0),
> > right? If someone manages to get a tracing BPF program to trigger in a
> > task that has switched to the patching mm, could they use
> > bpf_probe_write_user() - which uses probe_kernel_write() after
> > checking that KERNEL_DS isn't active and that access_ok() passes - to
> > overwrite kernel text that is mapped writable in the patching mm?
>
> Yes, this is a good point. I guess text_poke() should be defined with
> “__kprobes” and open-code memcpy.
>
> Does it sound reasonable?

Doesn't __text_poke() as implemented in the proposed patch use a
couple other kernel functions, too? Like switch_mm_irqs_off() and
pte_clear() (which can be a call into a separate function on paravirt
kernels)?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:17                               ` Jann Horn
@ 2019-02-22 22:21                                 ` Nadav Amit
  2019-02-22 22:39                                   ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2019-02-22 22:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel)

> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>> 
>>> (adding some people from the text_poke series to the thread, removing stable@)
>>> 
>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>> 
>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>> 
>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>> on a kernel access to user space.
>>>>>>> 
>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>> that the helper will actually try to read from that address.
>>>>>>> 
>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>> it will break the expectations.
>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>> in the loop?
>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>> and shows up in their perf report.
>>>>>> 
>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>> 
>>>>> so you're saying you will break existing kprobe scripts?
>>>>> I don't think it's a good idea.
>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>> 
>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>> 
>>>> What to do about existing scripts is a different question.
>>> 
>>> This lack of logical separation between user and kernel addresses
>>> might interact interestingly with the text_poke series, specifically
>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
>>> right? If someone manages to get a tracing BPF program to trigger in a
>>> task that has switched to the patching mm, could they use
>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>> overwrite kernel text that is mapped writable in the patching mm?
>> 
>> Yes, this is a good point. I guess text_poke() should be defined with
>> “__kprobes” and open-code memcpy.
>> 
>> Does it sound reasonable?
> 
> Doesn't __text_poke() as implemented in the proposed patch use a
> couple other kernel functions, too? Like switch_mm_irqs_off() and
> pte_clear() (which can be a call into a separate function on paravirt
> kernels)?

I will move the pte_clear() to be done after the poking mm was unloaded.
Give me a few minutes to send a sketch of what I think should be done.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:21                                 ` Nadav Amit
@ 2019-02-22 22:39                                   ` Nadav Amit
  2019-02-22 23:02                                     ` Jann Horn
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2019-02-22 22:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>> 
>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>> 
>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>> 
>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>> 
>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>> 
>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>> on a kernel access to user space.
>>>>>>>> 
>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>> 
>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>> it will break the expectations.
>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>> in the loop?
>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>> and shows up in their perf report.
>>>>>>> 
>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>> 
>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>> I don't think it's a good idea.
>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>> 
>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>> 
>>>>> What to do about existing scripts is a different question.
>>>> 
>>>> This lack of logical separation between user and kernel addresses
>>>> might interact interestingly with the text_poke series, specifically
>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>> task that has switched to the patching mm, could they use
>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>> overwrite kernel text that is mapped writable in the patching mm?
>>> 
>>> Yes, this is a good point. I guess text_poke() should be defined with
>>> “__kprobes” and open-code memcpy.
>>> 
>>> Does it sound reasonable?
>> 
>> Doesn't __text_poke() as implemented in the proposed patch use a
>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>> pte_clear() (which can be a call into a separate function on paravirt
>> kernels)?
> 
> I will move the pte_clear() to be done after the poking mm was unloaded.
> Give me a few minutes to send a sketch of what I think should be done.

Err.. You are right, I don’t see an easy way of preventing a kprobe from
being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.

The reasonable solution seems to me as taking all the relevant pieces of
code (and data) that might be used during text-poking and encapsulating them, so they
will be set in a memory area which cannot be kprobe'd. This can also be
useful to write-protect data structures of code that calls text_poke(),
e.g., static-keys. It can also protect data on that stack that is used
during text_poke() from being overwritten from another core.

This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
when doing write-rarely operations.

Right now, I think that text_poke() will keep being susceptible to such
an attack, unless you have a better suggestion.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:59                         ` Linus Torvalds
@ 2019-02-22 22:51                           ` Alexei Starovoitov
  2019-02-22 23:11                             ` Jann Horn
  2019-02-22 23:16                             ` Linus Torvalds
  2019-02-26  3:57                           ` Christoph Hellwig
  1 sibling, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 22:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> >
> > Don't be surprised if we see more separation like this in the future too.
> 
> Yes, with the whole meltdown fiasco, there's actually more pressure to
> add more support for separation of kernel/user address spaces. As Andy
> pointed out, it's been discussed as a future wish-list for x86-64 too.
> 
> But yeah, right now the *common* architectures all distinguish kernel
> and user space by pointers (ie x86-64, arm64 and powerpc).

That's all fine. I'm missing rationale for making probe_kernel_read()
fail on user addresses.
What is fundamentally wrong with a function probe_any_address_read() ?

For context, typical bpf kprobe program looks like this:
#define probe_read(P) \
    ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
SEC("kprobe/__set_task_comm")
int bpf_prog(struct pt_regs *ctx)
{
    struct signal_struct *signal;
    struct task_struct *tsk;
    char oldcomm[16] = {};
    char newcomm[16] = {};
    u16 oom_score_adj;
    u32 pid;

    tsk = (void *)PT_REGS_PARM1(ctx);
    pid = probe_read(tsk->pid);
    bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
    bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
    signal = probe_read(tsk->signal);
    oom_score_adj = probe_read(signal->oom_score_adj);
    ...
}

where PT_REGS_PARMx macros are defined per architecture.
On x86 it's #define PT_REGS_PARM1(x) ((x)->di)

The program writer has to know the meaning of function arguments.
In this example they need to know that __set_task_comm is defined as
void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.

Right now these programs just call bpf_probe_read() on whatever data
they need to access and not differentiating whether it's user or kernel.

One idea we discussed is to split bpf_probe_read() into kernel_read and user_read
helpers, but in the BPF verifier we cannot determine which address space
the program wants to access. The prog writer needs to manually analyze the program
to use correct one. But mistakes are possible and cannot be fatal.
On the kernel side we have to be safe.
Both probe_kernel_read and probe_user_read must not panic if a pointer
from wrong address space was passed.

Hence my preference is to keep probe_kernel_read as "try read any address".
The function can be renamed to indicate so.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:39                                   ` Nadav Amit
@ 2019-02-22 23:02                                     ` Jann Horn
  2019-02-22 23:22                                       ` Nadav Amit
  2019-02-22 23:59                                       ` Andy Lutomirski
  0 siblings, 2 replies; 34+ messages in thread
From: Jann Horn @ 2019-02-22 23:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> >
> >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> >>
> >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >>>>
> >>>> (adding some people from the text_poke series to the thread, removing stable@)
> >>>>
> >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>
> >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>>>>
> >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>>>>> on a kernel access to user space.
> >>>>>>>>
> >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>>>>> and users pass both user and kernel addresses into it and expect
> >>>>>>>> that the helper will actually try to read from that address.
> >>>>>>>>
> >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>>>>> it will break the expectations.
> >>>>>>>> How do we solve it in bpf_probe_read?
> >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>>>>> in the loop?
> >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>>>>> and shows up in their perf report.
> >>>>>>>
> >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> >>>>>>> differentiate between kernel or user reads. Can this be done with
> >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>>>>
> >>>>>> so you're saying you will break existing kprobe scripts?
> >>>>>> I don't think it's a good idea.
> >>>>>> It's not acceptable to break bpf_probe_read uapi.
> >>>>>
> >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>>>>
> >>>>> What to do about existing scripts is a different question.
> >>>>
> >>>> This lack of logical separation between user and kernel addresses
> >>>> might interact interestingly with the text_poke series, specifically
> >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> >>>> right? If someone manages to get a tracing BPF program to trigger in a
> >>>> task that has switched to the patching mm, could they use
> >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> >>>> overwrite kernel text that is mapped writable in the patching mm?
> >>>
> >>> Yes, this is a good point. I guess text_poke() should be defined with
> >>> “__kprobes” and open-code memcpy.
> >>>
> >>> Does it sound reasonable?
> >>
> >> Doesn't __text_poke() as implemented in the proposed patch use a
> >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> >> pte_clear() (which can be a call into a separate function on paravirt
> >> kernels)?
> >
> > I will move the pte_clear() to be done after the poking mm was unloaded.
> > Give me a few minutes to send a sketch of what I think should be done.
>
> Err.. You are right, I don’t see an easy way of preventing a kprobe from
> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>
> The reasonable solution seems to me as taking all the relevant pieces of
> code (and data) that might be used during text-poking and encapsulating them, so they
> will be set in a memory area which cannot be kprobe'd. This can also be
> useful to write-protect data structures of code that calls text_poke(),
> e.g., static-keys. It can also protect data on that stack that is used
> during text_poke() from being overwritten from another core.
>
> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> when doing write-rarely operations.
>
> Right now, I think that text_poke() will keep being susceptible to such
> an attack, unless you have a better suggestion.

A relatively simple approach might be to teach BPF not to run kprobe
programs and such in contexts where current->mm isn't the active mm?
Maybe using nmi_uaccess_okay(), or something like that? It looks like
perf_callchain_user() also already uses that. Except that a lot of
this code is x86-specific...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:51                           ` Alexei Starovoitov
@ 2019-02-22 23:11                             ` Jann Horn
  2019-02-22 23:16                               ` David Miller
  2019-02-22 23:16                             ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Jann Horn @ 2019-02-22 23:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 11:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> > >
> > > Don't be surprised if we see more separation like this in the future too.
> >
> > Yes, with the whole meltdown fiasco, there's actually more pressure to
> > add more support for separation of kernel/user address spaces. As Andy
> > pointed out, it's been discussed as a future wish-list for x86-64 too.
> >
> > But yeah, right now the *common* architectures all distinguish kernel
> > and user space by pointers (ie x86-64, arm64 and powerpc).
>
> That's all fine. I'm missing rationale for making probe_kernel_read()
> fail on user addresses.
> What is fundamentally wrong with a function probe_any_address_read() ?

I think what Linus is saying is: There are some scenarios (like a
system with the old 4G/4G X86 patch) where *the same* address can
refer to two different pieces of memory, depending on whether you
interpret it as a kernel pointer or a user pointer. So for example, if
your BPF program tries to read tsk->comm, it works, but if the BPF
program then tries to read from PT_REGS_PARM2(ctx), it's going to
actually interpret the userspace address as a kernel address and read
completely different memory.

On top of that, from the security angle, this means that if a user
passes a kernel pointer into a syscall, they can trick a tracing BPF
program into looking at random kernel memory instead of the user's
memory. That may or may not be problematic, depending on what you do
afterwards with the data you've read. (For example, if this is a
service that collects performance data and then saves it to some
world-readable location on disk because the data it is collecting
(including comm strings) isn't supposed to be sensitive, you might
have a problem.)

> For context, typical bpf kprobe program looks like this:
> #define probe_read(P) \
>     ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
> SEC("kprobe/__set_task_comm")
> int bpf_prog(struct pt_regs *ctx)
> {
>     struct signal_struct *signal;
>     struct task_struct *tsk;
>     char oldcomm[16] = {};
>     char newcomm[16] = {};
>     u16 oom_score_adj;
>     u32 pid;
>
>     tsk = (void *)PT_REGS_PARM1(ctx);
>     pid = probe_read(tsk->pid);
>     bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
>     bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
>     signal = probe_read(tsk->signal);
>     oom_score_adj = probe_read(signal->oom_score_adj);
>     ...
> }
>
> where PT_REGS_PARMx macros are defined per architecture.
> On x86 it's #define PT_REGS_PARM1(x) ((x)->di)
>
> The program writer has to know the meaning of function arguments.
> In this example they need to know that __set_task_comm is defined as
> void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) in the kernel.
>
> Right now these programs just call bpf_probe_read() on whatever data
> they need to access and not differentiating whether it's user or kernel.
>
> One idea we discussed is to split bpf_probe_read() into kernel_read and user_read
> helpers, but in the BPF verifier we cannot determine which address space
> the program wants to access. The prog writer needs to manually analyze the program
> to use correct one. But mistakes are possible and cannot be fatal.
> On the kernel side we have to be safe.
> Both probe_kernel_read and probe_user_read must not panic if a pointer
> from wrong address space was passed.
>
> Hence my preference is to keep probe_kernel_read as "try read any address".
> The function can be renamed to indicate so.
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:11                             ` Jann Horn
@ 2019-02-22 23:16                               ` David Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2019-02-22 23:16 UTC (permalink / raw)
  To: jannh
  Cc: alexei.starovoitov, torvalds, mhiramat, rostedt, luto,
	linux-kernel, mingo, akpm, stable, changbin.du, keescook, luto,
	daniel, netdev, bpf

From: Jann Horn <jannh@google.com>
Date: Sat, 23 Feb 2019 00:11:58 +0100

> I think what Linus is saying is: There are some scenarios (like a
> system with the old 4G/4G X86 patch) where *the same* address can
> refer to two different pieces of memory, depending on whether you
> interpret it as a kernel pointer or a user pointer.

Exactly.

On sparc64 the kernel is mapped exactly at the same virtual addresses
as userspace processes usually are mapped, even 32-bit ones.  The
difference is the MMU context only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 22:51                           ` Alexei Starovoitov
  2019-02-22 23:11                             ` Jann Horn
@ 2019-02-22 23:16                             ` Linus Torvalds
  2019-02-22 23:56                               ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2019-02-22 23:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 2:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> That's all fine. I'm missing rationale for making probe_kernel_read()
> fail on user addresses.

Because it already WON'T WORK in general!

> What is fundamentally wrong with a function probe_any_address_read() ?

What part of "the same pointer value can be a user address and a
kernel address" is not getting through?

The user address space and the kernel address space have separate page
tables on some architectures. We used to avoid it on x86, because
switching address spaces was expensive, but even on x86 some vendors
did it on 32-bit simply to get 4GB of user (and kernel) address space.
And now we end up doing it anyway just because of meltdown.

So a kernel pointer value of 0x12345678 could be a value kernel
pointer pointing to some random kmalloc'ed kernel memory, and a user
pointer value of 0x12345678 could be a valid _user_ pointer pointing
to some user mapping.

See?

If you access a user pointer, you need to use a user accessor function
(eg "get_user()"), while if you access a kernel pointer you need to
just dereference it directly (unless you can't trust it, in which case
you need to use a _different_ accessor function).

The fact that user and kernel pointers happen to be distinct on x86-64
(right now) is just a random implementation detail.

Really.

I didn't realize how many people seem to have been confused about
this. But it's always been true. It's just that the common
architectures have had that "one single address space for both kernel
and user pointers" in practice.

In fact, the *very* first kernel version had separate address spaces
for kernel and user mode even on x86 (using segments, not paging). So
it has literally been true since day one in Linux that a kernel
address can be indistinguishable from a user address from a pure value
standpoint.

                 Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:02                                     ` Jann Horn
@ 2019-02-22 23:22                                       ` Nadav Amit
  2019-02-22 23:59                                       ` Andy Lutomirski
  1 sibling, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2019-02-22 23:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Andy Lutomirski, Daniel Borkmann, Network Development, bpf,
	Rick Edgecombe, Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 3:02 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>>>> 
>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>>>> 
>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>>>> 
>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>>>> on a kernel access to user space.
>>>>>>>>>> 
>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>>>> 
>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>>>> it will break the expectations.
>>>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>>>> in the loop?
>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>>>> and shows up in their perf report.
>>>>>>>>> 
>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>>>> 
>>>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>>>> I don't think it's a good idea.
>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>>>> 
>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>>>> 
>>>>>>> What to do about existing scripts is a different question.
>>>>>> 
>>>>>> This lack of logical separation between user and kernel addresses
>>>>>> might interact interestingly with the text_poke series, specifically
>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=ky5iTrsCceoPwVW5N9FB4sDspwGEQ8MTlRE4b1Bqn54%3D&amp;reserved=0)
>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cd03df2db76624da8eb2008d69919e41a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864733821233906&amp;sdata=EJs8doLrfFfMTKiVHmWjmpnpWolmuuZ5pxOmEMcI0ew%3D&amp;reserved=0),
>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>>>> task that has switched to the patching mm, could they use
>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>>>> overwrite kernel text that is mapped writable in the patching mm?
>>>>> 
>>>>> Yes, this is a good point. I guess text_poke() should be defined with
>>>>> “__kprobes” and open-code memcpy.
>>>>> 
>>>>> Does it sound reasonable?
>>>> 
>>>> Doesn't __text_poke() as implemented in the proposed patch use a
>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>>>> pte_clear() (which can be a call into a separate function on paravirt
>>>> kernels)?
>>> 
>>> I will move the pte_clear() to be done after the poking mm was unloaded.
>>> Give me a few minutes to send a sketch of what I think should be done.
>> 
>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>> 
>> The reasonable solution seems to me as taking all the relevant pieces of
>> code (and data) that might be used during text-poking and encapsulating them, so they
>> will be set in a memory area which cannot be kprobe'd. This can also be
>> useful to write-protect data structures of code that calls text_poke(),
>> e.g., static-keys. It can also protect data on that stack that is used
>> during text_poke() from being overwritten from another core.
>> 
>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
>> when doing write-rarely operations.
>> 
>> Right now, I think that text_poke() will keep being susceptible to such
>> an attack, unless you have a better suggestion.
> 
> A relatively simple approach might be to teach BPF not to run kprobe
> programs and such in contexts where current->mm isn't the active mm?
> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> perf_callchain_user() also already uses that. Except that a lot of
> this code is x86-specific…

I keep having in mind how to reduce the TCB that is used while text_poke()
is running, but for the specific issue here, I think your approach would
be fine, and trace_call_bpf() can be modified to do what you ask for.

But, I am not sure that relying on current->mm gets us any more security,
relatively to having another unrelated explicit kprobe-disable indication,
which is cleaner from design point-of-view.

I can see how we get “some more security” if our decision whether kprobes
should be enabled was purely based on some hardware register (e.g., CR3) and
we could unequivocally realize whether kprobes eBPF should be on/off without
memory accesses (e.g., PCID bit set). Yet, I am not sure it worth it.

What do you say?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:16                             ` Linus Torvalds
@ 2019-02-22 23:56                               ` Alexei Starovoitov
  2019-02-23  0:08                                 ` Linus Torvalds
  2019-02-23  4:51                                 ` Masami Hiramatsu
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-22 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
> 
> So a kernel pointer value of 0x12345678 could be a value kernel
> pointer pointing to some random kmalloc'ed kernel memory, and a user
> pointer value of 0x12345678 could be a valid _user_ pointer pointing
> to some user mapping.
> 
> See?
> 
> If you access a user pointer, you need to use a user accessor function
> (eg "get_user()"), while if you access a kernel pointer you need to
> just dereference it directly (unless you can't trust it, in which case
> you need to use a _different_ accessor function).

that was clear already.
Reading 0x12345678 via probe_kernel_read can return valid value
and via get_user() can return another valid value on _some_ architectures.

> The fact that user and kernel pointers happen to be distinct on x86-64
> (right now) is just a random implementation detail.

yes and my point that people already rely on this implementation detail.
Say we implement 
int bpf_probe_read(void *val, void *unsafe_ptr)
{
  if (probe_kernel_read(val, unsafe_ptr) == OK) {
     return 0;
  } else (get_user(val, unsafe_ptr) == OK) {
     return 0;
  } else {
     *val = 0;
     return -EFAULT;
  }
}
It will preserve existing bpf_probe_read() behavior on x86.
If x86 implementation changes tomorrow then progs that read user
addresses may start failing randomly because first probe_kernel_read()
will be returning random values from kernel memory and that's no good,
but at least we won't be breaking them today, so we have time to
introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.

Imo that's much better than making current bpf_probe_read() fail
on user addresses today and not providing a non disruptive path forward.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:02                                     ` Jann Horn
  2019-02-22 23:22                                       ` Nadav Amit
@ 2019-02-22 23:59                                       ` Andy Lutomirski
  2019-02-23  0:03                                         ` Alexei Starovoitov
                                                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Andy Lutomirski @ 2019-02-22 23:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Nadav Amit, Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Andy Lutomirski,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> > >
> > >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> > >>
> > >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> > >>>>
> > >>>> (adding some people from the text_poke series to the thread, removing stable@)
> > >>>>
> > >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> > >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> > >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > >>>>>>>>>
> > >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> > >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> > >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> > >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> > >>>>>>>>> on a kernel access to user space.
> > >>>>>>>>
> > >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > >>>>>>>> and users pass both user and kernel addresses into it and expect
> > >>>>>>>> that the helper will actually try to read from that address.
> > >>>>>>>>
> > >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> > >>>>>>>> it will break the expectations.
> > >>>>>>>> How do we solve it in bpf_probe_read?
> > >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > >>>>>>>> in the loop?
> > >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> > >>>>>>>> and shows up in their perf report.
> > >>>>>>>
> > >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> > >>>>>>> differentiate between kernel or user reads. Can this be done with
> > >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> > >>>>>>
> > >>>>>> so you're saying you will break existing kprobe scripts?
> > >>>>>> I don't think it's a good idea.
> > >>>>>> It's not acceptable to break bpf_probe_read uapi.
> > >>>>>
> > >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> > >>>>>
> > >>>>> What to do about existing scripts is a different question.
> > >>>>
> > >>>> This lack of logical separation between user and kernel addresses
> > >>>> might interact interestingly with the text_poke series, specifically
> > >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> > >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> > >>>> right? If someone manages to get a tracing BPF program to trigger in a
> > >>>> task that has switched to the patching mm, could they use
> > >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> > >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> > >>>> overwrite kernel text that is mapped writable in the patching mm?
> > >>>
> > >>> Yes, this is a good point. I guess text_poke() should be defined with
> > >>> “__kprobes” and open-code memcpy.
> > >>>
> > >>> Does it sound reasonable?
> > >>
> > >> Doesn't __text_poke() as implemented in the proposed patch use a
> > >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> > >> pte_clear() (which can be a call into a separate function on paravirt
> > >> kernels)?
> > >
> > > I will move the pte_clear() to be done after the poking mm was unloaded.
> > > Give me a few minutes to send a sketch of what I think should be done.
> >
> > Err.. You are right, I don’t see an easy way of preventing a kprobe from
> > being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> >
> > The reasonable solution seems to me as taking all the relevant pieces of
> > code (and data) that might be used during text-poking and encapsulating them, so they
> > will be set in a memory area which cannot be kprobe'd. This can also be
> > useful to write-protect data structures of code that calls text_poke(),
> > e.g., static-keys. It can also protect data on that stack that is used
> > during text_poke() from being overwritten from another core.
> >
> > This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> > when doing write-rarely operations.
> >
> > Right now, I think that text_poke() will keep being susceptible to such
> > an attack, unless you have a better suggestion.
>
> A relatively simple approach might be to teach BPF not to run kprobe
> programs and such in contexts where current->mm isn't the active mm?
> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> perf_callchain_user() also already uses that. Except that a lot of
> this code is x86-specific...

This sounds like exactly the right solution.  If you're running from
some unknown context (like NMI or tracing), then you should check
nmi_uaccess_okay().  I think we should just promote that to be a
non-arch-specific function (that returns true by default) and check it
the relevant bpf_probe_xyz() functions.

Alexei, does that seem reasonable?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                       ` Andy Lutomirski
@ 2019-02-23  0:03                                         ` Alexei Starovoitov
  2019-02-23  0:15                                         ` Nadav Amit
  2019-02-25 13:36                                         ` Masami Hiramatsu
  2 siblings, 0 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-23  0:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Nadav Amit, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Daniel Borkmann,
	Network Development, bpf, Rick Edgecombe, Dave Hansen,
	Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, Feb 22, 2019 at 03:59:30PM -0800, Andy Lutomirski wrote:
> >
> > A relatively simple approach might be to teach BPF not to run kprobe
> > programs and such in contexts where current->mm isn't the active mm?
> > Maybe using nmi_uaccess_okay(), or something like that? It looks like
> > perf_callchain_user() also already uses that. Except that a lot of
> > this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.
> 
> Alexei, does that seem reasonable?

yep. I think that should work.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:56                               ` Alexei Starovoitov
@ 2019-02-23  0:08                                 ` Linus Torvalds
  2019-02-23  2:28                                   ` Alexei Starovoitov
  2019-02-23  4:51                                 ` Masami Hiramatsu
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2019-02-23  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> It will preserve existing bpf_probe_read() behavior on x86.

... but that's the worst possible situation.

It appears that people haven't understood that kernel and user
addresses are distinct, and may have written programs that are
fundamentally buggy.

And we _want_ to make it clear that they are buggy on x86-64, exactly
because x86-64 is the one that gets the most testing - by far.

So if x86-64 continues working for buggy programs, then that only
means that those bugs never get fixed.

It would be much better to try to get those things fixed, and make the
x86-64 implementation stricter, exactly so that people end up
_realizing_ that they can't just think "a pointer is a pointer, and
the context doesn't matter".

From a pure functional safety standpoint, I thought bpf already knew
what kind of a pointer it had?

                   Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                       ` Andy Lutomirski
  2019-02-23  0:03                                         ` Alexei Starovoitov
@ 2019-02-23  0:15                                         ` Nadav Amit
  2019-02-24 19:35                                           ` Andy Lutomirski
  2019-02-25 13:36                                         ` Masami Hiramatsu
  2 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2019-02-23  0:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Alexei Starovoitov, Steven Rostedt, Linus Torvalds,
	Masami Hiramatsu, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, Changbin Du, Kees Cook, Daniel Borkmann,
	Network Development, bpf, Rick Edgecombe, Dave Hansen,
	Peter Zijlstra (Intel),
	Igor Stoppa

> On Feb 22, 2019, at 3:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
>> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
>>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
>>>>> 
>>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
>>>>>>> 
>>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
>>>>>>> 
>>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
>>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
>>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
>>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
>>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
>>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
>>>>>>>>>>>> on a kernel access to user space.
>>>>>>>>>>> 
>>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
>>>>>>>>>>> and users pass both user and kernel addresses into it and expect
>>>>>>>>>>> that the helper will actually try to read from that address.
>>>>>>>>>>> 
>>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
>>>>>>>>>>> it will break the expectations.
>>>>>>>>>>> How do we solve it in bpf_probe_read?
>>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
>>>>>>>>>>> in the loop?
>>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
>>>>>>>>>>> and shows up in their perf report.
>>>>>>>>>> 
>>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
>>>>>>>>>> differentiate between kernel or user reads. Can this be done with
>>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
>>>>>>>>> 
>>>>>>>>> so you're saying you will break existing kprobe scripts?
>>>>>>>>> I don't think it's a good idea.
>>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
>>>>>>>> 
>>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
>>>>>>>> 
>>>>>>>> What to do about existing scripts is a different question.
>>>>>>> 
>>>>>>> This lack of logical separation between user and kernel addresses
>>>>>>> might interact interestingly with the text_poke series, specifically
>>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
>>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=2tqD7udTCfNbcNLcj5SFpZt8WwK5NwtgaWMKm1Ye1EE%3D&amp;reserved=0)
>>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
>>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=7%2BLShgLKnra6xzSkxdJrCclCacfdE5IdczwScW83nuE%3D&amp;reserved=0),
>>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
>>>>>>> task that has switched to the patching mm, could they use
>>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
>>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
>>>>>>> overwrite kernel text that is mapped writable in the patching mm?
>>>>>> 
>>>>>> Yes, this is a good point. I guess text_poke() should be defined with
>>>>>> “__kprobes” and open-code memcpy.
>>>>>> 
>>>>>> Does it sound reasonable?
>>>>> 
>>>>> Doesn't __text_poke() as implemented in the proposed patch use a
>>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
>>>>> pte_clear() (which can be a call into a separate function on paravirt
>>>>> kernels)?
>>>> 
>>>> I will move the pte_clear() to be done after the poking mm was unloaded.
>>>> Give me a few minutes to send a sketch of what I think should be done.
>>> 
>>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
>>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
>>> 
>>> The reasonable solution seems to me as taking all the relevant pieces of
>>> code (and data) that might be used during text-poking and encapsulating them, so they
>>> will be set in a memory area which cannot be kprobe'd. This can also be
>>> useful to write-protect data structures of code that calls text_poke(),
>>> e.g., static-keys. It can also protect data on that stack that is used
>>> during text_poke() from being overwritten from another core.
>>> 
>>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
>>> when doing write-rarely operations.
>>> 
>>> Right now, I think that text_poke() will keep being susceptible to such
>>> an attack, unless you have a better suggestion.
>> 
>> A relatively simple approach might be to teach BPF not to run kprobe
>> programs and such in contexts where current->mm isn't the active mm?
>> Maybe using nmi_uaccess_okay(), or something like that? It looks like
>> perf_callchain_user() also already uses that. Except that a lot of
>> this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.

I can do that, but notice that switch_mm_irqs_off() writes to
cpu_tlbstate.loaded_mm before it actually writes to CR3. So there are still
a couple of instructions (and the load_new_mm_cr3()) in between that a
kprobe can be set on, no?

I can mark them as non-kprobable.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  0:08                                 ` Linus Torvalds
@ 2019-02-23  2:28                                   ` Alexei Starovoitov
  2019-02-23  2:32                                     ` Linus Torvalds
  2019-02-23  3:02                                     ` Steven Rostedt
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2019-02-23  2:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 04:08:59PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > It will preserve existing bpf_probe_read() behavior on x86.
> 
> ... but that's the worst possible situation.
> 
> It appears that people haven't understood that kernel and user
> addresses are distinct, and may have written programs that are
> fundamentally buggy.
> 
> And we _want_ to make it clear that they are buggy on x86-64, exactly
> because x86-64 is the one that gets the most testing - by far.
> 
> So if x86-64 continues working for buggy programs, then that only
> means that those bugs never get fixed.
> 
> It would be much better to try to get those things fixed, and make the
> x86-64 implementation stricter, exactly so that people end up
> _realizing_ that they can't just think "a pointer is a pointer, and
> the context doesn't matter".
>
> From a pure functional safety standpoint, I thought bpf already knew
> what kind of a pointer it had?

when bpf verifier knows the type of pointer it allows direct access to it.
That's the case for skb, socket, packet data, hash maps, arrays, stack, etc.
Networking progs cannot call bpf_probe_read().
It's available to tracing progs only and their goal is to walk the kernel and
user memory with addresses that cannot be statically verified
at program load time.
We are working on adding type information (BTF) to vmlinux.
Soon we'll be able to tell the type of every kernel function argument.
Right now arg1 and arg2 in a kprobed function are just u64 pt_regs->di, si.
Soon we'll be able to precisely identify their C type.

I completely agree on the direction that x86 is the architecture that
sets an example and users need to learn the difference in pointers.
I only disagree on timing.
Right now users don't have an alternative.
In our repo I counted ~400 calls to bpf_probe_read and about 10 times more
'indirect' calls. What's happening with 'indirect' calls is BCC toolchain
using clang to automatically replace struct_a->field_foo access with
bpf_probe_read(struct_a + offsetof(typeof(struct_a), field_foo));

If we had __user vs __kernel annotation available to clang we could have
taught BCC to replace this '->' dereference with appropriate kernel vs user
helper. Also we need to teach GCC to recognize __user and store into dwarf,
so we can take it further into BTF and verify later.

Also I think disallowing bpf_probe_read() to read user pointer will not
make a desired teaching effect. It will only cause painful debugging to folks
when their progs will stop working. It's better to remove bpf_probe_read()
completely.
imo the process of teaching the users of kernel vs user pointer difference
needs to be gradual.
First we introduce bpf_probe_kernel_read and bpf_probe_user_read and
introduce clang/gcc tooling to catch the mistakes.
Going over this 400+ places and manually grepping kernel sources
for __user keyword is not a great proposal if we want to keep those users.
Once we have this working we can remove bpf_probe_read() altogether.
Rejecting bpf prog at load time is a clear signal that user has to fix it
(instead of changing run-time behavior).
When the verifier gets even smarter it could potentially replace prob_read
with probe_kernel_read and probe_user_read when it has that type info.

imo this kernel release should finish as-is and in the next cycle we can
change probe_kernel_read() to reject user address, have temporary
workaround in bpf_probe_read() with probe_kernel_read+get_user hack,
introduce new bpf helpers, new tooling and eventually remove
buggy bpf_probe_read.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  2:28                                   ` Alexei Starovoitov
@ 2019-02-23  2:32                                     ` Linus Torvalds
  2019-02-23  3:02                                     ` Steven Rostedt
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2019-02-23  2:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Masami Hiramatsu, Steven Rostedt, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 6:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> imo this kernel release should finish as-is and in the next cycle we can
> change probe_kernel_read() to reject user address [..]

Absolutely. Nothing is going to change right now for 5.0, which is imminent.

It's really a "long-term we really need to fix this", where the only
question is how soon "long-term" is.

                   Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  2:28                                   ` Alexei Starovoitov
  2019-02-23  2:32                                     ` Linus Torvalds
@ 2019-02-23  3:02                                     ` Steven Rostedt
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2019-02-23  3:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Andy Lutomirski,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andrew Lutomirski,
	Daniel Borkmann, Netdev, bpf

On Fri, 22 Feb 2019 18:28:53 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> First we introduce bpf_probe_kernel_read and bpf_probe_user_read and
> introduce clang/gcc tooling to catch the mistakes.
> Going over this 400+ places and manually grepping kernel sources
> for __user keyword is not a great proposal if we want to keep those users.
> Once we have this working we can remove bpf_probe_read() altogether.
> Rejecting bpf prog at load time is a clear signal that user has to fix it
> (instead of changing run-time behavior).
> When the verifier gets even smarter it could potentially replace prob_read
> with probe_kernel_read and probe_user_read when it has that type info.

I was about to suggest this approach. Document that bpf_probe_read() is
known to be buggy and will be deprecated in the future, and that all
new bpf scripts should start using bpf_probe_kernel/user_read()
instead (after they have been implemented of course). And give time for
people to fix their current scripts.

Perhaps in the near future, trigger some kind of warning for users that
use bpf_probe_read().

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:56                               ` Alexei Starovoitov
  2019-02-23  0:08                                 ` Linus Torvalds
@ 2019-02-23  4:51                                 ` Masami Hiramatsu
  1 sibling, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2019-02-23  4:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, David Miller, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, Linux List Kernel Mailing, Ingo Molnar,
	Andrew Morton, stable, Changbin Du, Jann Horn, Kees Cook,
	Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, 22 Feb 2019 15:56:20 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Feb 22, 2019 at 03:16:35PM -0800, Linus Torvalds wrote:
> > 
> > So a kernel pointer value of 0x12345678 could be a value kernel
> > pointer pointing to some random kmalloc'ed kernel memory, and a user
> > pointer value of 0x12345678 could be a valid _user_ pointer pointing
> > to some user mapping.
> > 
> > See?
> > 
> > If you access a user pointer, you need to use a user accessor function
> > (eg "get_user()"), while if you access a kernel pointer you need to
> > just dereference it directly (unless you can't trust it, in which case
> > you need to use a _different_ accessor function).
> 
> that was clear already.
> Reading 0x12345678 via probe_kernel_read can return valid value
> and via get_user() can return another valid value on _some_ architectures.
> 
> > The fact that user and kernel pointers happen to be distinct on x86-64
> > (right now) is just a random implementation detail.
> 
> yes and my point that people already rely on this implementation detail.
> Say we implement 
> int bpf_probe_read(void *val, void *unsafe_ptr)
> {
>   if (probe_kernel_read(val, unsafe_ptr) == OK) {
>      return 0;
>   } else (get_user(val, unsafe_ptr) == OK) {
>      return 0;
>   } else {
>      *val = 0;
>      return -EFAULT;
>   }
> }

Note that we can not use get_user() form kprobe handler. If you use it,
you have to prepare fault_handler() and make bpf itself can be aborted.
So, maybe you can use probe_user_read().

Hmm, however, it still doesn't work correctly on "some" architecture,
since whether a pointer (address) points user-space or kernel-space
depends on the context. In kprobe/bpf, the context means where you
put the probe and which pointer you record.

I think only "__user" tag tells us which one is user-space. But 
unfortunately, that "__user" tag is only for compiler or checker, not
for runtime binary. Such useful attribute goes away when we execute it.

So, even if we introduce "ustring", ftrace/perf users has to decide to use
it by themselves. As far as I know, DWARF(debuginfo) also doesn't have
that attribute. So perf-probe can not help it from debuginfo.
(Maybe if we introduce C parser, it might be detected...)

> It will preserve existing bpf_probe_read() behavior on x86.
> If x86 implementation changes tomorrow then progs that read user
> addresses may start failing randomly because first probe_kernel_read()
> will be returning random values from kernel memory and that's no good,
> but at least we won't be breaking them today, so we have time to
> introduce bpf_user_read and bpf_kernel_read and folks have time to adopt them.

I see. I think bpf also has to introduce new bpf_probe_read_user() and
keep bpf_probe_read() for kernel dataa only.

> Imo that's much better than making current bpf_probe_read() fail
> on user addresses today and not providing a non disruptive path forward.

Agreed.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-23  0:15                                         ` Nadav Amit
@ 2019-02-24 19:35                                           ` Andy Lutomirski
  0 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2019-02-24 19:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Jann Horn, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Sat, Feb 23, 2019 at 12:30 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Feb 22, 2019, at 3:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
> >> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> >>>>
> >>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> >>>>>
> >>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> >>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> >>>>>>>
> >>>>>>> (adding some people from the text_poke series to the thread, removing stable@)
> >>>>>>>
> >>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> >>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> >>>>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> >>>>>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> >>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> >>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> >>>>>>>>>>>> on a kernel access to user space.
> >>>>>>>>>>>
> >>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> >>>>>>>>>>> and users pass both user and kernel addresses into it and expect
> >>>>>>>>>>> that the helper will actually try to read from that address.
> >>>>>>>>>>>
> >>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> >>>>>>>>>>> it will break the expectations.
> >>>>>>>>>>> How do we solve it in bpf_probe_read?
> >>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> >>>>>>>>>>> in the loop?
> >>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> >>>>>>>>>>> and shows up in their perf report.
> >>>>>>>>>>
> >>>>>>>>>> We're changing kprobes to add a specific flag to say that we want to
> >>>>>>>>>> differentiate between kernel or user reads. Can this be done with
> >>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> >>>>>>>>>
> >>>>>>>>> so you're saying you will break existing kprobe scripts?
> >>>>>>>>> I don't think it's a good idea.
> >>>>>>>>> It's not acceptable to break bpf_probe_read uapi.
> >>>>>>>>
> >>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> >>>>>>>>
> >>>>>>>> What to do about existing scripts is a different question.
> >>>>>>>
> >>>>>>> This lack of logical separation between user and kernel addresses
> >>>>>>> might interact interestingly with the text_poke series, specifically
> >>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> >>>>>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=2tqD7udTCfNbcNLcj5SFpZt8WwK5NwtgaWMKm1Ye1EE%3D&amp;reserved=0)
> >>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> >>>>>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&amp;sdata=7%2BLShgLKnra6xzSkxdJrCclCacfdE5IdczwScW83nuE%3D&amp;reserved=0),
> >>>>>>> right? If someone manages to get a tracing BPF program to trigger in a
> >>>>>>> task that has switched to the patching mm, could they use
> >>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> >>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> >>>>>>> overwrite kernel text that is mapped writable in the patching mm?
> >>>>>>
> >>>>>> Yes, this is a good point. I guess text_poke() should be defined with
> >>>>>> “__kprobes” and open-code memcpy.
> >>>>>>
> >>>>>> Does it sound reasonable?
> >>>>>
> >>>>> Doesn't __text_poke() as implemented in the proposed patch use a
> >>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and
> >>>>> pte_clear() (which can be a call into a separate function on paravirt
> >>>>> kernels)?
> >>>>
> >>>> I will move the pte_clear() to be done after the poking mm was unloaded.
> >>>> Give me a few minutes to send a sketch of what I think should be done.
> >>>
> >>> Err.. You are right, I don’t see an easy way of preventing a kprobe from
> >>> being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> >>>
> >>> The reasonable solution seems to me as taking all the relevant pieces of
> >>> code (and data) that might be used during text-poking and encapsulating them, so they
> >>> will be set in a memory area which cannot be kprobe'd. This can also be
> >>> useful to write-protect data structures of code that calls text_poke(),
> >>> e.g., static-keys. It can also protect data on that stack that is used
> >>> during text_poke() from being overwritten from another core.
> >>>
> >>> This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> >>> when doing write-rarely operations.
> >>>
> >>> Right now, I think that text_poke() will keep being susceptible to such
> >>> an attack, unless you have a better suggestion.
> >>
> >> A relatively simple approach might be to teach BPF not to run kprobe
> >> programs and such in contexts where current->mm isn't the active mm?
> >> Maybe using nmi_uaccess_okay(), or something like that? It looks like
> >> perf_callchain_user() also already uses that. Except that a lot of
> >> this code is x86-specific...
> >
> > This sounds like exactly the right solution.  If you're running from
> > some unknown context (like NMI or tracing), then you should check
> > nmi_uaccess_okay().  I think we should just promote that to be a
> > non-arch-specific function (that returns true by default) and check it
> > the relevant bpf_probe_xyz() functions.
>
> I can do that, but notice that switch_mm_irqs_off() writes to
> cpu_tlbstate.loaded_mm before it actually writes to CR3. So there are still
> a couple of instructions (and the load_new_mm_cr3()) in between that a
> kprobe can be set on, no?

But you can't mark then as no-nmi :)  See the comment in
nmi_uaccess_ok() -- the code is intended to work correctly during this
window.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 23:59                                       ` Andy Lutomirski
  2019-02-23  0:03                                         ` Alexei Starovoitov
  2019-02-23  0:15                                         ` Nadav Amit
@ 2019-02-25 13:36                                         ` Masami Hiramatsu
  2 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2019-02-25 13:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Nadav Amit, Alexei Starovoitov, Steven Rostedt,
	Linus Torvalds, Masami Hiramatsu, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, Changbin Du, Kees Cook,
	Daniel Borkmann, Network Development, bpf, Rick Edgecombe,
	Dave Hansen, Peter Zijlstra (Intel),
	Igor Stoppa

On Fri, 22 Feb 2019 15:59:30 -0800
Andy Lutomirski <luto@kernel.org> wrote:

> On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <namit@vmware.com> wrote:
> > > > On Feb 22, 2019, at 2:21 PM, Nadav Amit <namit@vmware.com> wrote:
> > > >
> > > >> On Feb 22, 2019, at 2:17 PM, Jann Horn <jannh@google.com> wrote:
> > > >>
> > > >> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <namit@vmware.com> wrote:
> > > >>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <jannh@google.com> wrote:
> > > >>>>
> > > >>>> (adding some people from the text_poke series to the thread, removing stable@)
> > > >>>>
> > > >>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > > >>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > >>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote:
> > > >>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800
> > > >>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not
> > > >>>>>>>>> allow user accesses. The easiest way to do that is actually likely to
> > > >>>>>>>>> use the "unsafe_get_user()" functions *without* doing a
> > > >>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault
> > > >>>>>>>>> on a kernel access to user space.
> > > >>>>>>>>
> > > >>>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > > >>>>>>>> and users pass both user and kernel addresses into it and expect
> > > >>>>>>>> that the helper will actually try to read from that address.
> > > >>>>>>>>
> > > >>>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses
> > > >>>>>>>> it will break the expectations.
> > > >>>>>>>> How do we solve it in bpf_probe_read?
> > > >>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> > > >>>>>>>> in the loop?
> > > >>>>>>>> That's doable, but people already complain that bpf_probe_read() is slow
> > > >>>>>>>> and shows up in their perf report.
> > > >>>>>>>
> > > >>>>>>> We're changing kprobes to add a specific flag to say that we want to
> > > >>>>>>> differentiate between kernel or user reads. Can this be done with
> > > >>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single
> > > >>>>>>
> > > >>>>>> so you're saying you will break existing kprobe scripts?
> > > >>>>>> I don't think it's a good idea.
> > > >>>>>> It's not acceptable to break bpf_probe_read uapi.
> > > >>>>>
> > > >>>>> If so, the uapi is wrong: a long-sized number does not reliably identify an address if you don’t separately know whether it’s a user or kernel address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied for RISC-V and future x86_64 to join the crowd.  I don’t know whether I’ll win this fight, but the uapi will probably have to change for at least s390x.
> > > >>>>>
> > > >>>>> What to do about existing scripts is a different question.
> > > >>>>
> > > >>>> This lack of logical separation between user and kernel addresses
> > > >>>> might interact interestingly with the text_poke series, specifically
> > > >>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for
> > > >>>> patching" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&amp;reserved=0)
> > > >>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text
> > > >>>> poking" (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&amp;sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&amp;reserved=0),
> > > >>>> right? If someone manages to get a tracing BPF program to trigger in a
> > > >>>> task that has switched to the patching mm, could they use
> > > >>>> bpf_probe_write_user() - which uses probe_kernel_write() after
> > > >>>> checking that KERNEL_DS isn't active and that access_ok() passes - to
> > > >>>> overwrite kernel text that is mapped writable in the patching mm?
> > > >>>
> > > >>> Yes, this is a good point. I guess text_poke() should be defined with
> > > >>> “__kprobes” and open-code memcpy.
> > > >>>
> > > >>> Does it sound reasonable?
> > > >>
> > > >> Doesn't __text_poke() as implemented in the proposed patch use a
> > > >> couple other kernel functions, too? Like switch_mm_irqs_off() and
> > > >> pte_clear() (which can be a call into a separate function on paravirt
> > > >> kernels)?
> > > >
> > > > I will move the pte_clear() to be done after the poking mm was unloaded.
> > > > Give me a few minutes to send a sketch of what I think should be done.
> > >
> > > Err.. You are right, I don’t see an easy way of preventing a kprobe from
> > > being set on switch_mm_irqs_off(), and open-coding this monster is too ugly.
> > >
> > > The reasonable solution seems to me as taking all the relevant pieces of
> > > code (and data) that might be used during text-poking and encapsulating them, so they
> > > will be set in a memory area which cannot be kprobe'd. This can also be
> > > useful to write-protect data structures of code that calls text_poke(),
> > > e.g., static-keys. It can also protect data on that stack that is used
> > > during text_poke() from being overwritten from another core.
> > >
> > > This solution is somewhat similar to Igor Stoppa’s idea of using “enclaves”
> > > when doing write-rarely operations.
> > >
> > > Right now, I think that text_poke() will keep being susceptible to such
> > > an attack, unless you have a better suggestion.
> >
> > A relatively simple approach might be to teach BPF not to run kprobe
> > programs and such in contexts where current->mm isn't the active mm?
> > Maybe using nmi_uaccess_okay(), or something like that? It looks like
> > perf_callchain_user() also already uses that. Except that a lot of
> > this code is x86-specific...
> 
> This sounds like exactly the right solution.  If you're running from
> some unknown context (like NMI or tracing), then you should check
> nmi_uaccess_okay().  I think we should just promote that to be a
> non-arch-specific function (that returns true by default) and check it
> the relevant bpf_probe_xyz() functions.

This treat may also need for my work, like probe_user_read() we should
fail if nmi_uaccess_okay().

Thank you,

> 
> Alexei, does that seem reasonable?


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 21:59                         ` Linus Torvalds
  2019-02-22 22:51                           ` Alexei Starovoitov
@ 2019-02-26  3:57                           ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2019-02-26  3:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Alexei Starovoitov, Masami Hiramatsu,
	Steven Rostedt, Andy Lutomirski, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andrew Lutomirski, Daniel Borkmann, Netdev, bpf

On Fri, Feb 22, 2019 at 01:59:10PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 1:38 PM David Miller <davem@davemloft.net> wrote:
> >
> > Don't be surprised if we see more separation like this in the future too.
> 
> Yes, with the whole meltdown fiasco, there's actually more pressure to
> add more support for separation of kernel/user address spaces. As Andy
> pointed out, it's been discussed as a future wish-list for x86-64 too.

S/390 is another example.

I've also proposed a RISC-V extension for it, including prototypes
for Rocketchip and Qemu, and a Linux kernel support, but it didn't go
any way.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-22 19:27                   ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Alexei Starovoitov
  2019-02-22 19:30                     ` Steven Rostedt
  2019-02-22 21:20                     ` Linus Torvalds
@ 2019-02-26 15:24                     ` Joel Fernandes
  2019-02-28 12:29                       ` Masami Hiramatsu
  2 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2019-02-26 15:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linus Torvalds, Masami Hiramatsu, Steven Rostedt,
	Andy Lutomirski, will.deacon, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski, daniel, netdev, bpf

On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > 
> > Then we should still probably fix up "__probe_kernel_read()" to not
> > allow user accesses. The easiest way to do that is actually likely to
> > use the "unsafe_get_user()" functions *without* doing a
> > uaccess_begin(), which will mean that modern CPU's will simply fault
> > on a kernel access to user space.
> 
> On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> and users pass both user and kernel addresses into it and expect
> that the helper will actually try to read from that address.

Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
kprobe on do_sys_open to monitor calls to the open syscall globally.

do_sys_open() has prototype:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);

This causes a "blank" filename to be displayed by opensnoop when I run it on
my Pixel 3 (arm64), possibly because this is a user pointer. However, it
works fine on x86-64.

So it seems to me that on arm64, reading user pointers directly still doesn't
work even if there is a distinction between user/kernel addresses. In that
case reading the user pointer using user accessors (possibly using
bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
also privately discussed with me).

[1] https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py#L140

thanks!

 - Joel


> 
> If __probe_kernel_read will suddenly start failing on all user addresses
> it will break the expectations.
> How do we solve it in bpf_probe_read?
> Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte
> in the loop?
> That's doable, but people already complain that bpf_probe_read() is slow
> and shows up in their perf report.
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-26 15:24                     ` Joel Fernandes
@ 2019-02-28 12:29                       ` Masami Hiramatsu
  2019-02-28 15:18                         ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2019-02-28 12:29 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Alexei Starovoitov, Linus Torvalds, Masami Hiramatsu,
	Steven Rostedt, Andy Lutomirski, will.deacon,
	Linux List Kernel Mailing, Ingo Molnar, Andrew Morton, stable,
	Changbin Du, Jann Horn, Kees Cook, Andy Lutomirski, daniel,
	netdev, bpf

On Tue, 26 Feb 2019 10:24:47 -0500
Joel Fernandes <joel@joelfernandes.org> wrote:

> On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > 
> > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > allow user accesses. The easiest way to do that is actually likely to
> > > use the "unsafe_get_user()" functions *without* doing a
> > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > on a kernel access to user space.
> > 
> > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > and users pass both user and kernel addresses into it and expect
> > that the helper will actually try to read from that address.
> 
> Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
> kprobe on do_sys_open to monitor calls to the open syscall globally.
> 
> do_sys_open() has prototype:
> 
> long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
> 
> This causes a "blank" filename to be displayed by opensnoop when I run it on
> my Pixel 3 (arm64), possibly because this is a user pointer. However, it
> works fine on x86-64.
> 
> So it seems to me that on arm64, reading user pointers directly still doesn't
> work even if there is a distinction between user/kernel addresses. In that
> case reading the user pointer using user accessors (possibly using
> bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
> also privately discussed with me).

OK, it sounds like the same issue. Please add a bpf_user_read() and use it
for __user pointer.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
  2019-02-28 12:29                       ` Masami Hiramatsu
@ 2019-02-28 15:18                         ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2019-02-28 15:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Linus Torvalds, Steven Rostedt,
	Andy Lutomirski, will.deacon, Linux List Kernel Mailing,
	Ingo Molnar, Andrew Morton, stable, Changbin Du, Jann Horn,
	Kees Cook, Andy Lutomirski, daniel, netdev, bpf, yhs

On Thu, Feb 28, 2019 at 09:29:13PM +0900, Masami Hiramatsu wrote:
> On Tue, 26 Feb 2019 10:24:47 -0500
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > On Fri, Feb 22, 2019 at 11:27:05AM -0800, Alexei Starovoitov wrote:
> > > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote:
> > > > 
> > > > Then we should still probably fix up "__probe_kernel_read()" to not
> > > > allow user accesses. The easiest way to do that is actually likely to
> > > > use the "unsafe_get_user()" functions *without* doing a
> > > > uaccess_begin(), which will mean that modern CPU's will simply fault
> > > > on a kernel access to user space.
> > > 
> > > On bpf side the bpf_probe_read() helper just calls probe_kernel_read()
> > > and users pass both user and kernel addresses into it and expect
> > > that the helper will actually try to read from that address.
> > 
> > Slightly related and FWIW, BCC's eBPF-based opensnoop tool [1] installs a
> > kprobe on do_sys_open to monitor calls to the open syscall globally.
> > 
> > do_sys_open() has prototype:
> > 
> > long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode);
> > 
> > This causes a "blank" filename to be displayed by opensnoop when I run it on
> > my Pixel 3 (arm64), possibly because this is a user pointer. However, it
> > works fine on x86-64.
> > 
> > So it seems to me that on arm64, reading user pointers directly still doesn't
> > work even if there is a distinction between user/kernel addresses. In that
> > case reading the user pointer using user accessors (possibly using
> > bpf_probe_user_read helper) should be needed to fix this issue (as Yonghong
> > also privately discussed with me).
> 
> OK, it sounds like the same issue. Please add a bpf_user_read() and use it
> for __user pointer.

CC'd Yonghong who said eariler to me he would add it, but I could add it too
if he wants me to.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-02-28 15:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=wgJzNp0R3cVhjBPHTR4X9sOvHdqK4UVFfbsOKQ6L=A_eQ@mail.gmail.com>
     [not found] ` <CAHk-=wh9XrOykA5J9RQ7zaBio-S_D+1AE+rGnBYWSd==pCXh+w@mail.gmail.com>
     [not found]   ` <20190219111802.1d6dbaa3@gandalf.local.home>
     [not found]     ` <CAHk-=wgTuK3kAduP-gr10vykT1uG=B2VpdffvmyBuTQ1UxPpMg@mail.gmail.com>
     [not found]       ` <20190219140330.5dd9e876@gandalf.local.home>
     [not found]         ` <20190220171019.5e81a4946b56982f324f7c45@kernel.org>
     [not found]           ` <20190220094926.0ab575b3@gandalf.local.home>
     [not found]             ` <20190222172745.2c7205d62003c0a858e33278@kernel.org>
     [not found]               ` <20190222173509.88489b7c5d1bf0e2ec2382ee@kernel.org>
     [not found]                 ` <CAHk-=whNf_n1WXWW+ugAVeL5ZK0GcEP3cTYocju1nS85VtMjjQ@mail.gmail.com>
2019-02-22 19:27                   ` [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault Alexei Starovoitov
2019-02-22 19:30                     ` Steven Rostedt
2019-02-22 19:34                       ` Alexei Starovoitov
2019-02-22 19:39                         ` Steven Rostedt
2019-02-22 19:55                         ` Andy Lutomirski
2019-02-22 21:43                           ` Jann Horn
2019-02-22 22:08                             ` Nadav Amit
2019-02-22 22:17                               ` Jann Horn
2019-02-22 22:21                                 ` Nadav Amit
2019-02-22 22:39                                   ` Nadav Amit
2019-02-22 23:02                                     ` Jann Horn
2019-02-22 23:22                                       ` Nadav Amit
2019-02-22 23:59                                       ` Andy Lutomirski
2019-02-23  0:03                                         ` Alexei Starovoitov
2019-02-23  0:15                                         ` Nadav Amit
2019-02-24 19:35                                           ` Andy Lutomirski
2019-02-25 13:36                                         ` Masami Hiramatsu
2019-02-22 21:20                     ` Linus Torvalds
2019-02-22 21:38                       ` David Miller
2019-02-22 21:59                         ` Linus Torvalds
2019-02-22 22:51                           ` Alexei Starovoitov
2019-02-22 23:11                             ` Jann Horn
2019-02-22 23:16                               ` David Miller
2019-02-22 23:16                             ` Linus Torvalds
2019-02-22 23:56                               ` Alexei Starovoitov
2019-02-23  0:08                                 ` Linus Torvalds
2019-02-23  2:28                                   ` Alexei Starovoitov
2019-02-23  2:32                                     ` Linus Torvalds
2019-02-23  3:02                                     ` Steven Rostedt
2019-02-23  4:51                                 ` Masami Hiramatsu
2019-02-26  3:57                           ` Christoph Hellwig
2019-02-26 15:24                     ` Joel Fernandes
2019-02-28 12:29                       ` Masami Hiramatsu
2019-02-28 15:18                         ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).