bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
@ 2020-11-02 11:54 syzbot
  2020-11-11 14:57 ` Dmitry Vyukov
  2020-11-15  5:52 ` [PATCH] bpf: don't fail kmalloc while releasing raw_tp Matt Mullins
  0 siblings, 2 replies; 19+ messages in thread
From: syzbot @ 2020-11-02 11:54 UTC (permalink / raw)
  To: akpm, andrii, ast, bpf, daniel, davem, hawk, john.fastabend,
	kafai, kpsingh, kuba, linux-kernel, mingo, mingo, mmullins,
	netdev, peterz, rostedt, songliubraving, syzkaller-bugs, yhs

Hello,

syzbot found the following issue on:

HEAD commit:    080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
git tree:       bpf
console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c500000
kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
compiler:       gcc (GCC) 10.1.0-syz 20200507
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10f4b032500000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c500000

The issue was bisected to:

commit 9df1c28bb75217b244257152ab7d788bb2a386d0
Author: Matt Mullins <mmullins@fb.com>
Date:   Fri Apr 26 18:49:47 2019 +0000

    bpf: add writable context for raw tracepoints

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da500000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=11b6c4da500000
console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da500000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")

==================================================================
BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
Read of size 8 at addr ffffc90000e6c030 by task kworker/0:3/3754

CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue:  0x0 (events)
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
 __kasan_report mm/kasan/report.c:545 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
 __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
 bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
 __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
 __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
 trace_sched_switch include/trace/events/sched.h:138 [inline]
 __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
 schedule+0xcf/0x270 kernel/sched/core.c:4601
 worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
 kthread+0x3af/0x4a0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296


Memory state around the buggy address:
 ffffc90000e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
 ffffc90000e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>ffffc90000e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
                                     ^
 ffffc90000e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
 ffffc90000e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
  2020-11-02 11:54 KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3 syzbot
@ 2020-11-11 14:57 ` Dmitry Vyukov
  2020-11-13  5:37   ` Matt Mullins
  2020-11-15  5:52 ` [PATCH] bpf: don't fail kmalloc while releasing raw_tp Matt Mullins
  1 sibling, 1 reply; 19+ messages in thread
From: Dmitry Vyukov @ 2020-11-11 14:57 UTC (permalink / raw)
  To: syzbot
  Cc: Andrew Morton, andrii, Alexei Starovoitov, bpf, Daniel Borkmann,
	David Miller, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, KP Singh, Jakub Kicinski, LKML, Ingo Molnar,
	Ingo Molnar, mmullins, netdev, Peter Zijlstra, Steven Rostedt,
	Song Liu, syzkaller-bugs, Yonghong Song

On Mon, Nov 2, 2020 at 12:54 PM syzbot
<syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
> git tree:       bpf
> console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c500000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
> dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
> compiler:       gcc (GCC) 10.1.0-syz 20200507
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10f4b032500000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c500000
>
> The issue was bisected to:
>
> commit 9df1c28bb75217b244257152ab7d788bb2a386d0
> Author: Matt Mullins <mmullins@fb.com>
> Date:   Fri Apr 26 18:49:47 2019 +0000
>
>     bpf: add writable context for raw tracepoints


We have a number of kernel memory corruptions related to bpf_trace_run now:
https://groups.google.com/g/syzkaller-bugs/search?q=kernel%2Ftrace%2Fbpf_trace.c

Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)?
Or they shouldn't?

Looking at the description of Matt's commit, it seems that corruptions
should not be possible (bounded buffer, checked size, etc). Then it
means it's a real kernel bug?



> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da500000
> final oops:     https://syzkaller.appspot.com/x/report.txt?x=11b6c4da500000
> console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da500000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
> Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")
>
> ==================================================================
> BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
> BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
> Read of size 8 at addr ffffc90000e6c030 by task kworker/0:3/3754
>
> CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Workqueue:  0x0 (events)
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:118
>  print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
>  __kasan_report mm/kasan/report.c:545 [inline]
>  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
>  __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
>  bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
>  __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
>  __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
>  trace_sched_switch include/trace/events/sched.h:138 [inline]
>  __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
>  schedule+0xcf/0x270 kernel/sched/core.c:4601
>  worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
>  kthread+0x3af/0x4a0 kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
>
>
> Memory state around the buggy address:
>  ffffc90000e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>  ffffc90000e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >ffffc90000e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>                                      ^
>  ffffc90000e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>  ffffc90000e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> ==================================================================
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/00000000000004500b05b31e68ce%40google.com.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
  2020-11-11 14:57 ` Dmitry Vyukov
@ 2020-11-13  5:37   ` Matt Mullins
  2020-11-13 16:08     ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Mullins @ 2020-11-13  5:37 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Andrew Morton, andrii, Alexei Starovoitov, bpf,
	Daniel Borkmann, David Miller, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, KP Singh, Jakub Kicinski, LKML,
	Ingo Molnar, Ingo Molnar, mmullins, netdev, Peter Zijlstra,
	Steven Rostedt, Song Liu, syzkaller-bugs, Yonghong Song

On Wed, Nov 11, 2020 at 03:57:50PM +0100, Dmitry Vyukov wrote:
> On Mon, Nov 2, 2020 at 12:54 PM syzbot
> <syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
> > git tree:       bpf
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c500000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
> > dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
> > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10f4b032500000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c500000
> >
> > The issue was bisected to:
> >
> > commit 9df1c28bb75217b244257152ab7d788bb2a386d0
> > Author: Matt Mullins <mmullins@fb.com>
> > Date:   Fri Apr 26 18:49:47 2019 +0000
> >
> >     bpf: add writable context for raw tracepoints
> 
> 
> We have a number of kernel memory corruptions related to bpf_trace_run now:
> https://groups.google.com/g/syzkaller-bugs/search?q=kernel%2Ftrace%2Fbpf_trace.c
> 
> Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)?
> Or they shouldn't?
> 
> Looking at the description of Matt's commit, it seems that corruptions
> should not be possible (bounded buffer, checked size, etc). Then it
> means it's a real kernel bug?

This bug doesn't seem to be related to the writability of the
tracepoint; it bisected to that commit simply because it used
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE for the reproducer and it EINVAL's
before that program type was introduced.  The BPF program it loads is
pretty much a no-op.

The problem here is a kmalloc failure injection into
tracepoint_probe_unregister, but the error is ignored -- so the bpf
program is freed even though the tracepoint is never unregistered.

I have a first pass at a patch to pipe through the error code, but it's
pretty ugly.  It's also called from the file_operations ->release(), for
which errors are solidly ignored in __fput(), so I'm not sure what the
best way to handle ENOMEM is...

> 
> 
> 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da500000
> > final oops:     https://syzkaller.appspot.com/x/report.txt?x=11b6c4da500000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da500000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
> > Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")
> >
> > ==================================================================
> > BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
> > BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
> > Read of size 8 at addr ffffc90000e6c030 by task kworker/0:3/3754
> >
> > CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > Workqueue:  0x0 (events)
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x107/0x163 lib/dump_stack.c:118
> >  print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
> >  __kasan_report mm/kasan/report.c:545 [inline]
> >  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
> >  __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
> >  bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
> >  __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
> >  __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
> >  trace_sched_switch include/trace/events/sched.h:138 [inline]
> >  __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
> >  schedule+0xcf/0x270 kernel/sched/core.c:4601
> >  worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
> >  kthread+0x3af/0x4a0 kernel/kthread.c:292
> >  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
> >
> >
> > Memory state around the buggy address:
> >  ffffc90000e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >  ffffc90000e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> > >ffffc90000e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >                                      ^
> >  ffffc90000e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >  ffffc90000e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> > ==================================================================
> >
> >
> > ---
> > This report is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@googlegroups.com.
> >
> > syzbot will keep track of this issue. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > syzbot can test patches for this issue, for details see:
> > https://goo.gl/tpsmEJ#testing-patches
> >
> > --
> > You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/00000000000004500b05b31e68ce%40google.com.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
  2020-11-13  5:37   ` Matt Mullins
@ 2020-11-13 16:08     ` Yonghong Song
  2021-02-10 18:23       ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Yonghong Song @ 2020-11-13 16:08 UTC (permalink / raw)
  To: Dmitry Vyukov, syzbot, Andrew Morton, andrii, Alexei Starovoitov,
	bpf, Daniel Borkmann, David Miller, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, KP Singh, Jakub Kicinski, LKML,
	Ingo Molnar, Ingo Molnar, mmullins, netdev, Peter Zijlstra,
	Steven Rostedt, Song Liu, syzkaller-bugs



On 11/12/20 9:37 PM, Matt Mullins wrote:
> On Wed, Nov 11, 2020 at 03:57:50PM +0100, Dmitry Vyukov wrote:
>> On Mon, Nov 2, 2020 at 12:54 PM syzbot
>> <syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com> wrote:
>>>
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit:    080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
>>> git tree:       bpf
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c500000
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
>>> compiler:       gcc (GCC) 10.1.0-syz 20200507
>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10f4b032500000
>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c500000
>>>
>>> The issue was bisected to:
>>>
>>> commit 9df1c28bb75217b244257152ab7d788bb2a386d0
>>> Author: Matt Mullins <mmullins@fb.com>
>>> Date:   Fri Apr 26 18:49:47 2019 +0000
>>>
>>>      bpf: add writable context for raw tracepoints
>>
>>
>> We have a number of kernel memory corruptions related to bpf_trace_run now:
>> https://groups.google.com/g/syzkaller-bugs/search?q=kernel/trace/bpf_trace.c
>>
>> Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)?
>> Or they shouldn't?
>>
>> Looking at the description of Matt's commit, it seems that corruptions
>> should not be possible (bounded buffer, checked size, etc). Then it
>> means it's a real kernel bug?
> 
> This bug doesn't seem to be related to the writability of the
> tracepoint; it bisected to that commit simply because it used
> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE for the reproducer and it EINVAL's
> before that program type was introduced.  The BPF program it loads is
> pretty much a no-op.
> 
> The problem here is a kmalloc failure injection into
> tracepoint_probe_unregister, but the error is ignored -- so the bpf
> program is freed even though the tracepoint is never unregistered.
> 
> I have a first pass at a patch to pipe through the error code, but it's
> pretty ugly.  It's also called from the file_operations ->release(), for

Maybe you can still post the patch, so people can review and make 
suggestions which may lead to a *better* solution.

> which errors are solidly ignored in __fput(), so I'm not sure what the
> best way to handle ENOMEM is...
> 
>>
>>
>>
>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da500000
>>> final oops:     https://syzkaller.appspot.com/x/report.txt?x=11b6c4da500000
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da500000
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
>>> Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")
>>>
>>> ==================================================================
>>> BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
>>> BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
>>> Read of size 8 at addr ffffc90000e6c030 by task kworker/0:3/3754
>>>
>>> CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>> Workqueue:  0x0 (events)
>>> Call Trace:
>>>   __dump_stack lib/dump_stack.c:77 [inline]
>>>   dump_stack+0x107/0x163 lib/dump_stack.c:118
>>>   print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
>>>   __kasan_report mm/kasan/report.c:545 [inline]
>>>   kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
>>>   __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
>>>   bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
>>>   __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
>>>   __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
>>>   trace_sched_switch include/trace/events/sched.h:138 [inline]
>>>   __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
>>>   schedule+0xcf/0x270 kernel/sched/core.c:4601
>>>   worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
>>>   kthread+0x3af/0x4a0 kernel/kthread.c:292
>>>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
>>>
>>>
>>> Memory state around the buggy address:
>>>   ffffc90000e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>   ffffc90000e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>> ffffc90000e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>                                       ^
>>>   ffffc90000e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>   ffffc90000e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>> ==================================================================
[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-02 11:54 KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3 syzbot
  2020-11-11 14:57 ` Dmitry Vyukov
@ 2020-11-15  5:52 ` Matt Mullins
  2020-11-16 17:19   ` Steven Rostedt
  1 sibling, 1 reply; 19+ messages in thread
From: Matt Mullins @ 2020-11-15  5:52 UTC (permalink / raw)
  To: rostedt, mingo, ast, daniel
  Cc: dvyukov, Matt Mullins, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, linux-kernel, netdev,
	bpf

bpf_link_free is always called in process context, including from a
workqueue and from __fput.  Neither of these have the ability to
propagate an -ENOMEM to the caller.

Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com
Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
Signed-off-by: Matt Mullins <mmullins@mmlx.us>
---
I previously referenced a "pretty ugly" patch.  This is not that one,
because I don't think there's any way I can make the caller of
->release() actually handle errors like ENOMEM.

It also looks like most of the other ways tracepoint_probe_unregister is
called also don't check the error code (e.g. just a quick grep found
blk_unregister_tracepoints).  Should this just be upgraded to GFP_NOFAIL
across the board instead of passing around a gfp_t?

 include/linux/trace_events.h |  6 ++++--
 include/linux/tracepoint.h   |  7 +++++--
 kernel/bpf/syscall.c         |  2 +-
 kernel/trace/bpf_trace.c     |  6 ++++--
 kernel/trace/trace_events.c  |  6 ++++--
 kernel/tracepoint.c          | 20 ++++++++++----------
 6 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 5c6943354049..166ad7646a98 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -625,7 +625,8 @@ int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog,
+			 gfp_t flags);
 struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
 void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
 int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
@@ -654,7 +655,8 @@ static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_p
 {
 	return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp,
+				       struct bpf_prog *p, gfp_t flags)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..7b02f92f3b8f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -12,6 +12,7 @@
  * Heavily inspired from the Linux Kernel Markers.
  */
 
+#include <linux/gfp.h>
 #include <linux/smp.h>
 #include <linux/srcu.h>
 #include <linux/errno.h>
@@ -40,7 +41,8 @@ extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
 			       int prio);
 extern int
-tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
+tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data,
+			    gfp_t flags);
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
 		void *priv);
@@ -260,7 +262,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	unregister_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
 		return tracepoint_probe_unregister(&__tracepoint_##name,\
-						(void *)probe, data);	\
+						(void *)probe, data,	\
+						GFP_KERNEL);		\
 	}								\
 	static inline void						\
 	check_trace_callback_type_##name(void (*cb)(data_proto))	\
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b999e7ff2583..f6876681c4ab 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2601,7 +2601,7 @@ static void bpf_raw_tp_link_release(struct bpf_link *link)
 	struct bpf_raw_tp_link *raw_tp =
 		container_of(link, struct bpf_raw_tp_link, link);
 
-	bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog);
+	bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog, GFP_KERNEL | __GFP_NOFAIL);
 	bpf_put_raw_tracepoint(raw_tp->btp);
 }
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a8d4f253ed77..a4ea58c7506d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1955,9 +1955,11 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
 	return __bpf_probe_register(btp, prog);
 }
 
-int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog,
+			 gfp_t flags)
 {
-	return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
+	return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog,
+					   flags);
 }
 
 int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index a85effb2373b..ab1ac89caed2 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -296,7 +296,8 @@ int trace_event_reg(struct trace_event_call *call,
 	case TRACE_REG_UNREGISTER:
 		tracepoint_probe_unregister(call->tp,
 					    call->class->probe,
-					    file);
+					    file,
+					    GFP_KERNEL);
 		return 0;
 
 #ifdef CONFIG_PERF_EVENTS
@@ -307,7 +308,8 @@ int trace_event_reg(struct trace_event_call *call,
 	case TRACE_REG_PERF_UNREGISTER:
 		tracepoint_probe_unregister(call->tp,
 					    call->class->perf_probe,
-					    call);
+					    call,
+					    GFP_KERNEL);
 		return 0;
 	case TRACE_REG_PERF_OPEN:
 	case TRACE_REG_PERF_CLOSE:
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 73956eaff8a9..619666a43c9f 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -53,10 +53,9 @@ struct tp_probes {
 	struct tracepoint_func probes[0];
 };
 
-static inline void *allocate_probes(int count)
+static inline void *allocate_probes(int count, gfp_t flags)
 {
-	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
-				       GFP_KERNEL);
+	struct tp_probes *p  = kmalloc(struct_size(p, probes, count), flags);
 	return p == NULL ? NULL : p->probes;
 }
 
@@ -150,7 +149,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
 		}
 	}
 	/* + 2 : one for new probe, one for NULL func */
-	new = allocate_probes(nr_probes + 2);
+	new = allocate_probes(nr_probes + 2, GFP_KERNEL);
 	if (new == NULL)
 		return ERR_PTR(-ENOMEM);
 	if (old) {
@@ -174,7 +173,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
 }
 
 static void *func_remove(struct tracepoint_func **funcs,
-		struct tracepoint_func *tp_func)
+		struct tracepoint_func *tp_func, gfp_t flags)
 {
 	int nr_probes = 0, nr_del = 0, i;
 	struct tracepoint_func *old, *new;
@@ -207,7 +206,7 @@ static void *func_remove(struct tracepoint_func **funcs,
 		int j = 0;
 		/* N -> M, (N > 1, M > 0) */
 		/* + 1 for NULL */
-		new = allocate_probes(nr_probes - nr_del + 1);
+		new = allocate_probes(nr_probes - nr_del + 1, flags);
 		if (new == NULL)
 			return ERR_PTR(-ENOMEM);
 		for (i = 0; old[i].func; i++)
@@ -264,13 +263,13 @@ static int tracepoint_add_func(struct tracepoint *tp,
  * by preempt_disable around the call site.
  */
 static int tracepoint_remove_func(struct tracepoint *tp,
-		struct tracepoint_func *func)
+		struct tracepoint_func *func, gfp_t flags)
 {
 	struct tracepoint_func *old, *tp_funcs;
 
 	tp_funcs = rcu_dereference_protected(tp->funcs,
 			lockdep_is_held(&tracepoints_mutex));
-	old = func_remove(&tp_funcs, func);
+	old = func_remove(&tp_funcs, func, flags);
 	if (IS_ERR(old)) {
 		WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
 		return PTR_ERR(old);
@@ -344,7 +343,8 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register);
  *
  * Returns 0 if ok, error value on error.
  */
-int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data)
+int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data,
+				gfp_t flags)
 {
 	struct tracepoint_func tp_func;
 	int ret;
@@ -352,7 +352,7 @@ int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data)
 	mutex_lock(&tracepoints_mutex);
 	tp_func.func = probe;
 	tp_func.data = data;
-	ret = tracepoint_remove_func(tp, &tp_func);
+	ret = tracepoint_remove_func(tp, &tp_func, flags);
 	mutex_unlock(&tracepoints_mutex);
 	return ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-15  5:52 ` [PATCH] bpf: don't fail kmalloc while releasing raw_tp Matt Mullins
@ 2020-11-16 17:19   ` Steven Rostedt
  2020-11-16 20:37     ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2020-11-16 17:19 UTC (permalink / raw)
  To: Matt Mullins
  Cc: mingo, ast, daniel, dvyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf, Mathieu Desnoyers

On Sat, 14 Nov 2020 21:52:55 -0800
Matt Mullins <mmullins@mmlx.us> wrote:

> bpf_link_free is always called in process context, including from a
> workqueue and from __fput.  Neither of these have the ability to
> propagate an -ENOMEM to the caller.
> 

Hmm, I think the real fix is to not have unregistering a tracepoint probe
fail because of allocation. We are removing a probe, perhaps we could just
inject NULL pointer that gets checked via the DO_TRACE loop?

I bet failing an unregister because of an allocation failure causes
problems elsewhere than just BPF.

Mathieu,

Can't we do something that would still allow to unregister a probe even if
a new probe array fails to allocate? We could kick off a irq work to try to
clean up the probe at a later time, but still, the unregister itself should
not fail due to memory failure.

-- Steve



> Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com
> Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
> Signed-off-by: Matt Mullins <mmullins@mmlx.us>
> ---
> I previously referenced a "pretty ugly" patch.  This is not that one,
> because I don't think there's any way I can make the caller of
> ->release() actually handle errors like ENOMEM.  
> 
> It also looks like most of the other ways tracepoint_probe_unregister is
> called also don't check the error code (e.g. just a quick grep found
> blk_unregister_tracepoints).  Should this just be upgraded to GFP_NOFAIL
> across the board instead of passing around a gfp_t?
> 
>  include/linux/trace_events.h |  6 ++++--
>  include/linux/tracepoint.h   |  7 +++++--
>  kernel/bpf/syscall.c         |  2 +-
>  kernel/trace/bpf_trace.c     |  6 ++++--
>  kernel/trace/trace_events.c  |  6 ++++--
>  kernel/tracepoint.c          | 20 ++++++++++----------
>  6 files changed, 28 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 5c6943354049..166ad7646a98 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -625,7 +625,8 @@ int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
>  void perf_event_detach_bpf_prog(struct perf_event *event);
>  int perf_event_query_prog_array(struct perf_event *event, void __user *info);
>  int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
> -int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
> +int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog,
> +			 gfp_t flags);
>  struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
>  void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
>  int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
> @@ -654,7 +655,8 @@ static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_p
>  {
>  	return -EOPNOTSUPP;
>  }
> -static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *p)
> +static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp,
> +				       struct bpf_prog *p, gfp_t flags)
>  {
>  	return -EOPNOTSUPP;
>  }
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 598fec9f9dbf..7b02f92f3b8f 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -12,6 +12,7 @@
>   * Heavily inspired from the Linux Kernel Markers.
>   */
>  
> +#include <linux/gfp.h>
>  #include <linux/smp.h>
>  #include <linux/srcu.h>
>  #include <linux/errno.h>
> @@ -40,7 +41,8 @@ extern int
>  tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
>  			       int prio);
>  extern int
> -tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
> +tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data,
> +			    gfp_t flags);
>  extern void
>  for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
>  		void *priv);
> @@ -260,7 +262,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  	unregister_trace_##name(void (*probe)(data_proto), void *data)	\
>  	{								\
>  		return tracepoint_probe_unregister(&__tracepoint_##name,\
> -						(void *)probe, data);	\
> +						(void *)probe, data,	\
> +						GFP_KERNEL);		\
>  	}								\
>  	static inline void						\
>  	check_trace_callback_type_##name(void (*cb)(data_proto))	\
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index b999e7ff2583..f6876681c4ab 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2601,7 +2601,7 @@ static void bpf_raw_tp_link_release(struct bpf_link *link)
>  	struct bpf_raw_tp_link *raw_tp =
>  		container_of(link, struct bpf_raw_tp_link, link);
>  
> -	bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog);
> +	bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog, GFP_KERNEL | __GFP_NOFAIL);
>  	bpf_put_raw_tracepoint(raw_tp->btp);
>  }
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a8d4f253ed77..a4ea58c7506d 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1955,9 +1955,11 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
>  	return __bpf_probe_register(btp, prog);
>  }
>  
> -int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
> +int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog,
> +			 gfp_t flags)
>  {
> -	return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
> +	return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog,
> +					   flags);
>  }
>  
>  int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index a85effb2373b..ab1ac89caed2 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -296,7 +296,8 @@ int trace_event_reg(struct trace_event_call *call,
>  	case TRACE_REG_UNREGISTER:
>  		tracepoint_probe_unregister(call->tp,
>  					    call->class->probe,
> -					    file);
> +					    file,
> +					    GFP_KERNEL);
>  		return 0;
>  
>  #ifdef CONFIG_PERF_EVENTS
> @@ -307,7 +308,8 @@ int trace_event_reg(struct trace_event_call *call,
>  	case TRACE_REG_PERF_UNREGISTER:
>  		tracepoint_probe_unregister(call->tp,
>  					    call->class->perf_probe,
> -					    call);
> +					    call,
> +					    GFP_KERNEL);
>  		return 0;
>  	case TRACE_REG_PERF_OPEN:
>  	case TRACE_REG_PERF_CLOSE:
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 73956eaff8a9..619666a43c9f 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,10 +53,9 @@ struct tp_probes {
>  	struct tracepoint_func probes[0];
>  };
>  
> -static inline void *allocate_probes(int count)
> +static inline void *allocate_probes(int count, gfp_t flags)
>  {
> -	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> -				       GFP_KERNEL);
> +	struct tp_probes *p  = kmalloc(struct_size(p, probes, count), flags);
>  	return p == NULL ? NULL : p->probes;
>  }
>  
> @@ -150,7 +149,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
>  		}
>  	}
>  	/* + 2 : one for new probe, one for NULL func */
> -	new = allocate_probes(nr_probes + 2);
> +	new = allocate_probes(nr_probes + 2, GFP_KERNEL);
>  	if (new == NULL)
>  		return ERR_PTR(-ENOMEM);
>  	if (old) {
> @@ -174,7 +173,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
>  }
>  
>  static void *func_remove(struct tracepoint_func **funcs,
> -		struct tracepoint_func *tp_func)
> +		struct tracepoint_func *tp_func, gfp_t flags)
>  {
>  	int nr_probes = 0, nr_del = 0, i;
>  	struct tracepoint_func *old, *new;
> @@ -207,7 +206,7 @@ static void *func_remove(struct tracepoint_func **funcs,
>  		int j = 0;
>  		/* N -> M, (N > 1, M > 0) */
>  		/* + 1 for NULL */
> -		new = allocate_probes(nr_probes - nr_del + 1);
> +		new = allocate_probes(nr_probes - nr_del + 1, flags);
>  		if (new == NULL)
>  			return ERR_PTR(-ENOMEM);
>  		for (i = 0; old[i].func; i++)
> @@ -264,13 +263,13 @@ static int tracepoint_add_func(struct tracepoint *tp,
>   * by preempt_disable around the call site.
>   */
>  static int tracepoint_remove_func(struct tracepoint *tp,
> -		struct tracepoint_func *func)
> +		struct tracepoint_func *func, gfp_t flags)
>  {
>  	struct tracepoint_func *old, *tp_funcs;
>  
>  	tp_funcs = rcu_dereference_protected(tp->funcs,
>  			lockdep_is_held(&tracepoints_mutex));
> -	old = func_remove(&tp_funcs, func);
> +	old = func_remove(&tp_funcs, func, flags);
>  	if (IS_ERR(old)) {
>  		WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
>  		return PTR_ERR(old);
> @@ -344,7 +343,8 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register);
>   *
>   * Returns 0 if ok, error value on error.
>   */
> -int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data)
> +int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data,
> +				gfp_t flags)
>  {
>  	struct tracepoint_func tp_func;
>  	int ret;
> @@ -352,7 +352,7 @@ int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data)
>  	mutex_lock(&tracepoints_mutex);
>  	tp_func.func = probe;
>  	tp_func.data = data;
> -	ret = tracepoint_remove_func(tp, &tp_func);
> +	ret = tracepoint_remove_func(tp, &tp_func, flags);
>  	mutex_unlock(&tracepoints_mutex);
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 17:19   ` Steven Rostedt
@ 2020-11-16 20:37     ` Mathieu Desnoyers
  2020-11-16 20:44       ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2020-11-16 20:37 UTC (permalink / raw)
  To: rostedt, paulmck
  Cc: Matt Mullins, Ingo Molnar, Alexei Starovoitov, Daniel Borkmann,
	Dmitry Vyukov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, linux-kernel, netdev,
	bpf

----- On Nov 16, 2020, at 12:19 PM, rostedt rostedt@goodmis.org wrote:

> On Sat, 14 Nov 2020 21:52:55 -0800
> Matt Mullins <mmullins@mmlx.us> wrote:
> 
>> bpf_link_free is always called in process context, including from a
>> workqueue and from __fput.  Neither of these have the ability to
>> propagate an -ENOMEM to the caller.
>> 
> 
> Hmm, I think the real fix is to not have unregistering a tracepoint probe
> fail because of allocation. We are removing a probe, perhaps we could just
> inject NULL pointer that gets checked via the DO_TRACE loop?
> 
> I bet failing an unregister because of an allocation failure causes
> problems elsewhere than just BPF.
> 
> Mathieu,
> 
> Can't we do something that would still allow to unregister a probe even if
> a new probe array fails to allocate? We could kick off a irq work to try to
> clean up the probe at a later time, but still, the unregister itself should
> not fail due to memory failure.

Currently, the fast path iteration looks like:

                struct tracepoint_func *it_func_ptr;
                void *it_func;

                it_func_ptr =                                           \
                        rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
                do {                                                    \
                        it_func = (it_func_ptr)->func;                  \
                        __data = (it_func_ptr)->data;                   \
                        ((void(*)(void *, proto))(it_func))(__data, args); \
                } while ((++it_func_ptr)->func); 

So we RCU dereference the array, and iterate on the array until we find a NULL
func. So you could not use NULL to skip items, but you could perhaps reserve
a (void *)0x1UL tombstone for this.

It should ideally be an unlikely branch, and it would be good to benchmark the
change when multiple tracing probes are attached to figure out whether the
overhead is significant when tracing is enabled.

I wonder whether we really mind that much about using slightly more memory
than required after a failed reallocation due to ENOMEM. Perhaps the irq work
is not even needed. Chances are that the irq work would fail again and again if
it's in low memory conditions. So maybe it's better to just keep the tombstone
in place until the next successful callback array reallocation.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 20:37     ` Mathieu Desnoyers
@ 2020-11-16 20:44       ` Steven Rostedt
  2020-11-16 21:02         ` Steven Rostedt
  2020-11-16 21:21         ` Mathieu Desnoyers
  0 siblings, 2 replies; 19+ messages in thread
From: Steven Rostedt @ 2020-11-16 20:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Mon, 16 Nov 2020 15:37:27 -0500 (EST)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > Mathieu,
> > 
> > Can't we do something that would still allow to unregister a probe even if
> > a new probe array fails to allocate? We could kick off a irq work to try to
> > clean up the probe at a later time, but still, the unregister itself should
> > not fail due to memory failure.  
> 
> Currently, the fast path iteration looks like:
> 
>                 struct tracepoint_func *it_func_ptr;
>                 void *it_func;
> 
>                 it_func_ptr =                                           \
>                         rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>                 do {                                                    \
>                         it_func = (it_func_ptr)->func;                  \
>                         __data = (it_func_ptr)->data;                   \
>                         ((void(*)(void *, proto))(it_func))(__data, args); \
>                 } while ((++it_func_ptr)->func); 
> 
> So we RCU dereference the array, and iterate on the array until we find a NULL
> func. So you could not use NULL to skip items, but you could perhaps reserve
> a (void *)0x1UL tombstone for this.

Actually, you could just set it to a stub callback that does nothing. then
you don't even need to touch the above macro. Not sure why I didn't
recommend this to begin with, because that's exactly what the function
tracer does with ftrace_stub.


> 
> It should ideally be an unlikely branch, and it would be good to benchmark the
> change when multiple tracing probes are attached to figure out whether the
> overhead is significant when tracing is enabled.

If you use a stub function, it shouldn't affect anything. And the worse
that would happen is that you have a slight overhead of calling the stub
until you can properly remove the callback.

> 
> I wonder whether we really mind that much about using slightly more memory
> than required after a failed reallocation due to ENOMEM. Perhaps the irq work
> is not even needed. Chances are that the irq work would fail again and again if
> it's in low memory conditions. So maybe it's better to just keep the tombstone
> in place until the next successful callback array reallocation.
> 

True. If we just replace the function with a stub on memory failure (always
using __GFP_NOFAIL, and if it fails to reallocate a new array, just replace
the callback with the stub and be done with it. It may require some more
accounting to make sure the tracepoint.c code can handle these stubs, and
remove them on new additions to the code. Heck, if a stub exists, you could
just swap it with a new item. But on any new changes to the list, the stubs
should be purged.


-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 20:44       ` Steven Rostedt
@ 2020-11-16 21:02         ` Steven Rostedt
  2020-11-16 21:06           ` Steven Rostedt
  2020-11-16 21:34           ` Mathieu Desnoyers
  2020-11-16 21:21         ` Mathieu Desnoyers
  1 sibling, 2 replies; 19+ messages in thread
From: Steven Rostedt @ 2020-11-16 21:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Mon, 16 Nov 2020 15:44:37 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> If you use a stub function, it shouldn't affect anything. And the worse
> that would happen is that you have a slight overhead of calling the stub
> until you can properly remove the callback.

Something like this:

(haven't compiled it yet, I'm about to though).

-- Steve

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 3f659f855074..8eab40f9d388 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -53,10 +53,16 @@ struct tp_probes {
 	struct tracepoint_func probes[];
 };
 
-static inline void *allocate_probes(int count)
+/* Called in removal of a func but failed to allocate a new tp_funcs */
+static void tp_stub_func(void)
+{
+	return;
+}
+
+static inline void *allocate_probes(int count, gfp_t extra_flags)
 {
 	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
-				       GFP_KERNEL);
+				       GFP_KERNEL | extra_flags);
 	return p == NULL ? NULL : p->probes;
 }
 
@@ -150,7 +156,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
 		}
 	}
 	/* + 2 : one for new probe, one for NULL func */
-	new = allocate_probes(nr_probes + 2);
+	new = allocate_probes(nr_probes + 2, 0);
 	if (new == NULL)
 		return ERR_PTR(-ENOMEM);
 	if (old) {
@@ -188,8 +194,9 @@ static void *func_remove(struct tracepoint_func **funcs,
 	/* (N -> M), (N > 1, M >= 0) probes */
 	if (tp_func->func) {
 		for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
-			if (old[nr_probes].func == tp_func->func &&
-			     old[nr_probes].data == tp_func->data)
+			if ((old[nr_probes].func == tp_func->func &&
+			     old[nr_probes].data == tp_func->data) ||
+			    old[nr_probes].func == tp_stub_func)
 				nr_del++;
 		}
 	}
@@ -207,15 +214,20 @@ static void *func_remove(struct tracepoint_func **funcs,
 		int j = 0;
 		/* N -> M, (N > 1, M > 0) */
 		/* + 1 for NULL */
-		new = allocate_probes(nr_probes - nr_del + 1);
-		if (new == NULL)
-			return ERR_PTR(-ENOMEM);
-		for (i = 0; old[i].func; i++)
-			if (old[i].func != tp_func->func
-					|| old[i].data != tp_func->data)
-				new[j++] = old[i];
-		new[nr_probes - nr_del].func = NULL;
-		*funcs = new;
+		new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL);
+		if (new) {
+			for (i = 0; old[i].func; i++)
+				if (old[i].func != tp_func->func
+				    || old[i].data != tp_func->data)
+					new[j++] = old[i];
+			new[nr_probes - nr_del].func = NULL;
+		} else {
+			for (i = 0; old[i].func; i++)
+				if (old[i].func == tp_func->func &&
+				    old[i].data == tp_func->data)
+					old[i].func = tp_stub_func;
+		}
+		*funcs = old;
 	}
 	debug_print_probes(*funcs);
 	return old;
@@ -300,6 +312,10 @@ static int tracepoint_remove_func(struct tracepoint *tp,
 		return PTR_ERR(old);
 	}
 
+	if (tp_funcs == old)
+		/* Failed allocating new tp_funcs, replaced func with stub */
+		return 0;
+
 	if (!tp_funcs) {
 		/* Removed last function */
 		if (tp->unregfunc && static_key_enabled(&tp->key))

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 21:02         ` Steven Rostedt
@ 2020-11-16 21:06           ` Steven Rostedt
  2020-11-16 21:34           ` Mathieu Desnoyers
  1 sibling, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2020-11-16 21:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Mon, 16 Nov 2020 16:02:18 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> +		if (new) {
> +			for (i = 0; old[i].func; i++)
> +				if (old[i].func != tp_func->func
> +				    || old[i].data != tp_func->data)
> +					new[j++] = old[i];

Oops, need to check for old[i].func != tp_stub_func here too.

-- Steve

> +			new[nr_probes - nr_del].func = NULL;
> +		} else {

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 20:44       ` Steven Rostedt
  2020-11-16 21:02         ` Steven Rostedt
@ 2020-11-16 21:21         ` Mathieu Desnoyers
  1 sibling, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2020-11-16 21:21 UTC (permalink / raw)
  To: rostedt
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

----- On Nov 16, 2020, at 3:44 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:37:27 -0500 (EST)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> > 
>> > Mathieu,
>> > 
>> > Can't we do something that would still allow to unregister a probe even if
>> > a new probe array fails to allocate? We could kick off a irq work to try to
>> > clean up the probe at a later time, but still, the unregister itself should
>> > not fail due to memory failure.
>> 
>> Currently, the fast path iteration looks like:
>> 
>>                 struct tracepoint_func *it_func_ptr;
>>                 void *it_func;
>> 
>>                 it_func_ptr =                                           \
>>                         rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>>                 do {                                                    \
>>                         it_func = (it_func_ptr)->func;                  \
>>                         __data = (it_func_ptr)->data;                   \
>>                         ((void(*)(void *, proto))(it_func))(__data, args); \
>>                 } while ((++it_func_ptr)->func);
>> 
>> So we RCU dereference the array, and iterate on the array until we find a NULL
>> func. So you could not use NULL to skip items, but you could perhaps reserve
>> a (void *)0x1UL tombstone for this.
> 
> Actually, you could just set it to a stub callback that does nothing. then
> you don't even need to touch the above macro. Not sure why I didn't
> recommend this to begin with, because that's exactly what the function
> tracer does with ftrace_stub.

I like the stub idea.

What prototype should the stub function have ? Is it acceptable to pass
arguments to a function expecting (void) ? If not, then we may need to
create stub functions for each tracepoint.

> 
> 
>> 
>> It should ideally be an unlikely branch, and it would be good to benchmark the
>> change when multiple tracing probes are attached to figure out whether the
>> overhead is significant when tracing is enabled.
> 
> If you use a stub function, it shouldn't affect anything. And the worse
> that would happen is that you have a slight overhead of calling the stub
> until you can properly remove the callback.
> 
>> 
>> I wonder whether we really mind that much about using slightly more memory
>> than required after a failed reallocation due to ENOMEM. Perhaps the irq work
>> is not even needed. Chances are that the irq work would fail again and again if
>> it's in low memory conditions. So maybe it's better to just keep the tombstone
>> in place until the next successful callback array reallocation.
>> 
> 
> True. If we just replace the function with a stub on memory failure (always
> using __GFP_NOFAIL, and if it fails to reallocate a new array, just replace
> the callback with the stub and be done with it. It may require some more
> accounting to make sure the tracepoint.c code can handle these stubs, and
> remove them on new additions to the code.

Yes.

> Heck, if a stub exists, you could just swap it with a new item.

Not without proper synchronization, otherwise you could end up with
mismatch between function callback and private data. The only transition
valid without waiting for an rcu grace period is to change the function
pointer to a stub function. Anything else (e.g. replacing the stub by
some other callback function) would require to wait for a grace period
beforehand.

> But on any new changes to the list, the stubs should be purged.

Yes,

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 21:02         ` Steven Rostedt
  2020-11-16 21:06           ` Steven Rostedt
@ 2020-11-16 21:34           ` Mathieu Desnoyers
  2020-11-16 22:10             ` Steven Rostedt
  1 sibling, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2020-11-16 21:34 UTC (permalink / raw)
  To: rostedt
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

----- On Nov 16, 2020, at 4:02 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:44:37 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> If you use a stub function, it shouldn't affect anything. And the worse
>> that would happen is that you have a slight overhead of calling the stub
>> until you can properly remove the callback.
> 
> Something like this:
> 
> (haven't compiled it yet, I'm about to though).
> 
> -- Steve
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 3f659f855074..8eab40f9d388 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,10 +53,16 @@ struct tp_probes {
> 	struct tracepoint_func probes[];
> };
> 
> -static inline void *allocate_probes(int count)
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)

I'm still not sure whether it's OK to call a (void) function with arguments.

> +{
> +	return;
> +}
> +
> +static inline void *allocate_probes(int count, gfp_t extra_flags)
> {
> 	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> -				       GFP_KERNEL);
> +				       GFP_KERNEL | extra_flags);
> 	return p == NULL ? NULL : p->probes;
> }
> 
> @@ -150,7 +156,7 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
> 		}
> 	}
> 	/* + 2 : one for new probe, one for NULL func */
> -	new = allocate_probes(nr_probes + 2);
> +	new = allocate_probes(nr_probes + 2, 0);
> 	if (new == NULL)
> 		return ERR_PTR(-ENOMEM);
> 	if (old) {
> @@ -188,8 +194,9 @@ static void *func_remove(struct tracepoint_func **funcs,
> 	/* (N -> M), (N > 1, M >= 0) probes */
> 	if (tp_func->func) {
> 		for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> -			if (old[nr_probes].func == tp_func->func &&
> -			     old[nr_probes].data == tp_func->data)
> +			if ((old[nr_probes].func == tp_func->func &&
> +			     old[nr_probes].data == tp_func->data) ||
> +			    old[nr_probes].func == tp_stub_func)
> 				nr_del++;
> 		}
> 	}
> @@ -207,15 +214,20 @@ static void *func_remove(struct tracepoint_func **funcs,
> 		int j = 0;
> 		/* N -> M, (N > 1, M > 0) */
> 		/* + 1 for NULL */
> -		new = allocate_probes(nr_probes - nr_del + 1);
> -		if (new == NULL)
> -			return ERR_PTR(-ENOMEM);
> -		for (i = 0; old[i].func; i++)
> -			if (old[i].func != tp_func->func
> -					|| old[i].data != tp_func->data)
> -				new[j++] = old[i];
> -		new[nr_probes - nr_del].func = NULL;
> -		*funcs = new;
> +		new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL);
> +		if (new) {
> +			for (i = 0; old[i].func; i++)
> +				if (old[i].func != tp_func->func
> +				    || old[i].data != tp_func->data)

as you point out in your reply, skip tp_stub_func here too.

> +					new[j++] = old[i];
> +			new[nr_probes - nr_del].func = NULL;
> +		} else {
> +			for (i = 0; old[i].func; i++)
> +				if (old[i].func == tp_func->func &&
> +				    old[i].data == tp_func->data)
> +					old[i].func = tp_stub_func;

I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
func pointer can be updated and loaded concurrently.

> +		}
> +		*funcs = old;

The line above seems wrong for the successful allocate_probe case. You will likely
want *funcs = new on successful allocation, and *funcs = old for the failure case.

Thanks,

Mathieu

> 	}
> 	debug_print_probes(*funcs);
> 	return old;
> @@ -300,6 +312,10 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> 		return PTR_ERR(old);
> 	}
> 
> +	if (tp_funcs == old)
> +		/* Failed allocating new tp_funcs, replaced func with stub */
> +		return 0;
> +
> 	if (!tp_funcs) {
> 		/* Removed last function */
>  		if (tp->unregfunc && static_key_enabled(&tp->key))

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 21:34           ` Mathieu Desnoyers
@ 2020-11-16 22:10             ` Steven Rostedt
  2020-11-17 23:05               ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2020-11-16 22:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Nov 16, 2020, at 4:02 PM, rostedt rostedt@goodmis.org wrote:
> 
> > On Mon, 16 Nov 2020 15:44:37 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> >> If you use a stub function, it shouldn't affect anything. And the worse
> >> that would happen is that you have a slight overhead of calling the stub
> >> until you can properly remove the callback.  
> > 
> > Something like this:
> > 
> > (haven't compiled it yet, I'm about to though).

Still need more accounting to work on. Almost finished though. ;-)

> > 
> > -- Steve
> > 
> > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> > index 3f659f855074..8eab40f9d388 100644
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -53,10 +53,16 @@ struct tp_probes {
> > 	struct tracepoint_func probes[];
> > };
> > 
> > -static inline void *allocate_probes(int count)
> > +/* Called in removal of a func but failed to allocate a new tp_funcs */
> > +static void tp_stub_func(void)  
> 
> I'm still not sure whether it's OK to call a (void) function with arguments.

Actually, I've done it. The thing is, what can actually happen? A void
function that simply returns should not do anything. If anything, the only
waste is that the caller would save more registers than necessary.

I can't think of anything that can actually happen, but perhaps there is. I
wouldn't want to make a stub function for every trace point (it wouldn't be
hard to do).

But perhaps we should ask the compiler people to make sure.

> 
> > +{
> > +	return;
> > +}
> > +
> > +static inline void *allocate_probes(int count, gfp_t extra_flags)
> > {
> > 	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> > -				       GFP_KERNEL);
> > +				       GFP_KERNEL | extra_flags);
> > 	return p == NULL ? NULL : p->probes;
> > }
> > 
> > @@ -150,7 +156,7 @@ func_add(struct tracepoint_func **funcs, struct
> > tracepoint_func *tp_func,
> > 		}
> > 	}
> > 	/* + 2 : one for new probe, one for NULL func */
> > -	new = allocate_probes(nr_probes + 2);
> > +	new = allocate_probes(nr_probes + 2, 0);
> > 	if (new == NULL)
> > 		return ERR_PTR(-ENOMEM);
> > 	if (old) {
> > @@ -188,8 +194,9 @@ static void *func_remove(struct tracepoint_func **funcs,
> > 	/* (N -> M), (N > 1, M >= 0) probes */
> > 	if (tp_func->func) {
> > 		for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> > -			if (old[nr_probes].func == tp_func->func &&
> > -			     old[nr_probes].data == tp_func->data)
> > +			if ((old[nr_probes].func == tp_func->func &&
> > +			     old[nr_probes].data == tp_func->data) ||
> > +			    old[nr_probes].func == tp_stub_func)
> > 				nr_del++;
> > 		}
> > 	}
> > @@ -207,15 +214,20 @@ static void *func_remove(struct tracepoint_func **funcs,
> > 		int j = 0;
> > 		/* N -> M, (N > 1, M > 0) */
> > 		/* + 1 for NULL */
> > -		new = allocate_probes(nr_probes - nr_del + 1);
> > -		if (new == NULL)
> > -			return ERR_PTR(-ENOMEM);
> > -		for (i = 0; old[i].func; i++)
> > -			if (old[i].func != tp_func->func
> > -					|| old[i].data != tp_func->data)
> > -				new[j++] = old[i];
> > -		new[nr_probes - nr_del].func = NULL;
> > -		*funcs = new;
> > +		new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL);
> > +		if (new) {
> > +			for (i = 0; old[i].func; i++)
> > +				if (old[i].func != tp_func->func
> > +				    || old[i].data != tp_func->data)  
> 
> as you point out in your reply, skip tp_stub_func here too.
> 
> > +					new[j++] = old[i];
> > +			new[nr_probes - nr_del].func = NULL;
> > +		} else {
> > +			for (i = 0; old[i].func; i++)
> > +				if (old[i].func == tp_func->func &&
> > +				    old[i].data == tp_func->data)
> > +					old[i].func = tp_stub_func;  
> 
> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
> func pointer can be updated and loaded concurrently.

I thought about this a little, and then only thing we really should worry
about is synchronizing with those that unregister. Because when we make
this update, there are now two states. the __DO_TRACE either reads the
original func or the stub. And either should be OK to call.

Only the func gets updated and not the data. So what exactly are we worried
about here?

> 
> > +		}
> > +		*funcs = old;  
> 
> The line above seems wrong for the successful allocate_probe case. You will likely
> want *funcs = new on successful allocation, and *funcs = old for the failure case.

Yeah, it crashed because of this ;-)

Like I said, untested!

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-16 22:10             ` Steven Rostedt
@ 2020-11-17 23:05               ` Mathieu Desnoyers
  2020-11-18  0:42                 ` Matt Mullins
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2020-11-17 23:05 UTC (permalink / raw)
  To: rostedt
  Cc: paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

----- On Nov 16, 2020, at 5:10 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

[...]

>> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
>> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
>> func pointer can be updated and loaded concurrently.
> 
> I thought about this a little, and then only thing we really should worry
> about is synchronizing with those that unregister. Because when we make
> this update, there are now two states. the __DO_TRACE either reads the
> original func or the stub. And either should be OK to call.
> 
> Only the func gets updated and not the data. So what exactly are we worried
> about here?

Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

However, if we want to compare the function pointer to some other value and
conditionally do (or skip) the call, I think you'll need the READ_ONCE/WRITE_ONCE
to make sure the pointer is not re-fetched between comparison and call.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-17 23:05               ` Mathieu Desnoyers
@ 2020-11-18  0:42                 ` Matt Mullins
  2020-11-18  1:09                   ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Matt Mullins @ 2020-11-18  0:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: rostedt, paulmck, Matt Mullins, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Tue, Nov 17, 2020 at 06:05:51PM -0500, Mathieu Desnoyers wrote:
> ----- On Nov 16, 2020, at 5:10 PM, rostedt rostedt@goodmis.org wrote:
> 
> > On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> [...]
> 
> >> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
> >> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
> >> func pointer can be updated and loaded concurrently.
> > 
> > I thought about this a little, and then only thing we really should worry
> > about is synchronizing with those that unregister. Because when we make
> > this update, there are now two states. the __DO_TRACE either reads the
> > original func or the stub. And either should be OK to call.
> > 
> > Only the func gets updated and not the data. So what exactly are we worried
> > about here?
> 
> Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

I'm not sure if this is a practical issue, but without WRITE_ONCE, can't
the write be torn?  A racing __traceiter_ could potentially see a
half-modified function pointer, which wouldn't work out too well.

This was actually my gut instinct before I wrote the __GFP_NOFAIL
instead -- currently that whole array's memory ordering is provided by
RCU and I didn't dive deep enough to evaluate getting too clever with
atomic modifications to it.

> 
> However, if we want to compare the function pointer to some other value and
> conditionally do (or skip) the call, I think you'll need the READ_ONCE/WRITE_ONCE
> to make sure the pointer is not re-fetched between comparison and call.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-18  0:42                 ` Matt Mullins
@ 2020-11-18  1:09                   ` Steven Rostedt
  2020-11-18  4:57                     ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2020-11-18  1:09 UTC (permalink / raw)
  To: Matt Mullins
  Cc: Mathieu Desnoyers, paulmck, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Tue, 17 Nov 2020 16:42:44 -0800
Matt Mullins <mmullins@mmlx.us> wrote:


> > Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.  
> 
> I'm not sure if this is a practical issue, but without WRITE_ONCE, can't
> the write be torn?  A racing __traceiter_ could potentially see a
> half-modified function pointer, which wouldn't work out too well.

This has been discussed before, and Linus said:

"We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
because of some theoretical "compiler is free to do garbage"
arguments. If such garbage happens, we need to fix the compiler"

https://lore.kernel.org/lkml/CAHk-=wi_KeD1M-_-_SU_H92vJ-yNkDnAGhAS=RR1yNNGWKW+aA@mail.gmail.com/

> 
> This was actually my gut instinct before I wrote the __GFP_NOFAIL
> instead -- currently that whole array's memory ordering is provided by
> RCU and I didn't dive deep enough to evaluate getting too clever with
> atomic modifications to it.

The pointers are always going to be the architecture word size (by
definition), and any compiler that tears a write of a long is broken.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
  2020-11-18  1:09                   ` Steven Rostedt
@ 2020-11-18  4:57                     ` Paul E. McKenney
  0 siblings, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2020-11-18  4:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matt Mullins, Mathieu Desnoyers, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Dmitry Vyukov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf

On Tue, Nov 17, 2020 at 08:09:22PM -0500, Steven Rostedt wrote:
> On Tue, 17 Nov 2020 16:42:44 -0800
> Matt Mullins <mmullins@mmlx.us> wrote:
> 
> 
> > > Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.  
> > 
> > I'm not sure if this is a practical issue, but without WRITE_ONCE, can't
> > the write be torn?  A racing __traceiter_ could potentially see a
> > half-modified function pointer, which wouldn't work out too well.
> 
> This has been discussed before, and Linus said:
> 
> "We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> because of some theoretical "compiler is free to do garbage"
> arguments. If such garbage happens, we need to fix the compiler"
> 
> https://lore.kernel.org/lkml/CAHk-=wi_KeD1M-_-_SU_H92vJ-yNkDnAGhAS=RR1yNNGWKW+aA@mail.gmail.com/

I have to ask...  Did the ARM compilers get fixed?  As of a few
months ago, they would tear stores of some constant values.

> > This was actually my gut instinct before I wrote the __GFP_NOFAIL
> > instead -- currently that whole array's memory ordering is provided by
> > RCU and I didn't dive deep enough to evaluate getting too clever with
> > atomic modifications to it.
> 
> The pointers are always going to be the architecture word size (by
> definition), and any compiler that tears a write of a long is broken.

But yes, if the write is of a non-constant pointer, the compiler does
have less leverage.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
  2020-11-13 16:08     ` Yonghong Song
@ 2021-02-10 18:23       ` Eric Dumazet
  2021-02-10 19:52         ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2021-02-10 18:23 UTC (permalink / raw)
  To: Yonghong Song, Dmitry Vyukov, syzbot, Andrew Morton, andrii,
	Alexei Starovoitov, bpf, Daniel Borkmann, David Miller,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	KP Singh, Jakub Kicinski, LKML, Ingo Molnar, Ingo Molnar,
	mmullins, netdev, Peter Zijlstra, Steven Rostedt, Song Liu,
	syzkaller-bugs



On 11/13/20 5:08 PM, Yonghong Song wrote:
> 
> 
> On 11/12/20 9:37 PM, Matt Mullins wrote:
>> On Wed, Nov 11, 2020 at 03:57:50PM +0100, Dmitry Vyukov wrote:
>>> On Mon, Nov 2, 2020 at 12:54 PM syzbot
>>> <syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> syzbot found the following issue on:
>>>>
>>>> HEAD commit:    080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
>>>> git tree:       bpf
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c500000
>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
>>>> compiler:       gcc (GCC) 10.1.0-syz 20200507
>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10f4b032500000
>>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c500000
>>>>
>>>> The issue was bisected to:
>>>>
>>>> commit 9df1c28bb75217b244257152ab7d788bb2a386d0
>>>> Author: Matt Mullins <mmullins@fb.com>
>>>> Date:   Fri Apr 26 18:49:47 2019 +0000
>>>>
>>>>      bpf: add writable context for raw tracepoints
>>>
>>>
>>> We have a number of kernel memory corruptions related to bpf_trace_run now:
>>> https://groups.google.com/g/syzkaller-bugs/search?q=kernel/trace/bpf_trace.c
>>>
>>> Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)?
>>> Or they shouldn't?
>>>
>>> Looking at the description of Matt's commit, it seems that corruptions
>>> should not be possible (bounded buffer, checked size, etc). Then it
>>> means it's a real kernel bug?
>>
>> This bug doesn't seem to be related to the writability of the
>> tracepoint; it bisected to that commit simply because it used
>> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE for the reproducer and it EINVAL's
>> before that program type was introduced.  The BPF program it loads is
>> pretty much a no-op.
>>
>> The problem here is a kmalloc failure injection into
>> tracepoint_probe_unregister, but the error is ignored -- so the bpf
>> program is freed even though the tracepoint is never unregistered.
>>
>> I have a first pass at a patch to pipe through the error code, but it's
>> pretty ugly.  It's also called from the file_operations ->release(), for
> 
> Maybe you can still post the patch, so people can review and make suggestions which may lead to a *better* solution.


ping

This bug is still there.


> 
>> which errors are solidly ignored in __fput(), so I'm not sure what the
>> best way to handle ENOMEM is...
>>
>>>
>>>
>>>
>>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da500000
>>>> final oops:     https://syzkaller.appspot.com/x/report.txt?x=11b6c4da500000
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da500000
>>>>
>>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>>> Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com
>>>> Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")
>>>>
>>>> ==================================================================
>>>> BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
>>>> BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
>>>> Read of size 8 at addr ffffc90000e6c030 by task kworker/0:3/3754
>>>>
>>>> CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>> Workqueue:  0x0 (events)
>>>> Call Trace:
>>>>   __dump_stack lib/dump_stack.c:77 [inline]
>>>>   dump_stack+0x107/0x163 lib/dump_stack.c:118
>>>>   print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
>>>>   __kasan_report mm/kasan/report.c:545 [inline]
>>>>   kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
>>>>   __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
>>>>   bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
>>>>   __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
>>>>   __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
>>>>   trace_sched_switch include/trace/events/sched.h:138 [inline]
>>>>   __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
>>>>   schedule+0xcf/0x270 kernel/sched/core.c:4601
>>>>   worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
>>>>   kthread+0x3af/0x4a0 kernel/kthread.c:292
>>>>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
>>>>
>>>>
>>>> Memory state around the buggy address:
>>>>   ffffc90000e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>>   ffffc90000e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>>> ffffc90000e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>>                                       ^
>>>>   ffffc90000e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>>   ffffc90000e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
>>>> ==================================================================
> [...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
  2021-02-10 18:23       ` Eric Dumazet
@ 2021-02-10 19:52         ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2021-02-10 19:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yonghong Song, Dmitry Vyukov, syzbot, Andrew Morton, andrii,
	Alexei Starovoitov, bpf, Daniel Borkmann, David Miller,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	KP Singh, Jakub Kicinski, LKML, Ingo Molnar, Ingo Molnar,
	mmullins, netdev, Peter Zijlstra, Song Liu, syzkaller-bugs

On Wed, 10 Feb 2021 19:23:38 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> >> The problem here is a kmalloc failure injection into
> >> tracepoint_probe_unregister, but the error is ignored -- so the bpf
> >> program is freed even though the tracepoint is never unregistered.
> >>
> >> I have a first pass at a patch to pipe through the error code, but it's
> >> pretty ugly.  It's also called from the file_operations ->release(), for  
> > 
> > Maybe you can still post the patch, so people can review and make suggestions which may lead to a *better* solution.  
> 
> 
> ping
> 
> This bug is still there.

Is this a bug via syzkaller?

I have this fix in linux-next:

  befe6d946551 ("tracepoint: Do not fail unregistering a probe due to memory failure")

But because it is using undefined behavior (calling a sub return from a
call that has parameters, but Peter Zijlstra says is OK), I'm hesitant to
send it to Linus now or label it as stable.

Now this can only happen if kmalloc fails from here (called by func_remove).

static inline void *allocate_probes(int count)
{
	struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
				       GFP_KERNEL);
	return p == NULL ? NULL : p->probes;
}

As probes and count together is typically much less than a page (unless you
are doing fuzz testing and adding a ton of callbacks to a single
tracepoint), that kmalloc should always succeed.

The failure above can only happen if allocate_probes returns NULL, which is
extremely unlikely.

My question is, how is this triggered? And this should only be triggered by
root doing stupid crap. Is it that critical to have fixed?

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-02-10 19:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-02 11:54 KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3 syzbot
2020-11-11 14:57 ` Dmitry Vyukov
2020-11-13  5:37   ` Matt Mullins
2020-11-13 16:08     ` Yonghong Song
2021-02-10 18:23       ` Eric Dumazet
2021-02-10 19:52         ` Steven Rostedt
2020-11-15  5:52 ` [PATCH] bpf: don't fail kmalloc while releasing raw_tp Matt Mullins
2020-11-16 17:19   ` Steven Rostedt
2020-11-16 20:37     ` Mathieu Desnoyers
2020-11-16 20:44       ` Steven Rostedt
2020-11-16 21:02         ` Steven Rostedt
2020-11-16 21:06           ` Steven Rostedt
2020-11-16 21:34           ` Mathieu Desnoyers
2020-11-16 22:10             ` Steven Rostedt
2020-11-17 23:05               ` Mathieu Desnoyers
2020-11-18  0:42                 ` Matt Mullins
2020-11-18  1:09                   ` Steven Rostedt
2020-11-18  4:57                     ` Paul E. McKenney
2020-11-16 21:21         ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).