Re: [PATCH bpf] x86/bpf: handle bpf-program-triggered exceptions properly

From: Yonghong Song <yhs@fb.com>
To: Andy Lutomirski <luto@kernel.org>, Jann Horn <jannh@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andriin@fb.com>, Martin KaFai Lau <kafai@fb.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	kernel-team <kernel-team@fb.com>, X86 ML <x86@kernel.org>,
	KP Singh <kpsingh@kernel.org>
Subject: Re: [PATCH bpf] x86/bpf: handle bpf-program-triggered exceptions properly
Date: Wed, 27 Jan 2021 17:05:54 -0800	[thread overview]
Message-ID: <92a66173-6512-f1bc-0f9a-530c6c9a1ef0@fb.com> (raw)
In-Reply-To: <CALCETrX157htkCF81zb+5BBo9C_V39YNdt7yXRcFGGw_SRs02Q@mail.gmail.com>

On 1/27/21 1:47 PM, Andy Lutomirski wrote:
> On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> When reviewing patch ([1]), which adds a script to run bpf selftest
>> through qemu at /sbin/init stage, I found the following kernel bug
>> warning:
>>
>> [  112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351
>> [  112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name
>> [  112.120512] 3 locks held by new_name/354:
>> [  112.120868]  #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0
>> [  112.121573]  #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160
>> [  112.122348]  #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640
>> [  112.123128] Preemption disabled at:
>> [  112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80
>> [  112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G           O      5.11.0-rc4-00524-g6e66fbb
>> 10597-dirty #1249
>> [  112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01
>> /2014
>> [  112.125614] Call Trace:
>> [  112.125835]  dump_stack+0x77/0x97
>> [  112.126137]  ___might_sleep.cold.119+0xf2/0x106
>> [  112.126537]  exc_page_fault+0x1c1/0x640
>> [  112.126888]  asm_exc_page_fault+0x1e/0x30
>> [  112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c
>> [  112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b
>> bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c
>> 1 d0 48 8b 7d c8
>> [  112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282
>> [  112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024
>> [  112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000
>> [  112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000
>> [  112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400
>> [  112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458
>> [  112.132990]  ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c
>> [  112.133526]  bpf_iter_run_prog+0x75/0x160
>> [  112.133880]  __bpf_prog_seq_show+0x39/0x40
>> [  112.134258]  bpf_seq_read+0xf6/0x3d0
>> [  112.134582]  vfs_read+0xa3/0x1b0
>> [  112.134873]  ksys_read+0x4f/0xc0
>> [  112.135166]  do_syscall_64+0x2d/0x40
>> [  112.135482]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> To reproduce the issue, with patch [1] and use the following script:
>>    tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug
>>
>> The reason of the above kernel warning is due to bpf program
>> tries to dereference an address of 0 and which is not caught
>> by bpf exception handling logic.
>>
>> ...
>> SEC("iter/bpf_prog")
>> int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx)
>> {
>>          struct bpf_prog *prog = ctx->prog;
>>          struct bpf_prog_aux *aux;
>>          ...
>>          if (!prog)
>>                  return 0;
>>          aux = prog->aux;
>>          ...
>>          ... aux->dst_prog->aux->name ...
>>          return 0;
>> }
>>
>> If the aux->dst_prog is NULL pointer, a fault will happen when trying
>> to access aux->dst_prog->aux.
>>
> 
> Which would be a bug in the bpf verifier, no?  This is a bpf program
> that apparently passed verification, and it somehow dereferenced a
> NULL pointer.
> 
> Let's enumerate some cases.
> 
> 1. x86-like architecture, SMAP enabled.  The CPU determines that this
> is bogus, and bpf gets lucky, because the x86 code runs the exception
> handler instead of summarily OOPSing.  IMO it would be entirely
> reasonable to OOPS.
> 
> 2 x86-like architecture, SMAP disabled.  This looks like a valid user
> access, and for all bpf knows, 0 might be an actual mapped address,
> and it might have userfaultfd attached, etc.  And it's called from a
> non-sleepable context.  The warning is entirely correct.
> 
> 3. s390x-like architecture.  This is a dereference of a bad kernel
> pointer.  OOPS, unless you get lucky.
> 
> 
> This patch is totally bogus.  You can't work around this by some magic
> exception handling.  Unless I'm missing something big, this is a bpf
> bug.  The following is not a valid kernel code pattern:
> 
> label:
>    dereference might-be-NULL kernel pointer
> _EXHANDLER_...(label, ...); /* try to recover */
> 
> This appears to have been introduced by:
> 
> commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c
> Author: Alexei Starovoitov <ast@kernel.org>
> Date:   Tue Oct 15 20:25:03 2019 -0700
> 
>      bpf: Add support for BTF pointers to x86 JIT
> 
> Alexei, this looks like a very long-winded and ultimately incorrect
> way to remove the branch from:
> 
> if (ptr)
>    val = *ptr;
> 
> Wouldn't it be better to either just emit the branch directly or to
> make sure that the pointer is valid in the first place?

Let me explain the motivation for this patch.

Previously, for any kernel data structure access,
people has to use bpf_probe_read() or bpf_probe_read_kernel()
helper even most of these accesses are okay and will not
fault. For example, for
    int t = a->b->c->d
three bpf_probe_read() will be needed, e.g.,
    bpf_probe_read_kernel(&t1, sizeof(t1), &a->b);
    bpf_probe_read_kernel(&t2, sizeof(t2), &t1->c);
    bpf_probe_read_kernel(&t, sizeof(t), &t2->d);

if there is a fault, bpf_probe_read_kernel() helper will
suppress the exception and clears the dest buffer and
return.

The above usage of bpf_probe_read_kernel()
is complicated and not C like and bpf developers
does not like it.

bcc (https://github.com/iovisor/bcc/) permits
users to write "a->b->c->d" styles and then through
clang rewriter to convert it to a series of
bpf_probe_read_kernel()'s. But most users are
directly using clang to compile their programs so
they have to write bpf_probe_read_kernel()'s
by themselves.

The motivation here is to improve user experience so
user can just write
    int t = a->b->c->d
some kernel will automatically take care of exceptions
and maintain bpf_probe_read_kernel() semantics.
So what the above patch essentially did is to check if the "regs->ip"
is one of ips which try to a "bpf_probe_read_kernel()" (actually
a direct access), it will fix up exception (clear the dest register)
and returns.

For a->b->c->d, some users may add "if (ptr) check"
for some of them and if that is the
case, compiler/verifier will honor that. Some users
may not add if they are certain from code that in most
or all cases, pointer will not be null. From verifier
perspective, it will be hard to decide whether
a->b, a->b->c is null or not, adding null checks
to every kernel de-references might be excessive. Also,
not 100% sure whether with null check, the pointer
dereference will absolutely not produce fault or not.
If there is no such guarantee then bpf_probe_read_kernel()
will be still needed.

I see page fault handler specially processed page fault for
kprobe and vsyscall. maybe page faults generated by special
bpf insns (simulating bpf_probe_read_kernel()) can also
be specially processed here.

> 
> --Andy
>