Re: KASAN: use-after-free Read in ___bpf_prog_run

From: Yonghong Song <yhs@meta.com>
To: Hao Sun <sunhao.th@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>
Subject: Re: KASAN: use-after-free Read in ___bpf_prog_run
Date: Mon, 16 Jan 2023 23:50:15 -0800	[thread overview]
Message-ID: <18306a8f-e71f-b322-6c09-4a498e28c5e6@meta.com> (raw)
In-Reply-To: <F4EB9C63-8ED6-4640-9BD4-4709F01589FD@gmail.com>

On 1/16/23 4:45 AM, Hao Sun wrote:
> 
> 
>> On 12 Jan 2023, at 2:59 PM, Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 1/9/23 5:21 AM, Hao Sun wrote:
>>> Yonghong Song <yhs@meta.com> 于2022年12月18日周日 00:57写道：
>>>>
>>>>
>>>>
>>>> On 12/16/22 10:54 PM, Hao Sun wrote:
>>>>>
>>>>>
>>>>>> On 17 Dec 2022, at 1:07 PM, Yonghong Song <yhs@meta.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/14/22 11:49 PM, Hao Sun wrote:
>>>>>>> Hi,
>>>>>>> The following KASAN report can be triggered by loading and test
>>>>>>> running this simple BPF prog with a random data/ctx:
>>>>>>> 0: r0 = bpf_get_current_task_btf      ;
>>>>>>> R0_w=trusted_ptr_task_struct(off=0,imm=0)
>>>>>>> 1: r0 = *(u32 *)(r0 +8192)       ;
>>>>>>> R0_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
>>>>>>> 2: exit
>>>>>>> I've simplified the C reproducer but didn't find the root cause.
>>>>>>> JIT was disabled, and the interpreter triggered UAF when executing
>>>>>>> the load insn. A slab-out-of-bound read can also be triggered:
>>>>>>> https://pastebin.com/raw/g9zXr8jU
>>>>>>> This can be reproduced on:
>>>>>>> HEAD commit: b148c8b9b926 selftests/bpf: Add few corner cases to test
>>>>>>> padding handling of btf_dump
>>>>>>> git tree: bpf-next
>>>>>>> console log: https://pastebin.com/raw/1EUi9tJe
>>>>>>> kernel config: https://pastebin.com/raw/rgY3AJDZ
>>>>>>> C reproducer: https://pastebin.com/raw/cfVGuCBm
>>>>>>
>>>>>> I I tried with your above kernel config and C reproducer and cannot reproduce the kasan issue you reported.
>>>>>>
>>>>>> [root@arch-fb-vm1 bpf-next]# ./a.out
>>>>>> func#0 @0
>>>>>> 0: R1=ctx(off=0,imm=0) R10=fp0
>>>>>> 0: (85) call bpf_get_current_task_btf#158     ; R0_w=trusted_ptr_task_struct(off=0,imm=0)
>>>>>> 1: (61) r0 = *(u32 *)(r0 +8192)       ; R0_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
>>>>>> 2: (95) exit
>>>>>> processed 3 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
>>>>>>
>>>>>> prog fd: 3
>>>>>> [root@arch-fb-vm1 bpf-next]#
>>>>>>
>>>>>> Your config indeed has kasan on.
>>>>>
>>>>> Hi,
>>>>>
>>>>> I can still reproduce this on a latest bpf-next build: 0e43662e61f25
>>>>> (“tools/resolve_btfids: Use pkg-config to locate libelf”).
>>>>> The simplified C reproducer sometime need to be run twice to trigger
>>>>> the UAF. Also note that interpreter is required. Here is the original
>>>>> C reproducer that loads and runs the BPF prog continuously for your
>>>>> convenience:
>>>>> https://pastebin.com/raw/WSJuNnVU
>>>>>
>>>>
>>>> I still cannot reproduce with more than 10 runs. The config has jit off
>>>> so it already uses interpreter. It has kasan on as well.
>>>> # CONFIG_BPF_JIT is not set
>>>>
>>>> Since you can reproduce it, I guess it would be great if you can
>>>> continue to debug this.
>>>>
>>> The load insn ‘r0 = *(u32*) (current + 8192)’ is OOB, because sizeof(task_struct)
>>> is 7240 as shown in KASAN report. The issue is that struct task_struct is special,
>>> its runtime size is actually smaller than it static type size. In X86:
>>> task_struct->thread_struct->fpu->fpstate->union fpregs_state is
>>> /*
>>> * ...
>>> * The size of the structure is determined by the largest
>>> * member - which is the xsave area. The padding is there
>>> * to ensure that statically-allocated task_structs (just
>>> * the init_task today) have enough space.
>>> */
>>> union fpregs_state {
>>> struct fregs_state fsave;
>>> struct fxregs_state fxsave;
>>> struct swregs_state soft;
>>> struct xregs_state xsave;
>>> u8 __padding[PAGE_SIZE];
>>> };
>>> In btf_struct_access(), the resolved size for task_struct is 10496, much bigger
>>> than its runtime size, so the prog in reproducer passed the verifier and leads
>>> to the oob. This can happen to all similar types, whose runtime size is smaller
>>> than its static size.
>>> Not sure how many similar cases are there, maybe special check to task_struct
>>> is enough. Any hint on how this should be addressed?
>>
>> This should a corner case, I am not aware of other allocations like this.
>>
>> For a normal program, if the access chain looks
>> like
>> task_struct->thread_struct->fpu->fpstate->fpregs_state->{fsave,fxsave, soft, xsave},
>> we should not hit this issue. So I think we don't need to address this
>> issue in kernel. The test itself should filter this out.
> 
> Maybe I didn’t make my point clear. The issue here is that the runtime size
> of task_struct is `arch_task_struct_size`, which equals to the following,
> see fpu__init_task_struct_size():
> 
> sizeof(task_struct) - sizeof(fpregs_state) + fpu_kernel_cfg.default_size
> 
> However, the verifier validates task_struct access based on the ty info in
> btf_vmlinux, which equals to sizeof(task_struct), much bigger than `arch_
> task_struct_size` due to the `__padding[PAGE_SIZE]` in fpregs_state, this
> leads to OOB access.
> 
> We should not allow progs that access the task_struct beyond `arch_task_
> struct_size` to be loaded. So maybe we should add a fixup in `btf_parse_
> vmlinux()` to assign the correct size to each type of this chain:
> task_struct->thread_struct->fpu->fpstate->fpregs_state;
> Or maybe we should fix this during generating BTF of vmlinux?

Thanks for detailed explanation. I do understand the problem with a 
slight caveat. I mentioned if the user doing a proper access in C
code like

task_struct->thread_struct->fpu->fpstate->fpregs_state->{fsave,fxsave, 
soft, xsave}
we should not hit the problem if the bpf program is written to
retrieve *correct* and *expected* data.

But if the bpf program tries to access the data beyond the *correct*
boundary (i.e. beyond fpu__init_task_struct_size()), uninit mem access
might happen.

Since fpu__init_task_struct_size() might be different for different
arch features, we cannot change vmlinux BTF. We could add a check in
btf_struct_access such that if
   . btf_id references to struct task_struct
   . walked offset > arch_task_struct_size
reject the program with a verification failure.
But I kind of don't like this special case as it only happens with
*incorrect* program. Note that we won't have fault as the memory
read is protected by bpf exception handling.