Re: [BUG] hitting bug when running spinlock test

From: Yonghong Song <yhs@fb.com>
To: Roman Gushchin <guro@fb.com>
Cc: "Jiri Olsa" <jolsa@redhat.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"YiFei Zhu" <zhuyifei@google.com>,
	"Andrii Nakryiko" <andriin@fb.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Martin KaFai Lau" <kafai@fb.com>,
	"Song Liu" <songliubraving@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@chromium.org>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Yauheni Kaliuta" <ykaliuta@redhat.com>,
	"Jiri Benc" <jbenc@redhat.com>, "Hangbin Liu" <haliu@redhat.com>,
	"Stanislav Fomichev" <sdf@google.com>
Subject: Re: [BUG] hitting bug when running spinlock test
Date: Tue, 9 Mar 2021 10:36:06 -0800	[thread overview]
Message-ID: <45df5090-870a-ac49-3442-24ed67266d0e@fb.com> (raw)
In-Reply-To: <YEe8mt+iJqDXD6CW@carbon.dhcp.thefacebook.com>

On 3/9/21 10:21 AM, Roman Gushchin wrote:
> On Mon, Mar 08, 2021 at 09:44:08PM -0800, Yonghong Song wrote:
>>
>>
>> On 3/5/21 1:10 PM, Yonghong Song wrote:
>>>
>>>
>>> On 3/5/21 12:38 PM, Roman Gushchin wrote:
>>>> On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
>>>>> hi,
>>>>> I'm getting attached BUG/crash when running in parralel selftests, like:
>>>>>
>>>>>     while :; do ./test_progs -t spinlock; done
>>>>>     while :; do ./test_progs ; done
>>>>>
>>>>> it's the latest bpf-next/master, I can send the .config if needed,
>>>>> but I don't think there's anything special about it, because I saw
>>>>> the bug on other servers with different generic configs
>>>>>
>>>>> it looks like it's related to cgroup local storage, for some reason
>>>>> the storage deref returns NULL
>>>>>
>>>>> I'm bit lost in this code, so any help would be great ;-)
>>>>
>>>> Hi!
>>>>
>>>> I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
>>>> unkillable BPF_PROG_TEST_RUN").
>>>
>>> Thanks, Roman, I did some experiments and found the reason of NULL
>>> storage deref is because a tracing program (mostly like a kprobe) is run
>>> after bpf_cgroup_storage_set() is called but before bpf program calls
>>> bpf_get_local_storage(). Note that trace_call_bpf() macro
>>> BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
>>>
>>>> Prior to it, we were running the test program in the
>>>> preempt_disable() && rcu_read_lock()
>>>> section:
>>>>
>>>> preempt_disable();
>>>> rcu_read_lock();
>>>> bpf_cgroup_storage_set(storage);
>>>> ret = BPF_PROG_RUN(prog, ctx);
>>>> rcu_read_unlock();
>>>> preempt_enable();
>>>>
>>>> So, a percpu variable with a cgroup local storage pointer couldn't
>>>> go away.
>>>
>>> I think even with using preempt_disable(), if the bpf program calls map
>>> lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
>>> will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
>>> bpf_cgroup_storage_set() too. I need to write a test case to confirm
>>> this though.
>>>
>>>>
>>>> After df1a2cb7c74b we can temporarily enable the preemption, so
>>>> nothing prevents
>>>> another program to call into bpf_cgroup_storage_set() on the same cpu.
>>>> I guess it's exactly what happens here.
>>>
>>> It is. I confirmed.
>>
>> Actually, the failure is not due to breaking up preempt_disable(). Even with
>> adding cond_resched(), bpf_cgroup_storage_set() still happens
>> inside the preempt region. So it is okay. What I confirmed is that
>> changing migration_{disable/enable} to preempt_{disable/enable} fixed
>> the issue.
> 
> Hm, how so? If preemption is enabled, another task/bpf program can start
> executing on the same cpu and set their cgroup storage. I guess it's harder
> to reproduce or it will result in the (bpf map) memory corruption instead
> of a panic, but I don't think it's safe.

The code has been refactored recently. The following is the code right 
before refactoring to make it easy to understand:

         rcu_read_lock();
         migrate_disable();
         time_start = ktime_get_ns();
         for (i = 0; i < repeat; i++) {
                 bpf_cgroup_storage_set(storage);

                 if (xdp)
                         *retval = bpf_prog_run_xdp(prog, ctx);
                 else
                         *retval = BPF_PROG_RUN(prog, ctx);

                 if (signal_pending(current)) {
                         ret = -EINTR;
                         break;
                 }

                 if (need_resched()) {
                         time_spent += ktime_get_ns() - time_start;
                         migrate_enable();
                         rcu_read_unlock();

                         cond_resched();

                         rcu_read_lock();
                         migrate_disable();
                         time_start = ktime_get_ns();
                 }
         }
         time_spent += ktime_get_ns() - time_start;
         migrate_enable();
         rcu_read_unlock();

bpf_cgroup_storage_set() is called inside migration_disable/enable().
Previously it is called inside preempt_disable/enable(), so it should be 
fine.

> 
>>
>> So migration_{disable/enable} is the issue since any other process (and its
>> bpf program) and preempted the current process/bpf program and run.
> 
> Oh, I didn't know about the preempt_{disable/enable}/migration_{disable/enable}
> change. It's definitely not safe from a cgroup local storage perspective.
> 
>> Currently for each program, we will set the local storage before the
>> program run and each program may access to multiple local storage
>> maps. So we might need something similar sk_local_storage.
>> Considering possibility of deep nested migration_{disable/enable},
>> the cgroup local storage has to be preserved in prog/map data
>> structure and not as a percpu variable as it will be very hard
>> to save and restore percpu virable content as migration can
>> happen anywhere. I don't have concrete design yet. Just throw some
>> idea here.
> 
> Initially I thought about saving this pointer on stack, but then we need
> some sort of gcc/assembly magic to get this pointer from the stack outside
> of the current scope. At that time we didn't have sleepable programs,
> so the percpu approach looked simpler and more reliable. Maybe it's time
> to review it.

Indeed this is the time.

> 
>>
>> BTW, I send a patch to prevent tracing programs to mess up
>> with cgroup local storage:
>>     https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@fb.com/T/#u
>> we now all programs access cgroup local storage should be in
>> process context and we don't need to worry about kprobe-induced
>> percpu local storage access.
> 
> Thank you! My only issue is that the commit log looks like an optimization
> (like we're calling for set_cgroup_storage() for no good reason), where if
> I understand it correctly, it prevents some class of problems.

Yes, it prevents real problems as well. The reason I did not say it is 
because the patch does not really fix fundamental issue. But it does
prevent some issues. Let me reword the commit message.

> 
> Thanks!
> 
>>
>>>
>>>>
>>>> One option to fix it is to make bpf_cgroup_storage_set() to return
>>>> the old value,
>>>> save it on a local variable and restore after the execution of the
>>>> program.
>>>
>>> In this particular case, we are doing bpf_test_run, we explicitly
>>> allocate storage and call bpf_cgroup_storage_set() right before
>>> each BPF_PROG_RUN.
>>>
>>>> But I didn't follow closely the development of sleepable bpf
>>>> programs, so I could
>>>> easily miss something.
>>>
>>> Yes, sleepable bpf program is another complication. I think we need a
>>> variable similar to bpf_prog_active, which should not nested bpf program
>>> execution for those bpf programs having local_storage map.
>>> Will try to craft some patch to facilitate the discussion.
>>>
>> [...]