All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Roman Gushchin <guro@fb.com>
Cc: "Jiri Olsa" <jolsa@redhat.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"YiFei Zhu" <zhuyifei@google.com>,
	"Andrii Nakryiko" <andriin@fb.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Martin KaFai Lau" <kafai@fb.com>,
	"Song Liu" <songliubraving@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@chromium.org>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Yauheni Kaliuta" <ykaliuta@redhat.com>,
	"Jiri Benc" <jbenc@redhat.com>, "Hangbin Liu" <haliu@redhat.com>,
	"Stanislav Fomichev" <sdf@google.com>
Subject: Re: [BUG] hitting bug when running spinlock test
Date: Tue, 9 Mar 2021 10:36:06 -0800	[thread overview]
Message-ID: <45df5090-870a-ac49-3442-24ed67266d0e@fb.com> (raw)
In-Reply-To: <YEe8mt+iJqDXD6CW@carbon.dhcp.thefacebook.com>



On 3/9/21 10:21 AM, Roman Gushchin wrote:
> On Mon, Mar 08, 2021 at 09:44:08PM -0800, Yonghong Song wrote:
>>
>>
>> On 3/5/21 1:10 PM, Yonghong Song wrote:
>>>
>>>
>>> On 3/5/21 12:38 PM, Roman Gushchin wrote:
>>>> On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
>>>>> hi,
>>>>> I'm getting attached BUG/crash when running in parralel selftests, like:
>>>>>
>>>>>     while :; do ./test_progs -t spinlock; done
>>>>>     while :; do ./test_progs ; done
>>>>>
>>>>> it's the latest bpf-next/master, I can send the .config if needed,
>>>>> but I don't think there's anything special about it, because I saw
>>>>> the bug on other servers with different generic configs
>>>>>
>>>>> it looks like it's related to cgroup local storage, for some reason
>>>>> the storage deref returns NULL
>>>>>
>>>>> I'm bit lost in this code, so any help would be great ;-)
>>>>
>>>> Hi!
>>>>
>>>> I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
>>>> unkillable BPF_PROG_TEST_RUN").
>>>
>>> Thanks, Roman, I did some experiments and found the reason of NULL
>>> storage deref is because a tracing program (mostly like a kprobe) is run
>>> after bpf_cgroup_storage_set() is called but before bpf program calls
>>> bpf_get_local_storage(). Note that trace_call_bpf() macro
>>> BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
>>>
>>>> Prior to it, we were running the test program in the
>>>> preempt_disable() && rcu_read_lock()
>>>> section:
>>>>
>>>> preempt_disable();
>>>> rcu_read_lock();
>>>> bpf_cgroup_storage_set(storage);
>>>> ret = BPF_PROG_RUN(prog, ctx);
>>>> rcu_read_unlock();
>>>> preempt_enable();
>>>>
>>>> So, a percpu variable with a cgroup local storage pointer couldn't
>>>> go away.
>>>
>>> I think even with using preempt_disable(), if the bpf program calls map
>>> lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
>>> will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
>>> bpf_cgroup_storage_set() too. I need to write a test case to confirm
>>> this though.
>>>
>>>>
>>>> After df1a2cb7c74b we can temporarily enable the preemption, so
>>>> nothing prevents
>>>> another program to call into bpf_cgroup_storage_set() on the same cpu.
>>>> I guess it's exactly what happens here.
>>>
>>> It is. I confirmed.
>>
>> Actually, the failure is not due to breaking up preempt_disable(). Even with
>> adding cond_resched(), bpf_cgroup_storage_set() still happens
>> inside the preempt region. So it is okay. What I confirmed is that
>> changing migration_{disable/enable} to preempt_{disable/enable} fixed
>> the issue.
> 
> Hm, how so? If preemption is enabled, another task/bpf program can start
> executing on the same cpu and set their cgroup storage. I guess it's harder
> to reproduce or it will result in the (bpf map) memory corruption instead
> of a panic, but I don't think it's safe.

The code has been refactored recently. The following is the code right 
before refactoring to make it easy to understand:

         rcu_read_lock();
         migrate_disable();
         time_start = ktime_get_ns();
         for (i = 0; i < repeat; i++) {
                 bpf_cgroup_storage_set(storage);

                 if (xdp)
                         *retval = bpf_prog_run_xdp(prog, ctx);
                 else
                         *retval = BPF_PROG_RUN(prog, ctx);

                 if (signal_pending(current)) {
                         ret = -EINTR;
                         break;
                 }

                 if (need_resched()) {
                         time_spent += ktime_get_ns() - time_start;
                         migrate_enable();
                         rcu_read_unlock();

                         cond_resched();

                         rcu_read_lock();
                         migrate_disable();
                         time_start = ktime_get_ns();
                 }
         }
         time_spent += ktime_get_ns() - time_start;
         migrate_enable();
         rcu_read_unlock();

bpf_cgroup_storage_set() is called inside migration_disable/enable().
Previously it is called inside preempt_disable/enable(), so it should be 
fine.

> 
>>
>> So migration_{disable/enable} is the issue since any other process (and its
>> bpf program) and preempted the current process/bpf program and run.
> 
> Oh, I didn't know about the preempt_{disable/enable}/migration_{disable/enable}
> change. It's definitely not safe from a cgroup local storage perspective.
> 
>> Currently for each program, we will set the local storage before the
>> program run and each program may access to multiple local storage
>> maps. So we might need something similar sk_local_storage.
>> Considering possibility of deep nested migration_{disable/enable},
>> the cgroup local storage has to be preserved in prog/map data
>> structure and not as a percpu variable as it will be very hard
>> to save and restore percpu virable content as migration can
>> happen anywhere. I don't have concrete design yet. Just throw some
>> idea here.
> 
> Initially I thought about saving this pointer on stack, but then we need
> some sort of gcc/assembly magic to get this pointer from the stack outside
> of the current scope. At that time we didn't have sleepable programs,
> so the percpu approach looked simpler and more reliable. Maybe it's time
> to review it.

Indeed this is the time.

> 
>>
>> BTW, I send a patch to prevent tracing programs to mess up
>> with cgroup local storage:
>>     https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@fb.com/T/#u
>> we now all programs access cgroup local storage should be in
>> process context and we don't need to worry about kprobe-induced
>> percpu local storage access.
> 
> Thank you! My only issue is that the commit log looks like an optimization
> (like we're calling for set_cgroup_storage() for no good reason), where if
> I understand it correctly, it prevents some class of problems.

Yes, it prevents real problems as well. The reason I did not say it is 
because the patch does not really fix fundamental issue. But it does
prevent some issues. Let me reword the commit message.

> 
> Thanks!
> 
>>
>>>
>>>>
>>>> One option to fix it is to make bpf_cgroup_storage_set() to return
>>>> the old value,
>>>> save it on a local variable and restore after the execution of the
>>>> program.
>>>
>>> In this particular case, we are doing bpf_test_run, we explicitly
>>> allocate storage and call bpf_cgroup_storage_set() right before
>>> each BPF_PROG_RUN.
>>>
>>>> But I didn't follow closely the development of sleepable bpf
>>>> programs, so I could
>>>> easily miss something.
>>>
>>> Yes, sleepable bpf program is another complication. I think we need a
>>> variable similar to bpf_prog_active, which should not nested bpf program
>>> execution for those bpf programs having local_storage map.
>>> Will try to craft some patch to facilitate the discussion.
>>>
>> [...]

  reply	other threads:[~2021-03-09 18:39 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-04 19:03 [BUG] hitting bug when running spinlock test Jiri Olsa
2021-03-05  4:08 ` Yonghong Song
2021-03-05 20:38 ` Roman Gushchin
2021-03-05 21:10   ` Yonghong Song
2021-03-05 21:17     ` Stanislav Fomichev
2021-03-09  5:44     ` Yonghong Song
2021-03-09 18:21       ` Roman Gushchin
2021-03-09 18:36         ` Yonghong Song [this message]
2021-03-09 19:25           ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45df5090-870a-ac49-3442-24ed67266d0e@fb.com \
    --to=yhs@fb.com \
    --cc=andriin@fb.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=guro@fb.com \
    --cc=haliu@redhat.com \
    --cc=jbenc@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@redhat.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@chromium.org \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=songliubraving@fb.com \
    --cc=toke@redhat.com \
    --cc=ykaliuta@redhat.com \
    --cc=zhuyifei@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.