From: Waiman Long <longman@redhat.com>
To: Tonghao Zhang <xiangxia.m.yue@gmail.com>, Hou Tao <houtao1@huawei.com>
Cc: Hou Tao <houtao@huaweicloud.com>, Hao Luo <haoluo@google.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
netdev@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
bpf <bpf@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>,
Boqun Feng <boqun.feng@gmail.com>
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock
Date: Tue, 29 Nov 2022 23:07:13 -0500 [thread overview]
Message-ID: <fb7e9567-6452-7ccc-d2d5-697eb06ac251@redhat.com> (raw)
In-Reply-To: <CAMDZJNUdE7BKL6COF3xZD04iPn_4n5ZFmmoNB-y566QSVrct5w@mail.gmail.com>
On 11/29/22 22:32, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <longman@redhat.com> wrote:
>> On 11/29/22 21:47, Tonghao Zhang wrote:
>>> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi Hao,
>>>>
>>>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@gmail.com> wrote:
>>>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>>>> raw_spin_trylock_irqsave().
>>>>>>
>>>>>> Regards,
>>>>>> Boqun
>>>>>>
>>>>> The potential deadlock happens when the lock is contended from the
>>>>> same cpu. When the lock is contended from a remote cpu, we would like
>>>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>>>> this gives better throughput. So replacing the current
>>>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>>>
>>>>> I suspect the source of the problem is the 'hash' that we used in
>>>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>>>> whether we should use a hash derived from 'bucket' rather than from
>>>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>>>> different keys may fall into the same bucket, but yield different
>>>>> hashes. If the same bucket can never have two different 'hashes' here,
>>>>> the map_locked check should behave as intended. Also because
>>>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>>>> both pass.
>>>> The warning from lockdep is due to the reason the bucket lock A is used in a
>>>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
>>> Yes, I tested lockdep too, we can't use the lock in NMI(but only
>>> try_lock work fine) context if we use them no-NMI context. otherwise
>>> the lockdep prints the warning.
>>> * for the dead-lock case: we can use the
>>> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
>>> 2. or hash bucket address.
>>>
>>> * for lockdep warning, we should use in_nmi check with map_locked.
>>>
>>> BTW, the patch doesn't work, so we can remove the lock_key
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>>>
>>> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>> struct bucket *b, u32 hash,
>>> unsigned long *pflags)
>>> {
>>> unsigned long flags;
>>>
>>> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>
>>> preempt_disable();
>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>> __this_cpu_dec(*(htab->map_locked[hash]));
>>> preempt_enable();
>>> return -EBUSY;
>>> }
>>>
>>> if (in_nmi()) {
>>> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>>> return -EBUSY;
>> That is not right. You have to do the same step as above by decrementing
>> the percpu count and enable preemption. So you may want to put all these
>> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> Yes, thanks Waiman, I should add the busy_out label.
>>> } else {
>>> raw_spin_lock_irqsave(&b->raw_lock, flags);
>>> }
>>>
>>> *pflags = flags;
>>> return 0;
>>> }
>> BTW, with that change, I believe you can actually remove all the percpu
>> map_locked count code.
> there are some case, for example, we run the bpf_prog A B in task
> context on the same cpu.
> bpf_prog A
> update map X
> htab_lock_bucket
> raw_spin_lock_irqsave()
> lookup_elem_raw()
> // bpf prog B is attached on lookup_elem_raw()
> bpf prog B
> update map X again and update the element
> htab_lock_bucket()
> // dead-lock
> raw_spinlock_irqsave()
I see, so nested locking is possible in this case. Beside using the
percpu map_lock, another way is to have cpumask associated with each
bucket lock and use each bit in the cpumask for to control access using
test_and_set_bit() for each cpu. That will allow more concurrency and
you can actually find out how contended is the lock. Anyway, it is just
a thought.
Cheers,
Longman
next prev parent reply other threads:[~2022-11-30 4:08 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-21 10:05 [net-next] bpf: avoid the multi checking xiangxia.m.yue
2022-11-21 10:05 ` [net-next] bpf: avoid hashtab deadlock with try_lock xiangxia.m.yue
2022-11-21 20:19 ` Jakub Kicinski
2022-11-22 1:15 ` Hou Tao
2022-11-22 3:12 ` Tonghao Zhang
2022-11-22 4:01 ` Hou Tao
2022-11-22 4:06 ` Hou Tao
2022-11-24 12:57 ` Tonghao Zhang
2022-11-24 14:13 ` Hou Tao
2022-11-28 3:15 ` Tonghao Zhang
2022-11-28 21:55 ` Hao Luo
2022-11-29 4:32 ` Hou Tao
2022-11-29 6:06 ` Tonghao Zhang
2022-11-29 7:56 ` Hou Tao
2022-11-29 12:45 ` Hou Tao
2022-11-29 16:06 ` Waiman Long
2022-11-29 17:23 ` Boqun Feng
2022-11-29 17:32 ` Boqun Feng
2022-11-29 19:36 ` Hao Luo
2022-11-29 21:13 ` Waiman Long
2022-11-30 1:50 ` Hou Tao
2022-11-30 2:47 ` Tonghao Zhang
2022-11-30 3:06 ` Waiman Long
2022-11-30 3:32 ` Tonghao Zhang
2022-11-30 4:07 ` Waiman Long [this message]
2022-11-30 4:13 ` Hou Tao
2022-11-30 5:02 ` Hao Luo
2022-11-30 5:56 ` Tonghao Zhang
2022-11-30 5:55 ` Tonghao Zhang
2022-12-01 2:53 ` Hou Tao
2022-11-30 1:37 ` Hou Tao
2022-11-22 22:16 ` [net-next] bpf: avoid the multi checking Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fb7e9567-6452-7ccc-d2d5-697eb06ac251@redhat.com \
--to=longman@redhat.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=boqun.feng@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=houtao@huaweicloud.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=will@kernel.org \
--cc=xiangxia.m.yue@gmail.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).