Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked

From: Hao Luo <haoluo@google.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf@vger.kernel.org, Song Liu <songliubraving@fb.com>,
	Hao Sun <sunhao.th@gmail.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Andrii Nakryiko <andrii@kernel.org>, Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, KP Singh <kpsingh@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Lorenz Bauer <oss@lmb.io>,
	houtao1@huawei.com
Subject: Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
Date: Mon, 22 Aug 2022 17:56:29 -0700	[thread overview]
Message-ID: <CA+khW7iv+zX0XzC++i-F7QZju9QGufh6+SVN3JWp9WyJe2qhMg@mail.gmail.com> (raw)
In-Reply-To: <CA+khW7jgvZR8azSE3gEJvhT_psgEeHCdU3uWAQUkkKFLgh0a4Q@mail.gmail.com>

On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > Hi, Hou Tao
> > >
> > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > >>> Hi Hou Tao,
> > >>>
> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>
> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > >>>> processes) unexpectedly as shown below:
> > >>>>
> > >>>> process A                      process B
> > >>>>
> > >>>> htab_map_update_elem()
> > >>>>   htab_lock_bucket()
> > >>>>     migrate_disable()
> > >>>>     /* return 1 */
> > >>>>     __this_cpu_inc_return()
> > >>>>     /* preempted by B */
> > >>>>
> > >>>>                                htab_map_update_elem()
> > >>>>                                  /* the same bucket as A */
> > >>>>                                  htab_lock_bucket()
> > >>>>                                    migrate_disable()
> > >>>>                                    /* return 2, so lock fails */
> > >>>>                                    __this_cpu_inc_return()
> > >>>>                                    return -EBUSY
> > >>>>
> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > >>>> only checking the value of map_locked for nmi context. But it will
> > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > >>>> through non-tracing program (e.g. fentry program).
> > >>>>
> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > >>>> so still use migrate_disable() for spin-lock case.
> > >>>>
> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >>>> ---
> > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> > >>> map_locked. But I don't think the change is meaningful.
> > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> > >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> > >> case, so I don't think that the change is meaningless.
> > >>> This patch only affects the case when raw lock is used. In the case of
> > >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > >>> understanding about this patch is, it just makes sure preemption
> > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> > >>> so small that it should be really unlikely to happen.
> > >> No, it can be easily reproduced by running multiple htab update processes in the
> > >> same CPU. Will add selftest to demonstrate that.
> > > Can you clarify what you demonstrate?
> > First please enable CONFIG_PREEMPT for the running kernel and then run the
> > following test program as shown below.
> >
>
> Ah, fully preemptive kernel. It's worth mentioning that in the commit
> message. Then it seems promoting migrate_disable to preempt_disable
> may be the best way to solve the problem you described.
>
> > # sudo taskset -c 2 ./update.bin
> > thread nr 2
> > wait for error
> > update error -16
> > all threads exit
> >
> > If there is no "update error -16", you can try to create more map update
> > threads. For example running 16 update threads:
> >
> > # sudo taskset -c 2 ./update.bin 16
> > thread nr 16
> > wait for error
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > all threads exit
> >
> > The following is the source code for update.bin:
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdbool.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <errno.h>
> > #include <pthread.h>
> >
> > #include "bpf.h"
> > #include "libbpf.h"
> >
> > static bool stop;
> > static int fd;
> >
> > static void *update_fn(void *arg)
> > {
> >         while (!stop) {
> >                 unsigned int key = 0, value = 1;
> >                 int err;
> >
> >                 err = bpf_map_update_elem(fd, &key, &value, 0);
> >                 if (err) {
> >                         printf("update error %d\n", err);
> >                         stop = true;
> >                         break;
> >                 }
> >         }
> >
> >         return NULL;
> > }
> >
> > int main(int argc, char **argv)
> > {
> >         LIBBPF_OPTS(bpf_map_create_opts, opts);
> >         unsigned int i, nr;
> >         pthread_t *tids;
> >
> >         nr = 2;
> >         if (argc > 1)
> >                 nr = atoi(argv[1]);
> >         printf("thread nr %u\n", nr);
> >
> >         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> >         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
> >         if (fd < 0) {
> >                 printf("create map error %d\n", fd);
> >                 return 1;
> >         }
> >
> >         tids = malloc(nr * sizeof(*tids));
> >         for (i = 0; i < nr; i++)
> >                 pthread_create(&tids[i], NULL, update_fn, NULL);
> >
> >         printf("wait for error\n");
> >         for (i = 0; i < nr; i++)
> >                 pthread_join(tids[i], NULL);
> >
> >         printf("all threads exit\n");
> >
> >         free(tids);
> >         close(fd);
> >
> >         return 0;
> > }
> >

Tao, thanks very much for the test. I played it a bit and I can
confirm that map_update failures are seen under CONFIG_PREEMPT. The
failures are not present under CONFIG_PREEMPT_NONE or
CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
thinking of and they didn't work. It looks like Hou Tao's idea,
promoting migrate_disable to preempt_disable, is probably the best we
can do for the non-RT case. So

Reviewed-by: Hao Luo <haoluo@google.com>

But, I am not sure if we want to get rid of preemption-caused batch
map updates on preemptive kernels, or if there are better solutions to
address [0].

Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.

> > >
> > > Here is my theory, but please correct me if I'm wrong, I haven't
> > > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > > migrate_disable. That is because very soon after migrate_disable, we
> > > enter the critical section of b->raw_lock with irq disabled. In RT,
> > > preemptions can happen on acquiring b->lock, that is certainly
> > > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > > by this patch.
> > >
> > >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> > >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> > >>>>
> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > >>>> index 6c530a5e560a..ad09da139589 100644
> > >>>> --- a/kernel/bpf/hashtab.c
> > >>>> +++ b/kernel/bpf/hashtab.c
> > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > >>>>                                    unsigned long *pflags)
> > >>>>  {
> > >>>>         unsigned long flags;
> > >>>> +       bool use_raw_lock;
> > >>>>
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>>
> > >>>> -       migrate_disable();
> > >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_disable();
> > >>>> +       else
> > >>>> +               migrate_disable();
> > >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -               migrate_enable();
> > >>>> +               if (use_raw_lock)
> > >>>> +                       preempt_enable();
> > >>>> +               else
> > >>>> +                       migrate_enable();
> > >>>>                 return -EBUSY;
> > >>>>         }
> > >>>>
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_lock_irqsave(&b->lock, flags);
> > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> > >>>>                                       struct bucket *b, u32 hash,
> > >>>>                                       unsigned long flags)
> > >>>>  {
> > >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> > >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -       migrate_enable();
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_enable();
> > >>>> +       else
> > >>>> +               migrate_enable();
> > >>>>  }
> > >>>>
> > >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> > >>>> --
> > >>>> 2.29.2
> > >>>>
> > >>> .
> > > .
> >