All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hao Luo <haoluo@google.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf@vger.kernel.org, Song Liu <songliubraving@fb.com>,
	Hao Sun <sunhao.th@gmail.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Andrii Nakryiko <andrii@kernel.org>, Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, KP Singh <kpsingh@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Lorenz Bauer <oss@lmb.io>,
	houtao1@huawei.com
Subject: Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
Date: Mon, 22 Aug 2022 17:56:29 -0700	[thread overview]
Message-ID: <CA+khW7iv+zX0XzC++i-F7QZju9QGufh6+SVN3JWp9WyJe2qhMg@mail.gmail.com> (raw)
In-Reply-To: <CA+khW7jgvZR8azSE3gEJvhT_psgEeHCdU3uWAQUkkKFLgh0a4Q@mail.gmail.com>

On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > Hi, Hou Tao
> > >
> > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > >>> Hi Hou Tao,
> > >>>
> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>
> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > >>>> processes) unexpectedly as shown below:
> > >>>>
> > >>>> process A                      process B
> > >>>>
> > >>>> htab_map_update_elem()
> > >>>>   htab_lock_bucket()
> > >>>>     migrate_disable()
> > >>>>     /* return 1 */
> > >>>>     __this_cpu_inc_return()
> > >>>>     /* preempted by B */
> > >>>>
> > >>>>                                htab_map_update_elem()
> > >>>>                                  /* the same bucket as A */
> > >>>>                                  htab_lock_bucket()
> > >>>>                                    migrate_disable()
> > >>>>                                    /* return 2, so lock fails */
> > >>>>                                    __this_cpu_inc_return()
> > >>>>                                    return -EBUSY
> > >>>>
> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > >>>> only checking the value of map_locked for nmi context. But it will
> > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > >>>> through non-tracing program (e.g. fentry program).
> > >>>>
> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > >>>> so still use migrate_disable() for spin-lock case.
> > >>>>
> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >>>> ---
> > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> > >>> map_locked. But I don't think the change is meaningful.
> > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> > >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> > >> case, so I don't think that the change is meaningless.
> > >>> This patch only affects the case when raw lock is used. In the case of
> > >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > >>> understanding about this patch is, it just makes sure preemption
> > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> > >>> so small that it should be really unlikely to happen.
> > >> No, it can be easily reproduced by running multiple htab update processes in the
> > >> same CPU. Will add selftest to demonstrate that.
> > > Can you clarify what you demonstrate?
> > First please enable CONFIG_PREEMPT for the running kernel and then run the
> > following test program as shown below.
> >
>
> Ah, fully preemptive kernel. It's worth mentioning that in the commit
> message. Then it seems promoting migrate_disable to preempt_disable
> may be the best way to solve the problem you described.
>
> > # sudo taskset -c 2 ./update.bin
> > thread nr 2
> > wait for error
> > update error -16
> > all threads exit
> >
> > If there is no "update error -16", you can try to create more map update
> > threads. For example running 16 update threads:
> >
> > # sudo taskset -c 2 ./update.bin 16
> > thread nr 16
> > wait for error
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > all threads exit
> >
> > The following is the source code for update.bin:
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdbool.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <errno.h>
> > #include <pthread.h>
> >
> > #include "bpf.h"
> > #include "libbpf.h"
> >
> > static bool stop;
> > static int fd;
> >
> > static void *update_fn(void *arg)
> > {
> >         while (!stop) {
> >                 unsigned int key = 0, value = 1;
> >                 int err;
> >
> >                 err = bpf_map_update_elem(fd, &key, &value, 0);
> >                 if (err) {
> >                         printf("update error %d\n", err);
> >                         stop = true;
> >                         break;
> >                 }
> >         }
> >
> >         return NULL;
> > }
> >
> > int main(int argc, char **argv)
> > {
> >         LIBBPF_OPTS(bpf_map_create_opts, opts);
> >         unsigned int i, nr;
> >         pthread_t *tids;
> >
> >         nr = 2;
> >         if (argc > 1)
> >                 nr = atoi(argv[1]);
> >         printf("thread nr %u\n", nr);
> >
> >         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> >         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
> >         if (fd < 0) {
> >                 printf("create map error %d\n", fd);
> >                 return 1;
> >         }
> >
> >         tids = malloc(nr * sizeof(*tids));
> >         for (i = 0; i < nr; i++)
> >                 pthread_create(&tids[i], NULL, update_fn, NULL);
> >
> >         printf("wait for error\n");
> >         for (i = 0; i < nr; i++)
> >                 pthread_join(tids[i], NULL);
> >
> >         printf("all threads exit\n");
> >
> >         free(tids);
> >         close(fd);
> >
> >         return 0;
> > }
> >

Tao, thanks very much for the test. I played it a bit and I can
confirm that map_update failures are seen under CONFIG_PREEMPT. The
failures are not present under CONFIG_PREEMPT_NONE or
CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
thinking of and they didn't work. It looks like Hou Tao's idea,
promoting migrate_disable to preempt_disable, is probably the best we
can do for the non-RT case. So

Reviewed-by: Hao Luo <haoluo@google.com>

But, I am not sure if we want to get rid of preemption-caused batch
map updates on preemptive kernels, or if there are better solutions to
address [0].

Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.

> > >
> > > Here is my theory, but please correct me if I'm wrong, I haven't
> > > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > > migrate_disable. That is because very soon after migrate_disable, we
> > > enter the critical section of b->raw_lock with irq disabled. In RT,
> > > preemptions can happen on acquiring b->lock, that is certainly
> > > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > > by this patch.
> > >
> > >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> > >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> > >>>>
> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > >>>> index 6c530a5e560a..ad09da139589 100644
> > >>>> --- a/kernel/bpf/hashtab.c
> > >>>> +++ b/kernel/bpf/hashtab.c
> > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > >>>>                                    unsigned long *pflags)
> > >>>>  {
> > >>>>         unsigned long flags;
> > >>>> +       bool use_raw_lock;
> > >>>>
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>>
> > >>>> -       migrate_disable();
> > >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_disable();
> > >>>> +       else
> > >>>> +               migrate_disable();
> > >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -               migrate_enable();
> > >>>> +               if (use_raw_lock)
> > >>>> +                       preempt_enable();
> > >>>> +               else
> > >>>> +                       migrate_enable();
> > >>>>                 return -EBUSY;
> > >>>>         }
> > >>>>
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_lock_irqsave(&b->lock, flags);
> > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> > >>>>                                       struct bucket *b, u32 hash,
> > >>>>                                       unsigned long flags)
> > >>>>  {
> > >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> > >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -       migrate_enable();
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_enable();
> > >>>> +       else
> > >>>> +               migrate_enable();
> > >>>>  }
> > >>>>
> > >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> > >>>> --
> > >>>> 2.29.2
> > >>>>
> > >>> .
> > > .
> >

  reply	other threads:[~2022-08-23  0:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
2022-08-21 16:42   ` Hao Luo
2022-08-22  1:27     ` Hou Tao
2022-08-22  3:21       ` Hao Luo
2022-08-22 12:07         ` Hou Tao
2022-08-22 18:01           ` Hao Luo
2022-08-23  0:56             ` Hao Luo [this message]
2022-08-23  1:29               ` Alexei Starovoitov
2022-08-23  2:57                 ` Hou Tao
2022-08-23  4:50                   ` Alexei Starovoitov
2022-08-23  6:41                     ` Hou Tao
2022-08-23  2:54               ` Hou Tao
2022-08-22  8:13   ` Sebastian Andrzej Siewior
2022-08-22 12:09     ` Hou Tao
2022-08-22 15:30       ` Sebastian Andrzej Siewior
2022-08-21  3:32 ` [PATCH 2/3] bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case Hou Tao
2022-08-21  3:32 ` [PATCH 3/3] bpf: Propagate error from htab_lock_bucket() to userspace Hou Tao
2022-08-22  1:21 ` [PATCH 0/3] fixes for concurrent htab updates Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+khW7iv+zX0XzC++i-F7QZju9QGufh6+SVN3JWp9WyJe2qhMg@mail.gmail.com \
    --to=haoluo@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=oss@lmb.io \
    --cc=sdf@google.com \
    --cc=songliubraving@fb.com \
    --cc=sunhao.th@gmail.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.