All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] fixes for concurrent htab updates
@ 2022-08-21  3:32 Hou Tao
  2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-21  3:32 UTC (permalink / raw)
  To: bpf, Song Liu
  Cc: Hao Sun, Sebastian Andrzej Siewior, Andrii Nakryiko,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset aims to fix the issues found during investigating the
syzkaller problem reported in [0]. It seems that the normally concurrent
updates in the same hash-table are disallowed as shown in patch 1.

Patch 1 uses preempt_disable() to fix the problem to !PREEMPT_RT case.

Patch 2 introduces an extra bpf_map_busy bit in task_struct to
detect the re-entrancy of htab_lock_bucket() and allow concurrent map
updates due to preemption in PREEMPT_RT case. It is coarse-grained
compared with map_locked in !PREEMPT_RT case, because if two different
maps are manipulated the re-entrancy is still be rejected. But
considering Alexei is working on "BPF specific memory allocator" [1],
and the !htab_use_raw_lock() case can be removed after the patchset is
landed, so I think may be it is fine and hope to get some more feedback
about the proposed fix in patch 2.

Patch 3 just fixes the out-of-bound memory read problem reported in [0].
Once patch 1 & patch 2 are merged, htab_lock_bucket() will always
succeed for userspace process, but it is better to handle it gracefully.

Selftests will be added after getting more feedback about the patchset
and comments are always welcome.

Regards,
Tao

[0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/20220819214232.18784-1-alexei.starovoitov@gmail.com/

Hou Tao (3):
  bpf: Disable preemption when increasing per-cpu map_locked
  bpf: Allow normally concurrent map updates for !htab_use_raw_lock()
    case
  bpf: Propagate error from htab_lock_bucket() to userspace

 include/linux/sched.h |  3 +++
 kernel/bpf/hashtab.c  | 59 ++++++++++++++++++++++++++++++-------------
 2 files changed, 44 insertions(+), 18 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
@ 2022-08-21  3:32 ` Hou Tao
  2022-08-21 16:42   ` Hao Luo
  2022-08-22  8:13   ` Sebastian Andrzej Siewior
  2022-08-21  3:32 ` [PATCH 2/3] bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case Hou Tao
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-21  3:32 UTC (permalink / raw)
  To: bpf, Song Liu
  Cc: Hao Sun, Sebastian Andrzej Siewior, Andrii Nakryiko,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

From: Hou Tao <houtao1@huawei.com>

Per-cpu htab->map_locked is used to prohibit the concurrent accesses
from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
Make migrate_disable/enable() independent of RT"), migrations_disable()
is also preemptible under !PREEMPT_RT case, so now map_locked also
disallows concurrent updates from normal contexts (e.g. userspace
processes) unexpectedly as shown below:

process A                      process B

htab_map_update_elem()
  htab_lock_bucket()
    migrate_disable()
    /* return 1 */
    __this_cpu_inc_return()
    /* preempted by B */

                               htab_map_update_elem()
                                 /* the same bucket as A */
                                 htab_lock_bucket()
                                   migrate_disable()
                                   /* return 2, so lock fails */
                                   __this_cpu_inc_return()
                                   return -EBUSY

A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
only checking the value of map_locked for nmi context. But it will
re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
through non-tracing program (e.g. fentry program).

So fixing it by using disable_preempt() instead of migrate_disable() when
increasing htab->map_locked. However when htab_use_raw_lock() is false,
bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
so still use migrate_disable() for spin-lock case.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6c530a5e560a..ad09da139589 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
 				   unsigned long *pflags)
 {
 	unsigned long flags;
+	bool use_raw_lock;
 
 	hash = hash & HASHTAB_MAP_LOCK_MASK;
 
-	migrate_disable();
+	use_raw_lock = htab_use_raw_lock(htab);
+	if (use_raw_lock)
+		preempt_disable();
+	else
+		migrate_disable();
 	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
 		__this_cpu_dec(*(htab->map_locked[hash]));
-		migrate_enable();
+		if (use_raw_lock)
+			preempt_enable();
+		else
+			migrate_enable();
 		return -EBUSY;
 	}
 
-	if (htab_use_raw_lock(htab))
+	if (use_raw_lock)
 		raw_spin_lock_irqsave(&b->raw_lock, flags);
 	else
 		spin_lock_irqsave(&b->lock, flags);
@@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
 				      struct bucket *b, u32 hash,
 				      unsigned long flags)
 {
+	bool use_raw_lock = htab_use_raw_lock(htab);
+
 	hash = hash & HASHTAB_MAP_LOCK_MASK;
-	if (htab_use_raw_lock(htab))
+	if (use_raw_lock)
 		raw_spin_unlock_irqrestore(&b->raw_lock, flags);
 	else
 		spin_unlock_irqrestore(&b->lock, flags);
 	__this_cpu_dec(*(htab->map_locked[hash]));
-	migrate_enable();
+	if (use_raw_lock)
+		preempt_enable();
+	else
+		migrate_enable();
 }
 
 static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case
  2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
  2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
@ 2022-08-21  3:32 ` Hou Tao
  2022-08-21  3:32 ` [PATCH 3/3] bpf: Propagate error from htab_lock_bucket() to userspace Hou Tao
  2022-08-22  1:21 ` [PATCH 0/3] fixes for concurrent htab updates Hou Tao
  3 siblings, 0 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-21  3:32 UTC (permalink / raw)
  To: bpf, Song Liu
  Cc: Hao Sun, Sebastian Andrzej Siewior, Andrii Nakryiko,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

From: Hou Tao <houtao1@huawei.com>

For htab_use_raw_lock=true case, the normally concurrent map updates
are allowed by using preempt_disable() instead of migrate_disable()
before increasing htab->map_locked. However the false case can not use
preempt_disable(), because a sleepable spin-lock is acquired afterwards.

So introducing a locking_bpf_map bit in task_struct. Setting it before
acquiring bucket lock and clearing it after releasing the lock, so if
htab_lock_bucket() is re-entered, the re-enterancy will be rejected. And
if there is just preemption from another process, these processes can
run concurrently.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/sched.h |  3 +++
 kernel/bpf/hashtab.c  | 61 ++++++++++++++++++++++++-------------------
 2 files changed, 37 insertions(+), 27 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 51dc1e89d43f..55667f46e459 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -944,6 +944,9 @@ struct task_struct {
 #ifdef	CONFIG_CPU_SUP_INTEL
 	unsigned			reported_split_lock:1;
 #endif
+#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_BPF_SYSCALL)
+	unsigned			bpf_map_busy:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index ad09da139589..3ef7a853c737 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -138,6 +138,23 @@ static inline bool htab_use_raw_lock(const struct bpf_htab *htab)
 	return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab));
 }
 
+static inline void bpf_clear_map_busy(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+	current->bpf_map_busy = 0;
+#endif
+}
+
+static inline int bpf_test_and_set_map_busy(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+	if (current->bpf_map_busy)
+		return 1;
+	current->bpf_map_busy = 1;
+#endif
+	return 0;
+}
+
 static void htab_init_buckets(struct bpf_htab *htab)
 {
 	unsigned int i;
@@ -162,28 +179,21 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
 				   unsigned long *pflags)
 {
 	unsigned long flags;
-	bool use_raw_lock;
 
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
-
-	use_raw_lock = htab_use_raw_lock(htab);
-	if (use_raw_lock)
+	if (htab_use_raw_lock(htab)) {
+		hash = hash & HASHTAB_MAP_LOCK_MASK;
 		preempt_disable();
-	else
-		migrate_disable();
-	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
-		__this_cpu_dec(*(htab->map_locked[hash]));
-		if (use_raw_lock)
+		if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
+			__this_cpu_dec(*(htab->map_locked[hash]));
 			preempt_enable();
-		else
-			migrate_enable();
-		return -EBUSY;
-	}
-
-	if (use_raw_lock)
+			return -EBUSY;
+		}
 		raw_spin_lock_irqsave(&b->raw_lock, flags);
-	else
+	} else {
+		if (bpf_test_and_set_map_busy())
+			return -EBUSY;
 		spin_lock_irqsave(&b->lock, flags);
+	}
 	*pflags = flags;
 
 	return 0;
@@ -193,18 +203,15 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
 				      struct bucket *b, u32 hash,
 				      unsigned long flags)
 {
-	bool use_raw_lock = htab_use_raw_lock(htab);
-
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
-	if (use_raw_lock)
+	if (htab_use_raw_lock(htab)) {
+		hash = hash & HASHTAB_MAP_LOCK_MASK;
 		raw_spin_unlock_irqrestore(&b->raw_lock, flags);
-	else
-		spin_unlock_irqrestore(&b->lock, flags);
-	__this_cpu_dec(*(htab->map_locked[hash]));
-	if (use_raw_lock)
+		__this_cpu_dec(*(htab->map_locked[hash]));
 		preempt_enable();
-	else
-		migrate_enable();
+	} else {
+		spin_unlock_irqrestore(&b->lock, flags);
+		bpf_clear_map_busy();
+	}
 }
 
 static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] bpf: Propagate error from htab_lock_bucket() to userspace
  2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
  2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
  2022-08-21  3:32 ` [PATCH 2/3] bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case Hou Tao
@ 2022-08-21  3:32 ` Hou Tao
  2022-08-22  1:21 ` [PATCH 0/3] fixes for concurrent htab updates Hou Tao
  3 siblings, 0 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-21  3:32 UTC (permalink / raw)
  To: bpf, Song Liu
  Cc: Hao Sun, Sebastian Andrzej Siewior, Andrii Nakryiko,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

From: Hou Tao <houtao1@huawei.com>

In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns
-EBUSY, it will go to next bucket. Going to next bucket may not only
skip the elements in current bucket silently, but also incur
out-of-bound memory access or expose kernel memory to userspace if
current bucket_cnt is greater than bucket_size or zero.

Fixing it by stopping batch operation and returning -EBUSY when
htab_lock_bucket() fails, and the application can retry or skip the busy
batch as needed.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3ef7a853c737..ffd39048e6da 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1711,8 +1711,11 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	/* do not grab the lock unless need it (bucket_cnt > 0). */
 	if (locked) {
 		ret = htab_lock_bucket(htab, b, batch, &flags);
-		if (ret)
-			goto next_batch;
+		if (ret) {
+			rcu_read_unlock();
+			bpf_enable_instrumentation();
+			goto after_loop;
+		}
 	}
 
 	bucket_cnt = 0;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
@ 2022-08-21 16:42   ` Hao Luo
  2022-08-22  1:27     ` Hou Tao
  2022-08-22  8:13   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-08-21 16:42 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

Hi Hou Tao,

On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> Make migrate_disable/enable() independent of RT"), migrations_disable()
> is also preemptible under !PREEMPT_RT case, so now map_locked also
> disallows concurrent updates from normal contexts (e.g. userspace
> processes) unexpectedly as shown below:
>
> process A                      process B
>
> htab_map_update_elem()
>   htab_lock_bucket()
>     migrate_disable()
>     /* return 1 */
>     __this_cpu_inc_return()
>     /* preempted by B */
>
>                                htab_map_update_elem()
>                                  /* the same bucket as A */
>                                  htab_lock_bucket()
>                                    migrate_disable()
>                                    /* return 2, so lock fails */
>                                    __this_cpu_inc_return()
>                                    return -EBUSY
>
> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> only checking the value of map_locked for nmi context. But it will
> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> through non-tracing program (e.g. fentry program).
>
> So fixing it by using disable_preempt() instead of migrate_disable() when
> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> so still use migrate_disable() for spin-lock case.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---

IIUC, this patch enlarges the scope of preemption disable to cover inc
map_locked. But I don't think the change is meaningful.

This patch only affects the case when raw lock is used. In the case of
raw lock, irq is disabled for b->raw_lock protected critical section.
A raw spin lock itself doesn't block in both RT and non-RT. So, my
understanding about this patch is, it just makes sure preemption
doesn't happen on the exact __this_cpu_inc_return. But the window is
so small that it should be really unlikely to happen.

>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 6c530a5e560a..ad09da139589 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>                                    unsigned long *pflags)
>  {
>         unsigned long flags;
> +       bool use_raw_lock;
>
>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>
> -       migrate_disable();
> +       use_raw_lock = htab_use_raw_lock(htab);
> +       if (use_raw_lock)
> +               preempt_disable();
> +       else
> +               migrate_disable();
>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>                 __this_cpu_dec(*(htab->map_locked[hash]));
> -               migrate_enable();
> +               if (use_raw_lock)
> +                       preempt_enable();
> +               else
> +                       migrate_enable();
>                 return -EBUSY;
>         }
>
> -       if (htab_use_raw_lock(htab))
> +       if (use_raw_lock)
>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>         else
>                 spin_lock_irqsave(&b->lock, flags);
> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>                                       struct bucket *b, u32 hash,
>                                       unsigned long flags)
>  {
> +       bool use_raw_lock = htab_use_raw_lock(htab);
> +
>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> -       if (htab_use_raw_lock(htab))
> +       if (use_raw_lock)
>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>         else
>                 spin_unlock_irqrestore(&b->lock, flags);
>         __this_cpu_dec(*(htab->map_locked[hash]));
> -       migrate_enable();
> +       if (use_raw_lock)
> +               preempt_enable();
> +       else
> +               migrate_enable();
>  }
>
>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fixes for concurrent htab updates
  2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
                   ` (2 preceding siblings ...)
  2022-08-21  3:32 ` [PATCH 3/3] bpf: Propagate error from htab_lock_bucket() to userspace Hou Tao
@ 2022-08-22  1:21 ` Hou Tao
  3 siblings, 0 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-22  1:21 UTC (permalink / raw)
  To: Hou Tao, bpf, Song Liu
  Cc: Hao Sun, Sebastian Andrzej Siewior, Andrii Nakryiko,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, KP Singh, David S . Miller, Jakub Kicinski,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, John Fastabend,
	Lorenz Bauer

Oops, forget to add bpf prefix for the patchset.

On 8/21/2022 11:32 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The patchset aims to fix the issues found during investigating the
> syzkaller problem reported in [0]. It seems that the normally concurrent
> updates in the same hash-table are disallowed as shown in patch 1.
>
> Patch 1 uses preempt_disable() to fix the problem to !PREEMPT_RT case.
>
> Patch 2 introduces an extra bpf_map_busy bit in task_struct to
> detect the re-entrancy of htab_lock_bucket() and allow concurrent map
> updates due to preemption in PREEMPT_RT case. It is coarse-grained
> compared with map_locked in !PREEMPT_RT case, because if two different
> maps are manipulated the re-entrancy is still be rejected. But
> considering Alexei is working on "BPF specific memory allocator" [1],
> and the !htab_use_raw_lock() case can be removed after the patchset is
> landed, so I think may be it is fine and hope to get some more feedback
> about the proposed fix in patch 2.
>
> Patch 3 just fixes the out-of-bound memory read problem reported in [0].
> Once patch 1 & patch 2 are merged, htab_lock_bucket() will always
> succeed for userspace process, but it is better to handle it gracefully.
>
> Selftests will be added after getting more feedback about the patchset
> and comments are always welcome.
>
> Regards,
> Tao
>
> [0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/
> [1]: https://lore.kernel.org/bpf/20220819214232.18784-1-alexei.starovoitov@gmail.com/
>
> Hou Tao (3):
>   bpf: Disable preemption when increasing per-cpu map_locked
>   bpf: Allow normally concurrent map updates for !htab_use_raw_lock()
>     case
>   bpf: Propagate error from htab_lock_bucket() to userspace
>
>  include/linux/sched.h |  3 +++
>  kernel/bpf/hashtab.c  | 59 ++++++++++++++++++++++++++++++-------------
>  2 files changed, 44 insertions(+), 18 deletions(-)
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-21 16:42   ` Hao Luo
@ 2022-08-22  1:27     ` Hou Tao
  2022-08-22  3:21       ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Hou Tao @ 2022-08-22  1:27 UTC (permalink / raw)
  To: Hao Luo
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

Hi,

On 8/22/2022 12:42 AM, Hao Luo wrote:
> Hi Hou Tao,
>
> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>> disallows concurrent updates from normal contexts (e.g. userspace
>> processes) unexpectedly as shown below:
>>
>> process A                      process B
>>
>> htab_map_update_elem()
>>   htab_lock_bucket()
>>     migrate_disable()
>>     /* return 1 */
>>     __this_cpu_inc_return()
>>     /* preempted by B */
>>
>>                                htab_map_update_elem()
>>                                  /* the same bucket as A */
>>                                  htab_lock_bucket()
>>                                    migrate_disable()
>>                                    /* return 2, so lock fails */
>>                                    __this_cpu_inc_return()
>>                                    return -EBUSY
>>
>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>> only checking the value of map_locked for nmi context. But it will
>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>> through non-tracing program (e.g. fentry program).
>>
>> So fixing it by using disable_preempt() instead of migrate_disable() when
>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>> so still use migrate_disable() for spin-lock case.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
> IIUC, this patch enlarges the scope of preemption disable to cover inc
> map_locked. But I don't think the change is meaningful.
Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
case, so I don't think that the change is meaningless.
>
> This patch only affects the case when raw lock is used. In the case of
> raw lock, irq is disabled for b->raw_lock protected critical section.
> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> understanding about this patch is, it just makes sure preemption
> doesn't happen on the exact __this_cpu_inc_return. But the window is
> so small that it should be really unlikely to happen.
No, it can be easily reproduced by running multiple htab update processes in the
same CPU. Will add selftest to demonstrate that.
>
>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 6c530a5e560a..ad09da139589 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>                                    unsigned long *pflags)
>>  {
>>         unsigned long flags;
>> +       bool use_raw_lock;
>>
>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>
>> -       migrate_disable();
>> +       use_raw_lock = htab_use_raw_lock(htab);
>> +       if (use_raw_lock)
>> +               preempt_disable();
>> +       else
>> +               migrate_disable();
>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>> -               migrate_enable();
>> +               if (use_raw_lock)
>> +                       preempt_enable();
>> +               else
>> +                       migrate_enable();
>>                 return -EBUSY;
>>         }
>>
>> -       if (htab_use_raw_lock(htab))
>> +       if (use_raw_lock)
>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>         else
>>                 spin_lock_irqsave(&b->lock, flags);
>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>                                       struct bucket *b, u32 hash,
>>                                       unsigned long flags)
>>  {
>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>> +
>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>> -       if (htab_use_raw_lock(htab))
>> +       if (use_raw_lock)
>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>         else
>>                 spin_unlock_irqrestore(&b->lock, flags);
>>         __this_cpu_dec(*(htab->map_locked[hash]));
>> -       migrate_enable();
>> +       if (use_raw_lock)
>> +               preempt_enable();
>> +       else
>> +               migrate_enable();
>>  }
>>
>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>> --
>> 2.29.2
>>
> .


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22  1:27     ` Hou Tao
@ 2022-08-22  3:21       ` Hao Luo
  2022-08-22 12:07         ` Hou Tao
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-08-22  3:21 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

Hi, Hou Tao

On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > Hi Hou Tao,
> >
> > On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> >> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> >> Make migrate_disable/enable() independent of RT"), migrations_disable()
> >> is also preemptible under !PREEMPT_RT case, so now map_locked also
> >> disallows concurrent updates from normal contexts (e.g. userspace
> >> processes) unexpectedly as shown below:
> >>
> >> process A                      process B
> >>
> >> htab_map_update_elem()
> >>   htab_lock_bucket()
> >>     migrate_disable()
> >>     /* return 1 */
> >>     __this_cpu_inc_return()
> >>     /* preempted by B */
> >>
> >>                                htab_map_update_elem()
> >>                                  /* the same bucket as A */
> >>                                  htab_lock_bucket()
> >>                                    migrate_disable()
> >>                                    /* return 2, so lock fails */
> >>                                    __this_cpu_inc_return()
> >>                                    return -EBUSY
> >>
> >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> >> only checking the value of map_locked for nmi context. But it will
> >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> >> through non-tracing program (e.g. fentry program).
> >>
> >> So fixing it by using disable_preempt() instead of migrate_disable() when
> >> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> >> so still use migrate_disable() for spin-lock case.
> >>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> > IIUC, this patch enlarges the scope of preemption disable to cover inc
> > map_locked. But I don't think the change is meaningful.
> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> case, so I don't think that the change is meaningless.
> >
> > This patch only affects the case when raw lock is used. In the case of
> > raw lock, irq is disabled for b->raw_lock protected critical section.
> > A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > understanding about this patch is, it just makes sure preemption
> > doesn't happen on the exact __this_cpu_inc_return. But the window is
> > so small that it should be really unlikely to happen.
> No, it can be easily reproduced by running multiple htab update processes in the
> same CPU. Will add selftest to demonstrate that.

Can you clarify what you demonstrate?

Here is my theory, but please correct me if I'm wrong, I haven't
tested yet. In non-RT, I doubt preemptions are likely to happen after
migrate_disable. That is because very soon after migrate_disable, we
enter the critical section of b->raw_lock with irq disabled. In RT,
preemptions can happen on acquiring b->lock, that is certainly
possible, but this is the !use_raw_lock path, which isn't side-stepped
by this patch.

> >
> >>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> >>  1 file changed, 18 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >> index 6c530a5e560a..ad09da139589 100644
> >> --- a/kernel/bpf/hashtab.c
> >> +++ b/kernel/bpf/hashtab.c
> >> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >>                                    unsigned long *pflags)
> >>  {
> >>         unsigned long flags;
> >> +       bool use_raw_lock;
> >>
> >>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>
> >> -       migrate_disable();
> >> +       use_raw_lock = htab_use_raw_lock(htab);
> >> +       if (use_raw_lock)
> >> +               preempt_disable();
> >> +       else
> >> +               migrate_disable();
> >>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >>                 __this_cpu_dec(*(htab->map_locked[hash]));
> >> -               migrate_enable();
> >> +               if (use_raw_lock)
> >> +                       preempt_enable();
> >> +               else
> >> +                       migrate_enable();
> >>                 return -EBUSY;
> >>         }
> >>
> >> -       if (htab_use_raw_lock(htab))
> >> +       if (use_raw_lock)
> >>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> >>         else
> >>                 spin_lock_irqsave(&b->lock, flags);
> >> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> >>                                       struct bucket *b, u32 hash,
> >>                                       unsigned long flags)
> >>  {
> >> +       bool use_raw_lock = htab_use_raw_lock(htab);
> >> +
> >>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >> -       if (htab_use_raw_lock(htab))
> >> +       if (use_raw_lock)
> >>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> >>         else
> >>                 spin_unlock_irqrestore(&b->lock, flags);
> >>         __this_cpu_dec(*(htab->map_locked[hash]));
> >> -       migrate_enable();
> >> +       if (use_raw_lock)
> >> +               preempt_enable();
> >> +       else
> >> +               migrate_enable();
> >>  }
> >>
> >>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> >> --
> >> 2.29.2
> >>
> > .
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
  2022-08-21 16:42   ` Hao Luo
@ 2022-08-22  8:13   ` Sebastian Andrzej Siewior
  2022-08-22 12:09     ` Hou Tao
  1 sibling, 1 reply; 19+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-08-22  8:13 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Andrii Nakryiko, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Lorenz Bauer, houtao1

On 2022-08-21 11:32:21 [+0800], Hou Tao wrote:
> process A                      process B
> 
> htab_map_update_elem()
>   htab_lock_bucket()
>     migrate_disable()
>     /* return 1 */
>     __this_cpu_inc_return()
>     /* preempted by B */
> 
>                                htab_map_update_elem()
>                                  /* the same bucket as A */
>                                  htab_lock_bucket()
>                                    migrate_disable()
>                                    /* return 2, so lock fails */
>                                    __this_cpu_inc_return()
>                                    return -EBUSY
> 
> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> only checking the value of map_locked for nmi context. But it will
> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> through non-tracing program (e.g. fentry program).
> 
> So fixing it by using disable_preempt() instead of migrate_disable() when
> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> so still use migrate_disable() for spin-lock case.

But isn't the RT case still affected by the very same problem?

> Signed-off-by: Hou Tao <houtao1@huawei.com>

Sebastian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22  3:21       ` Hao Luo
@ 2022-08-22 12:07         ` Hou Tao
  2022-08-22 18:01           ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Hou Tao @ 2022-08-22 12:07 UTC (permalink / raw)
  To: Hao Luo
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

Hi,

On 8/22/2022 11:21 AM, Hao Luo wrote:
> Hi, Hou Tao
>
> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/22/2022 12:42 AM, Hao Luo wrote:
>>> Hi Hou Tao,
>>>
>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>>>> disallows concurrent updates from normal contexts (e.g. userspace
>>>> processes) unexpectedly as shown below:
>>>>
>>>> process A                      process B
>>>>
>>>> htab_map_update_elem()
>>>>   htab_lock_bucket()
>>>>     migrate_disable()
>>>>     /* return 1 */
>>>>     __this_cpu_inc_return()
>>>>     /* preempted by B */
>>>>
>>>>                                htab_map_update_elem()
>>>>                                  /* the same bucket as A */
>>>>                                  htab_lock_bucket()
>>>>                                    migrate_disable()
>>>>                                    /* return 2, so lock fails */
>>>>                                    __this_cpu_inc_return()
>>>>                                    return -EBUSY
>>>>
>>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>>>> only checking the value of map_locked for nmi context. But it will
>>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>>>> through non-tracing program (e.g. fentry program).
>>>>
>>>> So fixing it by using disable_preempt() instead of migrate_disable() when
>>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>>>> so still use migrate_disable() for spin-lock case.
>>>>
>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>> ---
>>> IIUC, this patch enlarges the scope of preemption disable to cover inc
>>> map_locked. But I don't think the change is meaningful.
>> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
>> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
>> case, so I don't think that the change is meaningless.
>>> This patch only affects the case when raw lock is used. In the case of
>>> raw lock, irq is disabled for b->raw_lock protected critical section.
>>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
>>> understanding about this patch is, it just makes sure preemption
>>> doesn't happen on the exact __this_cpu_inc_return. But the window is
>>> so small that it should be really unlikely to happen.
>> No, it can be easily reproduced by running multiple htab update processes in the
>> same CPU. Will add selftest to demonstrate that.
> Can you clarify what you demonstrate?
First please enable CONFIG_PREEMPT for the running kernel and then run the
following test program as shown below.

# sudo taskset -c 2 ./update.bin
thread nr 2
wait for error
update error -16
all threads exit

If there is no "update error -16", you can try to create more map update
threads. For example running 16 update threads:

# sudo taskset -c 2 ./update.bin 16
thread nr 16
wait for error
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
all threads exit

The following is the source code for update.bin:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>

#include "bpf.h"
#include "libbpf.h"

static bool stop;
static int fd;

static void *update_fn(void *arg)
{
        while (!stop) {
                unsigned int key = 0, value = 1;
                int err;

                err = bpf_map_update_elem(fd, &key, &value, 0);
                if (err) {
                        printf("update error %d\n", err);
                        stop = true;
                        break;
                }
        }

        return NULL;
}

int main(int argc, char **argv)
{
        LIBBPF_OPTS(bpf_map_create_opts, opts);
        unsigned int i, nr;
        pthread_t *tids;

        nr = 2;
        if (argc > 1)
                nr = atoi(argv[1]);
        printf("thread nr %u\n", nr);

        libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
        fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
        if (fd < 0) {
                printf("create map error %d\n", fd);
                return 1;
        }

        tids = malloc(nr * sizeof(*tids));
        for (i = 0; i < nr; i++)
                pthread_create(&tids[i], NULL, update_fn, NULL);

        printf("wait for error\n");
        for (i = 0; i < nr; i++)
                pthread_join(tids[i], NULL);

        printf("all threads exit\n");

        free(tids);
        close(fd);

        return 0;
}

>
> Here is my theory, but please correct me if I'm wrong, I haven't
> tested yet. In non-RT, I doubt preemptions are likely to happen after
> migrate_disable. That is because very soon after migrate_disable, we
> enter the critical section of b->raw_lock with irq disabled. In RT,
> preemptions can happen on acquiring b->lock, that is certainly
> possible, but this is the !use_raw_lock path, which isn't side-stepped
> by this patch.
>
>>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>> index 6c530a5e560a..ad09da139589 100644
>>>> --- a/kernel/bpf/hashtab.c
>>>> +++ b/kernel/bpf/hashtab.c
>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>                                    unsigned long *pflags)
>>>>  {
>>>>         unsigned long flags;
>>>> +       bool use_raw_lock;
>>>>
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>
>>>> -       migrate_disable();
>>>> +       use_raw_lock = htab_use_raw_lock(htab);
>>>> +       if (use_raw_lock)
>>>> +               preempt_disable();
>>>> +       else
>>>> +               migrate_disable();
>>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -               migrate_enable();
>>>> +               if (use_raw_lock)
>>>> +                       preempt_enable();
>>>> +               else
>>>> +                       migrate_enable();
>>>>                 return -EBUSY;
>>>>         }
>>>>
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_lock_irqsave(&b->lock, flags);
>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>>                                       struct bucket *b, u32 hash,
>>>>                                       unsigned long flags)
>>>>  {
>>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>>>> +
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_unlock_irqrestore(&b->lock, flags);
>>>>         __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -       migrate_enable();
>>>> +       if (use_raw_lock)
>>>> +               preempt_enable();
>>>> +       else
>>>> +               migrate_enable();
>>>>  }
>>>>
>>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>>>> --
>>>> 2.29.2
>>>>
>>> .
> .


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22  8:13   ` Sebastian Andrzej Siewior
@ 2022-08-22 12:09     ` Hou Tao
  2022-08-22 15:30       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 19+ messages in thread
From: Hou Tao @ 2022-08-22 12:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Song Liu, Hao Sun, Andrii Nakryiko, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Lorenz Bauer, houtao1

Hi,

On 8/22/2022 4:13 PM, Sebastian Andrzej Siewior wrote:
> On 2022-08-21 11:32:21 [+0800], Hou Tao wrote:
>> process A                      process B
>>
>> htab_map_update_elem()
>>   htab_lock_bucket()
>>     migrate_disable()
>>     /* return 1 */
>>     __this_cpu_inc_return()
>>     /* preempted by B */
>>
>>                                htab_map_update_elem()
>>                                  /* the same bucket as A */
>>                                  htab_lock_bucket()
>>                                    migrate_disable()
>>                                    /* return 2, so lock fails */
>>                                    __this_cpu_inc_return()
>>                                    return -EBUSY
>>
>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>> only checking the value of map_locked for nmi context. But it will
>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>> through non-tracing program (e.g. fentry program).
>>
>> So fixing it by using disable_preempt() instead of migrate_disable() when
>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>> so still use migrate_disable() for spin-lock case.
> But isn't the RT case still affected by the very same problem?
As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in
patch 2.
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> Sebastian
> .


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22 12:09     ` Hou Tao
@ 2022-08-22 15:30       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 19+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-08-22 15:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Andrii Nakryiko, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, KP Singh,
	David S . Miller, Jakub Kicinski, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, John Fastabend, Lorenz Bauer, houtao1

On 2022-08-22 20:09:47 [+0800], Hou Tao wrote:
> Hi,
Hi,

> > But isn't the RT case still affected by the very same problem?
> As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in
> patch 2.

ups, sorry.

Sebastian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22 12:07         ` Hou Tao
@ 2022-08-22 18:01           ` Hao Luo
  2022-08-23  0:56             ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-08-22 18:01 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2022 11:21 AM, Hao Luo wrote:
> > Hi, Hou Tao
> >
> > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> >>> Hi Hou Tao,
> >>>
> >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> From: Hou Tao <houtao1@huawei.com>
> >>>>
> >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> >>>> disallows concurrent updates from normal contexts (e.g. userspace
> >>>> processes) unexpectedly as shown below:
> >>>>
> >>>> process A                      process B
> >>>>
> >>>> htab_map_update_elem()
> >>>>   htab_lock_bucket()
> >>>>     migrate_disable()
> >>>>     /* return 1 */
> >>>>     __this_cpu_inc_return()
> >>>>     /* preempted by B */
> >>>>
> >>>>                                htab_map_update_elem()
> >>>>                                  /* the same bucket as A */
> >>>>                                  htab_lock_bucket()
> >>>>                                    migrate_disable()
> >>>>                                    /* return 2, so lock fails */
> >>>>                                    __this_cpu_inc_return()
> >>>>                                    return -EBUSY
> >>>>
> >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> >>>> only checking the value of map_locked for nmi context. But it will
> >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> >>>> through non-tracing program (e.g. fentry program).
> >>>>
> >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> >>>> so still use migrate_disable() for spin-lock case.
> >>>>
> >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>>> ---
> >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> >>> map_locked. But I don't think the change is meaningful.
> >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> >> case, so I don't think that the change is meaningless.
> >>> This patch only affects the case when raw lock is used. In the case of
> >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> >>> understanding about this patch is, it just makes sure preemption
> >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> >>> so small that it should be really unlikely to happen.
> >> No, it can be easily reproduced by running multiple htab update processes in the
> >> same CPU. Will add selftest to demonstrate that.
> > Can you clarify what you demonstrate?
> First please enable CONFIG_PREEMPT for the running kernel and then run the
> following test program as shown below.
>

Ah, fully preemptive kernel. It's worth mentioning that in the commit
message. Then it seems promoting migrate_disable to preempt_disable
may be the best way to solve the problem you described.

> # sudo taskset -c 2 ./update.bin
> thread nr 2
> wait for error
> update error -16
> all threads exit
>
> If there is no "update error -16", you can try to create more map update
> threads. For example running 16 update threads:
>
> # sudo taskset -c 2 ./update.bin 16
> thread nr 16
> wait for error
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> all threads exit
>
> The following is the source code for update.bin:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdbool.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <errno.h>
> #include <pthread.h>
>
> #include "bpf.h"
> #include "libbpf.h"
>
> static bool stop;
> static int fd;
>
> static void *update_fn(void *arg)
> {
>         while (!stop) {
>                 unsigned int key = 0, value = 1;
>                 int err;
>
>                 err = bpf_map_update_elem(fd, &key, &value, 0);
>                 if (err) {
>                         printf("update error %d\n", err);
>                         stop = true;
>                         break;
>                 }
>         }
>
>         return NULL;
> }
>
> int main(int argc, char **argv)
> {
>         LIBBPF_OPTS(bpf_map_create_opts, opts);
>         unsigned int i, nr;
>         pthread_t *tids;
>
>         nr = 2;
>         if (argc > 1)
>                 nr = atoi(argv[1]);
>         printf("thread nr %u\n", nr);
>
>         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
>         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
>         if (fd < 0) {
>                 printf("create map error %d\n", fd);
>                 return 1;
>         }
>
>         tids = malloc(nr * sizeof(*tids));
>         for (i = 0; i < nr; i++)
>                 pthread_create(&tids[i], NULL, update_fn, NULL);
>
>         printf("wait for error\n");
>         for (i = 0; i < nr; i++)
>                 pthread_join(tids[i], NULL);
>
>         printf("all threads exit\n");
>
>         free(tids);
>         close(fd);
>
>         return 0;
> }
>
> >
> > Here is my theory, but please correct me if I'm wrong, I haven't
> > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > migrate_disable. That is because very soon after migrate_disable, we
> > enter the critical section of b->raw_lock with irq disabled. In RT,
> > preemptions can happen on acquiring b->lock, that is certainly
> > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > by this patch.
> >
> >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>>> index 6c530a5e560a..ad09da139589 100644
> >>>> --- a/kernel/bpf/hashtab.c
> >>>> +++ b/kernel/bpf/hashtab.c
> >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >>>>                                    unsigned long *pflags)
> >>>>  {
> >>>>         unsigned long flags;
> >>>> +       bool use_raw_lock;
> >>>>
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>>
> >>>> -       migrate_disable();
> >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_disable();
> >>>> +       else
> >>>> +               migrate_disable();
> >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -               migrate_enable();
> >>>> +               if (use_raw_lock)
> >>>> +                       preempt_enable();
> >>>> +               else
> >>>> +                       migrate_enable();
> >>>>                 return -EBUSY;
> >>>>         }
> >>>>
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_lock_irqsave(&b->lock, flags);
> >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> >>>>                                       struct bucket *b, u32 hash,
> >>>>                                       unsigned long flags)
> >>>>  {
> >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> >>>> +
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -       migrate_enable();
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_enable();
> >>>> +       else
> >>>> +               migrate_enable();
> >>>>  }
> >>>>
> >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> >>>> --
> >>>> 2.29.2
> >>>>
> >>> .
> > .
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-22 18:01           ` Hao Luo
@ 2022-08-23  0:56             ` Hao Luo
  2022-08-23  1:29               ` Alexei Starovoitov
  2022-08-23  2:54               ` Hou Tao
  0 siblings, 2 replies; 19+ messages in thread
From: Hao Luo @ 2022-08-23  0:56 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > Hi, Hou Tao
> > >
> > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > >>> Hi Hou Tao,
> > >>>
> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>
> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > >>>> processes) unexpectedly as shown below:
> > >>>>
> > >>>> process A                      process B
> > >>>>
> > >>>> htab_map_update_elem()
> > >>>>   htab_lock_bucket()
> > >>>>     migrate_disable()
> > >>>>     /* return 1 */
> > >>>>     __this_cpu_inc_return()
> > >>>>     /* preempted by B */
> > >>>>
> > >>>>                                htab_map_update_elem()
> > >>>>                                  /* the same bucket as A */
> > >>>>                                  htab_lock_bucket()
> > >>>>                                    migrate_disable()
> > >>>>                                    /* return 2, so lock fails */
> > >>>>                                    __this_cpu_inc_return()
> > >>>>                                    return -EBUSY
> > >>>>
> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > >>>> only checking the value of map_locked for nmi context. But it will
> > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > >>>> through non-tracing program (e.g. fentry program).
> > >>>>
> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > >>>> so still use migrate_disable() for spin-lock case.
> > >>>>
> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >>>> ---
> > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> > >>> map_locked. But I don't think the change is meaningful.
> > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> > >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> > >> case, so I don't think that the change is meaningless.
> > >>> This patch only affects the case when raw lock is used. In the case of
> > >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > >>> understanding about this patch is, it just makes sure preemption
> > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> > >>> so small that it should be really unlikely to happen.
> > >> No, it can be easily reproduced by running multiple htab update processes in the
> > >> same CPU. Will add selftest to demonstrate that.
> > > Can you clarify what you demonstrate?
> > First please enable CONFIG_PREEMPT for the running kernel and then run the
> > following test program as shown below.
> >
>
> Ah, fully preemptive kernel. It's worth mentioning that in the commit
> message. Then it seems promoting migrate_disable to preempt_disable
> may be the best way to solve the problem you described.
>
> > # sudo taskset -c 2 ./update.bin
> > thread nr 2
> > wait for error
> > update error -16
> > all threads exit
> >
> > If there is no "update error -16", you can try to create more map update
> > threads. For example running 16 update threads:
> >
> > # sudo taskset -c 2 ./update.bin 16
> > thread nr 16
> > wait for error
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > all threads exit
> >
> > The following is the source code for update.bin:
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdbool.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <errno.h>
> > #include <pthread.h>
> >
> > #include "bpf.h"
> > #include "libbpf.h"
> >
> > static bool stop;
> > static int fd;
> >
> > static void *update_fn(void *arg)
> > {
> >         while (!stop) {
> >                 unsigned int key = 0, value = 1;
> >                 int err;
> >
> >                 err = bpf_map_update_elem(fd, &key, &value, 0);
> >                 if (err) {
> >                         printf("update error %d\n", err);
> >                         stop = true;
> >                         break;
> >                 }
> >         }
> >
> >         return NULL;
> > }
> >
> > int main(int argc, char **argv)
> > {
> >         LIBBPF_OPTS(bpf_map_create_opts, opts);
> >         unsigned int i, nr;
> >         pthread_t *tids;
> >
> >         nr = 2;
> >         if (argc > 1)
> >                 nr = atoi(argv[1]);
> >         printf("thread nr %u\n", nr);
> >
> >         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> >         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
> >         if (fd < 0) {
> >                 printf("create map error %d\n", fd);
> >                 return 1;
> >         }
> >
> >         tids = malloc(nr * sizeof(*tids));
> >         for (i = 0; i < nr; i++)
> >                 pthread_create(&tids[i], NULL, update_fn, NULL);
> >
> >         printf("wait for error\n");
> >         for (i = 0; i < nr; i++)
> >                 pthread_join(tids[i], NULL);
> >
> >         printf("all threads exit\n");
> >
> >         free(tids);
> >         close(fd);
> >
> >         return 0;
> > }
> >

Tao, thanks very much for the test. I played it a bit and I can
confirm that map_update failures are seen under CONFIG_PREEMPT. The
failures are not present under CONFIG_PREEMPT_NONE or
CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
thinking of and they didn't work. It looks like Hou Tao's idea,
promoting migrate_disable to preempt_disable, is probably the best we
can do for the non-RT case. So

Reviewed-by: Hao Luo <haoluo@google.com>

But, I am not sure if we want to get rid of preemption-caused batch
map updates on preemptive kernels, or if there are better solutions to
address [0].

Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.

> > >
> > > Here is my theory, but please correct me if I'm wrong, I haven't
> > > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > > migrate_disable. That is because very soon after migrate_disable, we
> > > enter the critical section of b->raw_lock with irq disabled. In RT,
> > > preemptions can happen on acquiring b->lock, that is certainly
> > > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > > by this patch.
> > >
> > >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> > >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> > >>>>
> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > >>>> index 6c530a5e560a..ad09da139589 100644
> > >>>> --- a/kernel/bpf/hashtab.c
> > >>>> +++ b/kernel/bpf/hashtab.c
> > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > >>>>                                    unsigned long *pflags)
> > >>>>  {
> > >>>>         unsigned long flags;
> > >>>> +       bool use_raw_lock;
> > >>>>
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>>
> > >>>> -       migrate_disable();
> > >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_disable();
> > >>>> +       else
> > >>>> +               migrate_disable();
> > >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -               migrate_enable();
> > >>>> +               if (use_raw_lock)
> > >>>> +                       preempt_enable();
> > >>>> +               else
> > >>>> +                       migrate_enable();
> > >>>>                 return -EBUSY;
> > >>>>         }
> > >>>>
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_lock_irqsave(&b->lock, flags);
> > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> > >>>>                                       struct bucket *b, u32 hash,
> > >>>>                                       unsigned long flags)
> > >>>>  {
> > >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> > >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -       migrate_enable();
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_enable();
> > >>>> +       else
> > >>>> +               migrate_enable();
> > >>>>  }
> > >>>>
> > >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> > >>>> --
> > >>>> 2.29.2
> > >>>>
> > >>> .
> > > .
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-23  0:56             ` Hao Luo
@ 2022-08-23  1:29               ` Alexei Starovoitov
  2022-08-23  2:57                 ` Hou Tao
  2022-08-23  2:54               ` Hou Tao
  1 sibling, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-23  1:29 UTC (permalink / raw)
  To: Hao Luo
  Cc: Hou Tao, bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Hou Tao

On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > > Hi, Hou Tao
> > > >
> > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > > >>> Hi Hou Tao,
> > > >>>
> > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>> From: Hou Tao <houtao1@huawei.com>
> > > >>>>
> > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > > >>>> processes) unexpectedly as shown below:
> > > >>>>
> > > >>>> process A                      process B
> > > >>>>
> > > >>>> htab_map_update_elem()
> > > >>>>   htab_lock_bucket()
> > > >>>>     migrate_disable()
> > > >>>>     /* return 1 */
> > > >>>>     __this_cpu_inc_return()
> > > >>>>     /* preempted by B */
> > > >>>>
> > > >>>>                                htab_map_update_elem()
> > > >>>>                                  /* the same bucket as A */
> > > >>>>                                  htab_lock_bucket()
> > > >>>>                                    migrate_disable()
> > > >>>>                                    /* return 2, so lock fails */
> > > >>>>                                    __this_cpu_inc_return()
> > > >>>>                                    return -EBUSY
> > > >>>>
> > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > > >>>> only checking the value of map_locked for nmi context. But it will
> > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > > >>>> through non-tracing program (e.g. fentry program).
> > > >>>>
> > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > > >>>> so still use migrate_disable() for spin-lock case.
> > > >>>>
> > > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>
> Tao, thanks very much for the test. I played it a bit and I can
> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> failures are not present under CONFIG_PREEMPT_NONE or
> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> thinking of and they didn't work. It looks like Hou Tao's idea,
> promoting migrate_disable to preempt_disable, is probably the best we
> can do for the non-RT case. So

preempt_disable is also faster than migrate_disable,
so patch 1 will not only fix this issue, but will improve performance.

Patch 2 is too hacky though.
I think it's better to wait until my bpf_mem_alloc patches land.
RT case won't be special anymore. We will be able to remove
htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
With bpf_mem_alloc there is no inline memory allocation anymore.

So please address Hao's comments, add a test and
resubmit patches 1 and 3.
Also please use [PATCH bpf-next] in the subject to help BPF CI
and patchwork scripts.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-23  0:56             ` Hao Luo
  2022-08-23  1:29               ` Alexei Starovoitov
@ 2022-08-23  2:54               ` Hou Tao
  1 sibling, 0 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-23  2:54 UTC (permalink / raw)
  To: Hao Luo
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, houtao1

Hi,

On 8/23/2022 8:56 AM, Hao Luo wrote:
> On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 8/22/2022 11:21 AM, Hao Luo wrote:
>>>> Hi, Hou Tao
>>>>
>>>> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/22/2022 12:42 AM, Hao Luo wrote:
>>>>>> Hi Hou Tao,
>>>>>>
>>>>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>
>>>>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>>>>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>>>>>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>>>>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>>>>>>> disallows concurrent updates from normal contexts (e.g. userspace
>>>>>>> processes) unexpectedly as shown below:
>>>>>>>
>>>>>>> process A                      process B
>>>>>>>
SNIP
>>>>>>> First please enable CONFIG_PREEMPT for the running kernel and then run the
>>>>>>> following test program as shown below.
>>>>>>>
>> Ah, fully preemptive kernel. It's worth mentioning that in the commit
>> message. Then it seems promoting migrate_disable to preempt_disable
>> may be the best way to solve the problem you described.
Sorry, i missed that. Will add in v2.
>>
>>> # sudo taskset -c 2 ./update.bin
>>> thread nr 2
>>> wait for error
>>> update error -16
>>> all threads exit
>>>
>>> If there is no "update error -16", you can try to create more map update
>>> threads. For example running 16 update threads:
>>>
>>> # sudo taskset -c 2 ./update.bin 16
>>> thread nr 16
>>> wait for error
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> all threads exit
SNIP
> Tao, thanks very much for the test. I played it a bit and I can
> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> failures are not present under CONFIG_PREEMPT_NONE or
> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> thinking of and they didn't work. It looks like Hou Tao's idea,
> promoting migrate_disable to preempt_disable, is probably the best we
> can do for the non-RT case. So
>
> Reviewed-by: Hao Luo <haoluo@google.com>
>
> But, I am not sure if we want to get rid of preemption-caused batch
> map updates on preemptive kernels, or if there are better solutions to
> address [0].
Thanks for your review.
Under preemptive kernel, if the preemption is disabled during batch map lookup
or update, htab_lock_bucket() from userspace will not fail, so I think it is OK
for now.
>
> Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.
Will do.
>>>> Here is my theory, but please correct me if I'm wrong, I haven't
>>>> tested yet. In non-RT, I doubt preemptions are likely to happen after
>>>> migrate_disable. That is because very soon after migrate_disable, we
>>>> enter the critical section of b->raw_lock with irq disabled. In RT,
>>>> preemptions can happen on acquiring b->lock, that is certainly
>>>> possible, but this is the !use_raw_lock path, which isn't side-stepped
>>>> by this patch.
>>>>
>>>>>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>>>>>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>>> index 6c530a5e560a..ad09da139589 100644
>>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>>>>                                    unsigned long *pflags)
>>>>>>>  {
>>>>>>>         unsigned long flags;
>>>>>>> +       bool use_raw_lock;
>>>>>>>
>>>>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>>>>
>>>>>>> -       migrate_disable();
>>>>>>> +       use_raw_lock = htab_use_raw_lock(htab);
>>>>>>> +       if (use_raw_lock)
>>>>>>> +               preempt_disable();
>>>>>>> +       else
>>>>>>> +               migrate_disable();
>>>>>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>>>>>>> -               migrate_enable();
>>>>>>> +               if (use_raw_lock)
>>>>>>> +                       preempt_enable();
>>>>>>> +               else
>>>>>>> +                       migrate_enable();
>>>>>>>                 return -EBUSY;
>>>>>>>         }
>>>>>>>
>>>>>>> -       if (htab_use_raw_lock(htab))
>>>>>>> +       if (use_raw_lock)
>>>>>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>>>>         else
>>>>>>>                 spin_lock_irqsave(&b->lock, flags);
>>>>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>>>>>                                       struct bucket *b, u32 hash,
>>>>>>>                                       unsigned long flags)
>>>>>>>  {
>>>>>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>>>>>>> +
>>>>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>>>> -       if (htab_use_raw_lock(htab))
>>>>>>> +       if (use_raw_lock)
>>>>>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>>>>>>         else
>>>>>>>                 spin_unlock_irqrestore(&b->lock, flags);
>>>>>>>         __this_cpu_dec(*(htab->map_locked[hash]));
>>>>>>> -       migrate_enable();
>>>>>>> +       if (use_raw_lock)
>>>>>>> +               preempt_enable();
>>>>>>> +       else
>>>>>>> +               migrate_enable();
>>>>>>>  }
>>>>>>>
>>>>>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>>>>>>> --
>>>>>>> 2.29.2
>>>>>>>
>>>>>> .
>>>> .


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-23  1:29               ` Alexei Starovoitov
@ 2022-08-23  2:57                 ` Hou Tao
  2022-08-23  4:50                   ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Hou Tao @ 2022-08-23  2:57 UTC (permalink / raw)
  To: Alexei Starovoitov, Hao Luo
  Cc: bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Hou Tao

Hi,

On 8/23/2022 9:29 AM, Alexei Starovoitov wrote:
> On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
>>
SNIP
>> Tao, thanks very much for the test. I played it a bit and I can
>> confirm that map_update failures are seen under CONFIG_PREEMPT. The
>> failures are not present under CONFIG_PREEMPT_NONE or
>> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
>> thinking of and they didn't work. It looks like Hou Tao's idea,
>> promoting migrate_disable to preempt_disable, is probably the best we
>> can do for the non-RT case. So
> preempt_disable is also faster than migrate_disable,
> so patch 1 will not only fix this issue, but will improve performance.
>
> Patch 2 is too hacky though.
> I think it's better to wait until my bpf_mem_alloc patches land.
> RT case won't be special anymore. We will be able to remove
> htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
> With bpf_mem_alloc there is no inline memory allocation anymore.
OK. Look forwards to the landing of BPF specific memory allocator.
>
> So please address Hao's comments, add a test and
> resubmit patches 1 and 3.
> Also please use [PATCH bpf-next] in the subject to help BPF CI
> and patchwork scripts.
Will do. And to bpf-next instead of bpf ?
> .


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-23  2:57                 ` Hou Tao
@ 2022-08-23  4:50                   ` Alexei Starovoitov
  2022-08-23  6:41                     ` Hou Tao
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-08-23  4:50 UTC (permalink / raw)
  To: Hou Tao
  Cc: Hao Luo, bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Hou Tao

On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/23/2022 9:29 AM, Alexei Starovoitov wrote:
> > On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
> >>
> SNIP
> >> Tao, thanks very much for the test. I played it a bit and I can
> >> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> >> failures are not present under CONFIG_PREEMPT_NONE or
> >> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> >> thinking of and they didn't work. It looks like Hou Tao's idea,
> >> promoting migrate_disable to preempt_disable, is probably the best we
> >> can do for the non-RT case. So
> > preempt_disable is also faster than migrate_disable,
> > so patch 1 will not only fix this issue, but will improve performance.
> >
> > Patch 2 is too hacky though.
> > I think it's better to wait until my bpf_mem_alloc patches land.
> > RT case won't be special anymore. We will be able to remove
> > htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
> > With bpf_mem_alloc there is no inline memory allocation anymore.
> OK. Look forwards to the landing of BPF specific memory allocator.
> >
> > So please address Hao's comments, add a test and
> > resubmit patches 1 and 3.
> > Also please use [PATCH bpf-next] in the subject to help BPF CI
> > and patchwork scripts.
> Will do. And to bpf-next instead of bpf ?

bpf-next is almost always prefered for fixes for corner
cases that have been around for some time.
bpf tree is for security and high pri fixes.
bpf-next gives time for fixes to prove themselves.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked
  2022-08-23  4:50                   ` Alexei Starovoitov
@ 2022-08-23  6:41                     ` Hou Tao
  0 siblings, 0 replies; 19+ messages in thread
From: Hou Tao @ 2022-08-23  6:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hao Luo, bpf, Song Liu, Hao Sun, Sebastian Andrzej Siewior,
	Andrii Nakryiko, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, KP Singh, David S . Miller,
	Jakub Kicinski, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Lorenz Bauer, Hou Tao

Hi,

On 8/23/2022 12:50 PM, Alexei Starovoitov wrote:
> On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
SNIP
>>> So please address Hao's comments, add a test and
>>> resubmit patches 1 and 3.
>>> Also please use [PATCH bpf-next] in the subject to help BPF CI
>>> and patchwork scripts.
>> Will do. And to bpf-next instead of bpf ?
> bpf-next is almost always prefered for fixes for corner
> cases that have been around for some time.
> bpf tree is for security and high pri fixes.
> bpf-next gives time for fixes to prove themselves.
> .
Understand. Thanks for the explanation.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-08-23  6:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-21  3:32 [PATCH 0/3] fixes for concurrent htab updates Hou Tao
2022-08-21  3:32 ` [PATCH 1/3] bpf: Disable preemption when increasing per-cpu map_locked Hou Tao
2022-08-21 16:42   ` Hao Luo
2022-08-22  1:27     ` Hou Tao
2022-08-22  3:21       ` Hao Luo
2022-08-22 12:07         ` Hou Tao
2022-08-22 18:01           ` Hao Luo
2022-08-23  0:56             ` Hao Luo
2022-08-23  1:29               ` Alexei Starovoitov
2022-08-23  2:57                 ` Hou Tao
2022-08-23  4:50                   ` Alexei Starovoitov
2022-08-23  6:41                     ` Hou Tao
2022-08-23  2:54               ` Hou Tao
2022-08-22  8:13   ` Sebastian Andrzej Siewior
2022-08-22 12:09     ` Hou Tao
2022-08-22 15:30       ` Sebastian Andrzej Siewior
2022-08-21  3:32 ` [PATCH 2/3] bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case Hou Tao
2022-08-21  3:32 ` [PATCH 3/3] bpf: Propagate error from htab_lock_bucket() to userspace Hou Tao
2022-08-22  1:21 ` [PATCH 0/3] fixes for concurrent htab updates Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.