linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
@ 2024-05-03 18:25 Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
                   ` (15 more replies)
  0 siblings, 16 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
- Jakub complained about touching networking drivers to make the
  additional locking work. Alexei complained about the additional
  locking within the XDP/eBFP case.
  This led to a change in how the per-CPU variables are accessed for the
  XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
  the task pointer to the structure is saved in the task_struct while
  keeping every for !RT unchanged. This was proposed as a RFC in
  	v1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/

  and then updated

        v2: https://lore.kernel.org/all/20240229183109.646865-1-bigeasy@linutronix.de/
	  - Renamed the container struct from xdp_storage to bpf_net_context.
            Suggested by Toke Høiland-Jørgensen.
	  - Use the container struct also on !PREEMPT_RT builds. Store the
	    pointer to the on-stack struct in a per-CPU variable. Suggested by
            Toke Høiland-Jørgensen.

  This reduces the initial queue from 24 to 15 patches.

- There were complains about the scoped_guard() which shifts the whole
  block and makes it harder to review because the whole gets removed and
  added again. The usage has been replaced with local_lock_nested_bh()+
  its unlock counterpart.

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH net-next 01/15] locking/local_lock: Introduce guard definition for local_lock.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

Introduce lock guard definition for local_lock_t. There are no users
yet.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/local_lock.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa73296..82366a37f4474 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,15 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+DEFINE_GUARD(local_lock, local_lock_t __percpu*,
+	     local_lock(_T),
+	     local_unlock(_T))
+DEFINE_GUARD(local_lock_irq, local_lock_t __percpu*,
+	     local_lock_irq(_T),
+	     local_unlock_irq(_T))
+DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t __percpu,
+		    local_lock_irqsave(_T->lock, _T->flags),
+		    local_unlock_irqrestore(_T->lock, _T->flags),
+		    unsigned long flags)
+
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

Add local_lock_nested_bh() locking. It is based on local_lock_t and the
naming follows the preempt_disable_nested() example.

For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking
assumptions based on local_bh_disable(). The macro is optimized away
during compilation.
For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to
the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that
BH is disabled.

For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU
lock. It does not disable CPU migration because it relies on
local_bh_disable() disabling CPU migration.
With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT.
Due to include hell the softirq check has been moved spinlock.c.

The intention is to use this locking in places where locking of a per-CPU
variable relies on BH being disabled. Instead of treating disabled
bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce
the locking scope to what actually needs protecting.
A side effect is that it also documents the protection scope of the
per-CPU variables.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/local_lock.h          | 10 ++++++++++
 include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++++
 include/linux/lockdep.h             |  3 +++
 kernel/locking/spinlock.c           |  8 ++++++++
 4 files changed, 52 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 82366a37f4474..091dc0b6bdfb9 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -62,4 +62,14 @@ DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t __percpu,
 		    local_unlock_irqrestore(_T->lock, _T->flags),
 		    unsigned long flags)
 
+#define local_lock_nested_bh(_lock)				\
+	__local_lock_nested_bh(_lock)
+
+#define local_unlock_nested_bh(_lock)				\
+	__local_unlock_nested_bh(_lock)
+
+DEFINE_GUARD(local_lock_nested_bh, local_lock_t __percpu*,
+	     local_lock_nested_bh(_T),
+	     local_unlock_nested_bh(_T))
+
 #endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a77..8dd71fbbb6d2b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -62,6 +62,17 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
+#define __spinlock_nested_bh_init(lock)				\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));\
+	lockdep_init_map_type(&(lock)->dep_map, #lock, &__key,  \
+			      0, LD_WAIT_CONFIG, LD_WAIT_INV,	\
+			      LD_LOCK_NORMAL);			\
+	local_lock_debug_init(lock);				\
+} while (0)
+
 #define __local_lock(lock)					\
 	do {							\
 		preempt_disable();				\
@@ -98,6 +109,15 @@ do {								\
 		local_irq_restore(flags);			\
 	} while (0)
 
+#define __local_lock_nested_bh(lock)				\
+	do {							\
+		lockdep_assert_in_softirq();			\
+		local_lock_acquire(this_cpu_ptr(lock));	\
+	} while (0)
+
+#define __local_unlock_nested_bh(lock)				\
+	local_lock_release(this_cpu_ptr(lock))
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -138,4 +158,15 @@ typedef spinlock_t local_lock_t;
 
 #define __local_unlock_irqrestore(lock, flags)	__local_unlock(lock)
 
+#define __local_lock_nested_bh(lock)				\
+do {								\
+	lockdep_assert_in_softirq_func();			\
+	spin_lock(this_cpu_ptr(lock));				\
+} while (0)
+
+#define __local_unlock_nested_bh(lock)				\
+do {								\
+	spin_unlock(this_cpu_ptr((lock)));			\
+} while (0)
+
 #endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 08b0d1d9d78b7..3f5a551579cc9 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -600,6 +600,8 @@ do {									\
 		     (!in_softirq() || in_irq() || in_nmi()));		\
 } while (0)
 
+extern void lockdep_assert_in_softirq_func(void);
+
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(lock) do { } while (0)
@@ -613,6 +615,7 @@ do {									\
 # define lockdep_assert_preemption_enabled() do { } while (0)
 # define lockdep_assert_preemption_disabled() do { } while (0)
 # define lockdep_assert_in_softirq() do { } while (0)
+# define lockdep_assert_in_softirq_func() do { } while (0)
 #endif
 
 #ifdef CONFIG_PROVE_RAW_LOCK_NESTING
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 8475a0794f8c5..438c6086d540e 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -413,3 +413,11 @@ notrace int in_lock_functions(unsigned long addr)
 	&& addr < (unsigned long)__lock_text_end;
 }
 EXPORT_SYMBOL(in_lock_functions);
+
+#if defined(CONFIG_PROVE_LOCKING) && defined(CONFIG_PREEMPT_RT)
+void notrace lockdep_assert_in_softirq_func(void)
+{
+	lockdep_assert_in_softirq();
+}
+EXPORT_SYMBOL(lockdep_assert_in_softirq_func);
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 04/15] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

The else condition within __netdev_alloc_frag_align() is an open coded
__napi_alloc_frag_align().

Use __napi_alloc_frag_align() instead of open coding it.
Move fragsz assignment before page_frag_alloc_align() invocation because
__napi_alloc_frag_align() also contains this statement.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/skbuff.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 28cd640a6ea97..c4479d5721a2a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -318,19 +318,15 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 {
 	void *data;
 
-	fragsz = SKB_DATA_ALIGN(fragsz);
 	if (in_hardirq() || irqs_disabled()) {
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
+		fragsz = SKB_DATA_ALIGN(fragsz);
 		data = __page_frag_alloc_align(nc, fragsz, GFP_ATOMIC,
 					       align_mask);
 	} else {
-		struct napi_alloc_cache *nc;
-
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = __page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC,
-					       align_mask);
+		data = __napi_alloc_frag_align(fragsz, align_mask);
 		local_bh_enable();
 	}
 	return data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 04/15] net: Use nested-BH locking for napi_alloc_cache.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

napi_alloc_cache is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/skbuff.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c4479d5721a2a..c8b40e6237057 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -277,6 +277,7 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 #endif
 
 struct napi_alloc_cache {
+	local_lock_t bh_lock;
 	struct page_frag_cache page;
 	struct page_frag_1k page_small;
 	unsigned int skb_count;
@@ -284,7 +285,9 @@ struct napi_alloc_cache {
 };
 
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 /* Double check that napi_get_frags() allocates skbs with
  * skb->head being backed by slab, not a page fragment.
@@ -308,6 +311,7 @@ void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 
 	return __page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC,
 				       align_mask);
@@ -338,6 +342,7 @@ static struct sk_buff *napi_skb_cache_get(void)
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 	if (unlikely(!nc->skb_count)) {
 		nc->skb_count = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache,
 						      GFP_ATOMIC,
@@ -740,9 +745,13 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 		pfmemalloc = nc->pfmemalloc;
 	} else {
 		local_bh_disable();
+		local_lock_nested_bh(&napi_alloc_cache.bh_lock);
+
 		nc = this_cpu_ptr(&napi_alloc_cache.page);
 		data = page_frag_alloc(nc, len, gfp_mask);
 		pfmemalloc = nc->pfmemalloc;
+
+		local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
 		local_bh_enable();
 	}
 
@@ -806,11 +815,11 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
 		goto skb_success;
 	}
 
-	nc = this_cpu_ptr(&napi_alloc_cache);
-
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
+	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
+	nc = this_cpu_ptr(&napi_alloc_cache);
 	if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
 		/* we are artificially inflating the allocation size, but
 		 * that is not as bad as it may look like, as:
@@ -832,6 +841,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
 		data = page_frag_alloc(&nc->page, len, gfp_mask);
 		pfmemalloc = nc->page.pfmemalloc;
 	}
+	local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
 
 	if (unlikely(!data))
 		return NULL;
@@ -1393,6 +1403,7 @@ static void napi_skb_cache_put(struct sk_buff *skb)
 	if (!kasan_mempool_poison_object(skb))
 		return;
 
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 	nc->skb_cache[nc->skb_count++] = skb;
 
 	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 04/15] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, David Ahern

sigpool_scratch is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a pad member (original sigpool_scratch) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/ipv4/tcp_sigpool.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_sigpool.c b/net/ipv4/tcp_sigpool.c
index 8512cb09ebc09..d8a4f192873a2 100644
--- a/net/ipv4/tcp_sigpool.c
+++ b/net/ipv4/tcp_sigpool.c
@@ -10,7 +10,14 @@
 #include <net/tcp.h>
 
 static size_t __scratch_size;
-static DEFINE_PER_CPU(void __rcu *, sigpool_scratch);
+struct sigpool_scratch {
+	local_lock_t bh_lock;
+	void __rcu *pad;
+};
+
+static DEFINE_PER_CPU(struct sigpool_scratch, sigpool_scratch) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 struct sigpool_entry {
 	struct crypto_ahash	*hash;
@@ -72,7 +79,7 @@ static int sigpool_reserve_scratch(size_t size)
 			break;
 		}
 
-		old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+		old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
 					scratch, lockdep_is_held(&cpool_mutex));
 		if (!cpu_online(cpu) || !old_scratch) {
 			kfree(old_scratch);
@@ -93,7 +100,7 @@ static void sigpool_scratch_free(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+		kfree(rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
 					  NULL, lockdep_is_held(&cpool_mutex)));
 	__scratch_size = 0;
 }
@@ -277,7 +284,8 @@ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RC
 	/* Pairs with tcp_sigpool_reserve_scratch(), scratch area is
 	 * valid (allocated) until tcp_sigpool_end().
 	 */
-	c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch));
+	local_lock_nested_bh(&sigpool_scratch.bh_lock);
+	c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch.pad));
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_sigpool_start);
@@ -286,6 +294,7 @@ void tcp_sigpool_end(struct tcp_sigpool *c) __releases(RCU_BH)
 {
 	struct crypto_ahash *hash = crypto_ahash_reqtfm(c->req);
 
+	local_unlock_nested_bh(&sigpool_scratch.bh_lock);
 	rcu_read_unlock_bh();
 	ahash_request_free(c->req);
 	crypto_free_ahash(hash);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, David Ahern

ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/sock.h  |  5 +++++
 net/ipv4/tcp_ipv4.c | 15 +++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0450494a1766a..8380898d71267 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -544,6 +544,11 @@ struct sock {
 	netns_tracker		ns_tracker;
 };
 
+struct sock_bh_locked {
+	struct sock *sock;
+	local_lock_t bh_lock;
+};
+
 enum sk_pacing {
 	SK_PACING_NONE		= 0,
 	SK_PACING_NEEDED	= 1,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0427deca3e0eb..eefeff2d2f2b1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -93,7 +93,9 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 struct inet_hashinfo tcp_hashinfo;
 EXPORT_SYMBOL(tcp_hashinfo);
 
-static DEFINE_PER_CPU(struct sock *, ipv4_tcp_sk);
+static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
@@ -879,7 +881,9 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
 	arg.tos = ip_hdr(skb)->tos;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ctl_sk = this_cpu_read(ipv4_tcp_sk);
+	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+	ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
+
 	sock_net_set(ctl_sk, net);
 	if (sk) {
 		ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
@@ -904,6 +908,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
 	sock_net_set(ctl_sk, &init_net);
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 	__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
+	local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
 	local_bh_enable();
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -999,7 +1004,8 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	arg.tos = tos;
 	arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ctl_sk = this_cpu_read(ipv4_tcp_sk);
+	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+	ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
 	sock_net_set(ctl_sk, net);
 	ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
 			   inet_twsk(sk)->tw_mark : READ_ONCE(sk->sk_mark);
@@ -1014,6 +1020,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
 
 	sock_net_set(ctl_sk, &init_net);
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
+	local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
 	local_bh_enable();
 }
 
@@ -3620,7 +3627,7 @@ void __init tcp_v4_init(void)
 		 */
 		inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
 
-		per_cpu(ipv4_tcp_sk, cpu) = sk;
+		per_cpu(ipv4_tcp_sk.sock, cpu) = sk;
 	}
 	if (register_pernet_subsys(&tcp_sk_ops))
 		panic("Failed to create the TCP control socket.\n");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 08/15] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Florian Westphal, Jozsef Kadlecsik,
	Nikolay Aleksandrov, Pablo Neira Ayuso, Roopa Prabhu, bridge,
	coreteam, netfilter-devel

brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: bridge@lists.linux.dev
Cc: coreteam@netfilter.org
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/bridge/br_netfilter_hooks.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 7948a9e7542c4..baacd80716046 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -137,6 +137,7 @@ static inline bool is_pppoe_ipv6(const struct sk_buff *skb,
 #define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
 
 struct brnf_frag_data {
+	local_lock_t bh_lock;
 	char mac[NF_BRIDGE_MAX_MAC_HEADER_LENGTH];
 	u8 encap_size;
 	u8 size;
@@ -144,7 +145,9 @@ struct brnf_frag_data {
 	__be16 vlan_proto;
 };
 
-static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
+static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 static void nf_bridge_info_free(struct sk_buff *skb)
 {
@@ -882,6 +885,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 
 		IPCB(skb)->frag_max_size = nf_bridge->frag_max_size;
 
+		guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
 		data = this_cpu_ptr(&brnf_frag_data_storage);
 
 		if (skb_vlan_tag_present(skb)) {
@@ -909,6 +913,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 
 		IP6CB(skb)->frag_max_size = nf_bridge->frag_max_size;
 
+		guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
 		data = this_cpu_ptr(&brnf_frag_data_storage);
 		data->encap_size = nf_bridge_encap_header_len(skb);
 		data->size = ETH_HLEN + data->encap_size;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 08/15] net: softnet_data: Make xmit.recursion per task.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (6 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*() Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Juri Lelli,
	Mel Gorman, Steven Rostedt, Valentin Schneider, Vincent Guittot

Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
recursion counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make the counter per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the recursion counter a task_struct member on PREEMPT_RT.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h | 11 +++++++++++
 include/linux/sched.h     |  4 +++-
 net/core/dev.h            | 20 ++++++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 41853424b41d7..c551ec235f9af 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3222,7 +3222,9 @@ struct softnet_data {
 #endif
 	/* written and read only by owning cpu: */
 	struct {
+#ifndef CONFIG_PREEMPT_RT
 		u16 recursion;
+#endif
 		u8  more;
 #ifdef CONFIG_NET_EGRESS
 		u8  skip_txqueue;
@@ -3255,10 +3257,19 @@ struct softnet_data {
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
+#ifdef CONFIG_PREEMPT_RT
+static inline int dev_recursion_level(void)
+{
+	return current->net_xmit_recursion;
+}
+
+#else
+
 static inline int dev_recursion_level(void)
 {
 	return this_cpu_read(softnet_data.xmit.recursion);
 }
+#endif
 
 void __netif_schedule(struct Qdisc *q);
 void netif_schedule_queue(struct netdev_queue *txq);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b49..6779d3b8f2578 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -969,7 +969,9 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
-
+#ifdef CONFIG_PREEMPT_RT
+	u8				net_xmit_recursion;
+#endif
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
 	struct restart_block		restart_block;
diff --git a/net/core/dev.h b/net/core/dev.h
index b7b518bc2be55..2f96d63053ad0 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -150,6 +150,25 @@ struct napi_struct *napi_by_id(unsigned int napi_id);
 void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu);
 
 #define XMIT_RECURSION_LIMIT	8
+
+#ifdef CONFIG_PREEMPT_RT
+static inline bool dev_xmit_recursion(void)
+{
+	return unlikely(current->net_xmit_recursion > XMIT_RECURSION_LIMIT);
+}
+
+static inline void dev_xmit_recursion_inc(void)
+{
+	current->net_xmit_recursion++;
+}
+
+static inline void dev_xmit_recursion_dec(void)
+{
+	current->net_xmit_recursion--;
+}
+
+#else
+
 static inline bool dev_xmit_recursion(void)
 {
 	return unlikely(__this_cpu_read(softnet_data.xmit.recursion) >
@@ -165,5 +184,6 @@ static inline void dev_xmit_recursion_dec(void)
 {
 	__this_cpu_dec(softnet_data.xmit.recursion);
 }
+#endif
 
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*().
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (7 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 08/15] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().

Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.

Remove the ifdefs in backlog_lock.*().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e02d2363347e2..cf7b452ce0d74 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -230,7 +230,7 @@ static inline void backlog_lock_irq_save(struct softnet_data *sd,
 {
 	if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
 		spin_lock_irqsave(&sd->input_pkt_queue.lock, *flags);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_save(*flags);
 }
 
@@ -238,7 +238,7 @@ static inline void backlog_lock_irq_disable(struct softnet_data *sd)
 {
 	if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
 		spin_lock_irq(&sd->input_pkt_queue.lock);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_disable();
 }
 
@@ -247,7 +247,7 @@ static inline void backlog_unlock_irq_restore(struct softnet_data *sd,
 {
 	if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
 		spin_unlock_irqrestore(&sd->input_pkt_queue.lock, *flags);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_restore(*flags);
 }
 
@@ -255,7 +255,7 @@ static inline void backlog_unlock_irq_enable(struct softnet_data *sd)
 {
 	if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
 		spin_unlock_irq(&sd->input_pkt_queue.lock);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_enable();
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (8 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*() Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.

Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c551ec235f9af..9d19e6ace7cb7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3199,6 +3199,7 @@ static inline bool dev_has_header(const struct net_device *dev)
 struct softnet_data {
 	struct list_head	poll_list;
 	struct sk_buff_head	process_queue;
+	local_lock_t		process_queue_bh_lock;
 
 	/* stats */
 	unsigned int		processed;
diff --git a/net/core/dev.c b/net/core/dev.c
index cf7b452ce0d74..1503883ce15a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -450,7 +450,9 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data) = {
+	.process_queue_bh_lock = INIT_LOCAL_LOCK(process_queue_bh_lock),
+};
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 /* Page_pool has a lockless array/stack to alloc/recycle pages.
@@ -5935,6 +5937,7 @@ static void flush_backlog(struct work_struct *work)
 	}
 	backlog_unlock_irq_enable(sd);
 
+	local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
 			__skb_unlink(skb, &sd->process_queue);
@@ -5942,6 +5945,7 @@ static void flush_backlog(struct work_struct *work)
 			rps_input_queue_head_incr(sd);
 		}
 	}
+	local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 	local_bh_enable();
 }
 
@@ -6063,7 +6067,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	while (again) {
 		struct sk_buff *skb;
 
+		local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 		while ((skb = __skb_dequeue(&sd->process_queue))) {
+			local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 			rcu_read_lock();
 			__netif_receive_skb(skb);
 			rcu_read_unlock();
@@ -6072,7 +6078,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
 				return work;
 			}
 
+			local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 		}
+		local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 
 		backlog_lock_irq_disable(sd);
 		if (skb_queue_empty(&sd->input_pkt_queue)) {
@@ -6087,8 +6095,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			napi->state &= NAPIF_STATE_THREADED;
 			again = false;
 		} else {
+			local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
 						   &sd->process_queue);
+			local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 		}
 		backlog_unlock_irq_enable(sd);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 11/15] lwt: Don't disable migration prio invoking BPF.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (9 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, bpf

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/lwt_bpf.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 {
 	int ret;
 
-	/* Migration disable and BH disable are needed to protect per-cpu
-	 * redirect_info between BPF prog and skb_do_redirect().
+	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+	 * BPF prog and skb_do_redirect().
 	 */
-	migrate_disable();
 	local_bh_disable();
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	}
 
 	local_bh_enable();
-	migrate_enable();
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (10 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() in
guard notation for locking. Add lockdep_assert_held() to ensure the lock
is held while the per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/seg6_local.h |  1 +
 net/core/filter.c        |  3 +++
 net/ipv6/seg6_local.c    | 22 ++++++++++++++--------
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..888c1ce6f5272 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -19,6 +19,7 @@ extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
 
 struct seg6_bpf_srh_state {
+	local_lock_t bh_lock;
 	struct ipv6_sr_hdr *srh;
 	u16 hdrlen;
 	bool valid;
diff --git a/net/core/filter.c b/net/core/filter.c
index 2510464692af0..cfe8ea59fd9db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6450,6 +6450,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
 	void *srh_tlvs, *srh_end, *ptr;
 	int srhoff = 0;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (srh == NULL)
 		return -EINVAL;
 
@@ -6506,6 +6507,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
 	int hdroff = 0;
 	int err;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	switch (action) {
 	case SEG6_LOCAL_ACTION_END_X:
 		if (!seg6_bpf_has_valid_srh(skb))
@@ -6582,6 +6584,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
 	int srhoff = 0;
 	int ret;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return -EINVAL;
 
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..c4828c6620f07 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 {
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 		this_cpu_ptr(&seg6_bpf_srh_states);
 	struct ipv6_sr_hdr *srh = srh_state->srh;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return false;
 
@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 static int input_action_end_bpf(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
 {
-	struct seg6_bpf_srh_state *srh_state =
-		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state *srh_state;
 	struct ipv6_sr_hdr *srh;
 	int ret;
 
@@ -1420,10 +1422,14 @@ static int input_action_end_bpf(struct sk_buff *skb,
 	}
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
-	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	/* The access to the per-CPU buffer srh_state is protected by running
+	 * always in softirq context (with disabled BH). On PREEMPT_RT the
+	 * required locking is provided by the following local_lock_nested_bh()
+	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+	 * bpf_prog_run_save_cb().
 	 */
-	preempt_disable();
+	local_lock_nested_bh(&seg6_bpf_srh_states.bh_lock);
+	srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
 	srh_state->srh = srh;
 	srh_state->hdrlen = srh->hdrlen << 3;
 	srh_state->valid = true;
@@ -1446,15 +1452,15 @@ static int input_action_end_bpf(struct sk_buff *skb,
 
 	if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
 		goto drop;
+	local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);
 
-	preempt_enable();
 	if (ret != BPF_REDIRECT)
 		seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
 drop:
-	preempt_enable();
+	local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 13/15] net: Use nested-BH locking for bpf_scratchpad.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (11 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-03 18:25 ` [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/filter.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index cfe8ea59fd9db..e95b235a1e4f4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1658,9 +1658,12 @@ struct bpf_scratchpad {
 		__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
 		u8     buff[MAX_BPF_STACK];
 	};
+	local_lock_t	bh_lock;
 };
 
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 static inline int __bpf_try_make_writable(struct sk_buff *skb,
 					  unsigned int write_len)
@@ -2029,6 +2032,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 		     diff_size > sizeof(sp->diff)))
 		return -EINVAL;
 
+	guard(local_lock_nested_bh)(&bpf_sp.bh_lock);
 	for (i = 0; i < from_size / sizeof(__be32); i++, j++)
 		sp->diff[j] = ~from[i];
 	for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (12 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-06 19:41   ` Toke Høiland-Jørgensen
  2024-05-03 18:25 ` [PATCH net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
  2024-05-06  8:43 ` [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Paolo Abeni
  15 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Toke Høiland-Jørgensen,
	Yonghong Song, bpf

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it. Use the __free() annotation to automatically reset the pointer once
function returns.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

On PREEMPT_RT the pointer to bpf_net_context is saved task's
task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
variable (which is always NODE-local memory). Using always the
bpf_net_context approach has the advantage that there is almost zero
differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 96 ++++++++++++++++++++++++++++++++++++++---
 include/linux/sched.h  |  5 +++
 kernel/bpf/cpumap.c    |  3 ++
 kernel/fork.c          |  3 ++
 net/bpf/test_run.c     | 11 ++++-
 net/core/dev.c         | 19 +++++++-
 net/core/filter.c      | 98 ++++++++++++++++++++++++++----------------
 net/core/lwt_bpf.c     |  3 ++
 8 files changed, 191 insertions(+), 47 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d5fea03cb6e61..bdd69bd81df45 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,83 @@ struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+#ifndef CONFIG_PREEMPT_RT
+DECLARE_PER_CPU(struct bpf_net_context *, bpf_net_context);
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct bpf_net_context *ctx;
+
+	ctx = this_cpu_read(bpf_net_context);
+	if (ctx != NULL)
+		return NULL;
+	this_cpu_write(bpf_net_context, bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	struct bpf_net_context *ctx;
+
+	ctx = this_cpu_read(bpf_net_context);
+	if (ctx != bpf_net_ctx)
+		return;
+	this_cpu_write(bpf_net_context, NULL);
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
+
+	WARN_ON_ONCE(!bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+#else
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != bpf_net_ctx)
+		return;
+	tsk->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	struct bpf_net_context *bpf_net_ctx = current->bpf_net_context;
+
+	WARN_ON_ONCE(!bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+#endif
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->ri;
+}
+
+DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -1021,23 +1097,27 @@ void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
+	if (!ri)
+		return false;
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
@@ -1591,9 +1671,11 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
+	if (!ri)
+		return XDP_ABORTED;
 	/* Lower bits of the flags are used as return code on lookup failure */
 	if (unlikely(flags & ~(action_mask | flag_mask)))
 		return XDP_ABORTED;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6779d3b8f2578..26324fb0e532d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1504,6 +1505,10 @@ struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
+#endif
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 				int xdp_n, struct xdp_cpumap_stats *stats,
 				struct list_head *list)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
 
 	if (!rcpu->prog)
 		return xdp_n;
 
 	rcu_read_lock_bh();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
 
@@ -255,6 +257,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (unlikely(!list_empty(list)))
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
diff --git a/kernel/fork.c b/kernel/fork.c
index aebb3e6c96dc6..82c16c22d960c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,9 @@ __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	p->bpf_net_context =  NULL;
+#endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
+	struct bpf_redirect_info *ri;
 	struct xdp_page_head *head;
 	struct xdp_frame *frm;
 	bool redirect = false;
@@ -295,6 +296,8 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 	xdp_clear_return_frame_no_direct();
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
 }
@@ -394,6 +398,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		if (xdp)
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
+
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 1503883ce15a4..26e524544942d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4031,11 +4031,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4086,13 +4090,17 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6357,13 +6365,15 @@ static void __napi_busy_loop(unsigned int napi_id,
 		      bool (*loop_end)(void *, unsigned long),
 		      void *loop_end_arg, unsigned flags, u16 budget)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
-
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 restart:
 	napi_poll = NULL;
 
@@ -6834,6 +6844,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 
 static void napi_threaded_poll_loop(struct napi_struct *napi)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
 	unsigned long last_qs = jiffies;
 
@@ -6842,6 +6853,8 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		void *have;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		sd = this_cpu_ptr(&softnet_data);
 		sd->in_napi_threaded_poll = true;
 
@@ -6857,6 +6870,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 			net_rps_action_and_irq_enable(sd);
 		}
 		skb_defer_free_flush(sd);
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!repoll)
@@ -6879,13 +6893,16 @@ static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear);
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
 	int budget = READ_ONCE(net_hotdata.netdev_budget);
+	struct bpf_net_context __bpf_net_ctx;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index e95b235a1e4f4..90afa393d0648 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2475,8 +2475,10 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
+#ifndef CONFIG_PREEMPT_RT
+DEFINE_PER_CPU(struct bpf_net_context *, bpf_net_context);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_net_context);
+#endif
 
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
@@ -2490,11 +2492,15 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net *net = dev_net(skb->dev);
+	struct bpf_redirect_info *ri;
 	struct net_device *dev;
-	u32 flags = ri->flags;
+	u32 flags;
 
+	ri = bpf_net_ctx_get_ri();
+	if (!ri)
+		goto out_drop;
+	flags = ri->flags;
 	dev = dev_get_by_index_rcu(net, ri->tgt_index);
 	ri->tgt_index = 0;
 	ri->flags = 0;
@@ -2523,9 +2529,9 @@ int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+	if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = flags;
@@ -2544,9 +2550,9 @@ static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_PEER;
@@ -2566,9 +2572,9 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely((plen && plen < sizeof(*params)) || flags))
+	if (unlikely((plen && plen < sizeof(*params)) || flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
@@ -4294,19 +4300,17 @@ void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protocted and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exist.  dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4315,11 +4319,14 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri;
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
+		ri = bpf_net_ctx_get_ri();
+		if (!ri)
+			return XDP_ABORTED;
 		/* The target device is different from the receiving device, so
 		 * redirect it to the new device.
 		 * Using XDP_REDIRECT gets the correct behaviour from XDP enabled
@@ -4432,10 +4439,12 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
@@ -4446,10 +4455,12 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
@@ -4463,10 +4474,13 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id,
 				       u32 flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
+	if (!ri)
+		return -EINVAL;
+
 	switch (map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
@@ -4517,13 +4531,21 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
-	void *fwd = ri->tgt_value;
-	u32 map_id = ri->map_id;
-	u32 flags = ri->flags;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
+	enum bpf_map_type map_type;
+	u32 map_id;
+	void *fwd;
+	u32 flags;
 	int err;
 
+	if (!ri)
+		return -EINVAL;
+
+	map_type = ri->map_type;
+	fwd = ri->tgt_value;
+	map_id = ri->map_id;
+	flags = ri->flags;
+
 	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
 	ri->flags = 0;
 	ri->map_type = BPF_MAP_TYPE_UNSPEC;
@@ -4553,9 +4575,9 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return XDP_ABORTED;
 
 	/* NB! Map type UNSPEC and map_id == INT_MAX (never generated
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +78,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (13 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-05-03 18:25 ` Sebastian Andrzej Siewior
  2024-05-06  8:43 ` [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Paolo Abeni
  15 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-03 18:25 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Björn Töpel,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Toke Høiland-Jørgensen, Yonghong Song, bpf

The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.

Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access.

Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/cpumap.c    | 24 ++++++++----------------
 kernel/bpf/devmap.c    | 16 ++++++++--------
 net/xdp/xsk.c          | 19 +++++++++++--------
 4 files changed, 65 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index bdd69bd81df45..68401d84e2050 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -746,6 +746,9 @@ struct bpf_redirect_info {
 
 struct bpf_net_context {
 	struct bpf_redirect_info ri;
+	struct list_head cpu_map_flush_list;
+	struct list_head dev_map_flush_list;
+	struct list_head xskmap_map_flush_list;
 };
 
 #ifndef CONFIG_PREEMPT_RT
@@ -758,6 +761,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 	ctx = this_cpu_read(bpf_net_context);
 	if (ctx != NULL)
 		return NULL;
+	INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
 	this_cpu_write(bpf_net_context, bpf_net_ctx);
 	return bpf_net_ctx;
 }
@@ -788,6 +795,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 
 	if (tsk->bpf_net_context != NULL)
 		return NULL;
+	INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
 	tsk->bpf_net_context = bpf_net_ctx;
 	return bpf_net_ctx;
 }
@@ -820,6 +831,33 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
 	return &bpf_net_ctx->ri;
 }
 
+static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->cpu_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->dev_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->xskmap_map_flush_list;
+}
+
 DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 66974bd027109..0d18ffc93dcab 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -79,8 +79,6 @@ struct bpf_cpu_map {
 	struct bpf_cpu_map_entry __rcu **cpu_map;
 };
 
-static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
-
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
 	u32 value_size = attr->value_size;
@@ -709,7 +707,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
@@ -761,9 +759,12 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 
 void __cpu_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq, *tmp;
 
+	if (!flush_list)
+		return;
+
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_flush_to_queue(bq);
 
@@ -775,20 +776,11 @@ void __cpu_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool cpu_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&cpu_map_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
+
+	if (!flush_list || list_empty(bpf_net_ctx_get_cpu_map_flush_list()))
 		return false;
 	__cpu_map_flush();
 	return true;
 }
 #endif
-
-static int __init cpu_map_init(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
-	return 0;
-}
-
-subsys_initcall(cpu_map_init);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 4e2cdbb5629f2..03533e45399a0 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -83,7 +83,6 @@ struct bpf_dtab {
 	u32 n_buckets;
 };
 
-static DEFINE_PER_CPU(struct list_head, dev_flush_list);
 static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
@@ -408,9 +407,12 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
  */
 void __dev_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
+	if (!flush_list)
+		return;
+
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_xmit_all(bq, XDP_XMIT_FLUSH);
 		bq->dev_rx = NULL;
@@ -422,7 +424,9 @@ void __dev_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool dev_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&dev_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
+
+	if (!flush_list || list_empty(bpf_net_ctx_get_dev_flush_list()))
 		return false;
 	__dev_flush();
 	return true;
@@ -453,7 +457,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
@@ -1156,15 +1160,11 @@ static struct notifier_block dev_map_notifier = {
 
 static int __init dev_map_init(void)
 {
-	int cpu;
-
 	/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
 	BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
 		     offsetof(struct _bpf_dtab_netdev, dev));
 	register_netdevice_notifier(&dev_map_notifier);
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
 	return 0;
 }
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 727aa20be4bde..0ac5c80eef6bf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,8 +35,6 @@
 #define TX_BATCH_SIZE 32
 #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
 
-static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
-
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
 {
 	if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -375,9 +373,12 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	int err;
 
+	if (!flush_list)
+		return -EINVAL;
+
 	err = xsk_rcv(xs, xdp);
 	if (err)
 		return err;
@@ -390,9 +391,11 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 void __xsk_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	struct xdp_sock *xs, *tmp;
 
+	if (!flush_list)
+		return;
 	list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
 		xsk_flush(xs);
 		__list_del_clearprev(&xs->flush_node);
@@ -402,7 +405,9 @@ void __xsk_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool xsk_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&xskmap_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
+
+	if (!flush_list || list_empty(flush_list))
 		return false;
 	__xsk_map_flush();
 	return true;
@@ -1775,7 +1780,7 @@ static struct pernet_operations xsk_net_ops = {
 
 static int __init xsk_init(void)
 {
-	int err, cpu;
+	int err;
 
 	err = proto_register(&xsk_proto, 0 /* no slab */);
 	if (err)
@@ -1793,8 +1798,6 @@ static int __init xsk_init(void)
 	if (err)
 		goto out_pernet;
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
 	return 0;
 
 out_pernet:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
  2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (14 preceding siblings ...)
  2024-05-03 18:25 ` [PATCH net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
@ 2024-05-06  8:43 ` Paolo Abeni
  2024-05-06  9:38   ` Sebastian Andrzej Siewior
  15 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2024-05-06  8:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> local_bh_disable() section remains preemtible. As a result high prior
> tasks (or threaded interrupts) will be blocked by lower-prio task (or
> threaded interrupts) which are long running which includes softirq
> sections.
> 
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

Let me rephrase to check I understood the plan correctly.

The idea is to pair 'bare' local_bh_{disable,enable} with local lock
and late make local_bh_{disable,enable} no ops (on RT).

With 'bare' I mean not followed by a spin_lock() - which is enough to
ensure mutual exclusion vs BH on RT build - am I correct?

> The series introduces the infrastructure and converts large parts of
> networking which is largest stake holder here. Once this done the
> per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
the above description and not addressed here. Is this series supposed
to cover 'net' fully?

Could you please include the diffstat for the whole series? I
think/hope it will help catching the full picture more easily.

Note that some callers use local_bh_disable(), no additional lock, and
there is no specific struct to protect, but enforce explicit
serialization vs bh to a bunch of operation, e.g.  the
local_bh_disable() in inet_twsk_purge().

I guess such call site should be handled, too?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
  2024-05-06  8:43 ` [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Paolo Abeni
@ 2024-05-06  9:38   ` Sebastian Andrzej Siewior
  2024-05-06 14:12     ` Paolo Abeni
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-06  9:38 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > local_bh_disable() section remains preemtible. As a result high prior
> > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > threaded interrupts) which are long running which includes softirq
> > sections.
> > 
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> 
> Let me rephrase to check I understood the plan correctly.
> 
> The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> and late make local_bh_{disable,enable} no ops (on RT).
> 
> With 'bare' I mean not followed by a spin_lock() - which is enough to
> ensure mutual exclusion vs BH on RT build - am I correct?

I might have I misunderstood your rephrase. But to make it clear:
| $ git grep -p local_lock\( kernel/softirq.c
| kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
| kernel/softirq.c:                       local_lock(&softirq_ctrl.lock);

this is what I want to remove. This is upstream RT only (not RT queue
only). !RT builds are not affected by this change.

> > The series introduces the infrastructure and converts large parts of
> > networking which is largest stake holder here. Once this done the
> > per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
> 
> AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
> the above description and not addressed here. Is this series supposed
> to cover 'net' fully?

The net subsystem has not been fully audited but the major parts have
been. I checked global per-CPU variables but there might be dynamic
ones. Also new ones might have appeared in the meantime. There are
two things which are not fixed yet that I am aware of:
- tw_timer timer
  https://lore.kernel.org/all/20240415113436.3261042-1-vschneid@redhat.com/T/#u

- can gw
  https://lore.kernel.org/linux-can/20231031112349.y0aLoBrz@linutronix.de/
  https://lore.kernel.org/all/20231221123703.8170-1-socketcan@hartkopp.net/T/#u

That means those two need to be fixed first before that local_local()
can disappear from local_bh_disable()/ enable. Also the whole tree
should be checked.

> Could you please include the diffstat for the whole series? I
> think/hope it will help catching the full picture more easily.

total over the series:

| include/linux/filter.h              | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
| include/linux/local_lock.h          |  21 +++++++++++++++++++++
| include/linux/local_lock_internal.h |  31 +++++++++++++++++++++++++++++++
| include/linux/lockdep.h             |   3 +++
| include/linux/netdevice.h           |  12 ++++++++++++
| include/linux/sched.h               |   9 ++++++++-
| include/net/seg6_local.h            |   1 +
| include/net/sock.h                  |   5 +++++
| kernel/bpf/cpumap.c                 |  27 +++++++++++----------------
| kernel/bpf/devmap.c                 |  16 ++++++++--------
| kernel/fork.c                       |   3 +++
| kernel/locking/spinlock.c           |   8 ++++++++
| net/bpf/test_run.c                  |  11 ++++++++++-
| net/bridge/br_netfilter_hooks.c     |   7 ++++++-
| net/core/dev.c                      |  39 +++++++++++++++++++++++++++++++++------
| net/core/dev.h                      |  20 ++++++++++++++++++++
| net/core/filter.c                   | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
| net/core/lwt_bpf.c                  |   9 +++++----
| net/core/skbuff.c                   |  25 ++++++++++++++++---------
| net/ipv4/tcp_ipv4.c                 |  15 +++++++++++----
| net/ipv4/tcp_sigpool.c              |  17 +++++++++++++----
| net/ipv6/seg6_local.c               |  22 ++++++++++++++--------
| net/xdp/xsk.c                       |  19 +++++++++++--------
| 23 files changed, 445 insertions(+), 116 deletions(-)

> Note that some callers use local_bh_disable(), no additional lock, and
> there is no specific struct to protect, but enforce explicit
> serialization vs bh to a bunch of operation, e.g.  the
> local_bh_disable() in inet_twsk_purge().
> 
> I guess such call site should be handled, too?

Yes but I didn't find much. inet_twsk_purge() is the first item from my
list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
could be mixed.

The only resources that can be protected by disabling BH are per-CPU
resources. Either explicit defined (such as napi_alloc_cache) or
implicit by other means of per-CPU usage such as a CPU-bound timer,
worker, …. Protecting global variables by disabling BH is broken on SMP
(see the CAN gw example) so I am not too worried about those.
Unless you are aware of a category I did not think of.

> Thanks!
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
  2024-05-06  9:38   ` Sebastian Andrzej Siewior
@ 2024-05-06 14:12     ` Paolo Abeni
  2024-05-06 14:43       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2024-05-06 14:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On Mon, 2024-05-06 at 11:38 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> > On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > > local_bh_disable() section remains preemtible. As a result high prior
> > > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > > threaded interrupts) which are long running which includes softirq
> > > sections.
> > > 
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> > 
> > Let me rephrase to check I understood the plan correctly.
> > 
> > The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> > and late make local_bh_{disable,enable} no ops (on RT).
> > 
> > With 'bare' I mean not followed by a spin_lock() - which is enough to
> > ensure mutual exclusion vs BH on RT build - am I correct?
> 
> I might have I misunderstood your rephrase. But to make it clear:
> > $ git grep -p local_lock\( kernel/softirq.c
> > kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> > kernel/softirq.c:                       local_lock(&softirq_ctrl.lock);
> 
> this is what I want to remove. This is upstream RT only (not RT queue
> only). !RT builds are not affected by this change.

I was trying to describe the places that need the additional
local_lock(), but I think we are on the same page WRT the overall
semantic.

> > 
> > Note that some callers use local_bh_disable(), no additional lock, and
> > there is no specific struct to protect, but enforce explicit
> > serialization vs bh to a bunch of operation, e.g.  the
> > local_bh_disable() in inet_twsk_purge().
> > 
> > I guess such call site should be handled, too?
> 
> Yes but I didn't find much. inet_twsk_purge() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
> could be mixed.
> 
> The only resources that can be protected by disabling BH are per-CPU
> resources. Either explicit defined (such as napi_alloc_cache) or
> implicit by other means of per-CPU usage such as a CPU-bound timer,
> worker, …. Protecting global variables by disabling BH is broken on SMP
> (see the CAN gw example) so I am not too worried about those.
> Unless you are aware of a category I did not think of.

I think sometimes the stack could call local_bh_enable() after a while
WRT the paired spin lock release, to enforce some serialization - alike
what inet_twsk_purge() is doing - but I can't point to any specific
line on top of my head.

A possible side-effect you should/could observe in the final tree is
more pressure on the process scheduler, as something alike:

local_bh_disable()

<spinlock lock unlock>

<again spinlock lock unlock>

local_bh_enable()

could results in more invocation of the scheduler, right? 

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
  2024-05-06 14:12     ` Paolo Abeni
@ 2024-05-06 14:43       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-06 14:43 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2024-05-06 16:12:00 [+0200], Paolo Abeni wrote:
> 
> I think sometimes the stack could call local_bh_enable() after a while
> WRT the paired spin lock release, to enforce some serialization - alike
> what inet_twsk_purge() is doing - but I can't point to any specific
> line on top of my head.

I *think* the inet_twsk_purge() is special because the timer is pinned
and that bh_disable call ensures that the timer does not fire.

> A possible side-effect you should/could observe in the final tree is
> more pressure on the process scheduler, as something alike:
> 
> local_bh_disable()
> 
> <spinlock lock unlock>
> 
> <again spinlock lock unlock>
> 
> local_bh_enable()
> 
> could results in more invocation of the scheduler, right? 

Yes, to some degree.
On PREEMPT_RT "spinlock lock" does not disable preemption so the section
remains preemptible. A task with elevated priority (SCHED_RR/FIFO/DL)
remains on the CPU unless preempted by task with higher priority.
Regardless of the locks.

A SCHED_OTHER task can be preempted by another SCHED_OTHER task even
with an acquired spinlock_t. This can be bad performance wise if this
other SCHED_OTHER task preempts the lock owner and blocks on the same
lock. To cope with this we had something called PREEMPT_LAZY (now
PREEMPT_AUTO) in the RT-queue to avoid preemption within SCHED_OTHER
tasks as long as a spinlock_t (or other lock that spins on !RT) is
acquired.
By removing the lock from local_bh_disable() we lose that "please don't
preempt me" feature from your scenario above across the BH disabled
section for SCHED_OTHER tasks. Nothing changes for tasks with elevated
priority.

> Cheers,
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-03 18:25 ` [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-05-06 19:41   ` Toke Høiland-Jørgensen
  2024-05-06 23:09     ` Alexei Starovoitov
  2024-05-07 10:57     ` [PATCH net-next 14/15] " Sebastian Andrzej Siewior
  0 siblings, 2 replies; 38+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-06 19:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided.
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it. Use the __free() annotation to automatically reset the pointer once
> function returns.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> On PREEMPT_RT the pointer to bpf_net_context is saved task's
> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> variable (which is always NODE-local memory). Using always the
> bpf_net_context approach has the advantage that there is almost zero
> differences between PREEMPT_RT and non-PREEMPT_RT builds.

Did you ever manage to get any performance data to see if this has an
impact?

[...]

> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> +{
> +	struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
> +
> +	WARN_ON_ONCE(!bpf_net_ctx);

If we have this WARN...

> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> +{
> +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> +
> +	if (!bpf_net_ctx)
> +		return NULL;

... do we really need all the NULL checks?

(not just here, but in the code below as well).

I'm a little concerned that we are introducing a bunch of new branches
in the XDP hot path. Which is also why I'm asking for benchmarks :)

[...]

> +	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
> +	 * program/ during NAPI callback. It is used during
> +	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
> +	 * redirect callback afterwards. ri->map is cleared after usage.
> +	 * The path has no explicit RCU read section but the local_bh_disable()
> +	 * is also a RCU read section which makes the complete softirq callback
> +	 * RCU protected. This in turn makes ri->map RCU protocted and it is

s/protocted/protected/

> +	 * sufficient to wait a grace period to ensure that no "ri->map == map"
> +	 * exist.  dev_map_free() removes the map from the list and then

s/exist/exists/


-Toke


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-06 19:41   ` Toke Høiland-Jørgensen
@ 2024-05-06 23:09     ` Alexei Starovoitov
  2024-05-07 12:36       ` Sebastian Andrzej Siewior
  2024-05-07 10:57     ` [PATCH net-next 14/15] " Sebastian Andrzej Siewior
  1 sibling, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2024-05-06 23:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sebastian Andrzej Siewior, LKML, Network Development,
	David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On Mon, May 6, 2024 at 12:41 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>
> > The XDP redirect process is two staged:
> > - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
> >   packet and makes decisions. While doing that, the per-CPU variable
> >   bpf_redirect_info is used.
> >
> > - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
> >   and it may also access other per-CPU variables like xskmap_flush_list.
> >
> > At the very end of the NAPI callback, xdp_do_flush() is invoked which
> > does not access bpf_redirect_info but will touch the individual per-CPU
> > lists.
> >
> > The per-CPU variables are only used in the NAPI callback hence disabling
> > bottom halves is the only protection mechanism. Users from preemptible
> > context (like cpu_map_kthread_run()) explicitly disable bottom halves
> > for protections reasons.
> > Without locking in local_bh_disable() on PREEMPT_RT this data structure
> > requires explicit locking.
> >
> > PREEMPT_RT has forced-threaded interrupts enabled and every
> > NAPI-callback runs in a thread. If each thread has its own data
> > structure then locking can be avoided.
> >
> > Create a struct bpf_net_context which contains struct bpf_redirect_info.
> > Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> > it. Use the __free() annotation to automatically reset the pointer once
> > function returns.
> > The bpf_net_ctx_set() may nest. For instance a function can be used from
> > within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> > NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> > updates the pointer.
> > Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> > bpf_redirect_info.
> >
> > On PREEMPT_RT the pointer to bpf_net_context is saved task's
> > task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> > variable (which is always NODE-local memory). Using always the
> > bpf_net_context approach has the advantage that there is almost zero
> > differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
> Did you ever manage to get any performance data to see if this has an
> impact?
>
> [...]
>
> > +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> > +{
> > +     struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
> > +
> > +     WARN_ON_ONCE(!bpf_net_ctx);
>
> If we have this WARN...
>
> > +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> > +{
> > +     struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> > +
> > +     if (!bpf_net_ctx)
> > +             return NULL;
>
> ... do we really need all the NULL checks?

Indeed.
Let's drop all NULL checks, since they definitely add overhead.
I'd also remove ifdef CONFIG_PREEMPT_RT and converge on single implementation:
static inline struct bpf_net_context * bpf_net_ctx_get(void)
{
 return current->bpf_net_context;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-06 19:41   ` Toke Høiland-Jørgensen
  2024-05-06 23:09     ` Alexei Starovoitov
@ 2024-05-07 10:57     ` Sebastian Andrzej Siewior
  2024-05-07 13:50       ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-07 10:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

On 2024-05-06 21:41:38 [+0200], Toke Høiland-Jørgensen wrote:
> > On PREEMPT_RT the pointer to bpf_net_context is saved task's
> > task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> > variable (which is always NODE-local memory). Using always the
> > bpf_net_context approach has the advantage that there is almost zero
> > differences between PREEMPT_RT and non-PREEMPT_RT builds.
> 
> Did you ever manage to get any performance data to see if this has an
> impact?

Not really. I would expect far away memory is more expensive.

I have just a 10G setup and after disabling IOMMU I got the "expected"
packet rate. Since the CPU usage was not 100% I always got that packet
rate. Lowering the CPU clock speed resulted in three (I think) rate
ranges depending on the invocation and I didn't figure out why. Since it
is always a range, I didn't see here if my changes had any impact since
the numbers were roughly the same.

With enabled IOMMU, its overhead was major so again I didn't see any
impact of my changes.

> [...]
> 
> > +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> > +{
> > +	struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
> > +
> > +	WARN_ON_ONCE(!bpf_net_ctx);
> 
> If we have this WARN...
> 
> > +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> > +{
> > +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> > +
> > +	if (!bpf_net_ctx)
> > +		return NULL;
> 
> ... do we really need all the NULL checks?
> 
> (not just here, but in the code below as well).
> 
> I'm a little concerned that we are introducing a bunch of new branches
> in the XDP hot path. Which is also why I'm asking for benchmarks :)

We could hide the WARN behind CONFIG_DEBUG_NET. The only purpose is to
see the backtrace where the context is missing. Having just an error
somewhere will make it difficult to track.

The NULL check is to avoid a crash if the context is missing. You could
argue that this should be noticed in development and never hit
production. If so, then we get the backtrace from NULL-pointer
dereference and don't need the checks and WARN.

> [...]
> 
> > +	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
> > +	 * program/ during NAPI callback. It is used during
> > +	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
> > +	 * redirect callback afterwards. ri->map is cleared after usage.
> > +	 * The path has no explicit RCU read section but the local_bh_disable()
> > +	 * is also a RCU read section which makes the complete softirq callback
> > +	 * RCU protected. This in turn makes ri->map RCU protocted and it is
> 
> s/protocted/protected/
> 
> > +	 * sufficient to wait a grace period to ensure that no "ri->map == map"
> > +	 * exist.  dev_map_free() removes the map from the list and then
> 
> s/exist/exists/

Thank you.

> 
> -Toke

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-06 23:09     ` Alexei Starovoitov
@ 2024-05-07 12:36       ` Sebastian Andrzej Siewior
  2024-05-07 13:27         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-07 12:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, LKML, Network Development,
	David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On 2024-05-06 16:09:47 [-0700], Alexei Starovoitov wrote:
> > > On PREEMPT_RT the pointer to bpf_net_context is saved task's
> > > task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> > > variable (which is always NODE-local memory). Using always the
> > > bpf_net_context approach has the advantage that there is almost zero
> > > differences between PREEMPT_RT and non-PREEMPT_RT builds.
> >
> > Did you ever manage to get any performance data to see if this has an
> > impact?
> >
> > [...]
> >
> > > +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> > > +{
> > > +     struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
> > > +
> > > +     WARN_ON_ONCE(!bpf_net_ctx);
> >
> > If we have this WARN...
> >
> > > +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> > > +{
> > > +     struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> > > +
> > > +     if (!bpf_net_ctx)
> > > +             return NULL;
> >
> > ... do we really need all the NULL checks?
> 
> Indeed.
> Let's drop all NULL checks, since they definitely add overhead.
> I'd also remove ifdef CONFIG_PREEMPT_RT and converge on single implementation:
> static inline struct bpf_net_context * bpf_net_ctx_get(void)
> {
>  return current->bpf_net_context;
> }

Okay, let me do that then.

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-07 12:36       ` Sebastian Andrzej Siewior
@ 2024-05-07 13:27         ` Jesper Dangaard Brouer
  2024-05-10 16:21           ` [PATCH net-next 14/15 v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2024-05-07 13:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, LKML, Network Development,
	David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf



On 07/05/2024 14.36, Sebastian Andrzej Siewior wrote:
> On 2024-05-06 16:09:47 [-0700], Alexei Starovoitov wrote:
>>>> On PREEMPT_RT the pointer to bpf_net_context is saved task's
>>>> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
>>>> variable (which is always NODE-local memory). Using always the
>>>> bpf_net_context approach has the advantage that there is almost zero
>>>> differences between PREEMPT_RT and non-PREEMPT_RT builds.
>>>
>>> Did you ever manage to get any performance data to see if this has an
>>> impact?
>>>
>>> [...]
>>>
>>>> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
>>>> +{
>>>> +     struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
>>>> +
>>>> +     WARN_ON_ONCE(!bpf_net_ctx);
>>>
>>> If we have this WARN...
>>>

When asking for change anyhow...

XDP redirect is an extreme fast-path.

Adding an WARN macro cause adding an 'ud2' instruction that cause CPU
instruction-cache to stop pre-fetching.

For this reason we in include/net/xdp.h have #define XDP_WARN and
function xdp_warn() that lives in net/core/xdp.c.
See how it is used in xdp_update_frame_from_buff().

Described in https://git.kernel.org/torvalds/c/34cc0b338a61
  - 34cc0b338a61 ("xdp: Xdp_frame add member frame_sz and handle in 
convert_to_xdp_frame")



>>>> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
>>>> +{
>>>> +     struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>>>> +
>>>> +     if (!bpf_net_ctx)
>>>> +             return NULL;
>>>
>>> ... do we really need all the NULL checks?
>>
>> Indeed.
>> Let's drop all NULL checks, since they definitely add overhead.
>> I'd also remove ifdef CONFIG_PREEMPT_RT and converge on single implementation:
>> static inline struct bpf_net_context * bpf_net_ctx_get(void)
>> {
>>   return current->bpf_net_context;
>> }
> 
> Okay, let me do that then.

I need/want to echo Toke's request to benchmark these changes.

--Jesper


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-07 10:57     ` [PATCH net-next 14/15] " Sebastian Andrzej Siewior
@ 2024-05-07 13:50       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 38+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-07 13:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

>> > +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
>> > +{
>> > +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>> > +
>> > +	if (!bpf_net_ctx)
>> > +		return NULL;
>> 
>> ... do we really need all the NULL checks?
>> 
>> (not just here, but in the code below as well).
>> 
>> I'm a little concerned that we are introducing a bunch of new branches
>> in the XDP hot path. Which is also why I'm asking for benchmarks :)
>
> We could hide the WARN behind CONFIG_DEBUG_NET. The only purpose is to
> see the backtrace where the context is missing. Having just an error
> somewhere will make it difficult to track.
>
> The NULL check is to avoid a crash if the context is missing. You could
> argue that this should be noticed in development and never hit
> production. If so, then we get the backtrace from NULL-pointer
> dereference and don't need the checks and WARN.

Yup, this (relying on the NULL deref) SGTM :)

-Toke


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-07 13:27         ` Jesper Dangaard Brouer
@ 2024-05-10 16:21           ` Sebastian Andrzej Siewior
  2024-05-10 16:22             ` Sebastian Andrzej Siewior
  2024-05-14 11:54             ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-10 16:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it. Use the __free() annotation to automatically reset the pointer once
function returns.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

On PREEMPT_RT the pointer to bpf_net_context is saved task's
task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
variable (which is always NODE-local memory). Using always the
bpf_net_context approach has the advantage that there is almost zero
differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 42 ++++++++++++++++++++++++++++++++-----
 include/linux/sched.h  |  3 +++
 kernel/bpf/cpumap.c    |  3 +++
 kernel/fork.c          |  1 +
 net/bpf/test_run.c     | 11 +++++++++-
 net/core/dev.c         | 19 ++++++++++++++++-
 net/core/filter.c      | 47 +++++++++++++++++++-----------------------
 net/core/lwt_bpf.c     |  3 +++
 8 files changed, 96 insertions(+), 33 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d5fea03cb6e61..6db5a68db6ee1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,39 @@ struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	if (bpf_net_ctx)
+		current->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	return current->bpf_net_context;
+}
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->ri;
+}
+
+DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -1021,21 +1053,21 @@ void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
@@ -1591,7 +1623,7 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
 	/* Lower bits of the flags are used as return code on lookup failure */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6779d3b8f2578..cc9be45de6606 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1504,6 +1505,8 @@ struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 				int xdp_n, struct xdp_cpumap_stats *stats,
 				struct list_head *list)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
 
 	if (!rcpu->prog)
 		return xdp_n;
 
 	rcu_read_lock_bh();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
 
@@ -255,6 +257,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (unlikely(!list_empty(list)))
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
diff --git a/kernel/fork.c b/kernel/fork.c
index aebb3e6c96dc6..aa61c9bca62be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,7 @@ __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+	p->bpf_net_context =  NULL;
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
+	struct bpf_redirect_info *ri;
 	struct xdp_page_head *head;
 	struct xdp_frame *frm;
 	bool redirect = false;
@@ -295,6 +296,8 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 	xdp_clear_return_frame_no_direct();
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
 }
@@ -394,6 +398,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		if (xdp)
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
+
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 1503883ce15a4..42c05ab53ee86 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4031,11 +4031,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4086,13 +4090,17 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6357,13 +6365,15 @@ static void __napi_busy_loop(unsigned int napi_id,
 		      bool (*loop_end)(void *, unsigned long),
 		      void *loop_end_arg, unsigned flags, u16 budget)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
-
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 restart:
 	napi_poll = NULL;
 
@@ -6834,6 +6844,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 
 static void napi_threaded_poll_loop(struct napi_struct *napi)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
 	unsigned long last_qs = jiffies;
 
@@ -6842,6 +6853,8 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		void *have;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		sd = this_cpu_ptr(&softnet_data);
 		sd->in_napi_threaded_poll = true;
 
@@ -6857,6 +6870,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 			net_rps_action_and_irq_enable(sd);
 		}
 		skb_defer_free_flush(sd);
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!repoll)
@@ -6879,13 +6893,16 @@ static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
 	int budget = READ_ONCE(net_hotdata.netdev_budget);
+	struct bpf_net_context __bpf_net_ctx;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index e95b235a1e4f4..25ec2a76afb42 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2475,9 +2475,6 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
-
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -2490,7 +2487,7 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net *net = dev_net(skb->dev);
 	struct net_device *dev;
 	u32 flags = ri->flags;
@@ -2523,7 +2520,7 @@ int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
@@ -2544,7 +2541,7 @@ static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
@@ -2566,7 +2563,7 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
@@ -4294,19 +4291,17 @@ void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protected and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exists. dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4314,8 +4309,8 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
@@ -4432,7 +4427,7 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4446,7 +4441,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4463,7 +4458,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id,
 				       u32 flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
@@ -4517,7 +4512,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
@@ -4553,7 +4548,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +78,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-10 16:21           ` [PATCH net-next 14/15 v2] " Sebastian Andrzej Siewior
@ 2024-05-10 16:22             ` Sebastian Andrzej Siewior
  2024-05-14  5:07               ` Jesper Dangaard Brouer
  2024-05-14 11:54             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-10 16:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
> The XDP redirect process is two staged:
…
On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
> 
> I need/want to echo Toke's request to benchmark these changes.

I have:
boxA: ixgbe
boxB: i40e

Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
limit and I boot box with only 64CPUs. The IOMMU has been disabled on
both box as well as CPU mitigations. The link is 10G.

The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
sch_sfq: annotate data-races around q->perturb_period") which I used to
rebase my series on top of.

pktgen_sample03_burst_single_flow.sh has been used to send packets and
"xdp-bench drop $nic -e" to receive them.

baseline
~~~~~~~~
boxB -> boxA | gov performance
-t2 (to pktgen)
| receive total 14,854,233 pkt/s        14,854,233 drop/s                0 error/s      

-t1 (to pktgen)
| receive total 10,642,895 pkt/s        10,642,895 drop/s                0 error/s      


boxB -> boxA | gov powersave
-t2 (to pktgen)
  receive total 10,196,085 pkt/s        10,196,085 drop/s                0 error/s      
  receive total 10,187,254 pkt/s        10,187,254 drop/s                0 error/s      
  receive total 10,553,298 pkt/s        10,553,298 drop/s                0 error/s

-t1
  receive total 10,427,732 pkt/s        10,427,732 drop/s                0 error/s      

======
boxA -> boxB (-t1) gov performance
performace:
  receive total 13,171,962 pkt/s        13,171,962 drop/s                0 error/s      
  receive total 13,368,344 pkt/s        13,368,344 drop/s                0 error/s

powersave:
  receive total 13,343,136 pkt/s        13,343,136 drop/s                0 error/s      
  receive total 13,220,326 pkt/s        13,220,326 drop/s                0 error/s      

(I the CPU governor had no impact, just noise)

The series applied (with updated 14/15)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
boxB -> boxA | gov performance
-t2:
  receive total  14,880,199 pkt/s        14,880,199 drop/s                0 error/s

-t1:
  receive total  10,769,082 pkt/s        10,769,082 drop/s                0 error/s      

boxB -> boxA | gov powersave
-t2:
 receive total   11,163,323 pkt/s        11,163,323 drop/s                0 error/s      

-t1:
 receive total   10,756,515 pkt/s        10,756,515 drop/s                0 error/s      

boxA -> boxB | gov perfomance

 receive total  13,395,919 pkt/s        13,395,919 drop/s                0 error/s      

boxA -> boxB | gov perfomance
 receive total  13,290,527 pkt/s        13,290,527 drop/s                0 error/s


Based on my numbers, there is just noise.  BoxA hit the CPU limit during
receive while lowering the CPU-freq. BoxB seems to be unaffected by
lowing CPU frequency during receive.

I can't comment on anything >10G due to HW limits.

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-10 16:22             ` Sebastian Andrzej Siewior
@ 2024-05-14  5:07               ` Jesper Dangaard Brouer
  2024-05-14  5:43                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2024-05-14  5:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf



On 10/05/2024 18.22, Sebastian Andrzej Siewior wrote:
> On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
>> The XDP redirect process is two staged:
> …
> On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
>>
>> I need/want to echo Toke's request to benchmark these changes.
> 
> I have:
> boxA: ixgbe
> boxB: i40e
> 
> Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
> limit and I boot box with only 64CPUs. The IOMMU has been disabled on
> both box as well as CPU mitigations. The link is 10G.
> 
> The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
> sch_sfq: annotate data-races around q->perturb_period") which I used to
> rebase my series on top of.
> 
> pktgen_sample03_burst_single_flow.sh has been used to send packets and
> "xdp-bench drop $nic -e" to receive them.
> 

Sorry, but a XDP_DROP test will not activate the code you are modifying.
Thus, this test is invalid and doesn't tell us anything about your code 
changes.

The code is modifying the XDP_REDIRECT handling system. Thus, the
benchmark test needs to activate this code.


> baseline
> ~~~~~~~~
> boxB -> boxA | gov performance
> -t2 (to pktgen)
> | receive total 14,854,233 pkt/s        14,854,233 drop/s                0 error/s
> 
> -t1 (to pktgen)
> | receive total 10,642,895 pkt/s        10,642,895 drop/s                0 error/s
> 
> 
> boxB -> boxA | gov powersave
> -t2 (to pktgen)
>    receive total 10,196,085 pkt/s        10,196,085 drop/s                0 error/s
>    receive total 10,187,254 pkt/s        10,187,254 drop/s                0 error/s
>    receive total 10,553,298 pkt/s        10,553,298 drop/s                0 error/s
> 
> -t1
>    receive total 10,427,732 pkt/s        10,427,732 drop/s                0 error/s
> 
> ======
> boxA -> boxB (-t1) gov performance
> performace:
>    receive total 13,171,962 pkt/s        13,171,962 drop/s                0 error/s
>    receive total 13,368,344 pkt/s        13,368,344 drop/s                0 error/s
> 
> powersave:
>    receive total 13,343,136 pkt/s        13,343,136 drop/s                0 error/s
>    receive total 13,220,326 pkt/s        13,220,326 drop/s                0 error/s
> 
> (I the CPU governor had no impact, just noise)
> 
> The series applied (with updated 14/15)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> boxB -> boxA | gov performance
> -t2:
>    receive total  14,880,199 pkt/s        14,880,199 drop/s                0 error/s
> 
> -t1:
>    receive total  10,769,082 pkt/s        10,769,082 drop/s                0 error/s
> 
> boxB -> boxA | gov powersave
> -t2:
>   receive total   11,163,323 pkt/s        11,163,323 drop/s                0 error/s
> 
> -t1:
>   receive total   10,756,515 pkt/s        10,756,515 drop/s                0 error/s
> 
> boxA -> boxB | gov perfomance
> 
>   receive total  13,395,919 pkt/s        13,395,919 drop/s                0 error/s
> 
> boxA -> boxB | gov perfomance
>   receive total  13,290,527 pkt/s        13,290,527 drop/s                0 error/s
> 
> 
> Based on my numbers, there is just noise.  BoxA hit the CPU limit during
> receive while lowering the CPU-freq. BoxB seems to be unaffected by
> lowing CPU frequency during receive.
> 
> I can't comment on anything >10G due to HW limits.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-14  5:07               ` Jesper Dangaard Brouer
@ 2024-05-14  5:43                 ` Sebastian Andrzej Siewior
  2024-05-14 12:20                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-14  5:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
> > pktgen_sample03_burst_single_flow.sh has been used to send packets and
> > "xdp-bench drop $nic -e" to receive them.
> > 
> 
> Sorry, but a XDP_DROP test will not activate the code you are modifying.
> Thus, this test is invalid and doesn't tell us anything about your code
> changes.
> 
> The code is modifying the XDP_REDIRECT handling system. Thus, the
> benchmark test needs to activate this code.

This was a misunderstanding on my side then. What do you suggest
instead? Same setup but "redirect" on the same interface instead of
"drop"?

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-10 16:21           ` [PATCH net-next 14/15 v2] " Sebastian Andrzej Siewior
  2024-05-10 16:22             ` Sebastian Andrzej Siewior
@ 2024-05-14 11:54             ` Toke Høiland-Jørgensen
  2024-05-15 13:43               ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 38+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-14 11:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Hao Luo, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided.
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it. Use the __free() annotation to automatically reset the pointer once
> function returns.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> On PREEMPT_RT the pointer to bpf_net_context is saved task's
> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> variable (which is always NODE-local memory). Using always the
> bpf_net_context approach has the advantage that there is almost zero
> differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Eduard Zingerman <eddyz87@gmail.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/filter.h | 42 ++++++++++++++++++++++++++++++++-----
>  include/linux/sched.h  |  3 +++
>  kernel/bpf/cpumap.c    |  3 +++
>  kernel/fork.c          |  1 +
>  net/bpf/test_run.c     | 11 +++++++++-
>  net/core/dev.c         | 19 ++++++++++++++++-
>  net/core/filter.c      | 47 +++++++++++++++++++-----------------------
>  net/core/lwt_bpf.c     |  3 +++
>  8 files changed, 96 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d5fea03cb6e61..6db5a68db6ee1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,7 +744,39 @@ struct bpf_redirect_info {
>  	struct bpf_nh_params nh;
>  };
>  
> -DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> +struct bpf_net_context {
> +	struct bpf_redirect_info ri;
> +};
> +
> +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
> +{
> +	struct task_struct *tsk = current;
> +
> +	if (tsk->bpf_net_context != NULL)
> +		return NULL;
> +	tsk->bpf_net_context = bpf_net_ctx;
> +	return bpf_net_ctx;
> +}
> +
> +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
> +{
> +	if (bpf_net_ctx)
> +		current->bpf_net_context = NULL;
> +}
> +
> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> +{
> +	return current->bpf_net_context;
> +}
> +
> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> +{
> +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> +
> +	return &bpf_net_ctx->ri;
> +}
> +
> +DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
>  
>  /* flags for bpf_redirect_info kern_flags */
>  #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
> @@ -1021,21 +1053,21 @@ void bpf_clear_redirect_map(struct bpf_map *map);
>  
>  static inline bool xdp_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
>  }
>  
>  static inline void xdp_set_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
>  }
>  
>  static inline void xdp_clear_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
>  }
> @@ -1591,7 +1623,7 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
>  						   u64 flags, const u64 flag_mask,
>  						   void *lookup_elem(struct bpf_map *map, u32 key))
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
>  
>  	/* Lower bits of the flags are used as return code on lookup failure */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6779d3b8f2578..cc9be45de6606 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -53,6 +53,7 @@ struct bio_list;
>  struct blk_plug;
>  struct bpf_local_storage;
>  struct bpf_run_ctx;
> +struct bpf_net_context;
>  struct capture_control;
>  struct cfs_rq;
>  struct fs_struct;
> @@ -1504,6 +1505,8 @@ struct task_struct {
>  	/* Used for BPF run context */
>  	struct bpf_run_ctx		*bpf_ctx;
>  #endif
> +	/* Used by BPF for per-TASK xdp storage */
> +	struct bpf_net_context		*bpf_net_context;

Okay, so if we are going the route of always putting this in 'current',
why not just embed the whole struct bpf_net_context inside task_struct,
instead of mucking about with the stack-allocated structures and
setting/clearing of pointers?

-Toke


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-14  5:43                 ` Sebastian Andrzej Siewior
@ 2024-05-14 12:20                   ` Jesper Dangaard Brouer
  2024-05-17 16:15                     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2024-05-14 12:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf



On 14/05/2024 07.43, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
>>> pktgen_sample03_burst_single_flow.sh has been used to send packets and
>>> "xdp-bench drop $nic -e" to receive them.
>>>
>>
>> Sorry, but a XDP_DROP test will not activate the code you are modifying.
>> Thus, this test is invalid and doesn't tell us anything about your code
>> changes.
>>
>> The code is modifying the XDP_REDIRECT handling system. Thus, the
>> benchmark test needs to activate this code.
> 
> This was a misunderstanding on my side then. What do you suggest
> instead? Same setup but "redirect" on the same interface instead of
> "drop"?
> 

Redirect is more flexible, but redirect back-out same interface is one
option, but I've often seen this will give issues, because it will
overload the traffic generator (without people realizing this) leading
to false-results. Thus, verify packet generator is sending faster than
results you are collecting. (I use this tool[2] on generator machine, in
another terminal, to see of something funky is happening with ethtool
stats).

To workaround this issue, I've previously redirected to device 'lo'
localhost, which is obviously invalid so packet gets dropped, but I can
see that when we converted from kernel samples/bpf/ to this tool, this
trick/hack is no longer supported.

The xdp-bench[1] tool also provide a number of redirect sub-commands.
E.g. redirect / redirect-cpu / redirect-map / redirect-multi.
Given you also modify CPU-map code, I would say we also need a
'redirect-cpu' test case.

Trick for CPU-map to do early drop on remote CPU:

  # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1

I recommend using Ctrl+\ while running to show more info like CPUs being
used and what kthread consumes.  To catch issues e.g. if you are CPU
redirecting to same CPU as RX happen to run on.

--Jesper

[1] https://github.com/xdp-project/xdp-tools/tree/master/xdp-bench
[2] 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-14 11:54             ` Toke Høiland-Jørgensen
@ 2024-05-15 13:43               ` Sebastian Andrzej Siewior
  2024-05-21  1:52                 ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-15 13:43 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On 2024-05-14 13:54:43 [+0200], Toke Høiland-Jørgensen wrote:
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1504,6 +1505,8 @@ struct task_struct {
> >  	/* Used for BPF run context */
> >  	struct bpf_run_ctx		*bpf_ctx;
> >  #endif
> > +	/* Used by BPF for per-TASK xdp storage */
> > +	struct bpf_net_context		*bpf_net_context;
> 
> Okay, so if we are going the route of always putting this in 'current',
> why not just embed the whole struct bpf_net_context inside task_struct,
> instead of mucking about with the stack-allocated structures and
> setting/clearing of pointers?

The whole struct bpf_net_context has 112 bytes. task_struct has 12352
bytes in my debug-config or 7296 bytes with defconfig on x86-64. Adding
it unconditionally would grow task_struct by ~1% but it would make
things way easier: The NULL case goes away, the assignment and cleanup
goes away, the INIT_LIST_HEAD can be moved to fork(). If the size
increase is not an issue then why not. Let me prepare…

> -Toke

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-14 12:20                   ` Jesper Dangaard Brouer
@ 2024-05-17 16:15                     ` Sebastian Andrzej Siewior
  2024-05-22  7:09                       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-17 16:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
> Trick for CPU-map to do early drop on remote CPU:
> 
>  # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
> 
> I recommend using Ctrl+\ while running to show more info like CPUs being
> used and what kthread consumes.  To catch issues e.g. if you are CPU
> redirecting to same CPU as RX happen to run on.

Okay. So I reworked the last two patches make the struct part of
task_struct and then did as you suggested:

Unpatched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat:    952102520 (    952,102,520) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat:    446045897 (    446,045,897) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat:      7434098 (      7,434,098) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat:    446556042 (    446,556,042) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat:      7442601 (      7,442,601) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat:    892592523 (    892,592,523) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14876542 (     14,876,542) <= tx_packets /sec
|Ethtool(eno2np1 ) stat:            2 (              2) <= tx_restart /sec
|Ethtool(eno2np1 ) stat:            2 (              2) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat:     14876622 (     14,876,622) <= tx_unicast /sec
|
|Receive:
|eth1->?                 8,732,508 rx/s                  0 err,drop/s
|  receive total         8,732,508 pkt/s                 0 drop/s                0 error/s
|    cpu:10              8,732,508 pkt/s                 0 drop/s                0 error/s
|  enqueue to cpu 3      8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
|    cpu:10->3           8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
|  kthread total         8,732,506 pkt/s                 0 drop/s          205,650 sched
|    cpu:3               8,732,506 pkt/s                 0 drop/s          205,650 sched
|    xdp_stats                   0 pass/s        8,732,506 drop/s                0 redir/s
|      cpu:3                     0 pass/s        8,732,506 drop/s                0 redir/s
|  redirect_err                  0 error/s
|  xdp_exception                 0 hit/s

I verified that the "drop only" case hits 14M packets/s while this
redirect part reports 8M packets/s.

Patched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat:    952635404 (    952,635,404) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14884934 (     14,884,934) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat:     14884928 (     14,884,928) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat:    446496117 (    446,496,117) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat:      7441602 (      7,441,602) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat:    446603461 (    446,603,461) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat:      7443391 (      7,443,391) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat:    893086506 (    893,086,506) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14884775 (     14,884,775) <= tx_packets /sec
|Ethtool(eno2np1 ) stat:           14 (             14) <= tx_restart /sec
|Ethtool(eno2np1 ) stat:           14 (             14) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat:     14884937 (     14,884,937) <= tx_unicast /sec
|
|Receive:
|eth1->?                 8,735,198 rx/s                  0 err,drop/s
|  receive total         8,735,198 pkt/s                 0 drop/s                0 error/s
|    cpu:6               8,735,198 pkt/s                 0 drop/s                0 error/s
|  enqueue to cpu 3      8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
|    cpu:6->3            8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
|  kthread total         8,735,191 pkt/s                 0 drop/s          208,054 sched
|    cpu:3               8,735,191 pkt/s                 0 drop/s          208,054 sched
|    xdp_stats                   0 pass/s        8,735,191 drop/s                0 redir/s
|      cpu:3                     0 pass/s        8,735,191 drop/s                0 redir/s
|  redirect_err                  0 error/s
|  xdp_exception                 0 hit/s

This looks to be in the same range/ noise level. top wise I have
ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
10G link. perf top shows
|   18.37%  bpf_prog_4f0ffbb35139c187_cpumap_l4_hash         [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   13.15%  [kernel]                                         [k] cpu_map_kthread_run
|   12.96%  [kernel]                                         [k] ixgbe_poll
|    6.78%  [kernel]                                         [k] page_frag_free
|    5.62%  [kernel]                                         [k] xdp_do_redirect

for the top 5. Is this something that looks reasonable?

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-15 13:43               ` Sebastian Andrzej Siewior
@ 2024-05-21  1:52                 ` Alexei Starovoitov
  0 siblings, 0 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2024-05-21  1:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

On Wed, May 15, 2024 at 6:43 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-05-14 13:54:43 [+0200], Toke Høiland-Jørgensen wrote:
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1504,6 +1505,8 @@ struct task_struct {
> > >     /* Used for BPF run context */
> > >     struct bpf_run_ctx              *bpf_ctx;
> > >  #endif
> > > +   /* Used by BPF for per-TASK xdp storage */
> > > +   struct bpf_net_context          *bpf_net_context;
> >
> > Okay, so if we are going the route of always putting this in 'current',
> > why not just embed the whole struct bpf_net_context inside task_struct,
> > instead of mucking about with the stack-allocated structures and
> > setting/clearing of pointers?
>
> The whole struct bpf_net_context has 112 bytes. task_struct has 12352
> bytes in my debug-config or 7296 bytes with defconfig on x86-64. Adding
> it unconditionally would grow task_struct by ~1% but it would make
> things way easier: The NULL case goes away, the assignment and cleanup
> goes away, the INIT_LIST_HEAD can be moved to fork(). If the size
> increase is not an issue then why not. Let me prepare…

I think 112 bytes or whatever the size of bpf_net_context is a bit
too much to consume in task_struct.
Yes, it's big, but there are systems with 1m threads. 112Mbyte is not
that small.
bpf_net_ctx_set/get are not in critical path and get_ri will be
inlined without any conditionals, so performance should be the same.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-17 16:15                     ` Sebastian Andrzej Siewior
@ 2024-05-22  7:09                       ` Jesper Dangaard Brouer
  2024-05-24  7:54                         ` Sebastian Andrzej Siewior
  2024-05-24 13:59                         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2024-05-22  7:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf, Arnaldo Carvalho de Melo



On 17/05/2024 18.15, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
>> Trick for CPU-map to do early drop on remote CPU:
>>
>>   # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
>>
>> I recommend using Ctrl+\ while running to show more info like CPUs being
>> used and what kthread consumes.  To catch issues e.g. if you are CPU
>> redirecting to same CPU as RX happen to run on.
> 
> Okay. So I reworked the last two patches make the struct part of
> task_struct and then did as you suggested:
> 
> Unpatched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat:    952102520 (    952,102,520) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat:    446045897 (    446,045,897) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat:      7434098 (      7,434,098) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat:    446556042 (    446,556,042) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat:      7442601 (      7,442,601) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat:    892592523 (    892,592,523) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14876542 (     14,876,542) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat:            2 (              2) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat:            2 (              2) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat:     14876622 (     14,876,622) <= tx_unicast /sec
> |
> |Receive:
> |eth1->?                 8,732,508 rx/s                  0 err,drop/s
> |  receive total         8,732,508 pkt/s                 0 drop/s                0 error/s
> |    cpu:10              8,732,508 pkt/s                 0 drop/s                0 error/s
> |  enqueue to cpu 3      8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
> |    cpu:10->3           8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
> |  kthread total         8,732,506 pkt/s                 0 drop/s          205,650 sched
> |    cpu:3               8,732,506 pkt/s                 0 drop/s          205,650 sched
> |    xdp_stats                   0 pass/s        8,732,506 drop/s                0 redir/s
> |      cpu:3                     0 pass/s        8,732,506 drop/s                0 redir/s
> |  redirect_err                  0 error/s
> |  xdp_exception                 0 hit/s
> 
> I verified that the "drop only" case hits 14M packets/s while this
> redirect part reports 8M packets/s.
>

Great, this is a good test.

The transmit speed 14.88 Mpps is 10G wirespeed at smallest Ethernet
packet size (84 bytes with overhead + intergap, 10*10^9/(84*8) = 14880952).


> Patched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat:    952635404 (    952,635,404) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14884934 (     14,884,934) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat:     14884928 (     14,884,928) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat:    446496117 (    446,496,117) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat:      7441602 (      7,441,602) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat:    446603461 (    446,603,461) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat:      7443391 (      7,443,391) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat:    893086506 (    893,086,506) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14884775 (     14,884,775) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat:           14 (             14) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat:           14 (             14) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat:     14884937 (     14,884,937) <= tx_unicast /sec
> |
> |Receive:
> |eth1->?                 8,735,198 rx/s                  0 err,drop/s
> |  receive total         8,735,198 pkt/s                 0 drop/s                0 error/s
> |    cpu:6               8,735,198 pkt/s                 0 drop/s                0 error/s
> |  enqueue to cpu 3      8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
> |    cpu:6->3            8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
> |  kthread total         8,735,191 pkt/s                 0 drop/s          208,054 sched
> |    cpu:3               8,735,191 pkt/s                 0 drop/s          208,054 sched
> |    xdp_stats                   0 pass/s        8,735,191 drop/s                0 redir/s
> |      cpu:3                     0 pass/s        8,735,191 drop/s                0 redir/s
> |  redirect_err                  0 error/s
> |  xdp_exception                 0 hit/s
> 

Great basically zero overhead. Awesome you verified this!


> This looks to be in the same range/ noise level. top wise I have
> ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
> 10G link. 

For our purpose of testing XDP_REDIRECT code, that you are modifying,
this is what we want.  Where RX CPU/NAPI is the bottleneck, given remote
cpumap CPU have idle cycles (also indicated by the 208,054 sched stats).

> perf top shows

I appreciate getting this perf data.

As we are explicitly dealing with splitting workload across CPUs, it
worth mentioning that perf support displaying and filtering on CPUs.

This perf commands include the CPU number (zero indexed):
  # perf report --sort cpu,comm,dso,symbol --no-children

For this benchmark, to focus, I would reduce this to:
   # perf report --sort cpu,symbol --no-children

The perf tool can also use -C to filter on some CPUs like:

  # perf report --sort cpu,symbol --no-children -C 3,6


> |   18.37%  bpf_prog_4f0ffbb35139c187_cpumap_l4_hash         [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash

This bpf_prog_4f0ffbb35139c187_cpumap_l4_hash is running on RX CPU doing 
the load-balancing.

> |   13.15%  [kernel]                                         [k] cpu_map_kthread_run

This runs on remote cpumap CPU (in this case CPU 3).

> |   12.96%  [kernel]                                         [k] ixgbe_poll
> |    6.78%  [kernel]                                         [k] page_frag_free

The page_frag_free call might run on remote cpumap CPU.

> |    5.62%  [kernel]                                         [k] xdp_do_redirect
> 
> for the top 5. Is this something that looks reasonable?

Yes, except I had to guess how the workload was split between CPUs ;-)

Thanks for doing these benchmarks! :-)
--Jesper



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-22  7:09                       ` Jesper Dangaard Brouer
@ 2024-05-24  7:54                         ` Sebastian Andrzej Siewior
  2024-05-24 13:59                         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-24  7:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf, Arnaldo Carvalho de Melo

On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
> 
> Yes, except I had to guess how the workload was split between CPUs ;-)
> 
> Thanks for doing these benchmarks! :-)

Thank you for all the explanations and walking me through it.
I'm going to update the patches as discussed and redo the numbers as
suggested here.
Thanks.

> --Jesper

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-22  7:09                       ` Jesper Dangaard Brouer
  2024-05-24  7:54                         ` Sebastian Andrzej Siewior
@ 2024-05-24 13:59                         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-24 13:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf, Arnaldo Carvalho de Melo

On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
> For this benchmark, to focus, I would reduce this to:
>   # perf report --sort cpu,symbol --no-children

Keeping the bpf_net_ctx_set()/clear, removing the NULL checks (to align
with Alexei in his last email).
Perf numbers wise, I'm using
	xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

unpached:

| eth1->?                 9,427,705 rx/s                  0 err,drop/s
|   receive total         9,427,705 pkt/s                 0 drop/s                0 error/s
|     cpu:17              9,427,705 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,427,708 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:17->3           9,427,708 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total         9,427,710 pkt/s                 0 drop/s          147,276 sched
|     cpu:3               9,427,710 pkt/s                 0 drop/s          147,276 sched
|     xdp_stats                   0 pass/s        9,427,710 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,427,710 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

Patched:
| eth1->?                 9,557,170 rx/s                  0 err,drop/s
|   receive total         9,557,170 pkt/s                 0 drop/s                0 error/s
|     cpu:9               9,557,170 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,557,170 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:9->3            9,557,170 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total         9,557,195 pkt/s                 0 drop/s          126,164 sched
|     cpu:3               9,557,195 pkt/s                 0 drop/s          126,164 sched
|     xdp_stats                   0 pass/s        9,557,195 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,557,195 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

I think this is noise. perf output as suggested (perf report --sort
cpu,symbol --no-children).

unpatched:
|  19.05%  017  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|  11.40%  017  [k] ixgbe_poll
|  10.68%  003  [k] cpu_map_kthread_run
|   7.62%  003  [k] intel_idle
|   6.11%  017  [k] xdp_do_redirect
|   6.01%  003  [k] page_frag_free
|   4.72%  017  [k] bq_flush_to_queue
|   3.74%  017  [k] cpu_map_redirect
|   2.35%  003  [k] xdp_return_frame
|   1.55%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
|   1.49%  017  [k] dma_sync_single_for_device
|   1.41%  017  [k] ixgbe_alloc_rx_buffers
|   1.26%  017  [k] cpu_map_enqueue
|   1.24%  017  [k] dma_sync_single_for_cpu
|   1.12%  003  [k] __xdp_return
|   0.83%  017  [k] bpf_trace_run4
|   0.77%  003  [k] __switch_to

patched:
|  18.20%  009  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|  11.64%  009  [k] ixgbe_poll
|   7.74%  003  [k] page_frag_free
|   6.69%  003  [k] cpu_map_bpf_prog_run_xdp
|   6.02%  003  [k] intel_idle
|   5.96%  009  [k] xdp_do_redirect
|   4.45%  003  [k] cpu_map_kthread_run
|   3.71%  009  [k] cpu_map_redirect
|   3.23%  009  [k] bq_flush_to_queue
|   2.55%  003  [k] xdp_return_frame
|   1.67%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
|   1.60%  009  [k] _raw_spin_lock
|   1.57%  009  [k] bpf_prog_d7eca17ddc334d36_tp_xdp_cpumap_enqueue
|   1.48%  009  [k] dma_sync_single_for_device
|   1.47%  009  [k] ixgbe_alloc_rx_buffers
|   1.39%  009  [k] dma_sync_single_for_cpu
|   1.33%  009  [k] cpu_map_enqueue
|   1.19%  003  [k] __xdp_return
|   0.66%  003  [k] __switch_to

I'm going to repost the series once the merge window closes unless there
is something you want me to do.

> --Jesper

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-05-24 14:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03 18:25 [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 04/15] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 08/15] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*() Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2024-05-03 18:25 ` [PATCH net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
2024-05-06 19:41   ` Toke Høiland-Jørgensen
2024-05-06 23:09     ` Alexei Starovoitov
2024-05-07 12:36       ` Sebastian Andrzej Siewior
2024-05-07 13:27         ` Jesper Dangaard Brouer
2024-05-10 16:21           ` [PATCH net-next 14/15 v2] " Sebastian Andrzej Siewior
2024-05-10 16:22             ` Sebastian Andrzej Siewior
2024-05-14  5:07               ` Jesper Dangaard Brouer
2024-05-14  5:43                 ` Sebastian Andrzej Siewior
2024-05-14 12:20                   ` Jesper Dangaard Brouer
2024-05-17 16:15                     ` Sebastian Andrzej Siewior
2024-05-22  7:09                       ` Jesper Dangaard Brouer
2024-05-24  7:54                         ` Sebastian Andrzej Siewior
2024-05-24 13:59                         ` Sebastian Andrzej Siewior
2024-05-14 11:54             ` Toke Høiland-Jørgensen
2024-05-15 13:43               ` Sebastian Andrzej Siewior
2024-05-21  1:52                 ` Alexei Starovoitov
2024-05-07 10:57     ` [PATCH net-next 14/15] " Sebastian Andrzej Siewior
2024-05-07 13:50       ` Toke Høiland-Jørgensen
2024-05-03 18:25 ` [PATCH net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
2024-05-06  8:43 ` [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking Paolo Abeni
2024-05-06  9:38   ` Sebastian Andrzej Siewior
2024-05-06 14:12     ` Paolo Abeni
2024-05-06 14:43       ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).