netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 00/24] locking: Introduce nested-BH locking.
@ 2023-12-15 17:07 Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
                   ` (24 more replies)
  0 siblings, 25 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

Hi,

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largerst stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Sebastian


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-18  8:16   ` Paolo Abeni
  2023-12-15 17:07 ` [PATCH net-next 02/24] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

Introduce lock guard definition for local_lock_t. There are no users
yet.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/local_lock.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa73296..706c4b65d9449 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,15 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+DEFINE_LOCK_GUARD_1(local_lock, local_lock_t,
+		    local_lock(_T->lock),
+		    local_unlock(_T->lock))
+DEFINE_LOCK_GUARD_1(local_lock_irq, local_lock_t,
+		    local_lock_irq(_T->lock),
+		    local_unlock_irq(_T->lock))
+DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t,
+		    local_lock_irqsave(_T->lock, _T->flags),
+		    local_unlock_irqrestore(_T->lock, _T->flags),
+		    unsigned long flags)
+
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 02/24] locking/local_lock: Add local nested BH locking infrastructure.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

Add local_lock_nested_bh() locking. It is based on local_lock_t and the
naming follows the preempt_disable_nested() example.

For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking
assumptions based on local_bh_disable(). The macro is optimized away
during compilation.
For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to
the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that
BH is disabled.

For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU
lock. It does not disable CPU migration because it relies on
local_bh_disable() disabling CPU migration.
With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT.
Due to include hell the softirq check has been moved spinlock.c.

The intention is to use this locking in places where locking of a per-CPU
variable relies on BH being disabled. Instead of treating disabled
bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce
the locking scope to what actually needs protecting.
A side effect is that it also documents the protection scope of the
per-CPU variables.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/local_lock.h          | 10 ++++++++++
 include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++++
 include/linux/lockdep.h             |  3 +++
 kernel/locking/spinlock.c           |  8 ++++++++
 4 files changed, 52 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 706c4b65d9449..303caac7fbbbc 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -62,4 +62,14 @@ DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t,
 		    local_unlock_irqrestore(_T->lock, _T->flags),
 		    unsigned long flags)
 
+#define local_lock_nested_bh(_lock)				\
+	__local_lock_nested_bh(_lock)
+
+#define local_unlock_nested_bh(_lock)				\
+	__local_unlock_nested_bh(_lock)
+
+DEFINE_LOCK_GUARD_1(local_lock_nested_bh, local_lock_t,
+		    local_lock_nested_bh(_T->lock),
+		    local_unlock_nested_bh(_T->lock))
+
 #endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a77..8dd71fbbb6d2b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -62,6 +62,17 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
+#define __spinlock_nested_bh_init(lock)				\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));\
+	lockdep_init_map_type(&(lock)->dep_map, #lock, &__key,  \
+			      0, LD_WAIT_CONFIG, LD_WAIT_INV,	\
+			      LD_LOCK_NORMAL);			\
+	local_lock_debug_init(lock);				\
+} while (0)
+
 #define __local_lock(lock)					\
 	do {							\
 		preempt_disable();				\
@@ -98,6 +109,15 @@ do {								\
 		local_irq_restore(flags);			\
 	} while (0)
 
+#define __local_lock_nested_bh(lock)				\
+	do {							\
+		lockdep_assert_in_softirq();			\
+		local_lock_acquire(this_cpu_ptr(lock));	\
+	} while (0)
+
+#define __local_unlock_nested_bh(lock)				\
+	local_lock_release(this_cpu_ptr(lock))
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -138,4 +158,15 @@ typedef spinlock_t local_lock_t;
 
 #define __local_unlock_irqrestore(lock, flags)	__local_unlock(lock)
 
+#define __local_lock_nested_bh(lock)				\
+do {								\
+	lockdep_assert_in_softirq_func();			\
+	spin_lock(this_cpu_ptr(lock));				\
+} while (0)
+
+#define __local_unlock_nested_bh(lock)				\
+do {								\
+	spin_unlock(this_cpu_ptr((lock)));			\
+} while (0)
+
 #endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index dc2844b071c2c..881732fa0df8c 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -657,6 +657,8 @@ do {									\
 		     (!in_softirq() || in_irq() || in_nmi()));		\
 } while (0)
 
+extern void lockdep_assert_in_softirq_func(void);
+
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(lock) do { } while (0)
@@ -670,6 +672,7 @@ do {									\
 # define lockdep_assert_preemption_enabled() do { } while (0)
 # define lockdep_assert_preemption_disabled() do { } while (0)
 # define lockdep_assert_in_softirq() do { } while (0)
+# define lockdep_assert_in_softirq_func() do { } while (0)
 #endif
 
 #ifdef CONFIG_PROVE_RAW_LOCK_NESTING
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 8475a0794f8c5..438c6086d540e 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -413,3 +413,11 @@ notrace int in_lock_functions(unsigned long addr)
 	&& addr < (unsigned long)__lock_text_end;
 }
 EXPORT_SYMBOL(in_lock_functions);
+
+#if defined(CONFIG_PROVE_LOCKING) && defined(CONFIG_PREEMPT_RT)
+void notrace lockdep_assert_in_softirq_func(void)
+{
+	lockdep_assert_in_softirq();
+}
+EXPORT_SYMBOL(lockdep_assert_in_softirq_func);
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 02/24] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-18  7:48   ` Paolo Abeni
  2023-12-15 17:07 ` [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

The else condition within __netdev_alloc_frag_align() is an open coded
__napi_alloc_frag_align().

Use __napi_alloc_frag_align() instead of open coding it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/skbuff.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b157efea5dea8..de9397e45718a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -311,11 +311,8 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 
 		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
 	} else {
-		struct napi_alloc_cache *nc;
-
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = __napi_alloc_frag_align(fragsz, align_mask);
 		local_bh_enable();
 	}
 	return data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  4:43   ` kernel test robot
  2023-12-15 17:07 ` [PATCH net-next 05/24] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

napi_alloc_cache is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/skbuff.c | 57 +++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index de9397e45718a..9c3073dcc80f1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -265,6 +265,7 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 #endif
 
 struct napi_alloc_cache {
+	local_lock_t bh_lock;
 	struct page_frag_cache page;
 	struct page_frag_1k page_small;
 	unsigned int skb_count;
@@ -272,7 +273,9 @@ struct napi_alloc_cache {
 };
 
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 /* Double check that napi_get_frags() allocates skbs with
  * skb->head being backed by slab, not a page fragment.
@@ -296,6 +299,7 @@ void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 
 	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
 }
@@ -324,6 +328,7 @@ static struct sk_buff *napi_skb_cache_get(void)
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 	if (unlikely(!nc->skb_count)) {
 		nc->skb_count = kmem_cache_alloc_bulk(skbuff_cache,
 						      GFP_ATOMIC,
@@ -726,9 +731,11 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 		pfmemalloc = nc->pfmemalloc;
 	} else {
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache.page);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		scoped_guard(local_lock_nested_bh, &napi_alloc_cache.bh_lock) {
+			nc = this_cpu_ptr(&napi_alloc_cache.page);
+			data = page_frag_alloc(nc, len, gfp_mask);
+			pfmemalloc = nc->pfmemalloc;
+		}
 		local_bh_enable();
 	}
 
@@ -793,31 +800,32 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 		goto skb_success;
 	}
 
-	nc = this_cpu_ptr(&napi_alloc_cache);
-
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
-		/* we are artificially inflating the allocation size, but
-		 * that is not as bad as it may look like, as:
-		 * - 'len' less than GRO_MAX_HEAD makes little sense
-		 * - On most systems, larger 'len' values lead to fragment
-		 *   size above 512 bytes
-		 * - kmalloc would use the kmalloc-1k slab for such values
-		 * - Builds with smaller GRO_MAX_HEAD will very likely do
-		 *   little networking, as that implies no WiFi and no
-		 *   tunnels support, and 32 bits arches.
-		 */
-		len = SZ_1K;
+	scoped_guard(local_lock_nested_bh, &napi_alloc_cache.bh_lock) {
+		nc = this_cpu_ptr(&napi_alloc_cache);
+		if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
+			/* we are artificially inflating the allocation size, but
+			 * that is not as bad as it may look like, as:
+			 * - 'len' less than GRO_MAX_HEAD makes little sense
+			 * - On most systems, larger 'len' values lead to fragment
+			 *   size above 512 bytes
+			 * - kmalloc would use the kmalloc-1k slab for such values
+			 * - Builds with smaller GRO_MAX_HEAD will very likely do
+			 *   little networking, as that implies no WiFi and no
+			 *   tunnels support, and 32 bits arches.
+			 */
+			len = SZ_1K;
 
-		data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
-		pfmemalloc = NAPI_SMALL_PAGE_PFMEMALLOC(nc->page_small);
-	} else {
-		len = SKB_HEAD_ALIGN(len);
+			data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
+			pfmemalloc = NAPI_SMALL_PAGE_PFMEMALLOC(nc->page_small);
+		} else {
+			len = SKB_HEAD_ALIGN(len);
 
-		data = page_frag_alloc(&nc->page, len, gfp_mask);
-		pfmemalloc = nc->page.pfmemalloc;
+			data = page_frag_alloc(&nc->page, len, gfp_mask);
+			pfmemalloc = nc->page.pfmemalloc;
+		}
 	}
 
 	if (unlikely(!data))
@@ -1306,6 +1314,7 @@ static void napi_skb_cache_put(struct sk_buff *skb)
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	u32 i;
 
+	guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
 	kasan_poison_object_data(skbuff_cache, skb);
 	nc->skb_cache[nc->skb_count++] = skb;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 05/24] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 06/24] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, David Ahern

sigpool_scratch is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a pad member (original sigpool_scratch) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/ipv4/tcp_sigpool.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_sigpool.c b/net/ipv4/tcp_sigpool.c
index 55b310a722c7d..0046d1a0dd796 100644
--- a/net/ipv4/tcp_sigpool.c
+++ b/net/ipv4/tcp_sigpool.c
@@ -10,7 +10,14 @@
 #include <net/tcp.h>
 
 static size_t __scratch_size;
-static DEFINE_PER_CPU(void __rcu *, sigpool_scratch);
+struct sigpool_scratch {
+	local_lock_t bh_lock;
+	void __rcu *pad;
+};
+
+static DEFINE_PER_CPU(struct sigpool_scratch, sigpool_scratch) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 struct sigpool_entry {
 	struct crypto_ahash	*hash;
@@ -72,7 +79,7 @@ static int sigpool_reserve_scratch(size_t size)
 			break;
 		}
 
-		old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+		old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
 					scratch, lockdep_is_held(&cpool_mutex));
 		if (!cpu_online(cpu) || !old_scratch) {
 			kfree(old_scratch);
@@ -93,7 +100,7 @@ static void sigpool_scratch_free(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+		kfree(rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
 					  NULL, lockdep_is_held(&cpool_mutex)));
 	__scratch_size = 0;
 }
@@ -278,7 +285,8 @@ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RC
 	/* Pairs with tcp_sigpool_reserve_scratch(), scratch area is
 	 * valid (allocated) until tcp_sigpool_end().
 	 */
-	c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch));
+	local_lock_nested_bh(&sigpool_scratch.bh_lock);
+	c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch.pad));
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_sigpool_start);
@@ -287,6 +295,7 @@ void tcp_sigpool_end(struct tcp_sigpool *c) __releases(RCU_BH)
 {
 	struct crypto_ahash *hash = crypto_ahash_reqtfm(c->req);
 
+	local_unlock_nested_bh(&sigpool_scratch.bh_lock);
 	rcu_read_unlock_bh();
 	ahash_request_free(c->req);
 	crypto_free_ahash(hash);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 06/24] net/ipv4: Use nested-BH locking for ipv4_tcp_sk.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 05/24] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 07/24] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, David Ahern

ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/sock.h  |  5 +++++
 net/ipv4/tcp_ipv4.c | 15 +++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3c..cb407b7ae39f8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -547,6 +547,11 @@ struct sock {
 	struct hlist_node	sk_bind2_node;
 };
 
+struct sock_bh_locked {
+	struct sock *sock;
+	local_lock_t bh_lock;
+};
+
 enum sk_pacing {
 	SK_PACING_NONE		= 0,
 	SK_PACING_NEEDED	= 1,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0c50c5a32b84a..e69dd19f39e02 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -92,7 +92,9 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 struct inet_hashinfo tcp_hashinfo;
 EXPORT_SYMBOL(tcp_hashinfo);
 
-static DEFINE_PER_CPU(struct sock *, ipv4_tcp_sk);
+static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
@@ -878,7 +880,9 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 	arg.tos = ip_hdr(skb)->tos;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ctl_sk = this_cpu_read(ipv4_tcp_sk);
+	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+	ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
+
 	sock_net_set(ctl_sk, net);
 	if (sk) {
 		ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
@@ -903,6 +907,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 	sock_net_set(ctl_sk, &init_net);
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 	__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
+	local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
 	local_bh_enable();
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -998,7 +1003,8 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	arg.tos = tos;
 	arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ctl_sk = this_cpu_read(ipv4_tcp_sk);
+	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+	ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
 	sock_net_set(ctl_sk, net);
 	ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
 			   inet_twsk(sk)->tw_mark : READ_ONCE(sk->sk_mark);
@@ -1013,6 +1019,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
 
 	sock_net_set(ctl_sk, &init_net);
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
+	local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
 	local_bh_enable();
 }
 
@@ -3604,7 +3611,7 @@ void __init tcp_v4_init(void)
 		 */
 		inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
 
-		per_cpu(ipv4_tcp_sk, cpu) = sk;
+		per_cpu(ipv4_tcp_sk.sock, cpu) = sk;
 	}
 	if (register_pernet_subsys(&tcp_sk_ops))
 		panic("Failed to create the TCP control socket.\n");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 07/24] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 06/24] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 08/24] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Florian Westphal, Jozsef Kadlecsik,
	Nikolay Aleksandrov, Pablo Neira Ayuso, Roopa Prabhu, bridge,
	coreteam, netfilter-devel

brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: bridge@lists.linux.dev
Cc: coreteam@netfilter.org
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/bridge/br_netfilter_hooks.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 6adcb45bca75d..c06f498c5f1b3 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -133,6 +133,7 @@ static inline bool is_pppoe_ipv6(const struct sk_buff *skb,
 #define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
 
 struct brnf_frag_data {
+	local_lock_t bh_lock;
 	char mac[NF_BRIDGE_MAX_MAC_HEADER_LENGTH];
 	u8 encap_size;
 	u8 size;
@@ -140,7 +141,9 @@ struct brnf_frag_data {
 	__be16 vlan_proto;
 };
 
-static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
+static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage) = {
+	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
 
 static void nf_bridge_info_free(struct sk_buff *skb)
 {
@@ -768,6 +771,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 
 		IPCB(skb)->frag_max_size = nf_bridge->frag_max_size;
 
+		guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
 		data = this_cpu_ptr(&brnf_frag_data_storage);
 
 		if (skb_vlan_tag_present(skb)) {
@@ -795,6 +799,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 
 		IP6CB(skb)->frag_max_size = nf_bridge->frag_max_size;
 
+		guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
 		data = this_cpu_ptr(&brnf_frag_data_storage);
 		data->encap_size = nf_bridge_encap_header_len(skb);
 		data->size = ETH_HLEN + data->encap_size;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 08/24] net: softnet_data: Make xmit.recursion per task.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (6 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 07/24] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 09/24] dev: Use the RPS lock for softnet_data::input_pkt_queue on PREEMPT_RT Sebastian Andrzej Siewior
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Juri Lelli,
	Mel Gorman, Steven Rostedt, Valentin Schneider, Vincent Guittot

Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
recursion counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make the counter per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the recursion counter a task_struct member on PREEMPT_RT.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h | 30 +++++++++++++++++++++++++++++-
 include/linux/sched.h     |  4 +++-
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2564e209465ea..06436695c3679 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3260,7 +3260,9 @@ struct softnet_data {
 #endif
 	/* written and read only by owning cpu: */
 	struct {
+#ifndef CONFIG_PREEMPT_RT
 		u16 recursion;
+#endif
 		u8  more;
 #ifdef CONFIG_NET_EGRESS
 		u8  skip_txqueue;
@@ -3308,12 +3310,36 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd,
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
+#define XMIT_RECURSION_LIMIT	8
+#ifdef CONFIG_PREEMPT_RT
+
+static inline int dev_recursion_level(void)
+{
+	return current->net_xmit_recursion;
+}
+
+static inline bool dev_xmit_recursion(void)
+{
+	return unlikely(current->net_xmit_recursion > XMIT_RECURSION_LIMIT);
+}
+
+static inline void dev_xmit_recursion_inc(void)
+{
+	current->net_xmit_recursion++;
+}
+
+static inline void dev_xmit_recursion_dec(void)
+{
+	current->net_xmit_recursion--;
+}
+
+#else
+
 static inline int dev_recursion_level(void)
 {
 	return this_cpu_read(softnet_data.xmit.recursion);
 }
 
-#define XMIT_RECURSION_LIMIT	8
 static inline bool dev_xmit_recursion(void)
 {
 	return unlikely(__this_cpu_read(softnet_data.xmit.recursion) >
@@ -3330,6 +3356,8 @@ static inline void dev_xmit_recursion_dec(void)
 	__this_cpu_dec(softnet_data.xmit.recursion);
 }
 
+#endif
+
 void __netif_schedule(struct Qdisc *q);
 void netif_schedule_queue(struct netdev_queue *txq);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c316972485..26b1b7d674b5d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -964,7 +964,9 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
-
+#ifdef CONFIG_PREEMPT_RT
+	u8				net_xmit_recursion;
+#endif
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
 	struct restart_block		restart_block;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 09/24] dev: Use the RPS lock for softnet_data::input_pkt_queue on PREEMPT_RT.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (7 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 08/24] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 10/24] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

There is no additional locking in rps_lock_*() on PREEMPT_RT in the
!CONFIG_RPS case. The reasoning is that local_bh_disable() provides the
required synchronization (everywhere).
All interrupts are threaded and so the required CPU local
synchronization is provided by the per-CPU lock in local_bh_disable()
which is always performed for the forced-threaded interrupts.
Needless to say that the softirq users (NAPI) have BH disabled while
being invoked as a softirq callback.

Without locking in local_bh_disable() on PREEMPT_RT the
softnet_data::input_pkt_queue data structure requires explicit locking
in the !CONFIG_RPS case.

Disabling interrupts for CPU-local locking is undesired because it will
impose restrictions within the locked sections like not allowing to free
a skb.

Utilise the locking which is present in the CONFIG_RPS case for
serialisation on PREEMPT_RT even without CONFIG_RPS enabled.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c879246be48d8..09232080843ee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -220,34 +220,34 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 static inline void rps_lock_irqsave(struct softnet_data *sd,
 				    unsigned long *flags)
 {
-	if (IS_ENABLED(CONFIG_RPS))
+	if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
 		spin_lock_irqsave(&sd->input_pkt_queue.lock, *flags);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_save(*flags);
 }
 
 static inline void rps_lock_irq_disable(struct softnet_data *sd)
 {
-	if (IS_ENABLED(CONFIG_RPS))
+	if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
 		spin_lock_irq(&sd->input_pkt_queue.lock);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_disable();
 }
 
 static inline void rps_unlock_irq_restore(struct softnet_data *sd,
 					  unsigned long *flags)
 {
-	if (IS_ENABLED(CONFIG_RPS))
+	if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
 		spin_unlock_irqrestore(&sd->input_pkt_queue.lock, *flags);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_restore(*flags);
 }
 
 static inline void rps_unlock_irq_enable(struct softnet_data *sd)
 {
-	if (IS_ENABLED(CONFIG_RPS))
+	if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
 		spin_unlock_irq(&sd->input_pkt_queue.lock);
-	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+	else
 		local_irq_enable();
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 10/24] dev: Use nested-BH locking for softnet_data.process_queue.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (8 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 09/24] dev: Use the RPS lock for softnet_data::input_pkt_queue on PREEMPT_RT Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 06436695c3679..88e27d3c39da2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3238,6 +3238,7 @@ static inline bool dev_has_header(const struct net_device *dev)
 struct softnet_data {
 	struct list_head	poll_list;
 	struct sk_buff_head	process_queue;
+	local_lock_t		process_queue_bh_lock;
 
 	/* stats */
 	unsigned int		processed;
diff --git a/net/core/dev.c b/net/core/dev.c
index 09232080843ee..5a0f6da7b3ae5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -440,7 +440,9 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data) = {
+	.process_queue_bh_lock = INIT_LOCAL_LOCK(process_queue_bh_lock),
+};
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 #ifdef CONFIG_LOCKDEP
@@ -5838,6 +5840,7 @@ static void flush_backlog(struct work_struct *work)
 	}
 	rps_unlock_irq_enable(sd);
 
+	local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
 			__skb_unlink(skb, &sd->process_queue);
@@ -5845,6 +5848,7 @@ static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
+	local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 	local_bh_enable();
 }
 
@@ -5966,15 +5970,22 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	while (again) {
 		struct sk_buff *skb;
 
+		local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 		while ((skb = __skb_dequeue(&sd->process_queue))) {
+			local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 			rcu_read_lock();
 			__netif_receive_skb(skb);
 			rcu_read_unlock();
+
+			local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 			input_queue_head_incr(sd);
-			if (++work >= quota)
+			if (++work >= quota) {
+				local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 				return work;
+			}
 
 		}
+		local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 
 		rps_lock_irq_disable(sd);
 		if (skb_queue_empty(&sd->input_pkt_queue)) {
@@ -5989,8 +6000,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			napi->state = 0;
 			again = false;
 		} else {
+			local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
 						   &sd->process_queue);
+			local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
 		}
 		rps_unlock_irq_enable(sd);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (9 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 10/24] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, bpf

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/lwt_bpf.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 {
 	int ret;
 
-	/* Migration disable and BH disable are needed to protect per-cpu
-	 * redirect_info between BPF prog and skb_do_redirect().
+	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+	 * BPF prog and skb_do_redirect().
 	 */
-	migrate_disable();
 	local_bh_disable();
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	}
 
 	local_bh_enable();
-	migrate_enable();
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (10 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  3:39   ` kernel test robot
  2023-12-18  8:33   ` Paolo Abeni
  2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
                   ` (12 subsequent siblings)
  24 siblings, 2 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() in
guard notation for locking. Add lockdep_assert_held() to ensure the lock
is held while the per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/seg6_local.h |  1 +
 net/core/filter.c        |  3 ++
 net/ipv6/seg6_local.c    | 59 ++++++++++++++++++++++------------------
 3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..0f22771359f4c 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
 
 struct seg6_bpf_srh_state {
 	struct ipv6_sr_hdr *srh;
+	local_lock_t bh_lock;
 	u16 hdrlen;
 	bool valid;
 };
diff --git a/net/core/filter.c b/net/core/filter.c
index 1737884be52f8..c8013f762524b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
 	void *srh_tlvs, *srh_end, *ptr;
 	int srhoff = 0;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (srh == NULL)
 		return -EINVAL;
 
@@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
 	int hdroff = 0;
 	int err;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	switch (action) {
 	case SEG6_LOCAL_ACTION_END_X:
 		if (!seg6_bpf_has_valid_srh(skb))
@@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
 	int srhoff = 0;
 	int ret;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return -EINVAL;
 
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..ed7278af321a2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 {
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 		this_cpu_ptr(&seg6_bpf_srh_states);
 	struct ipv6_sr_hdr *srh = srh_state->srh;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return false;
 
@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 static int input_action_end_bpf(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
 {
-	struct seg6_bpf_srh_state *srh_state =
-		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state *srh_state;
 	struct ipv6_sr_hdr *srh;
 	int ret;
 
@@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
 	}
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
-	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	/* The access to the per-CPU buffer srh_state is protected by running
+	 * always in softirq context (with disabled BH). On PREEMPT_RT the
+	 * required locking is provided by the following local_lock_nested_bh()
+	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+	 * bpf_prog_run_save_cb().
 	 */
-	preempt_disable();
-	srh_state->srh = srh;
-	srh_state->hdrlen = srh->hdrlen << 3;
-	srh_state->valid = true;
+	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
+		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
+		srh_state->srh = srh;
+		srh_state->hdrlen = srh->hdrlen << 3;
+		srh_state->valid = true;
 
-	rcu_read_lock();
-	bpf_compute_data_pointers(skb);
-	ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
-	rcu_read_unlock();
+		rcu_read_lock();
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+		rcu_read_unlock();
 
-	switch (ret) {
-	case BPF_OK:
-	case BPF_REDIRECT:
-		break;
-	case BPF_DROP:
-		goto drop;
-	default:
-		pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
-		goto drop;
+		switch (ret) {
+		case BPF_OK:
+		case BPF_REDIRECT:
+			break;
+		case BPF_DROP:
+			goto drop;
+		default:
+			pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+			goto drop;
+		}
+
+		if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
+			goto drop;
 	}
 
-	if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
-		goto drop;
-
-	preempt_enable();
 	if (ret != BPF_REDIRECT)
 		seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
 drop:
-	preempt_enable();
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (11 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/filter.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c8013f762524b..896aa3fa699f9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1652,9 +1652,12 @@ struct bpf_scratchpad {
 		__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
 		u8     buff[MAX_BPF_STACK];
 	};
+	local_lock_t	lock;
 };
 
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+	.lock	= INIT_LOCAL_LOCK(lock),
+};
 
 static inline int __bpf_try_make_writable(struct sk_buff *skb,
 					  unsigned int write_len)
@@ -2023,6 +2026,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 		     diff_size > sizeof(sp->diff)))
 		return -EINVAL;
 
+	guard(local_lock_nested_bh)(&bpf_sp.lock);
 	for (i = 0; i < from_size / sizeof(__be32); i++, j++)
 		sp->diff[j] = ~from[i];
 	for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 14/24] net: Add a lock which held during the redirect process.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (12 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

Introduce redirect_lock as a lock to be acquired when access to these
per-CPU variables is performed. Usually the lock is part of the per-CPU
variable which is about to be protected but since there are a few
different per-CPU variables which need to be protected at the same
time (and some of the variables depend on a CONFIG setting) a new
per-CPU data structure with variable bpf_run_lock is used for this.

The lock is a nested-BH lock meaning that on non-PREEMPT_RT kernels this
simply results in a lockdep check and ensuring that bottom halves are
disabled. On PREEMPT_RT kernels this will provide the needed
synchronisation once local_bh_disable() does not act as per-CPU lock.

This patch introduces the bpf_run_lock.redirect_lock lock. It will be
used by drivers in the following patches.

A follow-up step could be to keep bpf_prog_run_xdp() and the
XDP_REDIRECT switch case (with xdp_do_redirect()) close together. That
would allow a single scoped_guard() macro to cover the two required
instaces that require locking instead the whole switch case.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/bpf.h | 6 ++++++
 net/core/filter.c   | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cff5bb08820ec..6912b85209b12 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -291,6 +291,12 @@ struct bpf_map {
 	s64 __percpu *elem_count;
 };
 
+struct bpf_run_lock {
+	local_lock_t redirect_lock;
+};
+
+DECLARE_PER_CPU(struct bpf_run_lock, bpf_run_lock);
+
 static inline const char *btf_field_type_name(enum btf_field_type type)
 {
 	switch (type) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 896aa3fa699f9..7c9653734fb60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -89,6 +89,11 @@
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
 
+DEFINE_PER_CPU(struct bpf_run_lock, bpf_run_lock) = {
+	.redirect_lock = INIT_LOCAL_LOCK(redirect_lock),
+};
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_run_lock);
+
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len)
 {
 	if (in_compat_syscall()) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (13 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  4:12   ` kernel test robot
  2023-12-20  0:25   ` Alexei Starovoitov
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh,
	Martin KaFai Lau, Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Ronak Doshi <doshir@vmware.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/vmxnet3/vmxnet3_xdp.c |  1 +
 kernel/bpf/cpumap.c               |  2 ++
 net/bpf/test_run.c                | 11 ++++++++---
 net/core/dev.c                    |  3 +++
 net/core/filter.c                 |  1 +
 net/core/lwt_bpf.c                |  2 ++
 net/sched/cls_api.c               |  2 ++
 7 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
index 80ddaff759d47..18bce98fd2e31 100644
--- a/drivers/net/vmxnet3/vmxnet3_xdp.c
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -257,6 +257,7 @@ vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp,
 	u32 act;
 
 	rq->stats.xdp_packets++;
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	page = virt_to_page(xdp->data_hard_start);
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c26d49bb78679 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -144,6 +144,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 	int err;
 
 	list_for_each_entry_safe(skb, tmp, listp, list) {
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
 		switch (act) {
 		case XDP_PASS:
@@ -182,6 +183,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c9fdcc5cdce10..db8f7eb35c6ca 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -348,6 +349,9 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 out:
+	xdp_clear_return_frame_no_direct();
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
 	if (redirect)
 		xdp_do_flush();
 	if (nframes) {
@@ -356,7 +360,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			err = ret;
 	}
 
-	xdp_clear_return_frame_no_direct();
 	local_bh_enable();
 	return err;
 }
@@ -417,10 +420,12 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
-		if (xdp)
+		if (xdp) {
+			guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 			*retval = bpf_prog_run_xdp(prog, ctx);
-		else
+		} else {
 			*retval = bpf_prog_run(prog, ctx);
+		}
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a0f6da7b3ae5..5ba7509e88752 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		*pt_prev = NULL;
 	}
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	qdisc_skb_cb(skb)->pkt_len = skb->len;
 	tcx_set_ingress(skb, true);
 
@@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	if (!entry)
 		return skb;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 		u32 act;
 		int err;
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c9653734fb60..72a7812f933a1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
  */
 void xdp_do_flush(void)
 {
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	__dev_flush();
 	__cpu_map_flush();
 	__xsk_map_flush();
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..74b88e897a7e3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
 	local_bh_enable();
 
 	return ret;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd1639863..da61b99bc558f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -23,6 +23,7 @@
 #include <linux/jhash.h>
 #include <linux/rculist.h>
 #include <linux/rhashtable.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
 
 	fl = rcu_dereference_bh(qe->filter_chain);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
 	case TC_ACT_SHOT:
 		qdisc_qstats_drop(sch);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (14 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16 19:28   ` kernel test robot
  2023-12-18  8:52   ` Daniel Borkmann
  2023-12-15 17:07 ` [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: " Sebastian Andrzej Siewior
                   ` (8 subsequent siblings)
  24 siblings, 2 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, K. Y. Srinivasan, Michael S. Tsirkin,
	Alexei Starovoitov, Andrii Nakryiko, Dexuan Cui, Haiyang Zhang,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	Juergen Gross, KP Singh, Martin KaFai Lau, Nikolay Aleksandrov,
	Song Liu, Stanislav Fomichev, Stefano Stabellini, Wei Liu,
	Willem de Bruijn, Xuan Zhuo, Yonghong Song, bpf, virtualization,
	xen-devel

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Cc: virtualization@lists.linux.dev
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/hyperv/netvsc_bpf.c |  1 +
 drivers/net/netkit.c            | 13 +++++++----
 drivers/net/tun.c               | 28 +++++++++++++----------
 drivers/net/veth.c              | 40 ++++++++++++++++++++-------------
 drivers/net/virtio_net.c        |  1 +
 drivers/net/xen-netfront.c      |  1 +
 6 files changed, 52 insertions(+), 32 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 4a9522689fa4f..55f8ca92ca199 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -58,6 +58,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
 
 	memcpy(xdp->data, data, len);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	switch (act) {
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 39171380ccf29..fbcf78477bda8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
 	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
 	skb->dev = peer;
 	entry = rcu_dereference(nk->active);
-	if (entry)
-		ret = netkit_run(entry, skb, ret);
+	if (entry) {
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			ret = netkit_run(entry, skb, ret);
+			if (ret == NETKIT_REDIRECT) {
+				dev_sw_netstats_tx_add(dev, 1, len);
+				skb_do_redirect(skb);
+			}
+		}
+	}
 	switch (ret) {
 	case NETKIT_NEXT:
 	case NETKIT_PASS:
@@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		break;
 	case NETKIT_REDIRECT:
-		dev_sw_netstats_tx_add(dev, 1, len);
-		skb_do_redirect(skb);
 		break;
 	case NETKIT_DROP:
 	default:
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afa5497f7c35c..fe0d31f11e4b6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1708,16 +1708,18 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp_init_buff(&xdp, buflen, &tfile->xdp_rxq);
 		xdp_prepare_buff(&xdp, buf, pad, len, false);
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
-		if (act == XDP_REDIRECT || act == XDP_TX) {
-			get_page(alloc_frag->page);
-			alloc_frag->offset += buflen;
-		}
-		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
-		if (err < 0) {
-			if (act == XDP_REDIRECT || act == XDP_TX)
-				put_page(alloc_frag->page);
-			goto out;
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			if (act == XDP_REDIRECT || act == XDP_TX) {
+				get_page(alloc_frag->page);
+				alloc_frag->offset += buflen;
+			}
+			err = tun_xdp_act(tun, xdp_prog, &xdp, act);
+			if (err < 0) {
+				if (act == XDP_REDIRECT || act == XDP_TX)
+					put_page(alloc_frag->page);
+				goto out;
+			}
 		}
 
 		if (err == XDP_REDIRECT)
@@ -2460,8 +2462,10 @@ static int tun_xdp_one(struct tun_struct *tun,
 		xdp_init_buff(xdp, buflen, &tfile->xdp_rxq);
 		xdp_set_data_meta_invalid(xdp);
 
-		act = bpf_prog_run_xdp(xdp_prog, xdp);
-		ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, xdp);
+			ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+		}
 		if (ret < 0) {
 			put_page(virt_to_head_page(xdp->data));
 			return ret;
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 977861c46b1fe..c69e5ff9f8795 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -624,7 +624,18 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 		xdp->rxq = &rq->xdp_rxq;
 		vxbuf.skb = NULL;
 
-		act = bpf_prog_run_xdp(xdp_prog, xdp);
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, xdp);
+			if (act == XDP_REDIRECT) {
+				orig_frame = *frame;
+				xdp->rxq->mem = frame->mem;
+				if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+					frame = &orig_frame;
+					stats->xdp_drops++;
+					goto err_xdp;
+				}
+			}
+		}
 
 		switch (act) {
 		case XDP_PASS:
@@ -644,13 +655,6 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			rcu_read_unlock();
 			goto xdp_xmit;
 		case XDP_REDIRECT:
-			orig_frame = *frame;
-			xdp->rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
-				frame = &orig_frame;
-				stats->rx_drops++;
-				goto err_xdp;
-			}
 			stats->xdp_redirect++;
 			rcu_read_unlock();
 			goto xdp_xmit;
@@ -857,7 +861,18 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
+		if (act == XDP_REDIRECT) {
+			veth_xdp_get(xdp);
+			consume_skb(skb);
+			xdp->rxq->mem = rq->xdp_mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+				stats->rx_drops++;
+				goto err_xdp;
+			}
+		}
+	}
 
 	switch (act) {
 	case XDP_PASS:
@@ -875,13 +890,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(xdp);
-		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
-			stats->rx_drops++;
-			goto err_xdp;
-		}
 		stats->xdp_redirect++;
 		rcu_read_unlock();
 		goto xdp_xmit;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061f..5e362c4604239 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1010,6 +1010,7 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	u64_stats_inc(&stats->xdp_packets);
 
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index ad29f370034e4..e3daa8cdeb84e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -978,6 +978,7 @@ static u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
 	xdp_prepare_buff(xdp, page_address(pdata), XDP_PACKET_HEADROOM,
 			 len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_TX:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (15 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16 22:09   ` Kiyanovski, Arthur
  2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Arthur Kiyanovski,
	David Arinzon, Igor Russkikh, Jesper Dangaard Brouer,
	John Fastabend, Michael Chan, Noam Dagan, Saeed Bishara,
	Shay Agroskin, Sunil Goutham

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arthur Kiyanovski <akiyano@amazon.com>
Cc: David Arinzon <darinzon@amazon.com>
Cc: Igor Russkikh <irusskikh@marvell.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Noam Dagan <ndagan@amazon.com>
Cc: Saeed Bishara <saeedb@amazon.com>
Cc: Shay Agroskin <shayagr@amazon.com>
Cc: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  1 +
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 26 ++++++++++++-------
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  1 +
 .../net/ethernet/cavium/thunder/nicvf_main.c  |  3 ++-
 drivers/net/ethernet/engleder/tsnep_main.c    | 17 +++++++-----
 5 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index b5bca48148309..cf075bc5e2b13 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -385,6 +385,7 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		goto out;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	verdict = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	switch (verdict) {
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 694daeaf3e615..5d33d478d5109 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -458,7 +458,24 @@ static struct sk_buff *aq_xdp_run_prog(struct aq_nic_s *aq_nic,
 	if (xdp_buff_has_frags(xdp) && !prog->aux->xdp_has_frags)
 		goto out_aborted;
 
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
+	if (act == XDP_REDIRECT) {
+		if (xdp_do_redirect(aq_nic->ndev, xdp, prog) < 0) {
+			local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+			goto out_aborted;
+		}
+		local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
+		xdp_do_flush();
+		u64_stats_update_begin(&rx_ring->stats.rx.syncp);
+		++rx_ring->stats.rx.xdp_redirect;
+		u64_stats_update_end(&rx_ring->stats.rx.syncp);
+		aq_get_rxpages_xdp(buff, xdp);
+	} else {
+		local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+	}
+
 	switch (act) {
 	case XDP_PASS:
 		skb = aq_xdp_build_skb(xdp, aq_nic->ndev, buff);
@@ -481,15 +498,6 @@ static struct sk_buff *aq_xdp_run_prog(struct aq_nic_s *aq_nic,
 		u64_stats_update_end(&rx_ring->stats.rx.syncp);
 		aq_get_rxpages_xdp(buff, xdp);
 		break;
-	case XDP_REDIRECT:
-		if (xdp_do_redirect(aq_nic->ndev, xdp, prog) < 0)
-			goto out_aborted;
-		xdp_do_flush();
-		u64_stats_update_begin(&rx_ring->stats.rx.syncp);
-		++rx_ring->stats.rx.xdp_redirect;
-		u64_stats_update_end(&rx_ring->stats.rx.syncp);
-		aq_get_rxpages_xdp(buff, xdp);
-		break;
 	default:
 		fallthrough;
 	case XDP_ABORTED:
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 96f5ca778c67d..c4d989da7fade 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -253,6 +253,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 	/* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
 	orig_data = xdp.data;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	tx_avail = bnxt_tx_avail(bp, txr);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index eff350e0bc2a8..8e1406101f71b 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -554,7 +554,8 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 	xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false);
 	orig_data = xdp.data;
 
-	action = bpf_prog_run_xdp(prog, &xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+		action = bpf_prog_run_xdp(prog, &xdp);
 
 	len = xdp.data_end - xdp.data;
 	/* Check if XDP program has changed headers */
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index df40c720e7b23..acda3502d274f 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -1268,6 +1268,7 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
 
 	length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -1309,14 +1310,16 @@ static bool tsnep_xdp_run_prog_zc(struct tsnep_rx *rx, struct bpf_prog *prog,
 {
 	u32 act;
 
-	act = bpf_prog_run_xdp(prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(prog, xdp);
 
-	/* XDP_REDIRECT is the main action for zero-copy */
-	if (likely(act == XDP_REDIRECT)) {
-		if (xdp_do_redirect(rx->adapter->netdev, xdp, prog) < 0)
-			goto out_failure;
-		*status |= TSNEP_XDP_REDIRECT;
-		return true;
+		/* XDP_REDIRECT is the main action for zero-copy */
+		if (likely(act == XDP_REDIRECT)) {
+			if (xdp_do_redirect(rx->adapter->netdev, xdp, prog) < 0)
+				goto out_failure;
+			*status |= TSNEP_XDP_REDIRECT;
+			return true;
+		}
 	}
 
 	switch (act) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 18/24] net: Freescale: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (16 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Clark Wang,
	Claudiu Manoil, Ioana Ciornei, Jesper Dangaard Brouer,
	John Fastabend, Madalin Bucur, NXP Linux Team, Shenwei Wang,
	Vladimir Oltean, Wei Fang, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Clark Wang <xiaoning.wang@nxp.com>
Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Madalin Bucur <madalin.bucur@nxp.com>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Shenwei Wang <shenwei.wang@nxp.com>
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Cc: Wei Fang <wei.fang@nxp.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 .../net/ethernet/freescale/dpaa/dpaa_eth.c    |  1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-xsk.c  | 30 ++++++++++---------
 drivers/net/ethernet/freescale/enetc/enetc.c  |  1 +
 drivers/net/ethernet/freescale/fec_main.c     |  1 +
 5 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index dcbc598b11c6c..8adc766282fde 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2597,6 +2597,7 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	}
 #endif
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	/* Update the length and the offset of the FD */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 888509cf1f210..08be35a3e3de7 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -442,6 +442,7 @@ static u32 dpaa2_eth_run_xdp(struct dpaa2_eth_priv *priv,
 	xdp_prepare_buff(&xdp, vaddr + offset, XDP_PACKET_HEADROOM,
 			 dpaa2_fd_get_len(fd), false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	/* xdp.data pointer may have changed */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
index 051748b997f3f..e3ae9de6b0a34 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
@@ -56,23 +56,25 @@ static u32 dpaa2_xsk_run_xdp(struct dpaa2_eth_priv *priv,
 	xdp_buff->rxq = &ch->xdp_rxq;
 
 	xsk_buff_dma_sync_for_cpu(xdp_buff, ch->xsk_pool);
-	xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
 
-	/* xdp.data pointer may have changed */
-	dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
-	dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
+		/* xdp.data pointer may have changed */
+		dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
+		dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
 
-	if (likely(xdp_act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
-		if (unlikely(err)) {
-			ch->stats.xdp_drop++;
-			dpaa2_eth_recycle_buf(priv, ch, addr);
-		} else {
-			ch->buf_count--;
-			ch->stats.xdp_redirect++;
+		if (likely(xdp_act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
+			if (unlikely(err)) {
+				ch->stats.xdp_drop++;
+				dpaa2_eth_recycle_buf(priv, ch, addr);
+			} else {
+				ch->buf_count--;
+				ch->stats.xdp_redirect++;
+			}
+
+			goto xdp_redir;
 		}
-
-		goto xdp_redir;
 	}
 
 	switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index cffbf27c4656b..d516b28815af4 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -1578,6 +1578,7 @@ static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
 			rx_byte_cnt += VLAN_HLEN;
 		rx_byte_cnt += xdp_get_buff_len(&xdp_buff);
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		xdp_act = bpf_prog_run_xdp(prog, &xdp_buff);
 
 		switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c3b7694a74851..335b1e307d468 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1587,6 +1587,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail and xdp_adjust_head: DMA sync for_device cover
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (17 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, K. Y. Srinivasan, Alexei Starovoitov,
	AngeloGioacchino Del Regno, Dexuan Cui, Dimitris Michailidis,
	Felix Fietkau, Haiyang Zhang, Horatiu Vultur, Jeroen de Borst,
	Jesper Dangaard Brouer, John Crispin, John Fastabend,
	Lorenzo Bianconi, Mark Lee, Matthias Brugger,
	Praveen Kaligineedi, Sean Wang, Shailend Chand, UNGLinuxDriver,
	Wei Liu, bpf, linux-hyperv, linux-mediatek

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Dimitris Michailidis <dmichail@fungible.com>
Cc: Felix Fietkau <nbd@nbd.name>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Crispin <john@phrozen.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: Mark Lee <Mark-MC.Lee@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Sean Wang <sean.wang@mediatek.com>
Cc: Shailend Chand <shailend@google.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Wei Liu <wei.liu@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-mediatek@lists.infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/fungible/funeth/funeth_rx.c     |  1 +
 drivers/net/ethernet/google/gve/gve_rx.c             | 12 +++++++-----
 drivers/net/ethernet/mediatek/mtk_eth_soc.c          |  1 +
 drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c |  1 +
 drivers/net/ethernet/microsoft/mana/mana_bpf.c       |  1 +
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/fungible/funeth/funeth_rx.c b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
index 7e2584895de39..e7b1382545908 100644
--- a/drivers/net/ethernet/fungible/funeth/funeth_rx.c
+++ b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
@@ -152,6 +152,7 @@ static void *fun_run_xdp(struct funeth_rxq *q, skb_frag_t *frags, void *buf_va,
 	xdp_prepare_buff(&xdp, buf_va, FUN_XDP_HEADROOM, skb_frag_size(frags) -
 			 (FUN_RX_TAILROOM + FUN_XDP_HEADROOM), false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_prog = READ_ONCE(q->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 73655347902d2..504c8ef761a33 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -779,11 +779,13 @@ static void gve_rx(struct gve_rx_ring *rx, netdev_features_t feat,
 				 page_info->page_offset, GVE_RX_PAD,
 				 len, false);
 		old_data = xdp.data;
-		xdp_act = bpf_prog_run_xdp(xprog, &xdp);
-		if (xdp_act != XDP_PASS) {
-			gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
-			ctx->total_size += frag_size;
-			goto finish_ok_pkt;
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			xdp_act = bpf_prog_run_xdp(xprog, &xdp);
+			if (xdp_act != XDP_PASS) {
+				gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
+				ctx->total_size += frag_size;
+				goto finish_ok_pkt;
+			}
 		}
 
 		page_info->pad += xdp.data - old_data;
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 3cf6589cfdacf..477a74ee18c0a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1946,6 +1946,7 @@ static u32 mtk_xdp_run(struct mtk_eth *eth, struct mtk_rx_ring *ring,
 	if (!prog)
 		goto out;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
index 9ee61db8690b4..026311af07f9e 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
@@ -84,6 +84,7 @@ int lan966x_xdp_run(struct lan966x_port *port, struct page *page, u32 data_len)
 	xdp_prepare_buff(&xdp, page_address(page),
 			 IFH_LEN_BYTES + XDP_PACKET_HEADROOM,
 			 data_len - IFH_LEN_BYTES, false);
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 23b1521c0df96..d465b1dd9fca0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -93,6 +93,7 @@ u32 mana_run_xdp(struct net_device *ndev, struct mana_rxq *rxq,
 	xdp_init_buff(xdp, PAGE_SIZE, &rxq->xdp_rxq);
 	xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	rx_stats = &rxq->stats;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (18 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  4:53   ` kernel test robot
  2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: bpf@vger.kernel.org (open list:XDP
Cc: intel-wired-lan@lists.osuosl.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 22 +++++++++--------
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  1 +
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 21 ++++++++--------
 drivers/net/ethernet/intel/igb/igb_main.c     |  1 +
 drivers/net/ethernet/intel/igc/igc_main.c     |  5 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 24 ++++++++++---------
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 ++-
 9 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index dd410b15000f7..76e069ae2183a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2326,6 +2326,7 @@ static int i40e_run_xdp(struct i40e_ring *rx_ring, struct xdp_buff *xdp, struct
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e99fa854d17f1..2b0c0c1f3ddc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -201,17 +201,19 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp,
 	struct i40e_ring *xdp_ring;
 	u32 act;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return I40E_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = I40E_XDP_EXIT;
-		else
-			result = I40E_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return I40E_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = I40E_XDP_EXIT;
+			else
+				result = I40E_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 9e97ea8630686..5d4cfa3455b37 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -571,6 +571,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	if (!xdp_prog)
 		goto exit;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 99954508184f9..02f89c22d19e3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -762,17 +762,18 @@ ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	int err, result = ICE_XDP_PASS;
 	u32 act;
 
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return ICE_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = ICE_XDP_EXIT;
-		else
-			result = ICE_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return ICE_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = ICE_XDP_EXIT;
+			else
+				result = ICE_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b2295caa2f0ab..e01be809d030e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8621,6 +8621,7 @@ static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index e9bb403bbacf9..8321419b3a307 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2485,7 +2485,10 @@ static int __igc_xdp_run_prog(struct igc_adapter *adapter,
 			      struct bpf_prog *prog,
 			      struct xdp_buff *xdp)
 {
-	u32 act = bpf_prog_run_xdp(prog, xdp);
+	u32 act;
+
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
+	act = bpf_prog_run_xdp(prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 94bde2cad0f47..de564e8b83be2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2203,6 +2203,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 59798bc33298f..b988f758aad49 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -104,18 +104,20 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 	struct xdp_frame *xdpf;
 	u32 act;
 
-	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return IXGBE_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = IXGBE_XDP_EXIT;
-		else
-			result = IXGBE_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return IXGBE_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = IXGBE_XDP_EXIT;
+			else
+				result = IXGBE_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a44e4bd561421..1c58c08aa15ff 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1059,7 +1059,8 @@ static struct sk_buff *ixgbevf_run_xdp(struct ixgbevf_adapter *adapter,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 21/24] net: marvell: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (19 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Geetha sowjanya,
	Jesper Dangaard Brouer, John Fastabend, Marcin Wojtas,
	Russell King, Subbaraya Sundeep, Sunil Goutham, Thomas Petazzoni,
	hariprasad, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Geetha sowjanya <gakula@marvell.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Marcin Wojtas <mw@semihalf.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Subbaraya Sundeep <sbhatta@marvell.com>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: hariprasad <hkelam@marvell.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/marvell/mvneta.c                  | 2 ++
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c        | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c | 1 +
 3 files changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 29aac327574d6..9c7aacd73b590 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2263,6 +2263,8 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 	len = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
 	data_len = xdp->data_end - xdp->data;
+
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 93137606869e2..3a5524ffaba68 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3793,6 +3793,7 @@ mvpp2_run_xdp(struct mvpp2_port *port, struct bpf_prog *prog,
 	u32 ret, act;
 
 	len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM;
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
index 4d519ea833b2c..e48e84d6159bc 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
@@ -1422,6 +1422,7 @@ static bool otx2_xdp_rcv_pkt_handler(struct otx2_nic *pfvf,
 	xdp_prepare_buff(&xdp, hard_start, data - hard_start,
 			 cqe->sg.seg_size, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, &xdp);
 
 	switch (act) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 22/24] net: mellanox, nfp, sfc: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (20 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Edward Cree,
	Jesper Dangaard Brouer, John Fastabend, Leon Romanovsky,
	Louis Peens, Martin Habets, Saeed Mahameed, Tariq Toukan, bpf,
	linux-net-drivers, linux-rdma, oss-drivers

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Edward Cree <ecree.xilinx@gmail.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Martin Habets <habetsm.xilinx@gmail.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: bpf@vger.kernel.org
Cc: linux-net-drivers@amd.com
Cc: linux-rdma@vger.kernel.org
Cc: oss-drivers@corigine.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c       | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
 drivers/net/ethernet/netronome/nfp/nfd3/dp.c     | 3 ++-
 drivers/net/ethernet/netronome/nfp/nfd3/xsk.c    | 1 +
 drivers/net/ethernet/netronome/nfp/nfdk/dp.c     | 3 ++-
 drivers/net/ethernet/sfc/rx.c                    | 1 +
 drivers/net/ethernet/sfc/siena/rx.c              | 1 +
 7 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a09b6e05337d9..c0a3ac3405bc5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -833,6 +833,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			mxbuf.ring = ring;
 			mxbuf.dev = dev;
 
+			guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
 			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 7decc81ed33a9..b4e3c6a5a6da6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -269,6 +269,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 	u32 act;
 	int err;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
index 17381bfc15d72..a041b55514aa3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
@@ -1011,7 +1011,8 @@ static int nfp_nfd3_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 					 pkt_off - NFP_NET_RX_BUF_HEADROOM,
 					 pkt_len, true);
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+				act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 			pkt_len = xdp.data_end - xdp.data;
 			pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
index 45be6954d5aae..38f2d4c2b5b7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
@@ -216,6 +216,7 @@ nfp_nfd3_xsk_rx(struct nfp_net_rx_ring *rx_ring, int budget,
 			}
 		}
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = bpf_prog_run_xdp(xdp_prog, xrxbuf->xdp);
 
 		pkt_len = xrxbuf->xdp->data_end - xrxbuf->xdp->data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
index 8d78c6faefa8a..af0a36c4fb018 100644
--- a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
@@ -1130,7 +1130,8 @@ static int nfp_nfdk_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 					 pkt_off - NFP_NET_RX_BUF_HEADROOM,
 					 pkt_len, true);
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+				act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 			pkt_len = xdp.data_end - xdp.data;
 			pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index f77a2d3ef37ec..3712d29150af5 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
 			 rx_buf->len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	offset = (u8 *)xdp.data - *ehp;
diff --git a/drivers/net/ethernet/sfc/siena/rx.c b/drivers/net/ethernet/sfc/siena/rx.c
index 98d3c0743c0f5..6bfc4cd1c83e0 100644
--- a/drivers/net/ethernet/sfc/siena/rx.c
+++ b/drivers/net/ethernet/sfc/siena/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
 			 rx_buf->len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	offset = (u8 *)xdp.data - *ehp;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (21 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior
  2023-12-15 22:50 ` [PATCH net-next 00/24] locking: Introduce nested-BH locking Jakub Kicinski
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexandre Torgue, Alexei Starovoitov,
	Ariel Elior, Ilias Apalodimas, Jassi Brar,
	Jesper Dangaard Brouer, John Fastabend, Jose Abreu,
	Manish Chopra, Maxime Coquelin, Ravi Gunasekaran, Roger Quadros,
	Siddharth Vadapalli, bpf, linux-omap, linux-stm32

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ariel Elior <aelior@marvell.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jassi Brar <jaswinder.singh@linaro.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jose Abreu <joabreu@synopsys.com>
Cc: Manish Chopra <manishc@marvell.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Ravi Gunasekaran <r-gunasekaran@ti.com>
Cc: Roger Quadros <rogerq@kernel.org>
Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
Cc: bpf@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/qlogic/qede/qede_fp.c        |  1 +
 drivers/net/ethernet/socionext/netsec.c           |  1 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  1 +
 drivers/net/ethernet/ti/cpsw_priv.c               | 15 +++++++++------
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index cb1746bc0e0c5..ce5af094fb817 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1091,6 +1091,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 	xdp_prepare_buff(&xdp, page_address(bd->data), *data_offset,
 			 *len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, &xdp);
 
 	/* Recalculate, as XDP might have changed the headers */
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 0891e9e49ecb5..47e314338f3f3 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -905,6 +905,7 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, struct bpf_prog *prog,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 37e64283f9107..9e92affc8c22c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4893,6 +4893,7 @@ static int __stmmac_xdp_run_prog(struct stmmac_priv *priv,
 	u32 act;
 	int res;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c
index 764ed298b5708..f38c49f9fab35 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -1335,9 +1335,15 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
 	if (!prog)
 		return CPSW_XDP_PASS;
 
-	act = bpf_prog_run_xdp(prog, xdp);
-	/* XDP prog might have changed packet data and boundaries */
-	*len = xdp->data_end - xdp->data;
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(prog, xdp);
+		/* XDP prog might have changed packet data and boundaries */
+		*len = xdp->data_end - xdp->data;
+		if (act == XDP_REDIRECT) {
+			if (xdp_do_redirect(ndev, xdp, prog))
+				goto drop;
+		}
+	}
 
 	switch (act) {
 	case XDP_PASS:
@@ -1352,9 +1358,6 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
 			xdp_return_frame_rx_napi(xdpf);
 		break;
 	case XDP_REDIRECT:
-		if (xdp_do_redirect(ndev, xdp, prog))
-			goto drop;
-
 		/*  Have to flush here, per packet, instead of doing it in bulk
 		 *  at the end of the napi handler. The RX devices on this
 		 *  particular hardware is sharing a common queue, so the
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (22 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 22:50 ` [PATCH net-next 00/24] locking: Introduce nested-BH locking Jakub Kicinski
  24 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

The users of bpf_redirect_info should lock the access by acquiring the
nested BH-lock bpf_run_lock.redirect_lock. This lock should be acquired
before the first usage (bpf_prog_run_xdp()) and dropped after the last
user in the context (xdp_do_redirect()).

Current user in tree have been audited and updated.

Add lockdep annonation to ensure new user acquire the lock.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/xdp.h |  1 +
 net/core/filter.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 349c36fb5fd8f..cdeab175abf18 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -493,6 +493,7 @@ static inline void xdp_clear_features_flag(struct net_device *dev)
 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
 	 * under local_bh_disable(), which provides the needed RCU protection
 	 * for accessing map entries.
diff --git a/net/core/filter.c b/net/core/filter.c
index 72a7812f933a1..a2f97503ed578 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2495,6 +2495,7 @@ int skb_do_redirect(struct sk_buff *skb)
 	struct net_device *dev;
 	u32 flags = ri->flags;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	dev = dev_get_by_index_rcu(net, ri->tgt_index);
 	ri->tgt_index = 0;
 	ri->flags = 0;
@@ -2525,6 +2526,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
 
@@ -2546,6 +2549,8 @@ BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
 
@@ -2568,6 +2573,8 @@ BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
 
@@ -4287,6 +4294,8 @@ u32 xdp_master_redirect(struct xdp_buff *xdp)
 	struct net_device *master, *slave;
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
@@ -4394,6 +4403,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	enum bpf_map_type map_type = ri->map_type;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
@@ -4408,6 +4418,7 @@ int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	enum bpf_map_type map_type = ri->map_type;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 00/24] locking: Introduce nested-BH locking.
  2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
                   ` (23 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior
@ 2023-12-15 22:50 ` Jakub Kicinski
  2023-12-18 17:23   ` Sebastian Andrzej Siewior
  24 siblings, 1 reply; 60+ messages in thread
From: Jakub Kicinski @ 2023-12-15 22:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

As I said at LPC, complicating drivers with odd locking constructs
is a no go for me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2023-12-16  3:39   ` kernel test robot
  2023-12-18  8:33   ` Paolo Abeni
  1 sibling, 0 replies; 60+ messages in thread
From: kernel test robot @ 2023-12-16  3:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-13-bigeasy%40linutronix.de
patch subject: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
config: x86_64-randconfig-r131-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161151.k1MBvXUD-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161151.k1MBvXUD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161151.k1MBvXUD-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/ipv6/seg6_local.c:1431:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/ipv6/seg6_local.c:1431:9: sparse:     expected struct local_lock_t [usertype] *l
   net/ipv6/seg6_local.c:1431:9: sparse:     got struct local_lock_t [noderef] __percpu *

vim +1431 net/ipv6/seg6_local.c

  1410	
  1411	static int input_action_end_bpf(struct sk_buff *skb,
  1412					struct seg6_local_lwt *slwt)
  1413	{
  1414		struct seg6_bpf_srh_state *srh_state;
  1415		struct ipv6_sr_hdr *srh;
  1416		int ret;
  1417	
  1418		srh = get_and_validate_srh(skb);
  1419		if (!srh) {
  1420			kfree_skb(skb);
  1421			return -EINVAL;
  1422		}
  1423		advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
  1424	
  1425		/* The access to the per-CPU buffer srh_state is protected by running
  1426		 * always in softirq context (with disabled BH). On PREEMPT_RT the
  1427		 * required locking is provided by the following local_lock_nested_bh()
  1428		 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
  1429		 * bpf_prog_run_save_cb().
  1430		 */
> 1431		scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
  1432			srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
  1433			srh_state->srh = srh;
  1434			srh_state->hdrlen = srh->hdrlen << 3;
  1435			srh_state->valid = true;
  1436	
  1437			rcu_read_lock();
  1438			bpf_compute_data_pointers(skb);
  1439			ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
  1440			rcu_read_unlock();
  1441	
  1442			switch (ret) {
  1443			case BPF_OK:
  1444			case BPF_REDIRECT:
  1445				break;
  1446			case BPF_DROP:
  1447				goto drop;
  1448			default:
  1449				pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
  1450				goto drop;
  1451			}
  1452	
  1453			if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
  1454				goto drop;
  1455		}
  1456	
  1457		if (ret != BPF_REDIRECT)
  1458			seg6_lookup_nexthop(skb, NULL, 0);
  1459	
  1460		return dst_input(skb);
  1461	
  1462	drop:
  1463		kfree_skb(skb);
  1464		return -EINVAL;
  1465	}
  1466	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
@ 2023-12-16  4:12   ` kernel test robot
  2023-12-20  0:25   ` Alexei Starovoitov
  1 sibling, 0 replies; 60+ messages in thread
From: kernel test robot @ 2023-12-16  4:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh,
	Martin KaFai Lau, Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-16-bigeasy%40linutronix.de
patch subject: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
config: x86_64-randconfig-122-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161103.iKeGoacH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161103.iKeGoacH-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   net/sched/cls_api.c:391:22: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __be16 [usertype] protocol @@     got unsigned int [usertype] protocol @@
   net/sched/cls_api.c:391:22: sparse:     expected restricted __be16 [usertype] protocol
   net/sched/cls_api.c:391:22: sparse:     got unsigned int [usertype] protocol
   net/sched/cls_api.c:1862:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1862:16: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1862:16: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1962:20: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1962:20: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1962:20: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1924:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1924:25: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:1924:25: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1944:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/cls_api.c:1944:16: sparse:    struct tcf_proto *
   net/sched/cls_api.c:1944:16: sparse:    struct tcf_proto [noderef] __rcu *
   net/sched/cls_api.c:2010:25: sparse: sparse: restricted __be16 degrades to integer
   net/sched/cls_api.c:2697:50: sparse: sparse: restricted __be16 degrades to integer
>> net/sched/cls_api.c:3933:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/sched/cls_api.c:3933:38: sparse:     expected struct local_lock_t [usertype] *l
   net/sched/cls_api.c:3933:38: sparse:     got struct local_lock_t [noderef] __percpu *
   net/sched/cls_api.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
   include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected void const [noderef] __percpu *__vpp_verify @@     got struct local_lock_t [usertype] * @@
   include/linux/local_lock.h:71:1: sparse:     expected void const [noderef] __percpu *__vpp_verify
   include/linux/local_lock.h:71:1: sparse:     got struct local_lock_t [usertype] *

vim +3933 net/sched/cls_api.c

  3921	
  3922	struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
  3923					  struct sk_buff **to_free, int *ret)
  3924	{
  3925		struct tcf_result cl_res;
  3926		struct tcf_proto *fl;
  3927	
  3928		if (!qe->info.block_index)
  3929			return skb;
  3930	
  3931		fl = rcu_dereference_bh(qe->filter_chain);
  3932	
> 3933		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
  3934		switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
  3935		case TC_ACT_SHOT:
  3936			qdisc_qstats_drop(sch);
  3937			__qdisc_drop(skb, to_free);
  3938			*ret = __NET_XMIT_BYPASS;
  3939			return NULL;
  3940		case TC_ACT_STOLEN:
  3941		case TC_ACT_QUEUED:
  3942		case TC_ACT_TRAP:
  3943			__qdisc_drop(skb, to_free);
  3944			*ret = __NET_XMIT_STOLEN;
  3945			return NULL;
  3946		case TC_ACT_REDIRECT:
  3947			skb_do_redirect(skb);
  3948			*ret = __NET_XMIT_STOLEN;
  3949			return NULL;
  3950		}
  3951	
  3952		return skb;
  3953	}
  3954	EXPORT_SYMBOL(tcf_qevent_handle);
  3955	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache.
  2023-12-15 17:07 ` [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
@ 2023-12-16  4:43   ` kernel test robot
  2024-01-12 10:58     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: kernel test robot @ 2023-12-16  4:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-5-bigeasy%40linutronix.de
patch subject: [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache.
config: x86_64-randconfig-121-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161210.q8xdLxsl-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161210.q8xdLxsl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161210.q8xdLxsl-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/core/skbuff.c:302:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/core/skbuff.c:302:38: sparse:     expected struct local_lock_t [usertype] *l
   net/core/skbuff.c:302:38: sparse:     got struct local_lock_t [noderef] __percpu *
   net/core/skbuff.c:331:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/core/skbuff.c:331:38: sparse:     expected struct local_lock_t [usertype] *l
   net/core/skbuff.c:331:38: sparse:     got struct local_lock_t [noderef] __percpu *
   net/core/skbuff.c:734:17: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/core/skbuff.c:734:17: sparse:     expected struct local_lock_t [usertype] *l
   net/core/skbuff.c:734:17: sparse:     got struct local_lock_t [noderef] __percpu *
   net/core/skbuff.c:806:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/core/skbuff.c:806:9: sparse:     expected struct local_lock_t [usertype] *l
   net/core/skbuff.c:806:9: sparse:     got struct local_lock_t [noderef] __percpu *
   net/core/skbuff.c:1317:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/core/skbuff.c:1317:38: sparse:     expected struct local_lock_t [usertype] *l
   net/core/skbuff.c:1317:38: sparse:     got struct local_lock_t [noderef] __percpu *
   net/core/skbuff.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
   include/linux/page-flags.h:242:46: sparse: sparse: self-comparison always evaluates to false
   net/core/skbuff.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
>> include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected void const [noderef] __percpu *__vpp_verify @@     got struct local_lock_t [usertype] * @@
   include/linux/local_lock.h:71:1: sparse:     expected void const [noderef] __percpu *__vpp_verify
   include/linux/local_lock.h:71:1: sparse:     got struct local_lock_t [usertype] *
>> include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected void const [noderef] __percpu *__vpp_verify @@     got struct local_lock_t [usertype] * @@
   include/linux/local_lock.h:71:1: sparse:     expected void const [noderef] __percpu *__vpp_verify
   include/linux/local_lock.h:71:1: sparse:     got struct local_lock_t [usertype] *
>> include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected void const [noderef] __percpu *__vpp_verify @@     got struct local_lock_t [usertype] * @@
   include/linux/local_lock.h:71:1: sparse:     expected void const [noderef] __percpu *__vpp_verify
   include/linux/local_lock.h:71:1: sparse:     got struct local_lock_t [usertype] *
   net/core/skbuff.c: note: in included file (through include/net/net_namespace.h, include/linux/inet.h):
   include/linux/skbuff.h:2715:28: sparse: sparse: self-comparison always evaluates to false
   net/core/skbuff.c: note: in included file (through include/linux/skbuff.h, include/net/net_namespace.h, include/linux/inet.h):
   include/net/checksum.h:33:39: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] sum @@     got unsigned int @@
   include/net/checksum.h:33:39: sparse:     expected restricted __wsum [usertype] sum
   include/net/checksum.h:33:39: sparse:     got unsigned int

vim +302 net/core/skbuff.c

   296	
   297	void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
   298	{
   299		struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
   300	
   301		fragsz = SKB_DATA_ALIGN(fragsz);
 > 302		guard(local_lock_nested_bh)(&napi_alloc_cache.bh_lock);
   303	
   304		return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
   305	}
   306	EXPORT_SYMBOL(__napi_alloc_frag_align);
   307	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
@ 2023-12-16  4:53   ` kernel test robot
  2023-12-19  0:01     ` Nathan Chancellor
  0 siblings, 1 reply; 60+ messages in thread
From: kernel test robot @ 2023-12-16  4:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: llvm, oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

Hi Sebastian,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
                   goto xdp_out;
                   ^
   drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
           guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
           ^
   include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
           CLASS(_name, __UNIQUE_ID(guard))
                        ^
   include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
                               ^
   include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
   #define __PASTE(a,b) ___PASTE(a,b)
                        ^
   include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
   #define ___PASTE(a,b) a##b
                         ^
   <scratch space>:52:1: note: expanded from here
   __UNIQUE_ID_guard753
   ^
   1 error generated.


vim +8620 drivers/net/ethernet/intel/igb/igb_main.c

b1bb2eb0a0deb0 Alexander Duyck           2017-02-06  8608  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8609  static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8610  				   struct igb_ring *rx_ring,
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8611  				   struct xdp_buff *xdp)
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8612  {
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8613  	int err, result = IGB_XDP_PASS;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8614  	struct bpf_prog *xdp_prog;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8615  	u32 act;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8616  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8617  	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8618  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8619  	if (!xdp_prog)
9cbc948b5a20c9 Sven Auhagen              2020-09-02 @8620  		goto xdp_out;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8621  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8622  	prefetchw(xdp->data_hard_start); /* xdp_frame write */
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8623  
d568b111738dbb Sebastian Andrzej Siewior 2023-12-15  8624  	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8625  	act = bpf_prog_run_xdp(xdp_prog, xdp);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8626  	switch (act) {
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8627  	case XDP_PASS:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8628  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8629  	case XDP_TX:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8630  		result = igb_xdp_xmit_back(adapter, xdp);
74431c40b9c5fa Magnus Karlsson           2021-05-10  8631  		if (result == IGB_XDP_CONSUMED)
74431c40b9c5fa Magnus Karlsson           2021-05-10  8632  			goto out_failure;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8633  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8634  	case XDP_REDIRECT:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8635  		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
74431c40b9c5fa Magnus Karlsson           2021-05-10  8636  		if (err)
74431c40b9c5fa Magnus Karlsson           2021-05-10  8637  			goto out_failure;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8638  		result = IGB_XDP_REDIR;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8639  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8640  	default:
c8064e5b4adac5 Paolo Abeni               2021-11-30  8641  		bpf_warn_invalid_xdp_action(adapter->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8642  		fallthrough;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8643  	case XDP_ABORTED:
74431c40b9c5fa Magnus Karlsson           2021-05-10  8644  out_failure:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8645  		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8646  		fallthrough;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8647  	case XDP_DROP:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8648  		result = IGB_XDP_CONSUMED;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8649  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8650  	}
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8651  xdp_out:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8652  	return ERR_PTR(-result);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8653  }
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8654  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
@ 2023-12-16 19:28   ` kernel test robot
  2023-12-18  8:52   ` Daniel Borkmann
  1 sibling, 0 replies; 60+ messages in thread
From: kernel test robot @ 2023-12-16 19:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: llvm, oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, K. Y. Srinivasan, Michael S. Tsirkin,
	Alexei Starovoitov, Andrii Nakryiko, Dexuan Cui, Haiyang Zhang,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	Juergen Gross, KP Singh, Martin KaFai Lau, Nikolay Aleksandrov,
	Song Liu, Stanislav Fomichev

Hi Sebastian,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-17-bigeasy%40linutronix.de
patch subject: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
config: x86_64-rhel-8.3-bpf (https://download.01.org/0day-ci/archive/20231217/202312170350.n7ssgNDP-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231217/202312170350.n7ssgNDP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312170350.n7ssgNDP-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/net/hyperv/netvsc_bpf.c:53:3: error: cannot jump from this goto statement to its label
                   goto out;
                   ^
   drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
           guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
           ^
   include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
           CLASS(_name, __UNIQUE_ID(guard))
                        ^
   include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
                               ^
   include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
   #define __PASTE(a,b) ___PASTE(a,b)
                        ^
   include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
   #define ___PASTE(a,b) a##b
                         ^
   <scratch space>:81:1: note: expanded from here
   __UNIQUE_ID_guard635
   ^
   drivers/net/hyperv/netvsc_bpf.c:46:3: error: cannot jump from this goto statement to its label
                   goto out;
                   ^
   drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
           guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
           ^
   include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
           CLASS(_name, __UNIQUE_ID(guard))
                        ^
   include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
                               ^
   include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
   #define __PASTE(a,b) ___PASTE(a,b)
                        ^
   include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
   #define ___PASTE(a,b) a##b
                         ^
   <scratch space>:81:1: note: expanded from here
   __UNIQUE_ID_guard635
   ^
   drivers/net/hyperv/netvsc_bpf.c:41:3: error: cannot jump from this goto statement to its label
                   goto out;
                   ^
   drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
           guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
           ^
   include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
           CLASS(_name, __UNIQUE_ID(guard))
                        ^
   include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
                               ^
   include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
   #define __PASTE(a,b) ___PASTE(a,b)
                        ^
   include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
   #define ___PASTE(a,b) a##b
                         ^
   <scratch space>:81:1: note: expanded from here
   __UNIQUE_ID_guard635
   ^
   3 errors generated.


vim +53 drivers/net/hyperv/netvsc_bpf.c

351e1581395fcc Haiyang Zhang             2020-01-23   23  
351e1581395fcc Haiyang Zhang             2020-01-23   24  u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
351e1581395fcc Haiyang Zhang             2020-01-23   25  		   struct xdp_buff *xdp)
351e1581395fcc Haiyang Zhang             2020-01-23   26  {
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   27  	struct netvsc_stats_rx *rx_stats = &nvchan->rx_stats;
351e1581395fcc Haiyang Zhang             2020-01-23   28  	void *data = nvchan->rsc.data[0];
351e1581395fcc Haiyang Zhang             2020-01-23   29  	u32 len = nvchan->rsc.len[0];
351e1581395fcc Haiyang Zhang             2020-01-23   30  	struct page *page = NULL;
351e1581395fcc Haiyang Zhang             2020-01-23   31  	struct bpf_prog *prog;
351e1581395fcc Haiyang Zhang             2020-01-23   32  	u32 act = XDP_PASS;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   33  	bool drop = true;
351e1581395fcc Haiyang Zhang             2020-01-23   34  
351e1581395fcc Haiyang Zhang             2020-01-23   35  	xdp->data_hard_start = NULL;
351e1581395fcc Haiyang Zhang             2020-01-23   36  
351e1581395fcc Haiyang Zhang             2020-01-23   37  	rcu_read_lock();
351e1581395fcc Haiyang Zhang             2020-01-23   38  	prog = rcu_dereference(nvchan->bpf_prog);
351e1581395fcc Haiyang Zhang             2020-01-23   39  
351e1581395fcc Haiyang Zhang             2020-01-23   40  	if (!prog)
351e1581395fcc Haiyang Zhang             2020-01-23   41  		goto out;
351e1581395fcc Haiyang Zhang             2020-01-23   42  
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   43) 	/* Ensure that the below memcpy() won't overflow the page buffer. */
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   44) 	if (len > ndev->mtu + ETH_HLEN) {
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   45) 		act = XDP_DROP;
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   46) 		goto out;
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   47) 	}
505e3f00c3f364 Andrea Parri (Microsoft   2021-01-14   48) 
351e1581395fcc Haiyang Zhang             2020-01-23   49  	/* allocate page buffer for data */
351e1581395fcc Haiyang Zhang             2020-01-23   50  	page = alloc_page(GFP_ATOMIC);
351e1581395fcc Haiyang Zhang             2020-01-23   51  	if (!page) {
351e1581395fcc Haiyang Zhang             2020-01-23   52  		act = XDP_DROP;
351e1581395fcc Haiyang Zhang             2020-01-23  @53  		goto out;
351e1581395fcc Haiyang Zhang             2020-01-23   54  	}
351e1581395fcc Haiyang Zhang             2020-01-23   55  
43b5169d8355cc Lorenzo Bianconi          2020-12-22   56  	xdp_init_buff(xdp, PAGE_SIZE, &nvchan->xdp_rxq);
be9df4aff65f18 Lorenzo Bianconi          2020-12-22   57  	xdp_prepare_buff(xdp, page_address(page), NETVSC_XDP_HDRM, len, false);
351e1581395fcc Haiyang Zhang             2020-01-23   58  
351e1581395fcc Haiyang Zhang             2020-01-23   59  	memcpy(xdp->data, data, len);
351e1581395fcc Haiyang Zhang             2020-01-23   60  
31dbfc0f055c7d Sebastian Andrzej Siewior 2023-12-15   61  	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
351e1581395fcc Haiyang Zhang             2020-01-23   62  	act = bpf_prog_run_xdp(prog, xdp);
351e1581395fcc Haiyang Zhang             2020-01-23   63  
351e1581395fcc Haiyang Zhang             2020-01-23   64  	switch (act) {
351e1581395fcc Haiyang Zhang             2020-01-23   65  	case XDP_PASS:
351e1581395fcc Haiyang Zhang             2020-01-23   66  	case XDP_TX:
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   67  		drop = false;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   68  		break;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   69  
351e1581395fcc Haiyang Zhang             2020-01-23   70  	case XDP_DROP:
351e1581395fcc Haiyang Zhang             2020-01-23   71  		break;
351e1581395fcc Haiyang Zhang             2020-01-23   72  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   73  	case XDP_REDIRECT:
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   74  		if (!xdp_do_redirect(ndev, xdp, prog)) {
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   75  			nvchan->xdp_flush = true;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   76  			drop = false;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   77  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   78  			u64_stats_update_begin(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   79  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   80  			rx_stats->xdp_redirect++;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   81  			rx_stats->packets++;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   82  			rx_stats->bytes += nvchan->rsc.pktlen;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   83  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   84  			u64_stats_update_end(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   85  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   86  			break;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   87  		} else {
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   88  			u64_stats_update_begin(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   89  			rx_stats->xdp_drop++;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   90  			u64_stats_update_end(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   91  		}
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   92  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   93  		fallthrough;
1cb9d3b6185b2a Haiyang Zhang             2022-04-07   94  
351e1581395fcc Haiyang Zhang             2020-01-23   95  	case XDP_ABORTED:
351e1581395fcc Haiyang Zhang             2020-01-23   96  		trace_xdp_exception(ndev, prog, act);
351e1581395fcc Haiyang Zhang             2020-01-23   97  		break;
351e1581395fcc Haiyang Zhang             2020-01-23   98  
351e1581395fcc Haiyang Zhang             2020-01-23   99  	default:
c8064e5b4adac5 Paolo Abeni               2021-11-30  100  		bpf_warn_invalid_xdp_action(ndev, prog, act);
351e1581395fcc Haiyang Zhang             2020-01-23  101  	}
351e1581395fcc Haiyang Zhang             2020-01-23  102  
351e1581395fcc Haiyang Zhang             2020-01-23  103  out:
351e1581395fcc Haiyang Zhang             2020-01-23  104  	rcu_read_unlock();
351e1581395fcc Haiyang Zhang             2020-01-23  105  
1cb9d3b6185b2a Haiyang Zhang             2022-04-07  106  	if (page && drop) {
351e1581395fcc Haiyang Zhang             2020-01-23  107  		__free_page(page);
351e1581395fcc Haiyang Zhang             2020-01-23  108  		xdp->data_hard_start = NULL;
351e1581395fcc Haiyang Zhang             2020-01-23  109  	}
351e1581395fcc Haiyang Zhang             2020-01-23  110  
351e1581395fcc Haiyang Zhang             2020-01-23  111  	return act;
351e1581395fcc Haiyang Zhang             2020-01-23  112  }
351e1581395fcc Haiyang Zhang             2020-01-23  113  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: " Sebastian Andrzej Siewior
@ 2023-12-16 22:09   ` Kiyanovski, Arthur
  2024-01-12 17:53     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Kiyanovski, Arthur @ 2023-12-16 22:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Arinzon, David, Igor Russkikh,
	Jesper Dangaard Brouer, John Fastabend, Michael Chan, Dagan,
	Noam, Bshara, Saeed, Agroskin, Shay, Sunil Goutham

> The per-CPU variables used during bpf_prog_run_xdp() invocation and later
> during xdp_do_redirect() rely on disabled BH for their protection.
> Without locking in local_bh_disable() on PREEMPT_RT these data structure
> require explicit locking.
> 
> This is a follow-up on the previous change which introduced
> bpf_run_lock.redirect_lock and uses it now within drivers.
> 
> The simple way is to acquire the lock before bpf_prog_run_xdp() is invoked and
> hold it until the end of function.
> This does not always work because some drivers (cpsw, atlantic) invoke
> xdp_do_flush() in the same context.
> Acquiring the lock in bpf_prog_run_xdp() and dropping in
> xdp_do_redirect() (without touching drivers) does not work because not all
> driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and invoke
> xdp_do_redirect()).
> 
> Ideally the minimal locking scope would be bpf_prog_run_xdp() +
> xdp_do_redirect() and everything else (error recovery, DMA unmapping, free/
> alloc of memory, …) would happen outside of the locked section.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Hi Sebastian,

I would like to make sure I understand correctly the difference in this patch
between ena and atlantic drivers.

In the atlantic driver the change you've made seems like the best change
in terms of making the critical section as small as possible.

You could have done exactly the same thing with ena, but you chose instead
to let ena release the lock at the end of the function, which in case of an XDP_TX
may make the critical section considerably longer than in the atlantic solution.

If I understand correctly (quote from your commit message "This does not
always work because some drivers (cpsw, atlantic) invoke xdp_do_flush()
in the same context"), in the case of atlantic you had to go for the more
code-altering change, because if you simply used guard() you would include
the xdp_do_flush() in the critical section, but in the case of ena xdp_do_flush()
is called after the function ends so guard is good enough.

Questions:
1. Did I understand correctly the difference in solution choice between atlantic
and ena?
2. As far as I can see the guard() solution looks good for ena except for (maybe?)
XDP_TX, where the critical section becomes a bit long. Can you please explain,
why you think it is still  good enough for ena to use the guard() solution instead
of doing the more  code-altering atlantic solution?

Thanks!
Arthur



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it.
  2023-12-15 17:07 ` [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
@ 2023-12-18  7:48   ` Paolo Abeni
  2024-01-12  9:01     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Abeni @ 2023-12-18  7:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On Fri, 2023-12-15 at 18:07 +0100, Sebastian Andrzej Siewior wrote:
> The else condition within __netdev_alloc_frag_align() is an open coded
> __napi_alloc_frag_align().
> 
> Use __napi_alloc_frag_align() instead of open coding it.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/skbuff.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b157efea5dea8..de9397e45718a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -311,11 +311,8 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
>  
>  		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
>  	} else {
> -		struct napi_alloc_cache *nc;
> -
>  		local_bh_disable();
> -		nc = this_cpu_ptr(&napi_alloc_cache);
> -		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);

There is a:

	fragsz = SKB_DATA_ALIGN(fragsz);

statement just before the enclosing 'if'. I would consider moving such
assignment inside the 'then' branch - since __napi_alloc_frag_align()
already include that.

/P


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock.
  2023-12-15 17:07 ` [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
@ 2023-12-18  8:16   ` Paolo Abeni
  2024-01-11 16:19     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Abeni @ 2023-12-18  8:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On Fri, 2023-12-15 at 18:07 +0100, Sebastian Andrzej Siewior wrote:
> Introduce lock guard definition for local_lock_t. There are no users
> yet.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/local_lock.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index e55010fa73296..706c4b65d9449 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -51,4 +51,15 @@
>  #define local_unlock_irqrestore(lock, flags)			\
>  	__local_unlock_irqrestore(lock, flags)
>  
> +DEFINE_LOCK_GUARD_1(local_lock, local_lock_t,
> +		    local_lock(_T->lock),
> +		    local_unlock(_T->lock))
> +DEFINE_LOCK_GUARD_1(local_lock_irq, local_lock_t,
> +		    local_lock_irq(_T->lock),
> +		    local_unlock_irq(_T->lock))

DEFINE_GUARD should fit for the 2 above, right?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
  2023-12-16  3:39   ` kernel test robot
@ 2023-12-18  8:33   ` Paolo Abeni
  2024-01-12 11:23     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Abeni @ 2023-12-18  8:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, David Ahern, Hao Luo, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

On Fri, 2023-12-15 at 18:07 +0100, Sebastian Andrzej Siewior wrote:
> The access to seg6_bpf_srh_states is protected by disabling preemption.
> Based on the code, the entry point is input_action_end_bpf() and
> every other function (the bpf helper functions bpf_lwt_seg6_*()), that
> is accessing seg6_bpf_srh_states, should be called from within
> input_action_end_bpf().
> 
> input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
> the function and then disables preemption. This looks wrong because if
> preemption needs to be disabled as part of the locking mechanism then
> the variable shouldn't be accessed beforehand.
> 
> Looking at how it is used via test_lwt_seg6local.sh then
> input_action_end_bpf() is always invoked from softirq context. If this
> is always the case then the preempt_disable() statement is superfluous.
> If this is not always invoked from softirq then disabling only
> preemption is not sufficient.
> 
> Replace the preempt_disable() statement with nested-BH locking. This is
> not an equivalent replacement as it assumes that the invocation of
> input_action_end_bpf() always occurs in softirq context and thus the
> preempt_disable() is superfluous.
> Add a local_lock_t the data structure and use local_lock_nested_bh() in
> guard notation for locking. Add lockdep_assert_held() to ensure the lock
> is held while the per-CPU variable is referenced in the helper functions.
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/net/seg6_local.h |  1 +
>  net/core/filter.c        |  3 ++
>  net/ipv6/seg6_local.c    | 59 ++++++++++++++++++++++------------------
>  3 files changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
> index 3fab9dec2ec45..0f22771359f4c 100644
> --- a/include/net/seg6_local.h
> +++ b/include/net/seg6_local.h
> @@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
>  
>  struct seg6_bpf_srh_state {
>  	struct ipv6_sr_hdr *srh;
> +	local_lock_t bh_lock;
>  	u16 hdrlen;
>  	bool valid;
>  };
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1737884be52f8..c8013f762524b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
>  	void *srh_tlvs, *srh_end, *ptr;
>  	int srhoff = 0;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (srh == NULL)
>  		return -EINVAL;
>  
> @@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
>  	int hdroff = 0;
>  	int err;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	switch (action) {
>  	case SEG6_LOCAL_ACTION_END_X:
>  		if (!seg6_bpf_has_valid_srh(skb))
> @@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
>  	int srhoff = 0;
>  	int ret;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (unlikely(srh == NULL))
>  		return -EINVAL;
>  
> diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> index 24e2b4b494cb0..ed7278af321a2 100644
> --- a/net/ipv6/seg6_local.c
> +++ b/net/ipv6/seg6_local.c
> @@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
>  	return err;
>  }
>  
> -DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
> +DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
> +	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
> +};
>  
>  bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  {
> @@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  		this_cpu_ptr(&seg6_bpf_srh_states);
>  	struct ipv6_sr_hdr *srh = srh_state->srh;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (unlikely(srh == NULL))
>  		return false;
>  
> @@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  static int input_action_end_bpf(struct sk_buff *skb,
>  				struct seg6_local_lwt *slwt)
>  {
> -	struct seg6_bpf_srh_state *srh_state =
> -		this_cpu_ptr(&seg6_bpf_srh_states);
> +	struct seg6_bpf_srh_state *srh_state;
>  	struct ipv6_sr_hdr *srh;
>  	int ret;
>  
> @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
>  	}
>  	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
>  
> -	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
> -	 * which is also accessed by the bpf_lwt_seg6_* helpers
> +	/* The access to the per-CPU buffer srh_state is protected by running
> +	 * always in softirq context (with disabled BH). On PREEMPT_RT the
> +	 * required locking is provided by the following local_lock_nested_bh()
> +	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> +	 * bpf_prog_run_save_cb().
>  	 */
> -	preempt_disable();
> -	srh_state->srh = srh;
> -	srh_state->hdrlen = srh->hdrlen << 3;
> -	srh_state->valid = true;
> +	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> +		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> +		srh_state->srh = srh;
> +		srh_state->hdrlen = srh->hdrlen << 3;
> +		srh_state->valid = true;

Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
the added indentation. What about using directly
local_lock_nested_bh()/local_unlock_nested_bh() ?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
  2023-12-16 19:28   ` kernel test robot
@ 2023-12-18  8:52   ` Daniel Borkmann
  2024-01-12 15:37     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 60+ messages in thread
From: Daniel Borkmann @ 2023-12-18  8:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, K. Y. Srinivasan,
	Michael S. Tsirkin, Alexei Starovoitov, Andrii Nakryiko,
	Dexuan Cui, Haiyang Zhang, Hao Luo, Jesper Dangaard Brouer,
	Jiri Olsa, John Fastabend, Juergen Gross, KP Singh,
	Martin KaFai Lau, Nikolay Aleksandrov, Song Liu,
	Stanislav Fomichev, Stefano Stabellini, Wei Liu,
	Willem de Bruijn, Xuan Zhuo, Yonghong Song, bpf, virtualization,
	xen-devel

Hi Sebastian,

On 12/15/23 6:07 PM, Sebastian Andrzej Siewior wrote:
> The per-CPU variables used during bpf_prog_run_xdp() invocation and
> later during xdp_do_redirect() rely on disabled BH for their protection.
> Without locking in local_bh_disable() on PREEMPT_RT these data structure
> require explicit locking.
> 
> This is a follow-up on the previous change which introduced
> bpf_run_lock.redirect_lock and uses it now within drivers.
> 
> The simple way is to acquire the lock before bpf_prog_run_xdp() is
> invoked and hold it until the end of function.
> This does not always work because some drivers (cpsw, atlantic) invoke
> xdp_do_flush() in the same context.
> Acquiring the lock in bpf_prog_run_xdp() and dropping in
> xdp_do_redirect() (without touching drivers) does not work because not
> all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
> invoke xdp_do_redirect()).
> 
> Ideally the minimal locking scope would be bpf_prog_run_xdp() +
> xdp_do_redirect() and everything else (error recovery, DMA unmapping,
> free/ alloc of memory, …) would happen outside of the locked section.
[...]

>   drivers/net/hyperv/netvsc_bpf.c |  1 +
>   drivers/net/netkit.c            | 13 +++++++----
>   drivers/net/tun.c               | 28 +++++++++++++----------
>   drivers/net/veth.c              | 40 ++++++++++++++++++++-------------
>   drivers/net/virtio_net.c        |  1 +
>   drivers/net/xen-netfront.c      |  1 +
>   6 files changed, 52 insertions(+), 32 deletions(-)
[...]

Please exclude netkit from this set given it does not support XDP, but
instead only accepts tc BPF typed programs.

Thanks,
Daniel

> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 39171380ccf29..fbcf78477bda8 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
>   	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
>   	skb->dev = peer;
>   	entry = rcu_dereference(nk->active);
> -	if (entry)
> -		ret = netkit_run(entry, skb, ret);
> +	if (entry) {
> +		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
> +			ret = netkit_run(entry, skb, ret);
> +			if (ret == NETKIT_REDIRECT) {
> +				dev_sw_netstats_tx_add(dev, 1, len);
> +				skb_do_redirect(skb);
> +			}
> +		}
> +	}
>   	switch (ret) {
>   	case NETKIT_NEXT:
>   	case NETKIT_PASS:
> @@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
>   		}
>   		break;
>   	case NETKIT_REDIRECT:
> -		dev_sw_netstats_tx_add(dev, 1, len);
> -		skb_do_redirect(skb);
>   		break;
>   	case NETKIT_DROP:
>   	default:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 00/24] locking: Introduce nested-BH locking.
  2023-12-15 22:50 ` [PATCH net-next 00/24] locking: Introduce nested-BH locking Jakub Kicinski
@ 2023-12-18 17:23   ` Sebastian Andrzej Siewior
  2023-12-19  0:41     ` Jakub Kicinski
  0 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-18 17:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> 
> As I said at LPC, complicating drivers with odd locking constructs
> is a no go for me.

I misunderstood it then as I assumed you wanted to ease the work while I
was done which every driver after (hopefully) understanding what is
possible/ needed and what not. We do speak here about 15++?

Now. The pattern is usually
|	act = bpf_prog_run_xdp(xdp_prog, &xdp);
|	switch (act) {
|	case XDP_REDIRECT:
|		ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
|		if (ret)
|			goto XDP_ABORTED;
|		xdp_redir++ or so;

so we might be able to turn this into something that covers both and
returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
into

| u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
| 				  bpf_prog *prog, struct xdp_buff *xdp)
| {
| 	u32 act;
| 	int ret;
| 
| 	act = bpf_prog_run_xdp(prog, xdp);
| 	if (act == XDP_REDIRECT) {
| 		ret = xdp_do_redirect(netdev, xdp, prog);
| 		if (ret < 0)
| 			act = XDP_ABORTED;
| 	}
| 	return act;
| }

so the lock can be put inside the function and all drivers use this
function.

From looking through drivers/net/ethernet/, this should work for most
drivers:
- amazon/ena
- aquantia/atlantic
- engleder/tsnep
- freescale/enetc
- freescale/fec
- intel/igb
- intel/igc
- marvell/mvneta
- marvell/mvpp2
- marvell/octeontx2
- mediatek/mtk
- mellanox/mlx5
- microchip/lan966x
- microsoft/mana
- netronome/nfp (two call paths with no support XDP_REDIRECT)
- sfc/rx
- sfc/siena (that offset pointer can be moved)
- socionext/netsec
- stmicro/stmmac

A few do something custom/ additionally between bpf_prog_run_xdp() and
  xdp_do_redirect():

- broadcom/bnxt
  calculates length, offset, data pointer. DMA unmaps + memory
  allocations before redirect.

- freescale/dpaa2
- freescale/dpaa
  sets xdp.data_hard_start + frame_sz, unmaps DMA.

- fungible/funeth
  conditional redirect.

- google/gve
  Allocates a new packet for redirect.

- intel/ixgbe
- intel/i40e
- intel/ice
  Failure in the ZC case is different from XDP_ABORTED, depends on the
  error from xdp_do_redirect())

- mellanox/mlx4/
  calculates page_offset.

- qlogic/qede
  DMA unmap and buffer alloc.

- ti/cpsw_priv
  recalculates length (pointer).

and a few more don't support XDP_REDIRECT:

- cavium/thunder
  does not support XDP_REDIRECT, calculates length, offset.

- intel/ixgbevf
  does not support XDP_REDIRECT

I don't understand why some driver need to recalculate data_hard_start,
length and so on and others don't. This might be only needed for the
XDP_TX case or not needed…
Also I'm not sure about the dma unmaps and skb allocations. The new skb
allocation can be probably handled before running the bpf prog but then
in the XDP_PASS case it is a waste…
And the DMA unmaps. Only a few seem to need it. Maybe it can be done
before running the BPF program. After all the bpf may look into the skb.


If that is no go, then the only thing that comes to mind is (as you
mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
in xdp_do_redirect(). This would require that every driver invokes
xdp_do_redirect() even not if it is not supporting it (by setting netdev
to NULL or so).

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-16  4:53   ` kernel test robot
@ 2023-12-19  0:01     ` Nathan Chancellor
  2023-12-19 16:55       ` Nick Desaulniers
  0 siblings, 1 reply; 60+ messages in thread
From: Nathan Chancellor @ 2023-12-19  0:01 UTC (permalink / raw)
  To: kernel test robot
  Cc: Sebastian Andrzej Siewior, linux-kernel, netdev, llvm,
	oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Jesper Dangaard Brouer, Jesse Brandeburg,
	John Fastabend, Tony Nguyen, bpf, intel-wired-lan

On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> Hi Sebastian,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on net-next/main]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> base:   net-next/main
> patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
> compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
> >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
>                    goto xdp_out;
>                    ^
>    drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
>            guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>            ^
>    include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
>            CLASS(_name, __UNIQUE_ID(guard))
>                         ^
>    include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
>    #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
>                                ^
>    include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
>    #define __PASTE(a,b) ___PASTE(a,b)
>                         ^
>    include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
>    #define ___PASTE(a,b) a##b
>                          ^
>    <scratch space>:52:1: note: expanded from here
>    __UNIQUE_ID_guard753
>    ^
>    1 error generated.

I initially thought that this may have been
https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
involved here.

This error occurs because jumping over the initialization of a variable
declared with __attribute__((__cleanup__(...))) does not prevent the
clean up function from running as one may expect it to, but could
instead result in the clean up function getting run on uninitialized
memory. A contrived example (see the bottom of the "Output" tabs for the
execution output):

https://godbolt.org/z/9bvGboxvc

While there is a warning from GCC in that example, I don't see one in
the kernel's case. I see there is an open GCC issue around this problem:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951

While it is possible that there may not actually be a problem with how
the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
clang's behavior is reasonable given the potential footguns that this
construct has.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 00/24] locking: Introduce nested-BH locking.
  2023-12-18 17:23   ` Sebastian Andrzej Siewior
@ 2023-12-19  0:41     ` Jakub Kicinski
  2023-12-21 20:46       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Jakub Kicinski @ 2023-12-19  0:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Chan
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On Mon, 18 Dec 2023 18:23:31 +0100 Sebastian Andrzej Siewior wrote:
> On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> > On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:  
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.  
> > 
> > As I said at LPC, complicating drivers with odd locking constructs
> > is a no go for me.  
> 
> I misunderstood it then as I assumed you wanted to ease the work while I
> was done which every driver after (hopefully) understanding what is
> possible/ needed and what not. We do speak here about 15++?

My main concern is that it takes the complexity of writing network
device drivers to a next level. It's already hard enough to implement
XDP correctly. "local lock" and "guard"? Too complicated :(
Or "unmaintainable" as in "too much maintainer's time will be spent
reviewing code that gets this wrong".

> Now. The pattern is usually
> |	act = bpf_prog_run_xdp(xdp_prog, &xdp);
> |	switch (act) {
> |	case XDP_REDIRECT:
> |		ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
> |		if (ret)
> |			goto XDP_ABORTED;
> |		xdp_redir++ or so;
> 
> so we might be able to turn this into something that covers both and
> returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
> into
> 
> | u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
> | 				  bpf_prog *prog, struct xdp_buff *xdp)
> | {
> | 	u32 act;
> | 	int ret;
> | 
> | 	act = bpf_prog_run_xdp(prog, xdp);
> | 	if (act == XDP_REDIRECT) {
> | 		ret = xdp_do_redirect(netdev, xdp, prog);
> | 		if (ret < 0)
> | 			act = XDP_ABORTED;
> | 	}
> | 	return act;
> | }

If we could fold the DROP case into this -- even better!

> so the lock can be put inside the function and all drivers use this
> function.
> 
> From looking through drivers/net/ethernet/, this should work for most
> drivers:
> - amazon/ena
> - aquantia/atlantic
> - engleder/tsnep
> - freescale/enetc
> - freescale/fec
> - intel/igb
> - intel/igc
> - marvell/mvneta
> - marvell/mvpp2
> - marvell/octeontx2
> - mediatek/mtk
> - mellanox/mlx5
> - microchip/lan966x
> - microsoft/mana
> - netronome/nfp (two call paths with no support XDP_REDIRECT)
> - sfc/rx
> - sfc/siena (that offset pointer can be moved)
> - socionext/netsec
> - stmicro/stmmac
> 
> A few do something custom/ additionally between bpf_prog_run_xdp() and
>   xdp_do_redirect():
> 
> - broadcom/bnxt
>   calculates length, offset, data pointer. DMA unmaps + memory
>   allocations before redirect.

Just looked at this one. The recalculation is probably for the PASS /
TX cases, REDIRECT / DROP shouldn't care. The DMA unmap looks like 
a bug (hi, Michael!)

> - freescale/dpaa2
> - freescale/dpaa
>   sets xdp.data_hard_start + frame_sz, unmaps DMA.
> 
> - fungible/funeth
>   conditional redirect.
> 
> - google/gve
>   Allocates a new packet for redirect.
> 
> - intel/ixgbe
> - intel/i40e
> - intel/ice
>   Failure in the ZC case is different from XDP_ABORTED, depends on the
>   error from xdp_do_redirect())
> 
> - mellanox/mlx4/
>   calculates page_offset.
> 
> - qlogic/qede
>   DMA unmap and buffer alloc.
> 
> - ti/cpsw_priv
>   recalculates length (pointer).
> 
> and a few more don't support XDP_REDIRECT:
> 
> - cavium/thunder
>   does not support XDP_REDIRECT, calculates length, offset.
> 
> - intel/ixgbevf
>   does not support XDP_REDIRECT
> 
> I don't understand why some driver need to recalculate data_hard_start,
> length and so on and others don't. This might be only needed for the
> XDP_TX case or not needed…
> Also I'm not sure about the dma unmaps and skb allocations. The new skb
> allocation can be probably handled before running the bpf prog but then
> in the XDP_PASS case it is a waste…
> And the DMA unmaps. Only a few seem to need it. Maybe it can be done
> before running the BPF program. After all the bpf may look into the skb.
> 
> 
> If that is no go, then the only thing that comes to mind is (as you
> mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
> in xdp_do_redirect(). This would require that every driver invokes
> xdp_do_redirect() even not if it is not supporting it (by setting netdev
> to NULL or so).

To make progress on other parts of the stack we could also take 
the local lock around all of napi->poll() for now..

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-19  0:01     ` Nathan Chancellor
@ 2023-12-19 16:55       ` Nick Desaulniers
  0 siblings, 0 replies; 60+ messages in thread
From: Nick Desaulniers @ 2023-12-19 16:55 UTC (permalink / raw)
  To: Nathan Chancellor, Sebastian Andrzej Siewior
  Cc: kernel test robot, linux-kernel, netdev, llvm, oe-kbuild-all,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

On Mon, Dec 18, 2023 at 4:01 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> > Hi Sebastian,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on net-next/main]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> > base:   net-next/main
> > patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> > patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> > config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
> > compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> > >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
> >                    goto xdp_out;
> >                    ^

^ The problematic goto should be replaced with an early return. (and
perhaps a comment that you can't jump over __cleanup variable
initialization).

Otherwise the compiler cannot put the cleanup in the destination basic
block; it would have to split the edges and have all the happy paths
go to a synthesized basic block that runs the cleanup, then jumps to
the original destination.

> >    drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
> >            guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >            ^
> >    include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
> >            CLASS(_name, __UNIQUE_ID(guard))
> >                         ^
> >    include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
> >    #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> >                                ^
> >    include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
> >    #define __PASTE(a,b) ___PASTE(a,b)
> >                         ^
> >    include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
> >    #define ___PASTE(a,b) a##b
> >                          ^
> >    <scratch space>:52:1: note: expanded from here
> >    __UNIQUE_ID_guard753
> >    ^
> >    1 error generated.
>
> I initially thought that this may have been
> https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
> involved here.
>
> This error occurs because jumping over the initialization of a variable
> declared with __attribute__((__cleanup__(...))) does not prevent the
> clean up function from running as one may expect it to, but could
> instead result in the clean up function getting run on uninitialized
> memory. A contrived example (see the bottom of the "Output" tabs for the
> execution output):
>
> https://godbolt.org/z/9bvGboxvc
>
> While there is a warning from GCC in that example, I don't see one in
> the kernel's case. I see there is an open GCC issue around this problem:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951
>
> While it is possible that there may not actually be a problem with how
> the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
> clang's behavior is reasonable given the potential footguns that this
> construct has.
>
> Cheers,
> Nathan
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
  2023-12-16  4:12   ` kernel test robot
@ 2023-12-20  0:25   ` Alexei Starovoitov
  2024-01-04 19:29     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2023-12-20  0:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh,
	Martin KaFai Lau, Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a0f6da7b3ae5..5ba7509e88752 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>                 *pt_prev = NULL;
>         }
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>         tcx_set_ingress(skb, true);
>
> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>         if (!entry)
>                 return skb;
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>          * already set by the caller.
>          */
> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>                 u32 act;
>                 int err;
>
> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>                 if (act != XDP_PASS) {
>                         switch (act) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7c9653734fb60..72a7812f933a1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>   */
>  void xdp_do_flush(void)
>  {
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         __dev_flush();
>         __cpu_map_flush();
>         __xsk_map_flush();
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index a94943681e5aa..74b88e897a7e3 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>          * BPF prog and skb_do_redirect().
>          */
>         local_bh_disable();
> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>         bpf_compute_data_pointers(skb);
>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>
> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>                 break;
>         }
>
> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>         local_bh_enable();
>
>         return ret;
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 1976bd1639863..da61b99bc558f 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -23,6 +23,7 @@
>  #include <linux/jhash.h>
>  #include <linux/rculist.h>
>  #include <linux/rhashtable.h>
> +#include <linux/bpf.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  #include <net/netlink.h>
> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>
>         fl = rcu_dereference_bh(qe->filter_chain);
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>         case TC_ACT_SHOT:
>                 qdisc_qstats_drop(sch);

Here and in all other places this patch adds locks that
will kill performance of XDP, tcx and everything else in networking.

I'm surprised Jesper and other folks are not jumping in with nacks.
We measure performance in nanoseconds here.
Extra lock is no go.
Please find a different way without ruining performance.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 00/24] locking: Introduce nested-BH locking.
  2023-12-19  0:41     ` Jakub Kicinski
@ 2023-12-21 20:46       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-21 20:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael Chan, linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-12-18 16:41:42 [-0800], Jakub Kicinski wrote:
> 
> To make progress on other parts of the stack we could also take 
> the local lock around all of napi->poll() for now..

Okay. Thank you. I will look into this next year again.

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-20  0:25   ` Alexei Starovoitov
@ 2024-01-04 19:29     ` Toke Høiland-Jørgensen
  2024-01-12 17:41       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-04 19:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Sebastian Andrzej Siewior
  Cc: LKML, Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh,
	Martin KaFai Lau, Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 5a0f6da7b3ae5..5ba7509e88752 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>>                 *pt_prev = NULL;
>>         }
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>>         tcx_set_ingress(skb, true);
>>
>> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>>         if (!entry)
>>                 return skb;
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>>          * already set by the caller.
>>          */
>> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>>                 u32 act;
>>                 int err;
>>
>> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>>                 if (act != XDP_PASS) {
>>                         switch (act) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 7c9653734fb60..72a7812f933a1 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>   */
>>  void xdp_do_flush(void)
>>  {
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         __dev_flush();
>>         __cpu_map_flush();
>>         __xsk_map_flush();
>> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
>> index a94943681e5aa..74b88e897a7e3 100644
>> --- a/net/core/lwt_bpf.c
>> +++ b/net/core/lwt_bpf.c
>> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>          * BPF prog and skb_do_redirect().
>>          */
>>         local_bh_disable();
>> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>>         bpf_compute_data_pointers(skb);
>>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>>
>> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>                 break;
>>         }
>>
>> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>>         local_bh_enable();
>>
>>         return ret;
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 1976bd1639863..da61b99bc558f 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/jhash.h>
>>  #include <linux/rculist.h>
>>  #include <linux/rhashtable.h>
>> +#include <linux/bpf.h>
>>  #include <net/net_namespace.h>
>>  #include <net/sock.h>
>>  #include <net/netlink.h>
>> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>>
>>         fl = rcu_dereference_bh(qe->filter_chain);
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>>         case TC_ACT_SHOT:
>>                 qdisc_qstats_drop(sch);
>
> Here and in all other places this patch adds locks that
> will kill performance of XDP, tcx and everything else in networking.
>
> I'm surprised Jesper and other folks are not jumping in with nacks.
> We measure performance in nanoseconds here.
> Extra lock is no go.
> Please find a different way without ruining performance.

I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
believe there are people who are using XDP on PREEMPT_RT kernels and
still expect decent performance. And to achieve that it is absolutely
imperative that we can amortise expensive operations (such as locking)
over multiple packets.

I realise there's a fundamental trade-off between the amount of
amortisation and the latency hit that we take from holding locks for
longer, but tuning the batch size (while still keeping some amount of
batching) may be a way forward? I suppose Jakub's suggestion in the
other part of the thread, of putting the locks around napi->poll(), is a
step towards something like this.

-Toke


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock.
  2023-12-18  8:16   ` Paolo Abeni
@ 2024-01-11 16:19     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-11 16:19 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-12-18 09:16:49 [+0100], Paolo Abeni wrote:
> > --- a/include/linux/local_lock.h
> > +++ b/include/linux/local_lock.h
> > @@ -51,4 +51,15 @@
> >  #define local_unlock_irqrestore(lock, flags)			\
> >  	__local_unlock_irqrestore(lock, flags)
> >  
> > +DEFINE_LOCK_GUARD_1(local_lock, local_lock_t,
> > +		    local_lock(_T->lock),
> > +		    local_unlock(_T->lock))
> > +DEFINE_LOCK_GUARD_1(local_lock_irq, local_lock_t,
> > +		    local_lock_irq(_T->lock),
> > +		    local_unlock_irq(_T->lock))
> 
> DEFINE_GUARD should fit for the 2 above, right?

Yes, that will work. Let me update it.

> Cheers,
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it.
  2023-12-18  7:48   ` Paolo Abeni
@ 2024-01-12  9:01     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12  9:01 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-12-18 08:48:56 [+0100], Paolo Abeni wrote:
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index b157efea5dea8..de9397e45718a 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -311,11 +311,8 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
> >  
> >  		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
> >  	} else {
> > -		struct napi_alloc_cache *nc;
> > -
> >  		local_bh_disable();
> > -		nc = this_cpu_ptr(&napi_alloc_cache);
> > -		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
> 
> There is a:
> 
> 	fragsz = SKB_DATA_ALIGN(fragsz);
> 
> statement just before the enclosing 'if'. I would consider moving such
> assignment inside the 'then' branch - since __napi_alloc_frag_align()
> already include that.

Okay, moved.

> /P
Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache.
  2023-12-16  4:43   ` kernel test robot
@ 2024-01-12 10:58     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 10:58 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-kernel, netdev, oe-kbuild-all, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Jakub Kicinski,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-12-16 12:43:23 [+0800], kernel test robot wrote:
> Hi Sebastian,
Hi,

> >> net/core/skbuff.c:302:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@

updated the guard definition for that.
Thanks for the report.

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-18  8:33   ` Paolo Abeni
@ 2024-01-12 11:23     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 11:23 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, David Ahern,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

On 2023-12-18 09:33:39 [+0100], Paolo Abeni wrote:
> > --- a/net/ipv6/seg6_local.c
> > +++ b/net/ipv6/seg6_local.c
> > @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
> >  	}
> >  	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
> >  
> > -	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
> > -	 * which is also accessed by the bpf_lwt_seg6_* helpers
> > +	/* The access to the per-CPU buffer srh_state is protected by running
> > +	 * always in softirq context (with disabled BH). On PREEMPT_RT the
> > +	 * required locking is provided by the following local_lock_nested_bh()
> > +	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> > +	 * bpf_prog_run_save_cb().
> >  	 */
> > -	preempt_disable();
> > -	srh_state->srh = srh;
> > -	srh_state->hdrlen = srh->hdrlen << 3;
> > -	srh_state->valid = true;
> > +	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> > +		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> > +		srh_state->srh = srh;
> > +		srh_state->hdrlen = srh->hdrlen << 3;
> > +		srh_state->valid = true;
> 
> Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
> the added indentation. What about using directly
> local_lock_nested_bh()/local_unlock_nested_bh() ?

If this is preferred, sure.

> Cheers,
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-18  8:52   ` Daniel Borkmann
@ 2024-01-12 15:37     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 15:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	K. Y. Srinivasan, Michael S. Tsirkin, Alexei Starovoitov,
	Andrii Nakryiko, Dexuan Cui, Haiyang Zhang, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Juergen Gross,
	KP Singh, Martin KaFai Lau, Nikolay Aleksandrov, Song Liu,
	Stanislav Fomichev, Stefano Stabellini, Wei Liu,
	Willem de Bruijn, Xuan Zhuo, Yonghong Song, bpf, virtualization,
	xen-devel

On 2023-12-18 09:52:05 [+0100], Daniel Borkmann wrote:
> Hi Sebastian,
Hi Daniel,

> Please exclude netkit from this set given it does not support XDP, but
> instead only accepts tc BPF typed programs.

okay, thank you.

> Thanks,
> Daniel

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-04 19:29     ` Toke Høiland-Jørgensen
@ 2024-01-12 17:41       ` Sebastian Andrzej Siewior
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 17:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >>
> >>         fl = rcu_dereference_bh(qe->filter_chain);
> >>
> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> >>         case TC_ACT_SHOT:
> >>                 qdisc_qstats_drop(sch);
> >
> > Here and in all other places this patch adds locks that
> > will kill performance of XDP, tcx and everything else in networking.
> >
> > I'm surprised Jesper and other folks are not jumping in with nacks.
> > We measure performance in nanoseconds here.
> > Extra lock is no go.
> > Please find a different way without ruining performance.
> 
> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
> believe there are people who are using XDP on PREEMPT_RT kernels and
> still expect decent performance. And to achieve that it is absolutely
> imperative that we can amortise expensive operations (such as locking)
> over multiple packets.
> 
> I realise there's a fundamental trade-off between the amount of
> amortisation and the latency hit that we take from holding locks for
> longer, but tuning the batch size (while still keeping some amount of
> batching) may be a way forward? I suppose Jakub's suggestion in the
> other part of the thread, of putting the locks around napi->poll(), is a
> step towards something like this.

The RT requirements are usually different. Networking as in CAN might be
important but Ethernet could only used for remote communication and so
"not" important. People complained that they need to wait for Ethernet
to be done until the CAN packet can be injected into the stack.
With that expectation you would like to pause Ethernet immediately and
switch over the CAN interrupt thread.

But if someone managed to setup XDP then it is likely to be important.
With RT traffic it is usually not the throughput that matters but the
latency. You are likely in the position to receive a packet, say every
1ms, and need to respond immediately. XDP would be used to inspect the
packet and either hand it over to the stack or process it.

I expected the lock operation (under RT) to always succeeds and not
cause any delay because it should not be contended. It should only
block if something with higher priority preempted the current interrupt
thread _and_ also happen to use XDP on the same CPU. In that case (XDP
is needed) it would flush the current user out of the locked section
before the higher-prio thread could take over. Doing bulk and allowing
the low-priority thread to complete would delay the high-priority
thread. Maybe I am too pessimistic here and having two XDP programs on
one CPU is unlikely to happen.

Adding the lock on per-NAPI basis would allow to batch packets.
Acquiring the lock only if XDP is supported would not block the CAN
drivers since they dont't support XDP. But sounds like a hack.

Daniel said netkit doesn't need this locking because it is not
supporting this redirect and it made me think. Would it work to make
the redirect structures part of the bpf_prog-structure instead of
per-CPU? My understanding is that eBPF's programs data structures are
part of it and contain locking allowing one eBPF program preempt
another one.
Having the redirect structures part of the program would obsolete
locking. Do I miss anything?

> -Toke

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RE: [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: Use nested-BH locking for XDP redirect.
  2023-12-16 22:09   ` Kiyanovski, Arthur
@ 2024-01-12 17:53     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 17:53 UTC (permalink / raw)
  To: Kiyanovski, Arthur
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Arinzon, David,
	Igor Russkikh, Jesper Dangaard Brouer, John Fastabend,
	Michael Chan, Dagan, Noam, Bshara, Saeed, Agroskin, Shay,
	Sunil Goutham

On 2023-12-16 22:09:07 [+0000], Kiyanovski, Arthur wrote:
> Hi Sebastian,
 Arthur,

> I would like to make sure I understand correctly the difference in this patch
> between ena and atlantic drivers.
> 
> In the atlantic driver the change you've made seems like the best change
> in terms of making the critical section as small as possible.
> 
> You could have done exactly the same thing with ena, but you chose instead
> to let ena release the lock at the end of the function, which in case of an XDP_TX
> may make the critical section considerably longer than in the atlantic solution.
> 
> If I understand correctly (quote from your commit message "This does not
> always work because some drivers (cpsw, atlantic) invoke xdp_do_flush()
> in the same context"), in the case of atlantic you had to go for the more
> code-altering change, because if you simply used guard() you would include
> the xdp_do_flush() in the critical section, but in the case of ena xdp_do_flush()
> is called after the function ends so guard is good enough.
> 
> Questions:
> 1. Did I understand correctly the difference in solution choice between atlantic
> and ena?

Yes. I could have moved the "XDP_REDIRECT" case right after
bpf_prog_run_xdp() and use "scope guard" to make it slim in the ena
driver. I just made "this" because it was simpler I did not want to
spent unnecessarily cycles on it especially if I have to maintain for a
few releases.

> 2. As far as I can see the guard() solution looks good for ena except for (maybe?)
> XDP_TX, where the critical section becomes a bit long. Can you please explain,
> why you think it is still  good enough for ena to use the guard() solution instead
> of doing the more  code-altering atlantic solution?

Well, it was simpler/ quicker. If this approach would have been accepted
and this long section a problem then it could have been shorten
afterwards. Maybe a another function/ method could be introduced since
this pattern fits ~90% of all drivers.
However it looks like touching all drivers is not what we want so
avoiding spending a lot of cycles on it in the first place wasn't that
bad. (Also it was the third iteration until I got all details right).

> Thanks!
> Arthur
> 
Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-12 17:41       ` Sebastian Andrzej Siewior
@ 2024-01-17 16:37         ` Toke Høiland-Jørgensen
  2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 60+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-17 16:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>> >>
>> >>         fl = rcu_dereference_bh(qe->filter_chain);
>> >>
>> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>> >>         case TC_ACT_SHOT:
>> >>                 qdisc_qstats_drop(sch);
>> >
>> > Here and in all other places this patch adds locks that
>> > will kill performance of XDP, tcx and everything else in networking.
>> >
>> > I'm surprised Jesper and other folks are not jumping in with nacks.
>> > We measure performance in nanoseconds here.
>> > Extra lock is no go.
>> > Please find a different way without ruining performance.
>> 
>> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
>> believe there are people who are using XDP on PREEMPT_RT kernels and
>> still expect decent performance. And to achieve that it is absolutely
>> imperative that we can amortise expensive operations (such as locking)
>> over multiple packets.
>> 
>> I realise there's a fundamental trade-off between the amount of
>> amortisation and the latency hit that we take from holding locks for
>> longer, but tuning the batch size (while still keeping some amount of
>> batching) may be a way forward? I suppose Jakub's suggestion in the
>> other part of the thread, of putting the locks around napi->poll(), is a
>> step towards something like this.
>
> The RT requirements are usually different. Networking as in CAN might be
> important but Ethernet could only used for remote communication and so
> "not" important. People complained that they need to wait for Ethernet
> to be done until the CAN packet can be injected into the stack.
> With that expectation you would like to pause Ethernet immediately and
> switch over the CAN interrupt thread.
>
> But if someone managed to setup XDP then it is likely to be important.
> With RT traffic it is usually not the throughput that matters but the
> latency. You are likely in the position to receive a packet, say every
> 1ms, and need to respond immediately. XDP would be used to inspect the
> packet and either hand it over to the stack or process it.

I am not contesting that latency is important, but it's a pretty
fundamental trade-off and we don't want to kill throughput entirely
either. Especially since this is global to the whole kernel; and there
are definitely people who want to use XDP on an RT kernel and still
achieve high PPS rates.

(Whether those people really strictly speaking need to be running an RT
kernel is maybe debatable, but it does happen).

> I expected the lock operation (under RT) to always succeeds and not
> cause any delay because it should not be contended.

A lock does cause delay even when it's not contended. Bear in mind that
at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
packet (for 64-byte packets). So just the atomic op to figure out
whether there's any contention (around 10ns on the Intel processors I
usually test on) will blow a huge chunk of the total processing budget.
We can't actually do the full processing needed in those 64 nanoseconds
(not to mention the 6.4 nanoseconds we have available at 100Gbps), which
is why it's essential to amortise as much as we can over multiple
packets.

This is all back-of-the-envelope calculations, of course. Having some
actual numbers to look at would be great; I don't suppose you have a
setup where you can run xdp-bench and see how your patches affect the
throughput?

> It should only block if something with higher priority preempted the
> current interrupt thread _and_ also happen to use XDP on the same CPU.
> In that case (XDP is needed) it would flush the current user out of
> the locked section before the higher-prio thread could take over.
> Doing bulk and allowing the low-priority thread to complete would
> delay the high-priority thread. Maybe I am too pessimistic here and
> having two XDP programs on one CPU is unlikely to happen.
>
> Adding the lock on per-NAPI basis would allow to batch packets.
> Acquiring the lock only if XDP is supported would not block the CAN
> drivers since they dont't support XDP. But sounds like a hack.

I chatted with Jesper about this, and he had an idea not too far from
this: split up the XDP and regular stack processing in two stages, each
with their individual batching. So whereas right now we're doing
something like:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      run_netstack(pkt)  // this is the expensive bit
  bh_enable()

We could instead do:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      add_to_list(pkt, to_stack_list)
  bh_enable()
  // sched point
  bh_disable()
  for pkt in to_stack_list:
    run_netstack(pkt)
  bh_enable()


This would limit the batching that blocks everything to only the XDP
processing itself, which should limit the maximum time spent in the
blocking state significantly compared to what we have today. The caveat
being that rearranging things like this is potentially a pretty major
refactoring task that needs to touch all the drivers (even if some of
the logic can be moved into the core code in the process). So not really
sure if this approach is feasible, TBH.

> Daniel said netkit doesn't need this locking because it is not
> supporting this redirect and it made me think. Would it work to make
> the redirect structures part of the bpf_prog-structure instead of
> per-CPU? My understanding is that eBPF's programs data structures are
> part of it and contain locking allowing one eBPF program preempt
> another one.
> Having the redirect structures part of the program would obsolete
> locking. Do I miss anything?

This won't work, unfortunately: the same XDP program can be attached to
multiple interfaces simultaneously, and for hardware with multiple
receive queues (which is most of the hardware that supports XDP), it can
even run simultaneously on multiple CPUs on the same interface. This is
the reason why this is all being kept in per-CPU variables today.

-Toke


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
@ 2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 60+ messages in thread
From: Jakub Kicinski @ 2024-01-18  2:04 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
> I am not contesting that latency is important, but it's a pretty
> fundamental trade-off and we don't want to kill throughput entirely
> either. Especially since this is global to the whole kernel; and there
> are definitely people who want to use XDP on an RT kernel and still
> achieve high PPS rates.
> 
> (Whether those people really strictly speaking need to be running an RT
> kernel is maybe debatable, but it does happen).
> 
> > I expected the lock operation (under RT) to always succeeds and not
> > cause any delay because it should not be contended.  
> 
> A lock does cause delay even when it's not contended. Bear in mind that
> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
> packet (for 64-byte packets). So just the atomic op to figure out
> whether there's any contention (around 10ns on the Intel processors I
> usually test on) will blow a huge chunk of the total processing budget.
> We can't actually do the full processing needed in those 64 nanoseconds
> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
> is why it's essential to amortise as much as we can over multiple
> packets.
> 
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

A potentially stupid idea which I have been turning in my head is 
how we could get away from having the driver handle details of NAPI
budgeting. It's an source of bugs and endless review comments.

All drivers end up maintaining a counter of "how many packets have
I processed" and comparing that against the budget. Would it be crazy
if we put that inside napi_struct? Add a "budget" member inside
napi_struct as well, and:

struct napi_struct {
...
	// poll state
	unsigned int budget;
	unsigned int rx_used;
...
}

static inline bool napi_rx_has_budget(napi)
{
	return napi->budget > napi->rx_used;
}

poll(napi) // no budget
{
	while (napi_rx_has_budget(napi)) {
		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
		// maybe add explicit napi_rx_count() if
		// driver did something funny with the frame.
	}
}

We can also create napi_tx_has_budget() so that people stop being
confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
so that people stop completing XDP in hard irq context (budget==0)...

And we can pass napi into napi_consume_skb(), instead of, presumably
inexplicably to a newcomer, passing in budget.
And napi_complete_done() can lose the work_done argument, too.

Oh, and I'm bringing it up here, because CONFIG_RT can throw
in "need_resched()" into the napi_rx_has_budget(), obviously.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
  2024-01-18  2:04           ` Jakub Kicinski
@ 2024-01-18  7:35           ` Sebastian Andrzej Siewior
  2024-01-18 11:58             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18  7:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

No but I probably could set it up.

> I chatted with Jesper about this, and he had an idea not too far from
> this: split up the XDP and regular stack processing in two stages, each
> with their individual batching. So whereas right now we're doing
> something like:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       run_netstack(pkt)  // this is the expensive bit
>   bh_enable()
> 
> We could instead do:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       add_to_list(pkt, to_stack_list)
>   bh_enable()
>   // sched point
>   bh_disable()
>   for pkt in to_stack_list:
>     run_netstack(pkt)
>   bh_enable()
> 
> 
> This would limit the batching that blocks everything to only the XDP
> processing itself, which should limit the maximum time spent in the
> blocking state significantly compared to what we have today. The caveat
> being that rearranging things like this is potentially a pretty major
> refactoring task that needs to touch all the drivers (even if some of
> the logic can be moved into the core code in the process). So not really
> sure if this approach is feasible, TBH.

This does not work because bh_disable() does not disable scheduling.
Scheduling may happen. bh_disable() acquires a lock which is currently
the only synchronisation point between two say network driver doing
NAPI. And this what I want to get rid of.
Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
proposal, just the REDIRECT piece.

> > Daniel said netkit doesn't need this locking because it is not
> > supporting this redirect and it made me think. Would it work to make
> > the redirect structures part of the bpf_prog-structure instead of
> > per-CPU? My understanding is that eBPF's programs data structures are
> > part of it and contain locking allowing one eBPF program preempt
> > another one.
> > Having the redirect structures part of the program would obsolete
> > locking. Do I miss anything?
> 
> This won't work, unfortunately: the same XDP program can be attached to
> multiple interfaces simultaneously, and for hardware with multiple
> receive queues (which is most of the hardware that supports XDP), it can
> even run simultaneously on multiple CPUs on the same interface. This is
> the reason why this is all being kept in per-CPU variables today.

So I started hacking this and noticed yesterday and noticed that you can
run multiple bpf programs. This is how I learned that it won't work.
My plan B is now to move it into task_struct. 

> -Toke
Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  2:04           ` Jakub Kicinski
@ 2024-01-18  8:27             ` Sebastian Andrzej Siewior
  2024-01-18 16:38               ` Jakub Kicinski
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18  8:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> Oh, and I'm bringing it up here, because CONFIG_RT can throw
> in "need_resched()" into the napi_rx_has_budget(), obviously.

need_resched() does not work on PREEMPT_RT the way you think. This
context (the NAPI poll callback) is preemptible and (by default) runs at
SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
any time by a task with higher priority.
If threadA gets preempted and owns a lock that threadB, with higher
priority, wants then threadA will get back on CPU, inherit the priority
of the threadB and continue to run until it releases the lock.

If this is the per-CPU BH lock (which I want to remove) then it will
continue until all softirqs complete.

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
@ 2024-01-18 11:51             ` Toke Høiland-Jørgensen
  2024-01-18 16:37               ` Jakub Kicinski
  1 sibling, 1 reply; 60+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-18 11:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
>> I am not contesting that latency is important, but it's a pretty
>> fundamental trade-off and we don't want to kill throughput entirely
>> either. Especially since this is global to the whole kernel; and there
>> are definitely people who want to use XDP on an RT kernel and still
>> achieve high PPS rates.
>> 
>> (Whether those people really strictly speaking need to be running an RT
>> kernel is maybe debatable, but it does happen).
>> 
>> > I expected the lock operation (under RT) to always succeeds and not
>> > cause any delay because it should not be contended.  
>> 
>> A lock does cause delay even when it's not contended. Bear in mind that
>> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
>> packet (for 64-byte packets). So just the atomic op to figure out
>> whether there's any contention (around 10ns on the Intel processors I
>> usually test on) will blow a huge chunk of the total processing budget.
>> We can't actually do the full processing needed in those 64 nanoseconds
>> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
>> is why it's essential to amortise as much as we can over multiple
>> packets.
>> 
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> A potentially stupid idea which I have been turning in my head is 
> how we could get away from having the driver handle details of NAPI
> budgeting. It's an source of bugs and endless review comments.
>
> All drivers end up maintaining a counter of "how many packets have
> I processed" and comparing that against the budget. Would it be crazy
> if we put that inside napi_struct? Add a "budget" member inside
> napi_struct as well, and:
>
> struct napi_struct {
> ...
> 	// poll state
> 	unsigned int budget;
> 	unsigned int rx_used;
> ...
> }
>
> static inline bool napi_rx_has_budget(napi)
> {
> 	return napi->budget > napi->rx_used;
> }
>
> poll(napi) // no budget
> {
> 	while (napi_rx_has_budget(napi)) {
> 		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
> 		// maybe add explicit napi_rx_count() if
> 		// driver did something funny with the frame.
> 	}
> }
>
> We can also create napi_tx_has_budget() so that people stop being
> confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
> so that people stop completing XDP in hard irq context (budget==0)...
>
> And we can pass napi into napi_consume_skb(), instead of, presumably
> inexplicably to a newcomer, passing in budget.
> And napi_complete_done() can lose the work_done argument, too.

I do agree that conceptually it makes a lot of sense to encapsulate the
budget like this so drivers don't have to do all this state tracking
themselves. It does appear that drivers are doing different things with
the budget as it is today, though. For instance, the intel drivers seem
to divide the budget over all the enabled RX rings(?); so I'm wondering
if it'll be possible to unify drivers around a more opaque NAPI poll
API?

-Toke


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
@ 2024-01-18 11:58             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 60+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-18 11:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> No but I probably could set it up.

That would be great! Feel free to ping me if you need any pointers to
how we usually do the perf measurements :)

>> I chatted with Jesper about this, and he had an idea not too far from
>> this: split up the XDP and regular stack processing in two stages, each
>> with their individual batching. So whereas right now we're doing
>> something like:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       run_netstack(pkt)  // this is the expensive bit
>>   bh_enable()
>> 
>> We could instead do:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       add_to_list(pkt, to_stack_list)
>>   bh_enable()
>>   // sched point
>>   bh_disable()
>>   for pkt in to_stack_list:
>>     run_netstack(pkt)
>>   bh_enable()
>> 
>> 
>> This would limit the batching that blocks everything to only the XDP
>> processing itself, which should limit the maximum time spent in the
>> blocking state significantly compared to what we have today. The caveat
>> being that rearranging things like this is potentially a pretty major
>> refactoring task that needs to touch all the drivers (even if some of
>> the logic can be moved into the core code in the process). So not really
>> sure if this approach is feasible, TBH.
>
> This does not work because bh_disable() does not disable scheduling.
> Scheduling may happen. bh_disable() acquires a lock which is currently
> the only synchronisation point between two say network driver doing
> NAPI. And this what I want to get rid of.
> Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
> proposal, just the REDIRECT piece.

Right, well s/bh_disable()/lock()/; my main point was splitting up the
processing so that the XDP processing itself and the stack activation on
XDP_PASS is not interleaved. This will make it possible to hold the lock
around the whole XDP batch, not just individual packets, and so retain
the performance we gain from amortising expensive operations over
multiple packets.

-Toke


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
@ 2024-01-18 16:37               ` Jakub Kicinski
  2024-01-20 14:41                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 60+ messages in thread
From: Jakub Kicinski @ 2024-01-18 16:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
> I do agree that conceptually it makes a lot of sense to encapsulate the
> budget like this so drivers don't have to do all this state tracking
> themselves. It does appear that drivers are doing different things with
> the budget as it is today, though. For instance, the intel drivers seem
> to divide the budget over all the enabled RX rings(?); so I'm wondering
> if it'll be possible to unify drivers around a more opaque NAPI poll API?

We can come up with APIs which would cater to multi-queue cases.
Bigger question is what is the sensible polling strategy for those,
just dividing the budget seems, hm, crude.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
@ 2024-01-18 16:38               ` Jakub Kicinski
  2024-01-18 16:50                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 60+ messages in thread
From: Jakub Kicinski @ 2024-01-18 16:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On Thu, 18 Jan 2024 09:27:54 +0100 Sebastian Andrzej Siewior wrote:
> On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> > Oh, and I'm bringing it up here, because CONFIG_RT can throw
> > in "need_resched()" into the napi_rx_has_budget(), obviously.  
> 
> need_resched() does not work on PREEMPT_RT the way you think. This
> context (the NAPI poll callback) is preemptible and (by default) runs at
> SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
> any time by a task with higher priority.
> If threadA gets preempted and owns a lock that threadB, with higher
> priority, wants then threadA will get back on CPU, inherit the priority
> of the threadB and continue to run until it releases the lock.
> 
> If this is the per-CPU BH lock (which I want to remove) then it will
> continue until all softirqs complete.

So there's no way for a process to know on RT that someone with higher
prio is waiting for it to release its locks? :(

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 16:38               ` Jakub Kicinski
@ 2024-01-18 16:50                 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18 16:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On 2024-01-18 08:38:12 [-0800], Jakub Kicinski wrote:
> > If this is the per-CPU BH lock (which I want to remove) then it will
> > continue until all softirqs complete.
> 
> So there's no way for a process to know on RT that someone with higher
> prio is waiting for it to release its locks? :(

You could add a function to check if your current priority is inherited
from someone else and if so start dropping the locks you think are
responsible for it.
I made a PoC that appears to work for timer_list timer which is one of
the softirqs. This made me realise that I need in more spots and I am
doing it for the wrong reasons…

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 16:37               ` Jakub Kicinski
@ 2024-01-20 14:41                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 60+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-20 14:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Paolo Abeni, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, Cong Wang,
	Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa,
	Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
>> I do agree that conceptually it makes a lot of sense to encapsulate the
>> budget like this so drivers don't have to do all this state tracking
>> themselves. It does appear that drivers are doing different things with
>> the budget as it is today, though. For instance, the intel drivers seem
>> to divide the budget over all the enabled RX rings(?); so I'm wondering
>> if it'll be possible to unify drivers around a more opaque NAPI poll API?
>
> We can come up with APIs which would cater to multi-queue cases.
> Bigger question is what is the sensible polling strategy for those,
> just dividing the budget seems, hm, crude.

Right, agreed, though I don't have a good answer for what else to do off
the top of my head...

-Toke


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-01-20 14:41 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-15 17:07 [PATCH net-next 00/24] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 01/24] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
2023-12-18  8:16   ` Paolo Abeni
2024-01-11 16:19     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 02/24] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 03/24] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
2023-12-18  7:48   ` Paolo Abeni
2024-01-12  9:01     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 04/24] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
2023-12-16  4:43   ` kernel test robot
2024-01-12 10:58     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 05/24] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 06/24] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 07/24] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 08/24] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 09/24] dev: Use the RPS lock for softnet_data::input_pkt_queue on PREEMPT_RT Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 10/24] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2023-12-16  3:39   ` kernel test robot
2023-12-18  8:33   ` Paolo Abeni
2024-01-12 11:23     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
2023-12-16  4:12   ` kernel test robot
2023-12-20  0:25   ` Alexei Starovoitov
2024-01-04 19:29     ` Toke Høiland-Jørgensen
2024-01-12 17:41       ` Sebastian Andrzej Siewior
2024-01-17 16:37         ` Toke Høiland-Jørgensen
2024-01-18  2:04           ` Jakub Kicinski
2024-01-18  8:27             ` Sebastian Andrzej Siewior
2024-01-18 16:38               ` Jakub Kicinski
2024-01-18 16:50                 ` Sebastian Andrzej Siewior
2024-01-18 11:51             ` Toke Høiland-Jørgensen
2024-01-18 16:37               ` Jakub Kicinski
2024-01-20 14:41                 ` Toke Høiland-Jørgensen
2024-01-18  7:35           ` Sebastian Andrzej Siewior
2024-01-18 11:58             ` Toke Høiland-Jørgensen
2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
2023-12-16 19:28   ` kernel test robot
2023-12-18  8:52   ` Daniel Borkmann
2024-01-12 15:37     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 17/24] net: amazon, aquanti, broadcom, cavium, engleder: " Sebastian Andrzej Siewior
2023-12-16 22:09   ` Kiyanovski, Arthur
2024-01-12 17:53     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
2023-12-16  4:53   ` kernel test robot
2023-12-19  0:01     ` Nathan Chancellor
2023-12-19 16:55       ` Nick Desaulniers
2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior
2023-12-15 22:50 ` [PATCH net-next 00/24] locking: Introduce nested-BH locking Jakub Kicinski
2023-12-18 17:23   ` Sebastian Andrzej Siewior
2023-12-19  0:41     ` Jakub Kicinski
2023-12-21 20:46       ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).