[PATCH RFC v2 net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v2 net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT
@ 2024-02-29 18:17 Sebastian Andrzej Siewior
  2024-02-29 18:17 ` [PATCH RFC v2 net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
  2024-02-29 18:17 ` [PATCH RFC v2 net-next 2/2] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
  0 siblings, 2 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-29 18:17 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
	Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
	Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
	Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song

In [0] I introduced explicit locking for resources which are otherwise
locked implicit locked by local_bh_disable() and this protections goes
away if the lock in local_bh_disable() is removed on PREEMPT_RT.

There was a big complained once it came to the locking of XDP related
resources during XDP-redirect because it was mostly per-packet and the
locking, even not contended, was seen as a problem [1]. Another XDP
related problem was that I also touched every NIC-driver using XDP. This
complicated the "how to write a NIC driver" further.

To solve both problems I was thinking about an alternative and ended up
with moving the data structure from per-CPU to per-task on PREEMPT_RT.
Instead of adding it to task_struct, I added a pointer there and setup
the struct on stack. In my debug build on x86-64 the struct
bpf_xdp_storage has 112 bytes and collected the per-CPU ressources.

I've been benchmark the outcome on !RT. I meassured 14,880,077 pkt/s
with pktgen + "xdp-bench drop" on a ixgbe which is the interface limit.
I then lowered the clockspeed of the CPU (on the RX side) and received
approx between 12,156,279 and 12,476,759 pkt/s. The results were not
consistent and varied between runs mostly in that range with and without
the patches.

v1…v2 https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de
  - Renamed the container struct from xdp_storage to bpf_net_context. Suggested
    by Toke Høiland-Jørgensen.
  - Use the container struct also on !PREEMPT_RT builds. Store the pointer to
    the on-stack struct in a per-CPU variable. Suggested by Toke
    Høiland-Jørgensen.

I'm posting here only two patches which replace the XDP redirect part
(patches 14 to 24) from the original series.

[0] https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/
[1] https://lore.kernel.org/all/CAADnVQKJBpvfyvmgM29FLv+KpLwBBRggXWzwKzaCT9U-4bgxjA@mail.gmail.com/
[2] https://lore.kernel.org/all/20231215145059.3b42ee35@kernel.org

Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH RFC v2 net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-02-29 18:17 [PATCH RFC v2 net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-02-29 18:17 ` Sebastian Andrzej Siewior
  2024-02-29 18:17 ` [PATCH RFC v2 net-next 2/2] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
  1 sibling, 0 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-29 18:17 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
	Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
	Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
	Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song,
	Sebastian Andrzej Siewior

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it. Use the __free() annotation to automatically reset the pointer once
function returns.
The bpf_net_ctx_set() may nest. For instance local_bh_enable() in
bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
also uses bpf_net_ctx_set(). Therefore only the first invocations
updates the per-task pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

On PREEMPT_RT the pointer to bpf_net_context is saved task's
task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
variable (which is always NODE-local memory). Using always the
bpf_net_context approach has the advantage that there is almost zero
differences between PREEMPT_RT and non-PREEMPT_RT builds.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 96 ++++++++++++++++++++++++++++++++++++++---
 include/linux/sched.h  |  5 +++
 kernel/bpf/cpumap.c    |  2 +
 kernel/fork.c          |  3 ++
 net/bpf/test_run.c     |  9 +++-
 net/core/dev.c         | 17 ++++++++
 net/core/filter.c      | 97 ++++++++++++++++++++++++++----------------
 net/core/lwt_bpf.c     |  3 ++
 8 files changed, 187 insertions(+), 45 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 68fb6c8142fec..6d065991f9e9c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -704,7 +704,83 @@ struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+#ifndef CONFIG_PREEMPT_RT
+DECLARE_PER_CPU(struct bpf_net_context *, bpf_net_context);
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct bpf_net_context *ctx;
+
+	ctx = this_cpu_read(bpf_net_context);
+	if (ctx != NULL)
+		return NULL;
+	this_cpu_write(bpf_net_context, bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	struct bpf_net_context *ctx;
+
+	ctx = this_cpu_read(bpf_net_context);
+	if (ctx != bpf_net_ctx)
+		return;
+	this_cpu_write(bpf_net_context, NULL);
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	struct bpf_net_context *bpf_net_ctx = this_cpu_read(bpf_net_context);
+
+	WARN_ON_ONCE(!bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+#else
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != bpf_net_ctx)
+		return;
+	tsk->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	struct bpf_net_context *bpf_net_ctx = current->bpf_net_context;
+
+	WARN_ON_ONCE(!bpf_net_ctx);
+	return bpf_net_ctx;
+}
+
+#endif
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->ri;
+}
+
+DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -974,23 +1050,27 @@ void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
+	if (!ri)
+		return false;
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+	if (ri)
+		ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
@@ -1544,9 +1624,11 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
+	if (!ri)
+		return XDP_ABORTED;
 	/* Lower bits of the flags are used as return code on lookup failure */
 	if (unlikely(flags & ~(action_mask | flag_mask)))
 		return XDP_ABORTED;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 111f388d65323..e3a50de62015e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1501,6 +1502,10 @@ struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
+#endif
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..6d140f0769d4c 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -261,10 +261,12 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 
 static int cpu_map_kthread_run(void *data)
 {
+	struct bpf_net_context bpf_net_ctx __cleanup(bpf_net_ctx_clear);
 	struct bpf_cpu_map_entry *rcpu = data;
 
 	complete(&rcpu->kthread_running);
 	set_current_state(TASK_INTERRUPTIBLE);
+	bpf_net_ctx_set(&bpf_net_ctx);
 
 	/* When kthread gives stop order, then rcpu have been disconnected
 	 * from map, thus no new packets can enter. Remaining in-flight
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d944e92a43ff..73fe3bf7b8838 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2462,6 +2462,9 @@ __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+#ifdef CONFIG_PREEMPT_RT
+	p->bpf_net_context =  NULL;
+#endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index dfd9193740178..3e7f5404d177f 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -281,7 +281,7 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
 	struct xdp_page_head *head;
@@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -365,8 +366,10 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 				 u32 repeat, u32 batch_size, u32 *time)
 
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct xdp_test_data xdp = { .batch_size = batch_size };
 	struct bpf_test_timer t = { .mode = NO_MIGRATE };
+	struct bpf_net_context __bpf_net_ctx;
 	int ret;
 
 	if (!repeat)
@@ -376,6 +379,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 	if (ret)
 		return ret;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_test_timer_enter(&t);
 	do {
 		xdp.frame_cnt = 0;
@@ -392,7 +396,9 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_prog_array_item item = {.prog = prog};
+	struct bpf_net_context __bpf_net_ctx;
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
 	struct bpf_test_timer t = { NO_MIGRATE };
@@ -412,6 +418,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	if (!repeat)
 		repeat = 1;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_test_timer_enter(&t);
 	old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
 	do {
diff --git a/net/core/dev.c b/net/core/dev.c
index b94ce3a394b7c..95c2cd971e03c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4043,13 +4047,17 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6239,11 +6247,14 @@ void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 restart:
 	napi_poll = NULL;
 
@@ -6716,10 +6727,13 @@ static void skb_defer_free_flush(struct softnet_data *sd)
 
 static int napi_threaded_poll(void *data)
 {
+	struct bpf_net_context bpf_net_ctx __cleanup(bpf_net_ctx_clear);
 	struct napi_struct *napi = data;
 	struct softnet_data *sd;
 	void *have;
 
+	bpf_net_ctx_set(&bpf_net_ctx);
+
 	while (!napi_thread_wait(napi)) {
 		for (;;) {
 			bool repoll = false;
@@ -6753,13 +6767,16 @@ static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(netdev_budget_usecs));
 	int budget = READ_ONCE(netdev_budget);
+	struct bpf_net_context __bpf_net_ctx;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index eb8d5a0a0ec8f..3ea58b830caa8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2473,8 +2473,10 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
+#ifndef CONFIG_PREEMPT_RT
+DEFINE_PER_CPU(struct bpf_net_context *, bpf_net_context);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_net_context);
+#endif
 
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
@@ -2488,11 +2490,15 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net *net = dev_net(skb->dev);
+	struct bpf_redirect_info *ri;
 	struct net_device *dev;
-	u32 flags = ri->flags;
+	u32 flags;
 
+	ri = bpf_net_ctx_get_ri();
+	if (!ri)
+		goto out_drop;
+	flags = ri->flags;
 	dev = dev_get_by_index_rcu(net, ri->tgt_index);
 	ri->tgt_index = 0;
 	ri->flags = 0;
@@ -2521,9 +2527,9 @@ int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+	if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = flags;
@@ -2542,9 +2548,9 @@ static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_PEER;
@@ -2564,9 +2570,9 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely((plen && plen < sizeof(*params)) || flags))
+	if (unlikely((plen && plen < sizeof(*params)) || flags || !ri))
 		return TC_ACT_SHOT;
 
 	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
@@ -4292,19 +4298,17 @@ void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protocted and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exist.  dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4313,11 +4317,14 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri;
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
+		ri = bpf_net_ctx_get_ri();
+		if (!ri)
+			return XDP_ABORTED;
 		/* The target device is different from the receiving device, so
 		 * redirect it to the new device.
 		 * Using XDP_REDIRECT gets the correct behaviour from XDP enabled
@@ -4419,10 +4426,12 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
@@ -4433,10 +4442,12 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (map_type == BPF_MAP_TYPE_XSKMAP)
+	if (!ri)
+		return -EINVAL;
+
+	if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
 	return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
@@ -4450,10 +4461,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       void *fwd,
 				       enum bpf_map_type map_type, u32 map_id)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
+	if (!ri) {
+		err = -EINVAL;
+		goto err;
+	}
 	switch (map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 		fallthrough;
@@ -4495,12 +4510,20 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	enum bpf_map_type map_type = ri->map_type;
-	void *fwd = ri->tgt_value;
-	u32 map_id = ri->map_id;
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
+	enum bpf_map_type map_type;
+	void *fwd;
+	u32 map_id;
 	int err;
 
+	if (!ri) {
+		err = -EINVAL;
+		goto err;
+	}
+	map_type = ri->map_type;
+	fwd = ri->tgt_value;
+	map_id = ri->map_id;
+
 	ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
 	ri->map_type = BPF_MAP_TYPE_UNSPEC;
 
@@ -4529,9 +4552,9 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
-	if (unlikely(flags))
+	if (unlikely(flags || !ri))
 		return XDP_ABORTED;
 
 	/* NB! Map type UNSPEC and map_id == INT_MAX (never generated
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..09012d0daabe9 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,15 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
+	struct bpf_net_context __bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH RFC v2 net-next 2/2] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.
  2024-02-29 18:17 [PATCH RFC v2 net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
  2024-02-29 18:17 ` [PATCH RFC v2 net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
@ 2024-02-29 18:17 ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-29 18:17 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
	Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
	Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
	Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song,
	Sebastian Andrzej Siewior

The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.

Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/cpumap.c    | 24 ++++++++----------------
 kernel/bpf/devmap.c    | 16 ++++++++--------
 net/xdp/xsk.c          | 19 +++++++++++--------
 4 files changed, 65 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6d065991f9e9c..41c3dbbc4d163 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -706,6 +706,9 @@ struct bpf_redirect_info {
 
 struct bpf_net_context {
 	struct bpf_redirect_info ri;
+	struct list_head cpu_map_flush_list;
+	struct list_head dev_map_flush_list;
+	struct list_head xskmap_map_flush_list;
 };
 
 #ifndef CONFIG_PREEMPT_RT
@@ -718,6 +721,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 	ctx = this_cpu_read(bpf_net_context);
 	if (ctx != NULL)
 		return NULL;
+	INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
 	this_cpu_write(bpf_net_context, bpf_net_ctx);
 	return bpf_net_ctx;
 }
@@ -748,6 +755,10 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 
 	if (tsk->bpf_net_context != NULL)
 		return NULL;
+	INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+	INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
 	tsk->bpf_net_context = bpf_net_ctx;
 	return bpf_net_ctx;
 }
@@ -780,6 +791,33 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
 	return &bpf_net_ctx->ri;
 }
 
+static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->cpu_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->dev_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	if (!bpf_net_ctx)
+		return NULL;
+	return &bpf_net_ctx->xskmap_map_flush_list;
+}
+
 DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, if (_T) bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 6d140f0769d4c..59677cea28e18 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -78,8 +78,6 @@ struct bpf_cpu_map {
 	struct bpf_cpu_map_entry __rcu **cpu_map;
 };
 
-static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
-
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
 	u32 value_size = attr->value_size;
@@ -703,7 +701,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
@@ -755,9 +753,12 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 
 void __cpu_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq, *tmp;
 
+	if (!flush_list)
+		return;
+
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_flush_to_queue(bq);
 
@@ -769,20 +770,11 @@ void __cpu_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool cpu_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&cpu_map_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
+
+	if (!flush_list || list_empty(bpf_net_ctx_get_cpu_map_flush_list()))
 		return false;
 	__cpu_map_flush();
 	return true;
 }
 #endif
-
-static int __init cpu_map_init(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
-	return 0;
-}
-
-subsys_initcall(cpu_map_init);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a936c704d4e77..463f3c5eaa895 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -83,7 +83,6 @@ struct bpf_dtab {
 	u32 n_buckets;
 };
 
-static DEFINE_PER_CPU(struct list_head, dev_flush_list);
 static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
@@ -407,9 +406,12 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
  */
 void __dev_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
+	if (!flush_list)
+		return;
+
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_xmit_all(bq, XDP_XMIT_FLUSH);
 		bq->dev_rx = NULL;
@@ -421,7 +423,9 @@ void __dev_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool dev_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&dev_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
+
+	if (!flush_list || list_empty(bpf_net_ctx_get_dev_flush_list()))
 		return false;
 	__dev_flush();
 	return true;
@@ -452,7 +456,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
@@ -1155,15 +1159,11 @@ static struct notifier_block dev_map_notifier = {
 
 static int __init dev_map_init(void)
 {
-	int cpu;
-
 	/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
 	BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
 		     offsetof(struct _bpf_dtab_netdev, dev));
 	register_netdevice_notifier(&dev_map_notifier);
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
 	return 0;
 }
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b78c0e095e221..7a130d0871de1 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,8 +35,6 @@
 #define TX_BATCH_SIZE 32
 #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
 
-static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
-
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
 {
 	if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -372,9 +370,12 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	int err;
 
+	if (!flush_list)
+		return -EINVAL;
+
 	err = xsk_rcv(xs, xdp);
 	if (err)
 		return err;
@@ -387,9 +388,11 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 void __xsk_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	struct xdp_sock *xs, *tmp;
 
+	if (!flush_list)
+		return;
 	list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
 		xsk_flush(xs);
 		__list_del_clearprev(&xs->flush_node);
@@ -399,7 +402,9 @@ void __xsk_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool xsk_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&xskmap_flush_list)))
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
+
+	if (!flush_list || list_empty(flush_list))
 		return false;
 	__xsk_map_flush();
 	return true;
@@ -1770,7 +1775,7 @@ static struct pernet_operations xsk_net_ops = {
 
 static int __init xsk_init(void)
 {
-	int err, cpu;
+	int err;
 
 	err = proto_register(&xsk_proto, 0 /* no slab */);
 	if (err)
@@ -1788,8 +1793,6 @@ static int __init xsk_init(void)
 	if (err)
 		goto out_pernet;
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
 	return 0;
 
 out_pernet:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-02-29 18:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-29 18:17 [PATCH RFC v2 net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
2024-02-29 18:17 ` [PATCH RFC v2 net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
2024-02-29 18:17 ` [PATCH RFC v2 net-next 2/2] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).