* [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT @ 2024-02-13 14:58 Sebastian Andrzej Siewior 2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior 2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior 0 siblings, 2 replies; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw) To: bpf, netdev Cc: Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song In [0] I introduced explicit locking for resources which are otherwise locked implicit locked by local_bh_disable() and this protections goes away if the lock in local_bh_disable() is removed on PREEMPT_RT. There was a big complained once it came to the locking of XDP related resources during XDP-redirect because it was mostly per-packet and the locking, even not contended, was seen as a problem [1]. Another XDP related problem was that I also touched every NIC-driver using XDP. This complicated the "how to write a NIC driver" further. To solve both problems I was thinking about an alternative and ended up with moving the data structure from per-CPU to per-task on PREEMPT_RT. Instead of adding it to task_struct, I added a pointer there and setup the struct on stack. In my debug build on x86-64 the struct bpf_xdp_storage has 112 bytes and collected the per-CPU ressources. I'm posting here only two patches which replace the XDP redirect part (patches 14 to 24) from the original series. [0] https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/ [1] https://lore.kernel.org/all/CAADnVQKJBpvfyvmgM29FLv+KpLwBBRggXWzwKzaCT9U-4bgxjA@mail.gmail.com/ [2] https://lore.kernel.org/all/20231215145059.3b42ee35@kernel.org Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior @ 2024-02-13 14:58 ` Sebastian Andrzej Siewior 2024-02-13 20:50 ` Jesper Dangaard Brouer 2024-02-14 16:13 ` Toke Høiland-Jørgensen 2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior 1 sibling, 2 replies; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw) To: bpf, netdev Cc: Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song, Sebastian Andrzej Siewior The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used. - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list. At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists. The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking to avoid corruption if preemption occurs. PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided and data corruption is also avoided. Create a struct bpf_xdp_storage which contains struct bpf_redirect_info. Define the variable on stack, use xdp_storage_set() to set a pointer to it in task_struct of the current task. Use the __free() annotation to automatically reset the pointer once function returns. Use a pointer which can be used by the __free() annotation to avoid invoking the callback the pointer is NULL. This helps the compiler to optimize the code. The xdp_storage_set() can nest. For instance local_bh_enable() in bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which also uses xdp_storage_set(). Therefore only the first invocations updates the per-task pointer. Use xdp_storage_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the per-CPU variable instead. This should also work for !PREEMPT_RT but isn't needed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- include/linux/filter.h | 79 ++++++++++++++++++++++++++++++++++++--- include/linux/sched.h | 5 +++ kernel/bpf/cpumap.c | 2 + kernel/fork.c | 3 ++ net/bpf/test_run.c | 9 ++++- net/core/dev.c | 17 +++++++++ net/core/filter.c | 85 +++++++++++++++++++++++++++++++----------- net/core/lwt_bpf.c | 3 ++ 8 files changed, 174 insertions(+), 29 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 68fb6c8142fec..97c9be9cabfd6 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -704,8 +704,69 @@ struct bpf_redirect_info { struct bpf_nh_params nh; }; +#ifndef CONFIG_PREEMPT_RT DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); +struct bpf_xdp_storage { }; + +static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store) +{ + return NULL; +} + +static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store) { } + +static inline struct bpf_redirect_info *xdp_storage_get_ri(void) +{ + return this_cpu_ptr(&bpf_redirect_info); +} + +#else + +struct bpf_xdp_storage { + struct bpf_redirect_info ri; +}; + +static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store) +{ + struct task_struct *tsk; + + tsk = current; + if (tsk->bpf_xdp_storage != NULL) + return NULL; + tsk->bpf_xdp_storage = xdp_store; + return xdp_store; +} + +static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store) +{ + struct task_struct *tsk; + + tsk = current; + if (tsk->bpf_xdp_storage != xdp_store) + return; + tsk->bpf_xdp_storage = NULL; +} + +static inline struct bpf_xdp_storage *xdp_storage_get(void) +{ + struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage; + + WARN_ON_ONCE(!xdp_store); + return xdp_store; +} + +static inline struct bpf_redirect_info *xdp_storage_get_ri(void) +{ + struct bpf_xdp_storage *xdp_store = xdp_storage_get(); + + if (!xdp_store) + return NULL; + return &xdp_store->ri; +} +#endif +DEFINE_FREE(xdp_storage_clear, struct bpf_xdp_storage *, if (_T) xdp_storage_clear(_T)); + /* flags for bpf_redirect_info kern_flags */ #define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */ @@ -974,23 +1035,27 @@ void bpf_clear_redirect_map(struct bpf_map *map); static inline bool xdp_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); + if (!ri) + return false; return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_set_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; + if (ri) + ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_clear_return_frame_no_direct(void) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; + if (ri) + ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; } static inline int xdp_ok_fwd_dev(const struct net_device *fwd, @@ -1544,9 +1609,11 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde u64 flags, const u64 flag_mask, void *lookup_elem(struct bpf_map *map, u32 key)) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX; + if (!ri) + return XDP_ABORTED; /* Lower bits of the flags are used as return code on lookup failure */ if (unlikely(flags & ~(action_mask | flag_mask))) return XDP_ABORTED; diff --git a/include/linux/sched.h b/include/linux/sched.h index 111f388d65323..179ea10ae1fd1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -53,6 +53,7 @@ struct bio_list; struct blk_plug; struct bpf_local_storage; struct bpf_run_ctx; +struct bpf_xdp_storage; struct capture_control; struct cfs_rq; struct fs_struct; @@ -1501,6 +1502,10 @@ struct task_struct { /* Used for BPF run context */ struct bpf_run_ctx *bpf_ctx; #endif +#ifdef CONFIG_PREEMPT_RT + /* Used by BPF for per-TASK xdp storage */ + struct bpf_xdp_storage *bpf_xdp_storage; +#endif #ifdef CONFIG_GCC_PLUGIN_STACKLEAK unsigned long lowest_stack; diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 8a0bb80fe48a3..c40ae831ab1a6 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -261,10 +261,12 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, static int cpu_map_kthread_run(void *data) { + struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear); struct bpf_cpu_map_entry *rcpu = data; complete(&rcpu->kthread_running); set_current_state(TASK_INTERRUPTIBLE); + xdp_storage_set(&xdp_store); /* When kthread gives stop order, then rcpu have been disconnected * from map, thus no new packets can enter. Remaining in-flight diff --git a/kernel/fork.c b/kernel/fork.c index 0d944e92a43ff..0d8eb8d20963e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2462,6 +2462,9 @@ __latent_entropy struct task_struct *copy_process( RCU_INIT_POINTER(p->bpf_storage, NULL); p->bpf_ctx = NULL; #endif +#ifdef CONFIG_PREEMPT_RT + p->bpf_xdp_storage = NULL; +#endif /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index dfd9193740178..50902d5254115 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -281,7 +281,7 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes, static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, u32 repeat) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri; int err = 0, act, ret, i, nframes = 0, batch_sz; struct xdp_frame **frames = xdp->frames; struct xdp_page_head *head; @@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, batch_sz = min_t(u32, repeat, xdp->batch_size); local_bh_disable(); + ri = xdp_storage_get_ri(); xdp_set_return_frame_no_direct(); for (i = 0; i < batch_sz; i++) { @@ -365,8 +366,10 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, u32 repeat, u32 batch_size, u32 *time) { + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; struct xdp_test_data xdp = { .batch_size = batch_size }; struct bpf_test_timer t = { .mode = NO_MIGRATE }; + struct bpf_xdp_storage __xdp_store; int ret; if (!repeat) @@ -376,6 +379,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, if (ret) return ret; + xdp_store = xdp_storage_set(&__xdp_store); bpf_test_timer_enter(&t); do { xdp.frame_cnt = 0; @@ -392,7 +396,9 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *retval, u32 *time, bool xdp) { + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; struct bpf_prog_array_item item = {.prog = prog}; + struct bpf_xdp_storage __xdp_store; struct bpf_run_ctx *old_ctx; struct bpf_cg_run_ctx run_ctx; struct bpf_test_timer t = { NO_MIGRATE }; @@ -412,6 +418,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, if (!repeat) repeat = 1; + xdp_store = xdp_storage_set(&__xdp_store); bpf_test_timer_enter(&t); old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); do { diff --git a/net/core/dev.c b/net/core/dev.c index de362d5f26559..c3f7d2a6b6134 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, struct net_device *orig_dev, bool *another) { struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS; + struct bpf_xdp_storage __xdp_store; int sch_ret; if (!entry) return skb; + + xdp_store = xdp_storage_set(&__xdp_store); if (*pt_prev) { *ret = deliver_skb(skb, *pt_prev, orig_dev); *pt_prev = NULL; @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff * sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) { struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS; + struct bpf_xdp_storage __xdp_store; int sch_ret; if (!entry) return skb; + xdp_store = xdp_storage_set(&__xdp_store); + /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was * already set by the caller. */ @@ -6240,10 +6248,13 @@ void napi_busy_loop(unsigned int napi_id, void *loop_end_arg, bool prefer_busy_poll, u16 budget) { unsigned long start_time = loop_end ? busy_loop_current_time() : 0; + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; int (*napi_poll)(struct napi_struct *napi, int budget); + struct bpf_xdp_storage __xdp_store; void *have_poll_lock = NULL; struct napi_struct *napi; + xdp_store = xdp_storage_set(&__xdp_store); restart: napi_poll = NULL; @@ -6716,10 +6727,13 @@ static void skb_defer_free_flush(struct softnet_data *sd) static int napi_threaded_poll(void *data) { + struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear); struct napi_struct *napi = data; struct softnet_data *sd; void *have; + xdp_storage_set(&xdp_store); + while (!napi_thread_wait(napi)) { for (;;) { bool repoll = false; @@ -6753,13 +6767,16 @@ static int napi_threaded_poll(void *data) static __latent_entropy void net_rx_action(struct softirq_action *h) { + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + usecs_to_jiffies(READ_ONCE(netdev_budget_usecs)); int budget = READ_ONCE(netdev_budget); + struct bpf_xdp_storage __xdp_store; LIST_HEAD(list); LIST_HEAD(repoll); + xdp_store = xdp_storage_set(&__xdp_store); start: sd->in_net_rx_action = true; local_irq_disable(); diff --git a/net/core/filter.c b/net/core/filter.c index eb8d5a0a0ec8f..5721acb15d40f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2473,8 +2473,10 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = { .arg3_type = ARG_ANYTHING, }; +#ifndef CONFIG_PREEMPT_RT DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info); +#endif static struct net_device *skb_get_peer_dev(struct net_device *dev) { @@ -2488,11 +2490,15 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev) int skb_do_redirect(struct sk_buff *skb) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); struct net *net = dev_net(skb->dev); + struct bpf_redirect_info *ri; struct net_device *dev; - u32 flags = ri->flags; + u32 flags; + ri = xdp_storage_get_ri(); + if (!ri) + goto out_drop; + flags = ri->flags; dev = dev_get_by_index_rcu(net, ri->tgt_index); ri->tgt_index = 0; ri->flags = 0; @@ -2521,9 +2527,9 @@ int skb_do_redirect(struct sk_buff *skb) BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL))) + if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri)) return TC_ACT_SHOT; ri->flags = flags; @@ -2542,9 +2548,9 @@ static const struct bpf_func_proto bpf_redirect_proto = { BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - if (unlikely(flags)) + if (unlikely(flags || !ri)) return TC_ACT_SHOT; ri->flags = BPF_F_PEER; @@ -2564,9 +2570,9 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = { BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params, int, plen, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - if (unlikely((plen && plen < sizeof(*params)) || flags)) + if (unlikely((plen && plen < sizeof(*params)) || flags || !ri)) return TC_ACT_SHOT; ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0); @@ -4292,6 +4298,7 @@ void xdp_do_check_flushed(struct napi_struct *napi) void bpf_clear_redirect_map(struct bpf_map *map) { +#ifndef CONFIG_PREEMPT_RT struct bpf_redirect_info *ri; int cpu; @@ -4305,6 +4312,19 @@ void bpf_clear_redirect_map(struct bpf_map *map) if (unlikely(READ_ONCE(ri->map) == map)) cmpxchg(&ri->map, map, NULL); } +#else + /* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF + * program/ during NAPI callback. It is used during + * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the + * redirect callback afterwards. ri->map is cleared after usage. + * The path has no explicit RCU read section but the local_bh_disable() + * is also a RCU read section which makes the complete softirq callback + * RCU protected. This in turn makes ri->map RCU protocted and it is + * sufficient to wait a grace period to ensure that no "ri->map == map" + * exist. dev_map_free() removes the map from the list and then + * invokes synchronize_rcu() after calling this function. + */ +#endif } DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key); @@ -4313,11 +4333,14 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key); u32 xdp_master_redirect(struct xdp_buff *xdp) { struct net_device *master, *slave; - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri; master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev); slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp); if (slave && slave != xdp->rxq->dev) { + ri = xdp_storage_get_ri(); + if (!ri) + return XDP_ABORTED; /* The target device is different from the receiving device, so * redirect it to the new device. * Using XDP_REDIRECT gets the correct behaviour from XDP enabled @@ -4419,10 +4442,13 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri, int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; + struct bpf_redirect_info *ri; - if (map_type == BPF_MAP_TYPE_XSKMAP) + ri = xdp_storage_get_ri(); + if (!ri) + return -EINVAL; + + if (ri->map_type == BPF_MAP_TYPE_XSKMAP) return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog); return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp), @@ -4433,10 +4459,13 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect); int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp, struct xdp_frame *xdpf, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; + struct bpf_redirect_info *ri; - if (map_type == BPF_MAP_TYPE_XSKMAP) + ri = xdp_storage_get_ri(); + if (!ri) + return -EINVAL; + + if (ri->map_type == BPF_MAP_TYPE_XSKMAP) return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog); return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog); @@ -4450,10 +4479,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, void *fwd, enum bpf_map_type map_type, u32 map_id) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); struct bpf_map *map; int err; + if (!ri) { + err = -EINVAL; + goto err; + } switch (map_type) { case BPF_MAP_TYPE_DEVMAP: fallthrough; @@ -4495,12 +4528,20 @@ static int xdp_do_generic_redirect_map(struct net_device *dev, int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - enum bpf_map_type map_type = ri->map_type; - void *fwd = ri->tgt_value; - u32 map_id = ri->map_id; + struct bpf_redirect_info *ri = xdp_storage_get_ri(); + enum bpf_map_type map_type; + void *fwd; + u32 map_id; int err; + if (!ri) { + err = -EINVAL; + goto err; + } + map_type = ri->map_type; + fwd = ri->tgt_value; + map_id = ri->map_id; + ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */ ri->map_type = BPF_MAP_TYPE_UNSPEC; @@ -4529,9 +4570,9 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags) { - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_redirect_info *ri = xdp_storage_get_ri(); - if (unlikely(flags)) + if (unlikely(flags || !ri)) return XDP_ABORTED; /* NB! Map type UNSPEC and map_id == INT_MAX (never generated diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index a94943681e5aa..54690b85f1fe6 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -38,12 +38,15 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt) static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, struct dst_entry *dst, bool can_redirect) { + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; + struct bpf_xdp_storage __xdp_store; int ret; /* Disabling BH is needed to protect per-CPU bpf_redirect_info between * BPF prog and skb_do_redirect(). */ local_bh_disable(); + xdp_store = xdp_storage_set(&__xdp_store); bpf_compute_data_pointers(skb); ret = bpf_prog_run_save_cb(lwt->prog, skb); -- 2.43.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior @ 2024-02-13 20:50 ` Jesper Dangaard Brouer 2024-02-14 12:19 ` Sebastian Andrzej Siewior 2024-02-14 16:13 ` Toke Høiland-Jørgensen 1 sibling, 1 reply; 25+ messages in thread From: Jesper Dangaard Brouer @ 2024-02-13 20:50 UTC (permalink / raw) To: Sebastian Andrzej Siewior, bpf, netdev Cc: Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 13/02/2024 15.58, Sebastian Andrzej Siewior wrote: > The XDP redirect process is two staged: > - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the > packet and makes decisions. While doing that, the per-CPU variable > bpf_redirect_info is used. > > - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info > and it may also access other per-CPU variables like xskmap_flush_list. > > At the very end of the NAPI callback, xdp_do_flush() is invoked which > does not access bpf_redirect_info but will touch the individual per-CPU > lists. > > The per-CPU variables are only used in the NAPI callback hence disabling > bottom halves is the only protection mechanism. Users from preemptible > context (like cpu_map_kthread_run()) explicitly disable bottom halves > for protections reasons. > Without locking in local_bh_disable() on PREEMPT_RT this data structure > requires explicit locking to avoid corruption if preemption occurs. > > PREEMPT_RT has forced-threaded interrupts enabled and every > NAPI-callback runs in a thread. If each thread has its own data > structure then locking can be avoided and data corruption is also avoided. > > Create a struct bpf_xdp_storage which contains struct bpf_redirect_info. > Define the variable on stack, use xdp_storage_set() to set a pointer to > it in task_struct of the current task. Use the __free() annotation to > automatically reset the pointer once function returns. Use a pointer which can > be used by the __free() annotation to avoid invoking the callback the pointer > is NULL. This helps the compiler to optimize the code. > The xdp_storage_set() can nest. For instance local_bh_enable() in > bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which > also uses xdp_storage_set(). Therefore only the first invocations > updates the per-task pointer. > Use xdp_storage_get_ri() as a wrapper to retrieve the current struct > bpf_redirect_info. > > This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the > per-CPU variable instead. This should also work for !PREEMPT_RT but > isn't needed. > > Signed-off-by: Sebastian Andrzej Siewior<bigeasy@linutronix.de> I generally like the idea around bpf_xdp_storage. I only skimmed the code, but noticed some extra if-statements (for !NULL). I don't think they will make a difference, but I know Toke want me to test it... I'll hopefully have time to look at code closer tomorrow. --Jesper ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-13 20:50 ` Jesper Dangaard Brouer @ 2024-02-14 12:19 ` Sebastian Andrzej Siewior 2024-02-14 13:23 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-14 12:19 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote: > I generally like the idea around bpf_xdp_storage. > > I only skimmed the code, but noticed some extra if-statements (for > !NULL). I don't think they will make a difference, but I know Toke want > me to test it... I've been looking at the assembly for the return value of bpf_redirect_info() and there is a NULL pointer check. I hoped it was obvious to be nun-NULL because it is a static struct. Should this become a problem I could add "__attribute__((returns_nonnull))" to the declaration of the function which will optimize the NULL check away. > I'll hopefully have time to look at code closer tomorrow. > > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 12:19 ` Sebastian Andrzej Siewior @ 2024-02-14 13:23 ` Toke Høiland-Jørgensen 2024-02-14 14:28 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-14 13:23 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Jesper Dangaard Brouer Cc: bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote: >> I generally like the idea around bpf_xdp_storage. >> >> I only skimmed the code, but noticed some extra if-statements (for >> !NULL). I don't think they will make a difference, but I know Toke want >> me to test it... > > I've been looking at the assembly for the return value of > bpf_redirect_info() and there is a NULL pointer check. I hoped it was > obvious to be nun-NULL because it is a static struct. > > Should this become a problem I could add > "__attribute__((returns_nonnull))" to the declaration of the function > which will optimize the NULL check away. If we know the function will never return NULL (I was wondering about that, actually), why have the check in the C code at all? Couldn't we just omit it entirely instead of relying on the compiler to optimise it out? -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 13:23 ` Toke Høiland-Jørgensen @ 2024-02-14 14:28 ` Sebastian Andrzej Siewior 2024-02-14 16:08 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-14 14:28 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote: > Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > > > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote: > >> I generally like the idea around bpf_xdp_storage. > >> > >> I only skimmed the code, but noticed some extra if-statements (for > >> !NULL). I don't think they will make a difference, but I know Toke want > >> me to test it... > > > > I've been looking at the assembly for the return value of > > bpf_redirect_info() and there is a NULL pointer check. I hoped it was > > obvious to be nun-NULL because it is a static struct. > > > > Should this become a problem I could add > > "__attribute__((returns_nonnull))" to the declaration of the function > > which will optimize the NULL check away. > > If we know the function will never return NULL (I was wondering about > that, actually), why have the check in the C code at all? Couldn't we just > omit it entirely instead of relying on the compiler to optimise it out? The !RT version does: | static inline struct bpf_redirect_info *xdp_storage_get_ri(void) | { | return this_cpu_ptr(&bpf_redirect_info); | } which is static and can't be NULL (unless by mysterious ways the per-CPU offset + bpf_redirect_info offset is NULL). Maybe I can put this in this_cpu_ptr()… Let me think about it. For RT I have: | static inline struct bpf_xdp_storage *xdp_storage_get(void) | { | struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage; | | WARN_ON_ONCE(!xdp_store); | return xdp_store; | } | | static inline struct bpf_redirect_info *xdp_storage_get_ri(void) | { | struct bpf_xdp_storage *xdp_store = xdp_storage_get(); | | if (!xdp_store) | return NULL; | return &xdp_store->ri; | } so if current->bpf_xdp_storage is NULL then we get a warning and a NULL pointer. This *should* not happen due to xdp_storage_set() which assigns the pointer. However if I missed a spot then there is the check which aborts further processing. During testing I forgot a spot in egress and the test module. You could argue that the warning is enough since it should pop up in testing and not production because the code is always missed and not by chance (go boom, send a report). I *think* I covered all spots, at least the test suite didn't point anything out to me. I was unsure if I need something around net_tx_action() due to TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by sch_handle_egress(). > -Toke Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 14:28 ` Sebastian Andrzej Siewior @ 2024-02-14 16:08 ` Toke Høiland-Jørgensen 2024-02-14 16:36 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-14 16:08 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote: >> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: >> >> > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote: >> >> I generally like the idea around bpf_xdp_storage. >> >> >> >> I only skimmed the code, but noticed some extra if-statements (for >> >> !NULL). I don't think they will make a difference, but I know Toke want >> >> me to test it... >> > >> > I've been looking at the assembly for the return value of >> > bpf_redirect_info() and there is a NULL pointer check. I hoped it was >> > obvious to be nun-NULL because it is a static struct. >> > >> > Should this become a problem I could add >> > "__attribute__((returns_nonnull))" to the declaration of the function >> > which will optimize the NULL check away. >> >> If we know the function will never return NULL (I was wondering about >> that, actually), why have the check in the C code at all? Couldn't we just >> omit it entirely instead of relying on the compiler to optimise it out? > > The !RT version does: > | static inline struct bpf_redirect_info *xdp_storage_get_ri(void) > | { > | return this_cpu_ptr(&bpf_redirect_info); > | } > > which is static and can't be NULL (unless by mysterious ways the per-CPU > offset + bpf_redirect_info offset is NULL). Maybe I can put this in > this_cpu_ptr()… Let me think about it. > > For RT I have: > | static inline struct bpf_xdp_storage *xdp_storage_get(void) > | { > | struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage; > | > | WARN_ON_ONCE(!xdp_store); > | return xdp_store; > | } > | > | static inline struct bpf_redirect_info *xdp_storage_get_ri(void) > | { > | struct bpf_xdp_storage *xdp_store = xdp_storage_get(); > | > | if (!xdp_store) > | return NULL; > | return &xdp_store->ri; > | } > > so if current->bpf_xdp_storage is NULL then we get a warning and a NULL > pointer. This *should* not happen due to xdp_storage_set() which > assigns the pointer. However if I missed a spot then there is the check > which aborts further processing. > > During testing I forgot a spot in egress and the test module. You could > argue that the warning is enough since it should pop up in testing and > not production because the code is always missed and not by chance (go > boom, send a report). I *think* I covered all spots, at least the test > suite didn't point anything out to me. Well, I would prefer if we could make sure we covered everything and not have this odd failure mode where redirect just mysteriously stops working. At the very least, if we keep the check we should have a WARN_ON in there to make it really obvious that something needs to be fixed. This brings me to another thing I was going to point out separately, but may as well mention it here: It would be good if we could keep the difference between the RT and !RT versions as small as possible to avoid having subtle bugs that only appear in one configuration. I agree with Jesper that the concept of a stack-allocated "run context" for the XDP program makes sense in general (and I have some vague ideas about other things that may be useful to stick in there). So I'm wondering if it makes sense to do that even in the !RT case? We can't stick the pointer to it into 'current' when running in softirq, but we could change the per-cpu variable to just be a pointer that gets populated by xdp_storage_set()? I'm not really sure if this would be performance neutral (it's just moving around a few bits of memory, but we do gain an extra pointer deref), but it should be simple enough to benchmark. > I was unsure if I need something around net_tx_action() due to > TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by > sch_handle_egress(). Yup, I believe you're correct. -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 16:08 ` Toke Høiland-Jørgensen @ 2024-02-14 16:36 ` Sebastian Andrzej Siewior 2024-02-15 20:23 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-14 16:36 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote: > > During testing I forgot a spot in egress and the test module. You could > > argue that the warning is enough since it should pop up in testing and > > not production because the code is always missed and not by chance (go > > boom, send a report). I *think* I covered all spots, at least the test > > suite didn't point anything out to me. > > Well, I would prefer if we could make sure we covered everything and not > have this odd failure mode where redirect just mysteriously stops > working. At the very least, if we keep the check we should have a > WARN_ON in there to make it really obvious that something needs to be > fixed. Agree. > This brings me to another thing I was going to point out separately, but > may as well mention it here: It would be good if we could keep the > difference between the RT and !RT versions as small as possible to avoid > having subtle bugs that only appear in one configuration. Yes. I do so, too. > I agree with Jesper that the concept of a stack-allocated "run context" > for the XDP program makes sense in general (and I have some vague ideas > about other things that may be useful to stick in there). So I'm > wondering if it makes sense to do that even in the !RT case? We can't > stick the pointer to it into 'current' when running in softirq, but we > could change the per-cpu variable to just be a pointer that gets > populated by xdp_storage_set()? I *think* that it could be added to current. The assignment currently allows nesting so that is not a problem. Only the outer most set/clear would do something. If you run in softirq, you would hijack a random task_struct. If the pointer is already assigned then the list and so one must be empty because access is only allowed in BH-disabled sections. However, using per-CPU for the pointer (instead of task_struct) would have the advantage that it is always CPU/node local memory while the random task_struct could have been allocated on a different NUMA node. > I'm not really sure if this would be performance neutral (it's just > moving around a few bits of memory, but we do gain an extra pointer > deref), but it should be simple enough to benchmark. My guess is that we remain with one per-CPU dereference and an additional "add offset". That is why I kept the !RT bits as they are before getting yelled at. I could prepare something and run a specific test if you have one. > -Toke Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 16:36 ` Sebastian Andrzej Siewior @ 2024-02-15 20:23 ` Toke Høiland-Jørgensen 2024-02-16 16:57 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-15 20:23 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote: >> > During testing I forgot a spot in egress and the test module. You could >> > argue that the warning is enough since it should pop up in testing and >> > not production because the code is always missed and not by chance (go >> > boom, send a report). I *think* I covered all spots, at least the test >> > suite didn't point anything out to me. >> >> Well, I would prefer if we could make sure we covered everything and not >> have this odd failure mode where redirect just mysteriously stops >> working. At the very least, if we keep the check we should have a >> WARN_ON in there to make it really obvious that something needs to be >> fixed. > > Agree. > >> This brings me to another thing I was going to point out separately, but >> may as well mention it here: It would be good if we could keep the >> difference between the RT and !RT versions as small as possible to avoid >> having subtle bugs that only appear in one configuration. > > Yes. I do so, too. > >> I agree with Jesper that the concept of a stack-allocated "run context" >> for the XDP program makes sense in general (and I have some vague ideas >> about other things that may be useful to stick in there). So I'm >> wondering if it makes sense to do that even in the !RT case? We can't >> stick the pointer to it into 'current' when running in softirq, but we >> could change the per-cpu variable to just be a pointer that gets >> populated by xdp_storage_set()? > > I *think* that it could be added to current. The assignment currently > allows nesting so that is not a problem. Only the outer most set/clear > would do something. If you run in softirq, you would hijack a random > task_struct. If the pointer is already assigned then the list and so one > must be empty because access is only allowed in BH-disabled sections. > > However, using per-CPU for the pointer (instead of task_struct) would > have the advantage that it is always CPU/node local memory while the > random task_struct could have been allocated on a different NUMA node. Ah yes, good point, that's probably desirable :) >> I'm not really sure if this would be performance neutral (it's just >> moving around a few bits of memory, but we do gain an extra pointer >> deref), but it should be simple enough to benchmark. > > My guess is that we remain with one per-CPU dereference and an > additional "add offset". That is why I kept the !RT bits as they are > before getting yelled at. > > I could prepare something and run a specific test if you have one. The test itself is simple enough: Simply run xdp-bench (from xdp-tools[0]) in drop mode, serve some traffic to the machine and observe the difference in PPS before and after the patch. The tricky part is that the traffic actually has to stress the CPU, which means that the offered load has to be higher than what the CPU can handle. Which generally means running on high-speed NICs with small packets: a modern server CPU has no problem keeping up with a 10G link even at 64-byte packet size, so a 100G NIC is needed, or the test needs to be run on a low-powered machine. As a traffic generator, the xdp-trafficgen utility also in xdp-tools can be used, or the in-kernel pktgen, or something like T-rex or Moongen. Generally serving UDP traffic with 64-byte packets on a single port is enough to make sure the traffic is serviced by a single CPU, although some configuration may be needed to steer IRQs as well. -Toke [0] https://github.com/xdp-project/xdp-tools ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-15 20:23 ` Toke Høiland-Jørgensen @ 2024-02-16 16:57 ` Sebastian Andrzej Siewior 2024-02-19 19:01 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-16 16:57 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote: > The tricky part is that the traffic actually has to stress the CPU, > which means that the offered load has to be higher than what the CPU can > handle. Which generally means running on high-speed NICs with small > packets: a modern server CPU has no problem keeping up with a 10G link > even at 64-byte packet size, so a 100G NIC is needed, or the test needs > to be run on a low-powered machine. I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could reduce the queues to one and hope that it is slow enough. > As a traffic generator, the xdp-trafficgen utility also in xdp-tools can > be used, or the in-kernel pktgen, or something like T-rex or Moongen. > Generally serving UDP traffic with 64-byte packets on a single port > is enough to make sure the traffic is serviced by a single CPU, although > some configuration may be needed to steer IRQs as well. I played with xdp-trafficgen: | # xdp-trafficgen udp eno2 -v | Current rlimit is infinity or 0. Not raising | Kernel supports 5-arg xdp_cpumap_kthread tracepoint | Error in ethtool ioctl: Operation not supported | Got -95 queues for ifname lo | Kernel supports 5-arg xdp_cpumap_kthread tracepoint | Got 94 queues for ifname eno2 | Transmitting on eno2 (ifindex 3) | lo->eno2 0 err/s 0 xmit/s | lo->eno2 0 err/s 0 xmit/s | lo->eno2 0 err/s 0 xmit/s I even tried set the MAC address with -M/ -m but nothing happens. I see and on drop side something happening when I issue a ping command. Does something ring a bell? Otherwise I try the pktgen. This is a Debian kernel (just to ensure I didn't break something or forgot a config switch). > -Toke Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-16 16:57 ` Sebastian Andrzej Siewior @ 2024-02-19 19:01 ` Toke Høiland-Jørgensen 2024-02-20 9:17 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-19 19:01 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote: >> The tricky part is that the traffic actually has to stress the CPU, >> which means that the offered load has to be higher than what the CPU can >> handle. Which generally means running on high-speed NICs with small >> packets: a modern server CPU has no problem keeping up with a 10G link >> even at 64-byte packet size, so a 100G NIC is needed, or the test needs >> to be run on a low-powered machine. > > I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could > reduce the queues to one and hope that it is slow enough. Yeah, that may work. As long as the baseline performance is below the ~14Mpps that's 10G line rate for small packets. >> As a traffic generator, the xdp-trafficgen utility also in xdp-tools can >> be used, or the in-kernel pktgen, or something like T-rex or Moongen. >> Generally serving UDP traffic with 64-byte packets on a single port >> is enough to make sure the traffic is serviced by a single CPU, although >> some configuration may be needed to steer IRQs as well. > > I played with xdp-trafficgen: > | # xdp-trafficgen udp eno2 -v > | Current rlimit is infinity or 0. Not raising > | Kernel supports 5-arg xdp_cpumap_kthread tracepoint > | Error in ethtool ioctl: Operation not supported > | Got -95 queues for ifname lo > | Kernel supports 5-arg xdp_cpumap_kthread tracepoint > | Got 94 queues for ifname eno2 > | Transmitting on eno2 (ifindex 3) > | lo->eno2 0 err/s 0 xmit/s > | lo->eno2 0 err/s 0 xmit/s > | lo->eno2 0 err/s 0 xmit/s > > I even tried set the MAC address with -M/ -m but nothing happens. I see > and on drop side something happening when I issue a ping command. > Does something ring a bell? Otherwise I try the pktgen. This is a Debian > kernel (just to ensure I didn't break something or forgot a config > switch). Hmm, how old a kernel? And on what hardware? xdp-trafficgen requires a relatively new kernel, and the driver needs to support XDP_REDIRECT. It may be simpler to use pktgen, and at 10G rates that shouldn't become a bottleneck either. The pktgen_sample03_burst_single_flow.sh script in samples/pktgen in the kernel source tree is fine for this usage. -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-19 19:01 ` Toke Høiland-Jørgensen @ 2024-02-20 9:17 ` Jesper Dangaard Brouer 2024-02-20 10:17 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Jesper Dangaard Brouer @ 2024-02-20 9:17 UTC (permalink / raw) To: Toke Høiland-Jørgensen, Sebastian Andrzej Siewior; +Cc: bpf, netdev On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote: > may be simpler to use pktgen, and at 10G rates that shouldn't become a > bottleneck either. The pktgen_sample03_burst_single_flow.sh script in > samples/pktgen in the kernel source tree is fine for this usage. Example of running script: ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m ec:0d:9a:db:11:c4 -t 12 Notice the last parameter, which is number threads to start. If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more CPUs. --Jesper ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 9:17 ` Jesper Dangaard Brouer @ 2024-02-20 10:17 ` Sebastian Andrzej Siewior 2024-02-20 10:42 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-20 10:17 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote: > > > On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote: > > may be simpler to use pktgen, and at 10G rates that shouldn't become a > > bottleneck either. The pktgen_sample03_burst_single_flow.sh script in > > samples/pktgen in the kernel source tree is fine for this usage. > > Example of running script: > ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m > ec:0d:9a:db:11:c4 -t 12 > > Notice the last parameter, which is number threads to start. > If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more > CPUs. I get | Summary 8,435,690 rx/s 0 err/s | Summary 8,436,294 rx/s 0 err/s with "-t 8 -b 64". I started with 2 and then increased until rx/s was falling again. I have ixgbe on the sending side and i40e on the receiving side. I tried to receive on ixgbe but this ended with -ENOMEM | # xdp-bench drop eth1 | Failed to attach XDP program: Cannot allocate memory This is v6.8-rc5 on both sides. Let me see where this is coming from… > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 10:17 ` Sebastian Andrzej Siewior @ 2024-02-20 10:42 ` Jesper Dangaard Brouer 2024-02-20 12:08 ` Sebastian Andrzej Siewior 2024-02-20 12:10 ` Dave Taht 0 siblings, 2 replies; 25+ messages in thread From: Jesper Dangaard Brouer @ 2024-02-20 10:42 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 20/02/2024 11.17, Sebastian Andrzej Siewior wrote: > On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote: >> >> >> On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote: >>> may be simpler to use pktgen, and at 10G rates that shouldn't become a >>> bottleneck either. The pktgen_sample03_burst_single_flow.sh script in >>> samples/pktgen in the kernel source tree is fine for this usage. >> >> Example of running script: >> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m >> ec:0d:9a:db:11:c4 -t 12 >> >> Notice the last parameter, which is number threads to start. >> If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more >> CPUs. > > I get > | Summary 8,435,690 rx/s 0 err/s This seems low... Have you remembered to disable Ethernet flow-control? # ethtool -A ixgbe1 rx off tx off # ethtool -A i40e2 rx off tx off > | Summary 8,436,294 rx/s 0 err/s You want to see the "extended" info via cmdline (or Ctrl+\) # xdp-bench drop -e eth1 > > with "-t 8 -b 64". I started with 2 and then increased until rx/s was > falling again. I have ixgbe on the sending side and i40e on the With ixgbe on the sending side, my testlab shows I need -t 2. With -t 2 : Summary 14,678,170 rx/s 0 err/s receive total 14,678,170 pkt/s 14,678,170 drop/s 0 error/s cpu:1 14,678,170 pkt/s 14,678,170 drop/s 0 error/s xdp_exception 0 hit/s with -t 4: Summary 10,255,385 rx/s 0 err/s receive total 10,255,385 pkt/s 10,255,385 drop/s 0 error/s cpu:1 10,255,385 pkt/s 10,255,385 drop/s 0 error/s xdp_exception 0 hit/s > receiving side. I tried to receive on ixgbe but this ended with -ENOMEM > | # xdp-bench drop eth1 > | Failed to attach XDP program: Cannot allocate memory > > This is v6.8-rc5 on both sides. Let me see where this is coming from… > Another pitfall with ixgbe is that it does a full link reset when adding/removing XDP prog on device. This can be annoying if connected back-to-back, because "remote" pktgen will stop on link reset. --Jesper ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 10:42 ` Jesper Dangaard Brouer @ 2024-02-20 12:08 ` Sebastian Andrzej Siewior 2024-02-20 12:57 ` Jesper Dangaard Brouer 2024-02-20 12:10 ` Dave Taht 1 sibling, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-20 12:08 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote: > This seems low... > Have you remembered to disable Ethernet flow-control? No but one side says: | i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None but I did this > # ethtool -A ixgbe1 rx off tx off > # ethtool -A i40e2 rx off tx off and it didn't change much. > > > | Summary 8,436,294 rx/s 0 err/s > > You want to see the "extended" info via cmdline (or Ctrl+\) > > # xdp-bench drop -e eth1 > > > > > > with "-t 8 -b 64". I started with 2 and then increased until rx/s was > > falling again. I have ixgbe on the sending side and i40e on the > > With ixgbe on the sending side, my testlab shows I need -t 2. > > With -t 2 : > Summary 14,678,170 rx/s 0 err/s > receive total 14,678,170 pkt/s 14,678,170 drop/s 0 > error/s > cpu:1 14,678,170 pkt/s 14,678,170 drop/s 0 > error/s > xdp_exception 0 hit/s > > with -t 4: > > Summary 10,255,385 rx/s 0 err/s > receive total 10,255,385 pkt/s 10,255,385 drop/s 0 > error/s > cpu:1 10,255,385 pkt/s 10,255,385 drop/s 0 > error/s > xdp_exception 0 hit/s > > > receiving side. I tried to receive on ixgbe but this ended with -ENOMEM > > | # xdp-bench drop eth1 > > | Failed to attach XDP program: Cannot allocate memory > > > > This is v6.8-rc5 on both sides. Let me see where this is coming from… > > > > Another pitfall with ixgbe is that it does a full link reset when > adding/removing XDP prog on device. This can be annoying if connected > back-to-back, because "remote" pktgen will stop on link reset. so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run xdp-bench on the ixbe. so. i40 send, ixgbe receive. -t 2 | Summary 2,348,800 rx/s 0 err/s | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s | xdp_exception 0 hit/s -t 4 | Summary 4,158,199 rx/s 0 err/s | receive total 4,158,199 pkt/s 4,158,199 drop/s 0 error/s | cpu:0 4,158,199 pkt/s 4,158,199 drop/s 0 error/s | xdp_exception 0 hit/s -t 8 | Summary 5,612,861 rx/s 0 err/s | receive total 5,612,861 pkt/s 5,612,861 drop/s 0 error/s | cpu:0 5,612,861 pkt/s 5,612,861 drop/s 0 error/s | xdp_exception 0 hit/s going higher makes the rate drop. With 8 it floats between 5,5… 5,7… Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse, using the default 512/512 gets the numbers from above, going below 256 makes it worse. receiving on i40, sending on ixgbe: -t 2 |Summary 3,042,957 rx/s 0 err/s | receive total 3,042,957 pkt/s 3,042,957 drop/s 0 error/s | cpu:60 3,042,957 pkt/s 3,042,957 drop/s 0 error/s | xdp_exception 0 hit/s -t 4 |Summary 5,442,166 rx/s 0 err/s | receive total 5,442,166 pkt/s 5,442,166 drop/s 0 error/s | cpu:60 5,442,166 pkt/s 5,442,166 drop/s 0 error/s | xdp_exception 0 hit/s -t 6 | Summary 7,023,406 rx/s 0 err/s | receive total 7,023,406 pkt/s 7,023,406 drop/s 0 error/s | cpu:60 7,023,406 pkt/s 7,023,406 drop/s 0 error/s | xdp_exception 0 hit/s -t 8 | Summary 7,540,915 rx/s 0 err/s | receive total 7,540,915 pkt/s 7,540,915 drop/s 0 error/s | cpu:60 7,540,915 pkt/s 7,540,915 drop/s 0 error/s | xdp_exception 0 hit/s -t 10 |Summary 7,699,143 rx/s 0 err/s | receive total 7,699,143 pkt/s 7,699,143 drop/s 0 error/s | cpu:60 7,699,143 pkt/s 7,699,143 drop/s 0 error/s | xdp_exception 0 hit/s -t 18 | Summary 7,784,946 rx/s 0 err/s | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s | xdp_exception 0 hit/s after t18 it drop down to 2,… Now I got worse than before since -t8 says 7,5… and it did 8,4 in the morning. Do you have maybe a .config for me in case I did not enable the performance switch? > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 12:08 ` Sebastian Andrzej Siewior @ 2024-02-20 12:57 ` Jesper Dangaard Brouer 2024-02-20 15:32 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Jesper Dangaard Brouer @ 2024-02-20 12:57 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 20/02/2024 13.08, Sebastian Andrzej Siewior wrote: > On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote: >> This seems low... >> Have you remembered to disable Ethernet flow-control? > > No but one side says: > | i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None > > but I did this > >> # ethtool -A ixgbe1 rx off tx off >> # ethtool -A i40e2 rx off tx off > > and it didn't change much. > >> >>> | Summary 8,436,294 rx/s 0 err/s >> >> You want to see the "extended" info via cmdline (or Ctrl+\) >> >> # xdp-bench drop -e eth1 >> >> >>> >>> with "-t 8 -b 64". I started with 2 and then increased until rx/s was >>> falling again. I have ixgbe on the sending side and i40e on the >> >> With ixgbe on the sending side, my testlab shows I need -t 2. >> >> With -t 2 : >> Summary 14,678,170 rx/s 0 err/s >> receive total 14,678,170 pkt/s 14,678,170 drop/s 0 >> error/s >> cpu:1 14,678,170 pkt/s 14,678,170 drop/s 0 >> error/s >> xdp_exception 0 hit/s >> >> with -t 4: >> >> Summary 10,255,385 rx/s 0 err/s >> receive total 10,255,385 pkt/s 10,255,385 drop/s 0 >> error/s >> cpu:1 10,255,385 pkt/s 10,255,385 drop/s 0 >> error/s >> xdp_exception 0 hit/s >> >>> receiving side. I tried to receive on ixgbe but this ended with -ENOMEM >>> | # xdp-bench drop eth1 >>> | Failed to attach XDP program: Cannot allocate memory >>> >>> This is v6.8-rc5 on both sides. Let me see where this is coming from… >>> >> >> Another pitfall with ixgbe is that it does a full link reset when >> adding/removing XDP prog on device. This can be annoying if connected >> back-to-back, because "remote" pktgen will stop on link reset. > > so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run > xdp-bench on the ixgbe. > Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a hardware TX queue for every CPU in the system. So, I guess you have too many CPUs in your system - lol. Other drivers have a fallback to a locked XDP TX path, so this is also something to lookout for in the machine with i40e. > so. i40 send, ixgbe receive. > > -t 2 > > | Summary 2,348,800 rx/s 0 err/s > | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > | xdp_exception 0 hit/s > This is way too low, with i40e sending. On my system with only -t 1 my i40e driver can send with approx 15Mpps: Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec > -t 4 > | Summary 4,158,199 rx/s 0 err/s > | receive total 4,158,199 pkt/s 4,158,199 drop/s 0 error/s > | cpu:0 4,158,199 pkt/s 4,158,199 drop/s 0 error/s > | xdp_exception 0 hit/s > Do notice that this is all hitting CPU:0 (this is by design from pktgen_sample03_burst_single_flow.sh) > -t 8 > | Summary 5,612,861 rx/s 0 err/s > | receive total 5,612,861 pkt/s 5,612,861 drop/s 0 error/s > | cpu:0 5,612,861 pkt/s 5,612,861 drop/s 0 error/s > | xdp_exception 0 hit/s > > going higher makes the rate drop. With 8 it floats between 5,5… 5,7… > At -t 8 we seem to have hit limit on RX side, which also seems too low. I recommend checking what speeds packet generator is sending with. I use this tool: ethtool_stats.pl https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl What we are basically asking: Make sure packet generator is not the limiting factor in your tests. As we need to make sure DUT (Device Under Test) is being overloaded 100%. I often also check (via per record) that DUT don't have idle CPU cycles (yes, this can easily happen... and happens when we hit a limit in hardware either NIC or PCIe slot) > Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse, > using the default 512/512 gets the numbers from above, going below 256 > makes it worse. > > receiving on i40, sending on ixgbe: > > -t 2 > |Summary 3,042,957 rx/s 0 err/s > | receive total 3,042,957 pkt/s 3,042,957 drop/s 0 error/s > | cpu:60 3,042,957 pkt/s 3,042,957 drop/s 0 error/s > | xdp_exception 0 hit/s > > -t 4 > |Summary 5,442,166 rx/s 0 err/s > | receive total 5,442,166 pkt/s 5,442,166 drop/s 0 error/s > | cpu:60 5,442,166 pkt/s 5,442,166 drop/s 0 error/s > | xdp_exception 0 hit/s > > > -t 6 > | Summary 7,023,406 rx/s 0 err/s > | receive total 7,023,406 pkt/s 7,023,406 drop/s 0 error/s > | cpu:60 7,023,406 pkt/s 7,023,406 drop/s 0 error/s > | xdp_exception 0 hit/s > > > -t 8 > | Summary 7,540,915 rx/s 0 err/s > | receive total 7,540,915 pkt/s 7,540,915 drop/s 0 error/s > | cpu:60 7,540,915 pkt/s 7,540,915 drop/s 0 error/s > | xdp_exception 0 hit/s > > -t 10 > |Summary 7,699,143 rx/s 0 err/s > | receive total 7,699,143 pkt/s 7,699,143 drop/s 0 error/s > | cpu:60 7,699,143 pkt/s 7,699,143 drop/s 0 error/s > | xdp_exception 0 hit/s > At this level, if you can verify that CPU:60 is 100% loaded, and packet generator is sending more than rx number, then it could work as a valid experiment. > -t 18 > | Summary 7,784,946 rx/s 0 err/s > | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > | xdp_exception 0 hit/s > > after t18 it drop down to 2,… > Now I got worse than before since -t8 says 7,5… and it did 8,4 in the > morning. Do you have maybe a .config for me in case I did not enable the > performance switch? > I would look for root-cause with perf record + perf report --sort cpu,comm,dso,symbol --no-children --Jesper ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 12:57 ` Jesper Dangaard Brouer @ 2024-02-20 15:32 ` Sebastian Andrzej Siewior 2024-02-22 9:22 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-20 15:32 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 2024-02-20 13:57:24 [+0100], Jesper Dangaard Brouer wrote: > > so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run > > xdp-bench on the ixgbe. > > > > Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a > hardware TX queue for every CPU in the system. So, I guess you have too > many CPUs in your system - lol. > > Other drivers have a fallback to a locked XDP TX path, so this is also > something to lookout for in the machine with i40e. this locked XDP TX path starts at 64 but xdp_progs are rejected > 64 * 2. > > so. i40 send, ixgbe receive. > > > > -t 2 > > > > | Summary 2,348,800 rx/s 0 err/s > > | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > > | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > > | xdp_exception 0 hit/s > > > > This is way too low, with i40e sending. > > On my system with only -t 1 my i40e driver can send with approx 15Mpps: > > Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec > Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec -t1 in ixgbe Show adapter(s) (eth1) statistics (ONLY that changed!) Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec -t i40e Ethtool(eno2np1 ) stat: 90 ( 90) <= port.rx_bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_size_127 /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_unicast /sec Ethtool(eno2np1 ) stat: 79554379 ( 79,554,379) <= port.tx_bytes /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_size_64 /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_unicast /sec Ethtool(eno2np1 ) stat: 86 ( 86) <= rx-32.bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx-32.packets /sec Ethtool(eno2np1 ) stat: 86 ( 86) <= rx_bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_cache_waive /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_packets /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_unicast /sec Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx-0.bytes /sec Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx-0.packets /sec Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx_bytes /sec Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx_packets /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= tx_unicast /sec mine is slightly slower. But this seems to match what I see on the RX side. > At this level, if you can verify that CPU:60 is 100% loaded, and packet > generator is sending more than rx number, then it could work as a valid > experiment. i40e receiving on 8: %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 15.2 si, 0.0 st ixgbe receiving on 13: %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni, 56.7 id, 0.0 wa, 0.0 hi, 43.3 si, 0.0 st looks idle. On the sending side kpktgend_0 is always at 100%. > > -t 18 > > | Summary 7,784,946 rx/s 0 err/s > > | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > > | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > > | xdp_exception 0 hit/s > > > > after t18 it drop down to 2,… > > Now I got worse than before since -t8 says 7,5… and it did 8,4 in the > > morning. Do you have maybe a .config for me in case I did not enable the > > performance switch? > > > > I would look for root-cause with perf record + > perf report --sort cpu,comm,dso,symbol --no-children while sending with ixgbe while running perf top on the box: | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0 | Overhead CPU Command Shared Object Symbol | 31.98% 000 kpktgend_0 [kernel] [k] xas_find | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap | 2.30% 000 kpktgend_0 [kernel] [k] fq_ring_free_locked | 1.99% 000 kpktgend_0 [kernel] [k] __domain_mapping | 1.82% 000 kpktgend_0 [kernel] [k] iommu_dma_alloc_iova | 1.80% 000 kpktgend_0 [kernel] [k] __iommu_map | 1.72% 000 kpktgend_0 [kernel] [k] iommu_pgsize.isra.0 | 1.70% 000 kpktgend_0 [kernel] [k] __iommu_dma_map | 1.63% 000 kpktgend_0 [kernel] [k] alloc_iova_fast | 1.59% 000 kpktgend_0 [kernel] [k] _raw_spin_lock_irqsave | 1.32% 000 kpktgend_0 [kernel] [k] iommu_map | 1.30% 000 kpktgend_0 [kernel] [k] iommu_dma_map_page | 1.23% 000 kpktgend_0 [kernel] [k] intel_iommu_iotlb_sync_map | 1.21% 000 kpktgend_0 [kernel] [k] xa_find_after | 1.17% 000 kpktgend_0 [kernel] [k] ixgbe_poll | 1.06% 000 kpktgend_0 [kernel] [k] __iommu_unmap | 1.04% 000 kpktgend_0 [kernel] [k] intel_iommu_unmap_pages | 1.01% 000 kpktgend_0 [kernel] [k] free_iova_fast | 0.96% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker the i40e box while sending: |Samples: 400K of event 'cycles:P', 4000 Hz, Event count (approx.): 80512443924 lost: 0/0 drop: 0/0 |Overhead CPU Command Shared Object Symbol | 24.04% 000 kpktgend_0 [kernel] [k] i40e_lan_xmit_frame | 17.20% 019 swapper [kernel] [k] i40e_napi_poll | 4.84% 019 swapper [kernel] [k] intel_idle_irq | 4.20% 019 swapper [kernel] [k] napi_consume_skb | 3.00% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker | 2.76% 008 swapper [kernel] [k] i40e_napi_poll | 2.36% 000 kpktgend_0 [kernel] [k] dma_map_page_attrs | 1.93% 019 swapper [kernel] [k] dma_unmap_page_attrs | 1.70% 008 swapper [kernel] [k] intel_idle_irq | 1.44% 008 swapper [kernel] [k] __udp4_lib_rcv | 1.44% 008 swapper [kernel] [k] __netif_receive_skb_core.constprop.0 | 1.40% 008 swapper [kernel] [k] napi_build_skb | 1.28% 000 kpktgend_0 [kernel] [k] kfree_skb_reason | 1.27% 008 swapper [kernel] [k] ip_rcv_core | 1.19% 008 swapper [kernel] [k] inet_gro_receive | 1.01% 008 swapper [kernel] [k] kmem_cache_free.part.0 > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 15:32 ` Sebastian Andrzej Siewior @ 2024-02-22 9:22 ` Sebastian Andrzej Siewior 2024-02-22 10:10 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-22 9:22 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote: > > > > Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec > > Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec > > -t1 in ixgbe > Show adapter(s) (eth1) statistics (ONLY that changed!) > Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec > Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec > Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec > Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec > Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec > Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec … > while sending with ixgbe while running perf top on the box: > | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0 > | Overhead CPU Command Shared Object Symbol > | 31.98% 000 kpktgend_0 [kernel] [k] xas_find > | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte > | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring > | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level > | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap I disabled the iommu and I get to Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec looks like a small improvement… It is not your 15 but close. -t2 does improve the situation. There is a warning from DMA mapping code but ;) > > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-22 9:22 ` Sebastian Andrzej Siewior @ 2024-02-22 10:10 ` Jesper Dangaard Brouer 2024-02-22 10:58 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 25+ messages in thread From: Jesper Dangaard Brouer @ 2024-02-22 10:10 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 22/02/2024 10.22, Sebastian Andrzej Siewior wrote: > On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote: >>> >>> Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec >>> Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec >> >> -t1 in ixgbe >> Show adapter(s) (eth1) statistics (ONLY that changed!) >> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec >> Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec >> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec >> Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec >> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec >> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec > … >> while sending with ixgbe while running perf top on the box: >> | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0 >> | Overhead CPU Command Shared Object Symbol >> | 31.98% 000 kpktgend_0 [kernel] [k] xas_find >> | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte >> | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring >> | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level >> | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap > > I disabled the iommu and I get to Yes, clearly IOMMU code that cause the performance issue for you. This driver doesn't use page_pool, so I want to point out (for people finding this post in the future) that page_pool keeps DMA mappings for recycled frame, which should address the IOMMU overhead issue here. > > Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec > Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec > > looks like a small improvement… It is not your 15 but close. -t2 does > improve the situation. You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps. Congratulations, I think this 14.15 Mpps is as close to wirespeed as it possible on your hardware. BTW what CPU are you using? > There is a warning from DMA mapping code but ;) It is a warning from IOMMU code? It usually means there is a real DMA unmap bug (which we should fix). --Jesper ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-22 10:10 ` Jesper Dangaard Brouer @ 2024-02-22 10:58 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-22 10:58 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev On 2024-02-22 11:10:44 [+0100], Jesper Dangaard Brouer wrote: > > Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec > > Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec > > > > looks like a small improvement… It is not your 15 but close. -t2 does > > improve the situation. > > You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps. Oh, my bad. > Congratulations, I think this 14.15 Mpps is as close to wirespeed as it > possible on your hardware. > > BTW what CPU are you using? "Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz" The "performance" governor is used, I lowered the number of CPUs and disabled SMT. > > There is a warning from DMA mapping code but ;) > > It is a warning from IOMMU code? > It usually means there is a real DMA unmap bug (which we should fix). Not sure, I don't think so: | ------------[ cut here ]------------ | ehci-pci 0000:00:1a.0: DMA addr 0x0000000105016ce8+8 overflow (mask ffffffff, bus limit 0). | WARNING: CPU: 0 PID: 1029 at kernel/dma/direct.h:105 dma_map_page_attrs+0x1e8/0x1f0 | RIP: 0010:dma_map_page_attrs+0x1e8/0x1f0 | Call Trace: | <TASK> | usb_hcd_map_urb_for_dma+0x1b0/0x4d0 | usb_hcd_submit_urb+0x342/0x9b0 | usb_start_wait_urb+0x50/0xc0 | usb_control_msg+0xc8/0x110 | get_bMaxPacketSize0+0x5a/0xb0 and USB isn't working. I *think* it is the "memory above 4G" thing, not sure where it took the wrong turn. > --Jesper Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-20 10:42 ` Jesper Dangaard Brouer 2024-02-20 12:08 ` Sebastian Andrzej Siewior @ 2024-02-20 12:10 ` Dave Taht 1 sibling, 0 replies; 25+ messages in thread From: Dave Taht @ 2024-02-20 12:10 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Sebastian Andrzej Siewior, Toke Høiland-Jørgensen, bpf, netdev I am actually rather interested in what granularity ethernet flow control takes place at these days at these speeds. Anyone know? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior 2024-02-13 20:50 ` Jesper Dangaard Brouer @ 2024-02-14 16:13 ` Toke Høiland-Jørgensen 2024-02-15 9:04 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-14 16:13 UTC (permalink / raw) To: Sebastian Andrzej Siewior, bpf, netdev Cc: Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song, Sebastian Andrzej Siewior Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > The XDP redirect process is two staged: > - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the > packet and makes decisions. While doing that, the per-CPU variable > bpf_redirect_info is used. > > - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info > and it may also access other per-CPU variables like xskmap_flush_list. > > At the very end of the NAPI callback, xdp_do_flush() is invoked which > does not access bpf_redirect_info but will touch the individual per-CPU > lists. > > The per-CPU variables are only used in the NAPI callback hence disabling > bottom halves is the only protection mechanism. Users from preemptible > context (like cpu_map_kthread_run()) explicitly disable bottom halves > for protections reasons. > Without locking in local_bh_disable() on PREEMPT_RT this data structure > requires explicit locking to avoid corruption if preemption occurs. > > PREEMPT_RT has forced-threaded interrupts enabled and every > NAPI-callback runs in a thread. If each thread has its own data > structure then locking can be avoided and data corruption is also avoided. > > Create a struct bpf_xdp_storage which contains struct bpf_redirect_info. > Define the variable on stack, use xdp_storage_set() to set a pointer to > it in task_struct of the current task. Use the __free() annotation to > automatically reset the pointer once function returns. Use a pointer which can > be used by the __free() annotation to avoid invoking the callback the pointer > is NULL. This helps the compiler to optimize the code. > The xdp_storage_set() can nest. For instance local_bh_enable() in > bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which > also uses xdp_storage_set(). Therefore only the first invocations > updates the per-task pointer. > Use xdp_storage_get_ri() as a wrapper to retrieve the current struct > bpf_redirect_info. > > This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the > per-CPU variable instead. This should also work for !PREEMPT_RT but > isn't needed. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [...] > diff --git a/net/core/dev.c b/net/core/dev.c > index de362d5f26559..c3f7d2a6b6134 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > struct net_device *orig_dev, bool *another) > { > struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS; > + struct bpf_xdp_storage __xdp_store; > int sch_ret; > > if (!entry) > return skb; > + > + xdp_store = xdp_storage_set(&__xdp_store); > if (*pt_prev) { > *ret = deliver_skb(skb, *pt_prev, orig_dev); > *pt_prev = NULL; > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff * > sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > { > struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS; > + struct bpf_xdp_storage __xdp_store; > int sch_ret; > > if (!entry) > return skb; > > + xdp_store = xdp_storage_set(&__xdp_store); > + > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was > * already set by the caller. > */ These, and the LWT code, don't actually have anything to do with XDP, which indicates that the 'xdp_storage' name misleading. Maybe 'bpf_net_context' or something along those lines? Or maybe we could just move the flush lists into bpf_redirect_info itself and just keep that as the top-level name? -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-14 16:13 ` Toke Høiland-Jørgensen @ 2024-02-15 9:04 ` Sebastian Andrzej Siewior 2024-02-15 12:11 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-15 9:04 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote: > > diff --git a/net/core/dev.c b/net/core/dev.c > > index de362d5f26559..c3f7d2a6b6134 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff * > > sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > > { > > struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); > > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; > > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS; > > + struct bpf_xdp_storage __xdp_store; > > int sch_ret; > > > > if (!entry) > > return skb; > > > > + xdp_store = xdp_storage_set(&__xdp_store); > > + > > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was > > * already set by the caller. > > */ > > > These, and the LWT code, don't actually have anything to do with XDP, > which indicates that the 'xdp_storage' name misleading. Maybe > 'bpf_net_context' or something along those lines? Or maybe we could just > move the flush lists into bpf_redirect_info itself and just keep that as > the top-level name? I'm going to rename it for now as suggested for now. If it is a better fit to include the lists into bpf_redirect_info then I will update accordingly. > -Toke Sebastian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. 2024-02-15 9:04 ` Sebastian Andrzej Siewior @ 2024-02-15 12:11 ` Toke Høiland-Jørgensen 0 siblings, 0 replies; 25+ messages in thread From: Toke Høiland-Jørgensen @ 2024-02-15 12:11 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: bpf, netdev, Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote: >> > diff --git a/net/core/dev.c b/net/core/dev.c >> > index de362d5f26559..c3f7d2a6b6134 100644 >> > --- a/net/core/dev.c >> > +++ b/net/core/dev.c >> > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff * >> > sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) >> > { >> > struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); >> > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL; >> > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS; >> > + struct bpf_xdp_storage __xdp_store; >> > int sch_ret; >> > >> > if (!entry) >> > return skb; >> > >> > + xdp_store = xdp_storage_set(&__xdp_store); >> > + >> > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was >> > * already set by the caller. >> > */ >> >> >> These, and the LWT code, don't actually have anything to do with XDP, >> which indicates that the 'xdp_storage' name misleading. Maybe >> 'bpf_net_context' or something along those lines? Or maybe we could just >> move the flush lists into bpf_redirect_info itself and just keep that as >> the top-level name? > > I'm going to rename it for now as suggested for now. If it is a better > fit to include the lists into bpf_redirect_info then I will update > accordingly. OK, SGTM. -Toke ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage on PREEMPT_RT. 2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior 2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior @ 2024-02-13 14:58 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 25+ messages in thread From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw) To: bpf, netdev Cc: Björn Töpel, David S. Miller, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song, Sebastian Andrzej Siewior The per-CPU flush lists are accessed from within the NAPI callback (xdp_do_flush() for instance). They are subject to the same problem as struct bpf_redirect_info. Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and xskmap_map_flush_list to struct bpf_xdp_storage. Add wrappers for the access. Use it only on PREEMPT_RT, keep the per-CPU lists on non-PREEMPT_RT builds. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- include/linux/filter.h | 34 ++++++++++++++++++++++++++++++++++ kernel/bpf/cpumap.c | 33 ++++++++++++++++++++++++++------- kernel/bpf/devmap.c | 33 ++++++++++++++++++++++++++------- net/xdp/xsk.c | 33 +++++++++++++++++++++++++++------ 4 files changed, 113 insertions(+), 20 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 97c9be9cabfd6..231ecdc431a00 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -725,6 +725,9 @@ static inline struct bpf_redirect_info *xdp_storage_get_ri(void) struct bpf_xdp_storage { struct bpf_redirect_info ri; + struct list_head cpu_map_flush_list; + struct list_head dev_map_flush_list; + struct list_head xskmap_map_flush_list; }; static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store) @@ -734,6 +737,9 @@ static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xd tsk = current; if (tsk->bpf_xdp_storage != NULL) return NULL; + INIT_LIST_HEAD(&xdp_store->cpu_map_flush_list); + INIT_LIST_HEAD(&xdp_store->dev_map_flush_list); + INIT_LIST_HEAD(&xdp_store->xskmap_map_flush_list); tsk->bpf_xdp_storage = xdp_store; return xdp_store; } @@ -764,6 +770,34 @@ static inline struct bpf_redirect_info *xdp_storage_get_ri(void) return NULL; return &xdp_store->ri; } + +static inline struct list_head *xdp_storage_get_cpu_map_flush_list(void) +{ + struct bpf_xdp_storage *xdp_store = xdp_storage_get(); + + if (!xdp_store) + return NULL; + return &xdp_store->cpu_map_flush_list; +} + +static inline struct list_head *xdp_storage_get_dev_flush_list(void) +{ + struct bpf_xdp_storage *xdp_store = xdp_storage_get(); + + if (!xdp_store) + return NULL; + return &xdp_store->dev_map_flush_list; +} + +static inline struct list_head *xdp_storage_get_xskmap_flush_list(void) +{ + struct bpf_xdp_storage *xdp_store = xdp_storage_get(); + + if (!xdp_store) + return NULL; + return &xdp_store->xskmap_map_flush_list; +} + #endif DEFINE_FREE(xdp_storage_clear, struct bpf_xdp_storage *, if (_T) xdp_storage_clear(_T)); diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index c40ae831ab1a6..ec5be37399c82 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -78,8 +78,30 @@ struct bpf_cpu_map { struct bpf_cpu_map_entry __rcu **cpu_map; }; +#ifndef CONFIG_PREEMPT_RT static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list); +static struct list_head *xdp_storage_get_cpu_map_flush_list(void) +{ + return this_cpu_ptr(&cpu_map_flush_list); +} + +static void init_cpu_map_flush_list(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu)); +} + +#else + +static void init_cpu_map_flush_list(void) +{ +} + +#endif + static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) { u32 value_size = attr->value_size; @@ -703,7 +725,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq) */ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) { - struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list); + struct list_head *flush_list = xdp_storage_get_cpu_map_flush_list(); struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) @@ -755,7 +777,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu, void __cpu_map_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list); + struct list_head *flush_list = xdp_storage_get_cpu_map_flush_list(); struct xdp_bulk_queue *bq, *tmp; list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { @@ -769,7 +791,7 @@ void __cpu_map_flush(void) #ifdef CONFIG_DEBUG_NET bool cpu_map_check_flush(void) { - if (list_empty(this_cpu_ptr(&cpu_map_flush_list))) + if (list_empty(xdp_storage_get_cpu_map_flush_list())) return false; __cpu_map_flush(); return true; @@ -778,10 +800,7 @@ bool cpu_map_check_flush(void) static int __init cpu_map_init(void) { - int cpu; - - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu)); + init_cpu_map_flush_list(); return 0; } diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index a936c704d4e77..c1af5d9d60381 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -83,7 +83,29 @@ struct bpf_dtab { u32 n_buckets; }; +#ifndef CONFIG_PREEMPT_RT static DEFINE_PER_CPU(struct list_head, dev_flush_list); + +static struct list_head *xdp_storage_get_dev_flush_list(void) +{ + return this_cpu_ptr(&dev_flush_list); +} + +static void init_dev_flush_list(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu)); +} + +#else + +static void init_dev_flush_list(void) +{ +} +#endif + static DEFINE_SPINLOCK(dev_map_lock); static LIST_HEAD(dev_map_list); @@ -407,7 +429,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) */ void __dev_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&dev_flush_list); + struct list_head *flush_list = xdp_storage_get_dev_flush_list(); struct xdp_dev_bulk_queue *bq, *tmp; list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { @@ -421,7 +443,7 @@ void __dev_flush(void) #ifdef CONFIG_DEBUG_NET bool dev_check_flush(void) { - if (list_empty(this_cpu_ptr(&dev_flush_list))) + if (list_empty(xdp_storage_get_dev_flush_list())) return false; __dev_flush(); return true; @@ -452,7 +474,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key) static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, struct net_device *dev_rx, struct bpf_prog *xdp_prog) { - struct list_head *flush_list = this_cpu_ptr(&dev_flush_list); + struct list_head *flush_list = xdp_storage_get_dev_flush_list(); struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) @@ -1155,15 +1177,12 @@ static struct notifier_block dev_map_notifier = { static int __init dev_map_init(void) { - int cpu; - /* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */ BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) != offsetof(struct _bpf_dtab_netdev, dev)); register_netdevice_notifier(&dev_map_notifier); - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu)); + init_dev_flush_list(); return 0; } diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index b78c0e095e221..3050739cfe1e0 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -35,8 +35,30 @@ #define TX_BATCH_SIZE 32 #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) +#ifndef CONFIG_PREEMPT_RT static DEFINE_PER_CPU(struct list_head, xskmap_flush_list); +static struct list_head *xdp_storage_get_xskmap_flush_list(void) +{ + return this_cpu_ptr(&xskmap_flush_list); +} + +static void init_xskmap_flush_list(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu)); +} + +#else + +static void init_xskmap_flush_list(void) +{ +} + +#endif + void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { if (pool->cached_need_wakeup & XDP_WAKEUP_RX) @@ -372,7 +394,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) { - struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); + struct list_head *flush_list = xdp_storage_get_xskmap_flush_list(); int err; err = xsk_rcv(xs, xdp); @@ -387,7 +409,7 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) void __xsk_map_flush(void) { - struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); + struct list_head *flush_list = xdp_storage_get_xskmap_flush_list(); struct xdp_sock *xs, *tmp; list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { @@ -399,7 +421,7 @@ void __xsk_map_flush(void) #ifdef CONFIG_DEBUG_NET bool xsk_map_check_flush(void) { - if (list_empty(this_cpu_ptr(&xskmap_flush_list))) + if (list_empty(xdp_storage_get_xskmap_flush_list())) return false; __xsk_map_flush(); return true; @@ -1770,7 +1792,7 @@ static struct pernet_operations xsk_net_ops = { static int __init xsk_init(void) { - int err, cpu; + int err; err = proto_register(&xsk_proto, 0 /* no slab */); if (err) @@ -1788,8 +1810,7 @@ static int __init xsk_init(void) if (err) goto out_pernet; - for_each_possible_cpu(cpu) - INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu)); + init_xskmap_flush_list(); return 0; out_pernet: -- 2.43.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-02-22 10:58 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior 2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior 2024-02-13 20:50 ` Jesper Dangaard Brouer 2024-02-14 12:19 ` Sebastian Andrzej Siewior 2024-02-14 13:23 ` Toke Høiland-Jørgensen 2024-02-14 14:28 ` Sebastian Andrzej Siewior 2024-02-14 16:08 ` Toke Høiland-Jørgensen 2024-02-14 16:36 ` Sebastian Andrzej Siewior 2024-02-15 20:23 ` Toke Høiland-Jørgensen 2024-02-16 16:57 ` Sebastian Andrzej Siewior 2024-02-19 19:01 ` Toke Høiland-Jørgensen 2024-02-20 9:17 ` Jesper Dangaard Brouer 2024-02-20 10:17 ` Sebastian Andrzej Siewior 2024-02-20 10:42 ` Jesper Dangaard Brouer 2024-02-20 12:08 ` Sebastian Andrzej Siewior 2024-02-20 12:57 ` Jesper Dangaard Brouer 2024-02-20 15:32 ` Sebastian Andrzej Siewior 2024-02-22 9:22 ` Sebastian Andrzej Siewior 2024-02-22 10:10 ` Jesper Dangaard Brouer 2024-02-22 10:58 ` Sebastian Andrzej Siewior 2024-02-20 12:10 ` Dave Taht 2024-02-14 16:13 ` Toke Høiland-Jørgensen 2024-02-15 9:04 ` Sebastian Andrzej Siewior 2024-02-15 12:11 ` Toke Høiland-Jørgensen 2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).