netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next 0/7] bpf: netdev TX metadata
@ 2023-06-12 17:23 Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 1/7] bpf: rename some xdp-metadata functions into dev-bound Stanislav Fomichev
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, willemb, dsahern, magnus.karlsson,
	bjorn, maciej.fijalkowski, netdev

--- Use cases ---

The goal of this series is to add two new standard-ish places
in the transmit path:

1. Right before the packet is transmitted (with access to TX
   descriptors)
2. Right after the packet is actually transmitted and we've received the
   completion (again, with access to TX completion descriptors)

Accessing TX descriptors unlocks the following use-cases:

- Setting device hints at TX: XDP/AF_XDP might use these new hooks to
use device offloads. The existing case implements TX timestamp.
- Observability: global per-netdev hooks can be used for tracing
the packets and exploring completion descriptors for all sorts of
device errors.

Accessing TX descriptors also means that the hooks have to be called
from the drivers.

The hooks are a light-weight alternative to XDP at egress and currently
don't provide any packet modification abilities. However, eventually,
can expose new kfuncs to operate on the packet (or, rather, the actual
descriptors; for performance sake).

--- UAPI ---

The hooks are implemented in a HID-BPF style. Meaning they don't
expose any UAPI and are implemented as tracing programs that call
a bunch of kfuncs. The attach/detach operation happen via BPF syscall
programs. The series expands device-bound infrastructure to tracing
programs.

--- skb vs xdp ---

The hooks operate on a new light-weight devtx_frame which contains:
- data
- len
- sinfo

This should allow us to have a unified (from BPF POW) place at TX
and not be super-taxing (we need to copy 2 pointers + len to the stack
for each invocation).

--- Multiprog attachment ---

Currently, attach/detach don't expose links and don't support multiple
programs. I'm planning to use Daniel's bpf_mprog once it lands.

--- TODO ---

Things that I'm planning to do for the non-RFC series:
- have some real device support to verify xdp_hw_metadata works
- freplace
- Documentation/networking/xdp-rx-metadata.rst - like documentation

--- CC ---

CC'ing people only on the cover letter. Hopefully can find the rest via
lore.

Cc: willemb@google.com
Cc: dsahern@kernel.org
Cc: john.fastabend@gmail.com
Cc: magnus.karlsson@intel.com
Cc: bjorn@kernel.org
Cc: maciej.fijalkowski@intel.com
Cc: netdev@vger.kernel.org

Stanislav Fomichev (7):
  bpf: rename some xdp-metadata functions into dev-bound
  bpf: resolve single typedef when walking structs
  bpf: implement devtx hook points
  bpf: implement devtx timestamp kfunc
  net: veth: implement devtx timestamp kfuncs
  selftests/bpf: extend xdp_metadata with devtx kfuncs
  selftests/bpf: extend xdp_hw_metadata with devtx kfuncs

 MAINTAINERS                                   |   2 +
 drivers/net/veth.c                            |  94 ++++++-
 include/linux/netdevice.h                     |   6 +
 include/net/devtx.h                           |  76 +++++
 include/net/offload.h                         |  38 +++
 include/net/xdp.h                             |  18 +-
 kernel/bpf/btf.c                              |   2 +
 kernel/bpf/offload.c                          |  40 ++-
 kernel/bpf/verifier.c                         |   4 +-
 net/core/Makefile                             |   1 +
 net/core/dev.c                                |   2 +
 net/core/devtx.c                              | 266 ++++++++++++++++++
 net/core/xdp.c                                |  20 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  82 +++++-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  59 ++++
 .../selftests/bpf/progs/xdp_metadata.c        | 101 +++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 160 ++++++++++-
 tools/testing/selftests/bpf/xdp_metadata.h    |  13 +
 18 files changed, 934 insertions(+), 50 deletions(-)
 create mode 100644 include/net/devtx.h
 create mode 100644 include/net/offload.h
 create mode 100644 net/core/devtx.c

-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC bpf-next 1/7] bpf: rename some xdp-metadata functions into dev-bound
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 2/7] bpf: resolve single typedef when walking structs Stanislav Fomichev
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

No functional changes.

To make existing dev-bound infrastructure more generic and be
less tightly bound to the xdp layer, rename some functions and
move kfunc-related things into kernel/bpf/offload.c

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/offload.h | 28 ++++++++++++++++++++++++++++
 include/net/xdp.h     | 18 +-----------------
 kernel/bpf/offload.c  | 26 ++++++++++++++++++++++++--
 kernel/bpf/verifier.c |  4 ++--
 net/core/xdp.c        | 20 ++------------------
 5 files changed, 57 insertions(+), 39 deletions(-)
 create mode 100644 include/net/offload.h

diff --git a/include/net/offload.h b/include/net/offload.h
new file mode 100644
index 000000000000..264a35881473
--- /dev/null
+++ b/include/net/offload.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_OFFLOAD_H__
+#define __LINUX_NET_OFFLOAD_H__
+
+#include <linux/types.h>
+
+#define XDP_METADATA_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			      bpf_xdp_metadata_rx_timestamp) \
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			      bpf_xdp_metadata_rx_hash)
+
+enum {
+#define NETDEV_METADATA_KFUNC(name, _) name,
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+MAX_NETDEV_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 bpf_dev_bound_kfunc_id(int id);
+bool bpf_is_dev_bound_kfunc(u32 btf_id);
+#else
+static inline u32 bpf_dev_bound_kfunc_id(int id) { return 0; }
+static inline bool bpf_is_dev_bound_kfunc(u32 btf_id) { return false; }
+#endif
+
+#endif /* __LINUX_NET_OFFLOAD_H__ */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..de4c3b70abde 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -9,6 +9,7 @@
 #include <linux/skbuff.h> /* skb_shared_info */
 #include <uapi/linux/netdev.h>
 #include <linux/bitfield.h>
+#include <net/offload.h>
 
 /**
  * DOC: XDP RX-queue information
@@ -384,19 +385,6 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
-#define XDP_METADATA_KFUNC_xxx	\
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
-			   bpf_xdp_metadata_rx_timestamp) \
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
-			   bpf_xdp_metadata_rx_hash) \
-
-enum {
-#define XDP_METADATA_KFUNC(name, _) name,
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-MAX_XDP_METADATA_KFUNC,
-};
-
 enum xdp_rss_hash_type {
 	/* First part: Individual bits for L3/L4 types */
 	XDP_RSS_L3_IPV4		= BIT(0),
@@ -444,14 +432,10 @@ enum xdp_rss_hash_type {
 };
 
 #ifdef CONFIG_NET
-u32 bpf_xdp_metadata_kfunc_id(int id);
-bool bpf_dev_bound_kfunc_id(u32 btf_id);
 void xdp_set_features_flag(struct net_device *dev, xdp_features_t val);
 void xdp_features_set_redirect_target(struct net_device *dev, bool support_sg);
 void xdp_features_clear_redirect_target(struct net_device *dev);
 #else
-static inline u32 bpf_xdp_metadata_kfunc_id(int id) { return 0; }
-static inline bool bpf_dev_bound_kfunc_id(u32 btf_id) { return false; }
 
 static inline void
 xdp_set_features_flag(struct net_device *dev, xdp_features_t val)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8a26cd8814c1..235d81f7e0ed 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -844,9 +844,9 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	if (!ops)
 		goto out;
 
-	if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+	if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
 		p = ops->xmo_rx_timestamp;
-	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
 out:
 	up_read(&bpf_devs_lock);
@@ -854,6 +854,28 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	return p;
 }
 
+BTF_SET_START(dev_bound_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET_END(dev_bound_kfunc_ids)
+
+BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+
+u32 bpf_dev_bound_kfunc_id(int id)
+{
+	/* dev_bound_kfunc_ids is sorted and can't be used */
+	return dev_bound_kfunc_ids_unsorted[id];
+}
+
+bool bpf_is_dev_bound_kfunc(u32 btf_id)
+{
+	return btf_id_set_contains(&dev_bound_kfunc_ids, btf_id);
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1e38584d497c..4db48b5af47e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2721,7 +2721,7 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		}
 	}
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		err = bpf_dev_bound_kfunc_check(&env->log, prog_aux);
 		if (err)
 			return err;
@@ -17757,7 +17757,7 @@ static void specialize_kfunc(struct bpf_verifier_env *env,
 	void *xdp_kfunc;
 	bool is_rdonly;
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		xdp_kfunc = bpf_dev_bound_resolve_kfunc(prog, func_id);
 		if (xdp_kfunc) {
 			*addr = (unsigned long)xdp_kfunc;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 41e5ca8643ec..819767697370 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -741,9 +741,9 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-#define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
 XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
+#undef NETDEV_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
 
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
@@ -751,22 +751,6 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
 	.set   = &xdp_metadata_kfunc_ids,
 };
 
-BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
-#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-
-u32 bpf_xdp_metadata_kfunc_id(int id)
-{
-	/* xdp_metadata_kfunc_ids is sorted and can't be used */
-	return xdp_metadata_kfunc_ids_unsorted[id];
-}
-
-bool bpf_dev_bound_kfunc_id(u32 btf_id)
-{
-	return btf_id_set8_contains(&xdp_metadata_kfunc_ids, btf_id);
-}
-
 static int __init xdp_metadata_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 2/7] bpf: resolve single typedef when walking structs
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 1/7] bpf: rename some xdp-metadata functions into dev-bound Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

It is impossible to use skb_frag_t in the tracing program. So let's
resolve a single typedef when walking the struct.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/btf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index bd2cac057928..9bdaa1225e8a 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6140,6 +6140,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
 	*flag = 0;
 again:
 	tname = __btf_name_by_offset(btf, t->name_off);
+	if (btf_type_is_typedef(t))
+		t = btf_type_by_id(btf, t->type);
 	if (!btf_type_is_struct(t)) {
 		bpf_log(log, "Type '%s' is not a struct\n", tname);
 		return -EINVAL;
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 1/7] bpf: rename some xdp-metadata functions into dev-bound Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 2/7] bpf: resolve single typedef when walking structs Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-13 14:54   ` Willem de Bruijn
                     ` (2 more replies)
  2023-06-12 17:23 ` [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc Stanislav Fomichev
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

devtx is a lightweight set of hooks before and after packet transmission.
The hook is supposed to work for both skb and xdp paths by exposing
a light-weight packet wrapper via devtx_frame (header portion + frags).

devtx is implemented as a tracing program which has access to the
XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
in the next patch, but the idea is similar to XDP metadata:
the kfuncs have netdev-specific implementation, but common
interface. Upon loading, the kfuncs are resolved to direct
calls against per-netdev implementation. This can be achieved
by marking devtx-tracing programs as dev-bound (largely
reusing xdp-dev-bound program infrastructure).

Attachment and detachment is implemented via syscall BPF program
by calling bpf_devtx_sb_attach (attach to tx-submission)
or bpf_devtx_cp_attach (attach to tx completion). Right now,
the attachment does not return a link and doesn't support
multiple programs. I plan to switch to Daniel's bpf_mprog infra
once it's available.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 MAINTAINERS               |   2 +
 include/linux/netdevice.h |   2 +
 include/net/devtx.h       |  76 ++++++++++++++
 kernel/bpf/offload.c      |   6 ++
 net/core/Makefile         |   1 +
 net/core/dev.c            |   2 +
 net/core/devtx.c          | 208 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 297 insertions(+)
 create mode 100644 include/net/devtx.h
 create mode 100644 net/core/devtx.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c904dba1733b..516529b42e66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22976,11 +22976,13 @@ L:	bpf@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/*/*/*/*/*xdp*
 F:	drivers/net/ethernet/*/*/*xdp*
+F:	include/net/devtx.h
 F:	include/net/xdp.h
 F:	include/net/xdp_priv.h
 F:	include/trace/events/xdp.h
 F:	kernel/bpf/cpumap.c
 F:	kernel/bpf/devmap.c
+F:	net/core/devtx.c
 F:	net/core/xdp.c
 F:	samples/bpf/xdp*
 F:	tools/testing/selftests/bpf/*/*xdp*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 08fbd4622ccf..e08e3fd39dfc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2238,6 +2238,8 @@ struct net_device {
 	unsigned int		real_num_rx_queues;
 
 	struct bpf_prog __rcu	*xdp_prog;
+	struct bpf_prog	__rcu	*devtx_sb;
+	struct bpf_prog	__rcu	*devtx_cp;
 	unsigned long		gro_flush_timeout;
 	int			napi_defer_hard_irqs;
 #define GRO_LEGACY_MAX_SIZE	65536u
diff --git a/include/net/devtx.h b/include/net/devtx.h
new file mode 100644
index 000000000000..7eab66d0ce80
--- /dev/null
+++ b/include/net/devtx.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_DEVTX_H__
+#define __LINUX_NET_DEVTX_H__
+
+#include <linux/jump_label.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <net/xdp.h>
+
+struct devtx_frame {
+	void *data;
+	u16 len;
+	struct skb_shared_info *sinfo; /* for frags */
+};
+
+#ifdef CONFIG_NET
+void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx);
+void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx);
+bool is_devtx_kfunc(u32 kfunc_id);
+void devtx_shutdown(struct net_device *netdev);
+
+static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
+{
+	ctx->data = skb->data;
+	ctx->len = skb_headlen(skb);
+	ctx->sinfo = skb_shinfo(skb);
+}
+
+static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
+{
+	ctx->data = xdpf->data;
+	ctx->len = xdpf->len;
+	ctx->sinfo = xdp_frame_has_frags(xdpf) ? xdp_get_shared_info_from_frame(xdpf) : NULL;
+}
+
+DECLARE_STATIC_KEY_FALSE(devtx_enabled);
+
+static inline bool devtx_submit_enabled(struct net_device *netdev)
+{
+	return static_branch_unlikely(&devtx_enabled) &&
+	       rcu_access_pointer(netdev->devtx_sb);
+}
+
+static inline bool devtx_complete_enabled(struct net_device *netdev)
+{
+	return static_branch_unlikely(&devtx_enabled) &&
+	       rcu_access_pointer(netdev->devtx_cp);
+}
+#else
+static inline void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
+{
+}
+
+static inline void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
+{
+}
+
+static inline bool is_devtx_kfunc(u32 kfunc_id)
+{
+	return false;
+}
+
+static inline void devtx_shutdown(struct net_device *netdev)
+{
+}
+
+static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
+{
+}
+
+static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
+{
+}
+#endif
+
+#endif /* __LINUX_NET_DEVTX_H__ */
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 235d81f7e0ed..9cfe96422c80 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -25,6 +25,7 @@
 #include <linux/rhashtable.h>
 #include <linux/rtnetlink.h>
 #include <linux/rwsem.h>
+#include <net/devtx.h>
 
 /* Protects offdevs, members of bpf_offload_netdev and offload members
  * of all progs.
@@ -228,6 +229,7 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	int err;
 
 	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
+	    attr->prog_type != BPF_PROG_TYPE_TRACING &&
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
@@ -238,6 +240,10 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	    attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY)
 		return -EINVAL;
 
+	if (attr->prog_type == BPF_PROG_TYPE_TRACING &&
+	    !is_devtx_kfunc(prog->aux->attach_btf_id))
+		return -EINVAL;
+
 	netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex);
 	if (!netdev)
 		return -EINVAL;
diff --git a/net/core/Makefile b/net/core/Makefile
index 8f367813bc68..c1db05ccfac7 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -39,4 +39,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
 obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += devtx.o
 obj-$(CONFIG_OF)	+= of_net.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 3393c2f3dbe8..ef0e65e68024 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -150,6 +150,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/prandom.h>
 #include <linux/once_lite.h>
+#include <net/devtx.h>
 
 #include "dev.h"
 #include "net-sysfs.h"
@@ -10875,6 +10876,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 		dev_shutdown(dev);
 
 		dev_xdp_uninstall(dev);
+		devtx_shutdown(dev);
 		bpf_dev_bound_netdev_unregister(dev);
 
 		netdev_offload_xstats_disable_all(dev);
diff --git a/net/core/devtx.c b/net/core/devtx.c
new file mode 100644
index 000000000000..b7cbc26d1c01
--- /dev/null
+++ b/net/core/devtx.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <net/devtx.h>
+#include <linux/filter.h>
+
+DEFINE_STATIC_KEY_FALSE(devtx_enabled);
+EXPORT_SYMBOL_GPL(devtx_enabled);
+
+static void devtx_run(struct net_device *netdev, struct devtx_frame *ctx, struct bpf_prog **pprog)
+{
+	struct bpf_prog *prog;
+	void *real_ctx[1] = {ctx};
+
+	prog = rcu_dereference(*pprog);
+	if (likely(prog))
+		bpf_prog_run(prog, real_ctx);
+}
+
+void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
+{
+	rcu_read_lock();
+	devtx_run(netdev, ctx, &netdev->devtx_sb);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(devtx_submit);
+
+void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
+{
+	rcu_read_lock();
+	devtx_run(netdev, ctx, &netdev->devtx_cp);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(devtx_complete);
+
+/**
+ * devtx_sb - Called for every egress netdev packet
+ *
+ * Note: this function is never actually called by the kernel and declared
+ * only to allow loading an attaching appropriate tracepoints.
+ */
+__weak noinline void devtx_sb(struct devtx_frame *ctx)
+{
+}
+
+/**
+ * devtx_cp - Called upon egress netdev packet completion
+ *
+ * Note: this function is never actually called by the kernel and declared
+ * only to allow loading an attaching appropriate tracepoints.
+ */
+__weak noinline void devtx_cp(struct devtx_frame *ctx)
+{
+}
+
+BTF_SET8_START(bpf_devtx_hook_ids)
+BTF_ID_FLAGS(func, devtx_sb)
+BTF_ID_FLAGS(func, devtx_cp)
+BTF_SET8_END(bpf_devtx_hook_ids)
+
+static const struct btf_kfunc_id_set bpf_devtx_hook_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_devtx_hook_ids,
+};
+
+static DEFINE_MUTEX(devtx_attach_lock);
+
+static int __bpf_devtx_detach(struct net_device *netdev, struct bpf_prog **pprog)
+{
+	if (!*pprog)
+		return -EINVAL;
+	bpf_prog_put(*pprog);
+	*pprog = NULL;
+
+	static_branch_dec(&devtx_enabled);
+	return 0;
+}
+
+static int __bpf_devtx_attach(struct net_device *netdev, int prog_fd,
+			      const char *attach_func_name, struct bpf_prog **pprog)
+{
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	if (prog_fd < 0)
+		return __bpf_devtx_detach(netdev, pprog);
+
+	if (*pprog)
+		return -EBUSY;
+
+	prog = bpf_prog_get(prog_fd);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	if (prog->type != BPF_PROG_TYPE_TRACING ||
+	    prog->expected_attach_type != BPF_TRACE_FENTRY ||
+	    !bpf_prog_is_dev_bound(prog->aux) ||
+	    !bpf_offload_dev_match(prog, netdev) ||
+	    strcmp(prog->aux->attach_func_name, attach_func_name)) {
+		bpf_prog_put(prog);
+		return -EINVAL;
+	}
+
+	*pprog = prog;
+	static_branch_inc(&devtx_enabled);
+
+	return ret;
+}
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * bpf_devtx_sb_attach - Attach devtx 'packet submit' program
+ * @ifindex: netdev interface index.
+ * @prog_fd: BPF program file descriptor.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_sb_attach(int ifindex, int prog_fd)
+{
+	struct net_device *netdev;
+	int ret;
+
+	netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
+	if (!netdev)
+		return -EINVAL;
+
+	mutex_lock(&devtx_attach_lock);
+	ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_sb", &netdev->devtx_sb);
+	mutex_unlock(&devtx_attach_lock);
+
+	dev_put(netdev);
+
+	return ret;
+}
+
+/**
+ * bpf_devtx_cp_attach - Attach devtx 'packet complete' program
+ * @ifindex: netdev interface index.
+ * @prog_fd: BPF program file descriptor.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_cp_attach(int ifindex, int prog_fd)
+{
+	struct net_device *netdev;
+	int ret;
+
+	netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
+	if (!netdev)
+		return -EINVAL;
+
+	mutex_lock(&devtx_attach_lock);
+	ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_cp", &netdev->devtx_cp);
+	mutex_unlock(&devtx_attach_lock);
+
+	dev_put(netdev);
+
+	return ret;
+}
+
+__diag_pop();
+
+bool is_devtx_kfunc(u32 kfunc_id)
+{
+	return !!btf_id_set8_contains(&bpf_devtx_hook_ids, kfunc_id);
+}
+
+void devtx_shutdown(struct net_device *netdev)
+{
+	mutex_lock(&devtx_attach_lock);
+	__bpf_devtx_detach(netdev, &netdev->devtx_sb);
+	__bpf_devtx_detach(netdev, &netdev->devtx_cp);
+	mutex_unlock(&devtx_attach_lock);
+}
+
+BTF_SET8_START(bpf_devtx_syscall_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_devtx_sb_attach)
+BTF_ID_FLAGS(func, bpf_devtx_cp_attach)
+BTF_SET8_END(bpf_devtx_syscall_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_devtx_syscall_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_devtx_syscall_kfunc_ids,
+};
+
+static int __init devtx_init(void)
+{
+	int ret;
+
+	ret = register_btf_fmodret_id_set(&bpf_devtx_hook_set);
+	if (ret) {
+		pr_warn("failed to register devtx hooks: %d", ret);
+		return ret;
+	}
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_devtx_syscall_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register syscall kfuncs: %d", ret);
+		return ret;
+	}
+
+	return 0;
+}
+late_initcall(devtx_init);
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-13 15:14   ` Simon Horman
  2023-06-12 17:23 ` [RFC bpf-next 5/7] net: veth: implement devtx timestamp kfuncs Stanislav Fomichev
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Two kfuncs, one per hook point:

1. at submit time - bpf_devtx_sb_request_timestamp - to request HW
   to put TX timestamp into TX completion descriptors

2. at completion time - bpf_devtx_cp_timestamp - to read out
   TX timestamp

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/netdevice.h |  4 +++
 include/net/offload.h     | 10 +++++++
 kernel/bpf/offload.c      |  8 ++++++
 net/core/devtx.c          | 58 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 80 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e08e3fd39dfc..6e42e62fd1bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1651,10 +1651,14 @@ struct net_device_ops {
 						  bool cycles);
 };
 
+struct devtx_frame;
+
 struct xdp_metadata_ops {
 	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
 	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
 			       enum xdp_rss_hash_type *rss_type);
+	int	(*xmo_sb_request_timestamp)(const struct devtx_frame *ctx);
+	int	(*xmo_cp_timestamp)(const struct devtx_frame *ctx, u64 *timestamp);
 };
 
 /**
diff --git a/include/net/offload.h b/include/net/offload.h
index 264a35881473..36899b64f4c8 100644
--- a/include/net/offload.h
+++ b/include/net/offload.h
@@ -10,9 +10,19 @@
 	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
 			      bpf_xdp_metadata_rx_hash)
 
+#define DEVTX_SB_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_SB_KFUNC_REQUEST_TIMESTAMP, \
+			      bpf_devtx_sb_request_timestamp)
+
+#define DEVTX_CP_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_CP_KFUNC_TIMESTAMP, \
+			      bpf_devtx_cp_timestamp)
+
 enum {
 #define NETDEV_METADATA_KFUNC(name, _) name,
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 MAX_NETDEV_METADATA_KFUNC,
 };
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 9cfe96422c80..91dc7b7e3684 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -854,6 +854,10 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 		p = ops->xmo_rx_timestamp;
 	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
+	else if (func_id == bpf_dev_bound_kfunc_id(DEVTX_SB_KFUNC_REQUEST_TIMESTAMP))
+		p = ops->xmo_sb_request_timestamp;
+	else if (func_id == bpf_dev_bound_kfunc_id(DEVTX_CP_KFUNC_TIMESTAMP))
+		p = ops->xmo_cp_timestamp;
 out:
 	up_read(&bpf_devs_lock);
 
@@ -863,12 +867,16 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 BTF_SET_START(dev_bound_kfunc_ids)
 #define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 BTF_SET_END(dev_bound_kfunc_ids)
 
 BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
 #define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 
 u32 bpf_dev_bound_kfunc_id(int id)
diff --git a/net/core/devtx.c b/net/core/devtx.c
index b7cbc26d1c01..2b37c31f0912 100644
--- a/net/core/devtx.c
+++ b/net/core/devtx.c
@@ -162,6 +162,30 @@ __bpf_kfunc int bpf_devtx_cp_attach(int ifindex, int prog_fd)
 	return ret;
 }
 
+/**
+ * bpf_devtx_sb_request_timestamp - Request TX timestamp on the packet.
+ * Callable only from the devtx-submit hook.
+ * @ctx: devtx context pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * bpf_devtx_cp_timestamp - Read TX timestamp of the packet. Callable
+ * only from the devtx-complete hook.
+ * @ctx: devtx context pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 bool is_devtx_kfunc(u32 kfunc_id)
@@ -187,6 +211,28 @@ static const struct btf_kfunc_id_set bpf_devtx_syscall_kfunc_set = {
 	.set   = &bpf_devtx_syscall_kfunc_ids,
 };
 
+BTF_SET8_START(devtx_sb_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+DEVTX_SB_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_sb_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_sb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_sb_kfunc_ids,
+};
+
+BTF_SET8_START(devtx_cp_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+DEVTX_CP_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_cp_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_cp_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_cp_kfunc_ids,
+};
+
 static int __init devtx_init(void)
 {
 	int ret;
@@ -197,6 +243,18 @@ static int __init devtx_init(void)
 		return ret;
 	}
 
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_sb_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_sb kfuncs: %d", ret);
+		return ret;
+	}
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_cp_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_cp completion kfuncs: %d", ret);
+		return ret;
+	}
+
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_devtx_syscall_kfunc_set);
 	if (ret) {
 		pr_warn("failed to register syscall kfuncs: %d", ret);
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 5/7] net: veth: implement devtx timestamp kfuncs
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2023-06-12 17:23 ` [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-12 17:23 ` [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs Stanislav Fomichev
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Have a software-base example for kfuncs to showcase how it
can be used in the real devices and to have something to
test against in the selftests.

Both path (skb & xdp) are covered. Only the skb path is really
tested though.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 94 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..eb78d51d8352 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <net/page_pool.h>
+#include <net/devtx.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -123,6 +124,13 @@ struct veth_xdp_buff {
 	struct sk_buff *skb;
 };
 
+struct veth_devtx_frame {
+	struct devtx_frame frame;
+	bool request_timestamp;
+	ktime_t xdp_tx_timestamp;
+	struct sk_buff *skb;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -314,9 +322,29 @@ static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 }
 
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool request_timestamp)
 {
-	return __dev_forward_skb(dev, skb) ?: xdp ?
+	struct net_device *src_dev = skb->dev;
+	int ret;
+
+	ret = __dev_forward_skb(dev, skb);
+	if (ret)
+		return ret;
+
+	if (devtx_complete_enabled(src_dev)) {
+		struct veth_devtx_frame ctx;
+
+		if (unlikely(request_timestamp))
+			__net_timestamp(skb);
+
+		devtx_frame_from_skb(&ctx.frame, skb);
+		ctx.frame.data -= ETH_HLEN; /* undo eth_type_trans pull */
+		ctx.frame.len += ETH_HLEN;
+		ctx.skb = skb;
+		devtx_complete(src_dev, &ctx.frame);
+	}
+
+	return xdp ?
 		veth_xdp_rx(rq, skb) :
 		__netif_rx(skb);
 }
@@ -343,6 +371,7 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	bool request_timestamp = false;
 	struct veth_rq *rq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
@@ -356,6 +385,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
+	if (devtx_submit_enabled(dev)) {
+		struct veth_devtx_frame ctx;
+
+		devtx_frame_from_skb(&ctx.frame, skb);
+		ctx.request_timestamp = false;
+		devtx_submit(dev, &ctx.frame);
+		request_timestamp = ctx.request_timestamp;
+	}
+
 	rcv_priv = netdev_priv(rcv);
 	rxq = skb_get_queue_mapping(skb);
 	if (rxq < rcv->real_num_rx_queues) {
@@ -370,7 +408,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
-	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb, rq, use_napi, request_timestamp) == NET_RX_SUCCESS)) {
 		if (!use_napi)
 			dev_lstats_add(dev, length);
 	} else {
@@ -483,6 +521,7 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	int i, ret = -ENXIO, nxmit = 0;
+	ktime_t tx_timestamp = 0;
 	struct net_device *rcv;
 	unsigned int max_len;
 	struct veth_rq *rq;
@@ -511,9 +550,32 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 		void *ptr = veth_xdp_to_ptr(frame);
 
 		if (unlikely(xdp_get_frame_len(frame) > max_len ||
-			     __ptr_ring_produce(&rq->xdp_ring, ptr)))
+			     __ptr_ring_full(&rq->xdp_ring)))
+			break;
+
+		if (devtx_submit_enabled(dev)) {
+			struct veth_devtx_frame ctx;
+
+			devtx_frame_from_xdp(&ctx.frame, frame);
+			ctx.request_timestamp = false;
+			devtx_submit(dev, &ctx.frame);
+
+			if (unlikely(ctx.request_timestamp))
+				tx_timestamp = ktime_get_real();
+		}
+
+		if (unlikely(__ptr_ring_produce(&rq->xdp_ring, ptr)))
 			break;
 		nxmit++;
+
+		if (devtx_complete_enabled(dev)) {
+			struct veth_devtx_frame ctx;
+
+			devtx_frame_from_xdp(&ctx.frame, frame);
+			ctx.xdp_tx_timestamp = tx_timestamp;
+			ctx.skb = NULL;
+			devtx_complete(dev, &ctx.frame);
+		}
 	}
 	spin_unlock(&rq->xdp_ring.producer_lock);
 
@@ -1732,6 +1794,28 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int veth_devtx_sb_request_timestamp(const struct devtx_frame *_ctx)
+{
+	struct veth_devtx_frame *ctx = (struct veth_devtx_frame *)_ctx;
+
+	ctx->request_timestamp = true;
+
+	return 0;
+}
+
+static int veth_devtx_cp_timestamp(const struct devtx_frame *_ctx, u64 *timestamp)
+{
+	struct veth_devtx_frame *ctx = (struct veth_devtx_frame *)_ctx;
+
+	if (ctx->skb) {
+		*timestamp = ctx->skb->tstamp;
+		return 0;
+	}
+
+	*timestamp = ctx->xdp_tx_timestamp;
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1756,6 +1840,8 @@ static const struct net_device_ops veth_netdev_ops = {
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
 	.xmo_rx_hash			= veth_xdp_rx_hash,
+	.xmo_sb_request_timestamp	= veth_devtx_sb_request_timestamp,
+	.xmo_cp_timestamp		= veth_devtx_cp_timestamp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2023-06-12 17:23 ` [RFC bpf-next 5/7] net: veth: implement devtx timestamp kfuncs Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-13 14:47   ` Willem de Bruijn
  2023-06-12 17:23 ` [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata " Stanislav Fomichev
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Attach kfuncs that request and report TX timestamp via ringbuf.
Confirm on the userspace side that the program has triggered
and the timestamp is non-zero.

Also make sure devtx_frame has a sensible pointers and data.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  82 +++++++++++++-
 .../selftests/bpf/progs/xdp_metadata.c        | 101 ++++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |  13 +++
 3 files changed, 194 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..ebaa50293f85 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -42,6 +42,9 @@ struct xsk {
 	struct xsk_ring_prod tx;
 	struct xsk_ring_cons rx;
 	struct xsk_socket *socket;
+	int tx_completions;
+	u32 last_tx_timestamp_retval;
+	u64 last_tx_timestamp;
 };
 
 static int open_xsk(int ifindex, struct xsk *xsk)
@@ -192,7 +195,8 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 	return 0;
 }
 
-static void complete_tx(struct xsk *xsk)
+static void complete_tx(struct xsk *xsk, struct xdp_metadata *bpf_obj,
+			struct ring_buffer *ringbuf)
 {
 	__u32 idx;
 	__u64 addr;
@@ -202,6 +206,13 @@ static void complete_tx(struct xsk *xsk)
 
 		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
 		xsk_ring_cons__release(&xsk->comp, 1);
+
+		ring_buffer__poll(ringbuf, 1000);
+
+		ASSERT_EQ(bpf_obj->bss->pkts_fail_tx, 0, "pkts_fail_tx");
+		ASSERT_GE(xsk->tx_completions, 1, "tx_completions");
+		ASSERT_EQ(xsk->last_tx_timestamp_retval, 0, "last_tx_timestamp_retval");
+		ASSERT_GE(xsk->last_tx_timestamp, 0, "last_tx_timestamp");
 	}
 }
 
@@ -276,8 +287,24 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	return 0;
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+	struct xsk *xsk = ctx;
+
+	printf("%p: got tx timestamp sample %u %llu\n",
+	       xsk, sample->timestamp_retval, sample->timestamp);
+
+	xsk->tx_completions++;
+	xsk->last_tx_timestamp_retval = sample->timestamp_retval;
+	xsk->last_tx_timestamp = sample->timestamp;
+
+	return 0;
+}
+
 void test_xdp_metadata(void)
 {
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	struct xdp_metadata2 *bpf_obj2 = NULL;
 	struct xdp_metadata *bpf_obj = NULL;
 	struct bpf_program *new_prog, *prog;
@@ -290,6 +317,7 @@ void test_xdp_metadata(void)
 	int retries = 10;
 	int rx_ifindex;
 	int tx_ifindex;
+	int syscall_fd;
 	int sock_fd;
 	int ret;
 
@@ -323,6 +351,16 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
 		goto out;
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "devtx_sb");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	bpf_program__set_autoattach(prog, false);
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "devtx_cp");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	bpf_program__set_autoattach(prog, false);
+
 	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
 	bpf_program__set_ifindex(prog, rx_ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
@@ -330,6 +368,15 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
 		goto out;
 
+	ret = xdp_metadata__attach(bpf_obj);
+	if (!ASSERT_OK(ret, "xdp_metadata__attach"))
+		goto out;
+
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, &tx_xsk, NULL);
+	if (!ASSERT_OK_PTR(tx_compl_ringbuf, "ring_buffer__new"))
+		goto out;
+
 	/* Make sure we can't add dev-bound programs to prog maps. */
 	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
 	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
@@ -341,6 +388,26 @@ void test_xdp_metadata(void)
 			"update prog_arr"))
 		goto out;
 
+	/* Attach egress BPF programs to interface. */
+	struct devtx_attach_args args = {
+		.ifindex = tx_ifindex,
+		.devtx_sb_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_sb),
+		.devtx_cp_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_cp),
+		.devtx_sb_retval = -1,
+		.devtx_cp_retval = -1,
+	};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, tattr,
+		.ctx_in = &args,
+		.ctx_size_in = sizeof(args),
+	);
+
+	syscall_fd = bpf_program__fd(bpf_obj->progs.attach_prog);
+	ret = bpf_prog_test_run_opts(syscall_fd, &tattr);
+	if (!ASSERT_GE(ret, 0, "bpf_prog_test_run_opts(attach_prog)"))
+		goto out;
+	ASSERT_GE(args.devtx_sb_retval, 0, "bpf_prog_test_run_opts(attach_prog) devtx_sb_retval");
+	ASSERT_GE(args.devtx_cp_retval, 0, "bpf_prog_test_run_opts(attach_prog) devtx_cp_retval");
+
 	/* Attach BPF program to RX interface. */
 
 	ret = bpf_xdp_attach(rx_ifindex,
@@ -364,7 +431,16 @@ void test_xdp_metadata(void)
 		       "verify_xsk_metadata"))
 		goto out;
 
-	complete_tx(&tx_xsk);
+	/* Verify AF_XDP TX packet has completion event with a timestamp. */
+	complete_tx(&tx_xsk, bpf_obj, tx_compl_ringbuf);
+
+	/* Detach egress program. */
+	syscall_fd = bpf_program__fd(bpf_obj->progs.detach_prog);
+	ret = bpf_prog_test_run_opts(syscall_fd, &tattr);
+	if (!ASSERT_GE(ret, 0, "bpf_prog_test_run_opts(detach_prog)"))
+		goto out;
+	ASSERT_GE(args.devtx_sb_retval, 0, "bpf_prog_test_run_opts(detach_prog) devtx_sb_retval");
+	ASSERT_GE(args.devtx_cp_retval, 0, "bpf_prog_test_run_opts(detach_prog) devtx_cp_retval");
 
 	/* Make sure freplace correctly picks up original bound device
 	 * and doesn't crash.
@@ -402,5 +478,7 @@ void test_xdp_metadata(void)
 	xdp_metadata__destroy(bpf_obj);
 	if (tok)
 		close_netns(tok);
+	if (tx_compl_ringbuf)
+		ring_buffer__free(tx_compl_ringbuf);
 	SYS_NOFAIL("ip netns del xdp_metadata");
 }
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index d151d406a123..5b815bd03fe7 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -4,6 +4,11 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -19,10 +24,22 @@ struct {
 	__type(value, __u32);
 } prog_arr SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 10);
+} tx_compl_buf SEC(".maps");
+
+__u64 pkts_fail_tx = 0;
+
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
+extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
+
+extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
+extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -61,4 +78,88 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+static inline int verify_frame(const struct devtx_frame *frame)
+{
+	struct ethhdr eth = {};
+
+	/* all the pointers are set up correctly */
+	if (!frame->data)
+		return -1;
+	if (!frame->sinfo)
+		return -1;
+
+	/* can get to the frags */
+	if (frame->sinfo->nr_frags != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_page != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_len != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_offset != 0)
+		return -1;
+
+	/* the data has something that looks like ethernet */
+	if (frame->len != 46)
+		return -1;
+	bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
+
+	if (eth.h_proto != bpf_htons(ETH_P_IP))
+		return -1;
+
+	return 0;
+}
+
+SEC("fentry/devtx_sb")
+int BPF_PROG(devtx_sb, const struct devtx_frame *frame)
+{
+	int ret;
+
+	ret = verify_frame(frame);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	ret = bpf_devtx_sb_request_timestamp(frame);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	return 0;
+}
+
+SEC("fentry/devtx_cp")
+int BPF_PROG(devtx_cp, const struct devtx_frame *frame)
+{
+	struct devtx_sample *sample;
+	int ret;
+
+	ret = verify_frame(frame);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample)
+		return 0;
+
+	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
+SEC("syscall")
+int attach_prog(struct devtx_attach_args *ctx)
+{
+	ctx->devtx_sb_retval = bpf_devtx_sb_attach(ctx->ifindex, ctx->devtx_sb_prog_fd);
+	ctx->devtx_cp_retval = bpf_devtx_cp_attach(ctx->ifindex, ctx->devtx_cp_prog_fd);
+	return 0;
+}
+
+SEC("syscall")
+int detach_prog(struct devtx_attach_args *ctx)
+{
+	ctx->devtx_sb_retval = bpf_devtx_sb_attach(ctx->ifindex, -1);
+	ctx->devtx_cp_retval = bpf_devtx_cp_attach(ctx->ifindex, -1);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 938a729bd307..add900c9035c 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -18,3 +18,16 @@ struct xdp_meta {
 		__s32 rx_hash_err;
 	};
 };
+
+struct devtx_sample {
+	int timestamp_retval;
+	__u64 timestamp;
+};
+
+struct devtx_attach_args {
+	int ifindex;
+	int devtx_sb_prog_fd;
+	int devtx_cp_prog_fd;
+	int devtx_sb_retval;
+	int devtx_cp_retval;
+};
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata with devtx kfuncs
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2023-06-12 17:23 ` [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs Stanislav Fomichev
@ 2023-06-12 17:23 ` Stanislav Fomichev
  2023-06-13 15:03   ` Willem de Bruijn
  2023-06-12 21:00 ` [RFC bpf-next 0/7] bpf: netdev TX metadata Toke Høiland-Jørgensen
  2023-06-14  3:31 ` Jakub Kicinski
  8 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-12 17:23 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

When we get packets on port 9091, we swap src/dst and send it out.
At this point, we also request the timestamp and plumb it back
to the userspace. The userspace simply prints the timestamp.

Haven't really tested, still working on mlx5 patches...

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  59 +++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 160 +++++++++++++++++-
 2 files changed, 214 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index b2dfd7066c6e..e27823b755ef 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -4,6 +4,7 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -12,14 +13,26 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 10);
+} tx_compl_buf SEC(".maps");
+
 __u64 pkts_skip = 0;
 __u64 pkts_fail = 0;
 __u64 pkts_redir = 0;
+__u64 pkts_fail_tx = 0;
+__u64 pkts_ringbuf_full = 0;
 
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
+extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
+
+extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
+extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -90,4 +103,50 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+SEC("fentry/devtx_sb")
+int BPF_PROG(devtx_sb, const struct devtx_frame *frame)
+{
+	int ret;
+
+	ret = bpf_devtx_sb_request_timestamp(frame);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	return 0;
+}
+
+SEC("fentry/devtx_cp")
+int BPF_PROG(devtx_cp, const struct devtx_frame *frame)
+{
+	struct devtx_sample *sample;
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample) {
+		__sync_add_and_fetch(&pkts_ringbuf_full, 1);
+		return 0;
+	}
+
+	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
+SEC("syscall")
+int attach_prog(struct devtx_attach_args *ctx)
+{
+	ctx->devtx_sb_retval = bpf_devtx_sb_attach(ctx->ifindex, ctx->devtx_sb_prog_fd);
+	ctx->devtx_cp_retval = bpf_devtx_cp_attach(ctx->ifindex, ctx->devtx_cp_prog_fd);
+	return 0;
+}
+
+SEC("syscall")
+int detach_prog(struct devtx_attach_args *ctx)
+{
+	ctx->devtx_sb_retval = bpf_devtx_sb_attach(ctx->ifindex, -1);
+	ctx->devtx_cp_retval = bpf_devtx_cp_attach(ctx->ifindex, -1);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..6cc364c2af8a 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,8 @@
  *   - rx_hash
  *
  * TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
  */
 
 #include <test_progs.h>
@@ -228,7 +229,83 @@ static void verify_skb_metadata(int fd)
 	printf("skb hwtstamp is not found!\n");
 }
 
-static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
+static void complete_tx(struct xsk *xsk, struct ring_buffer *ringbuf)
+{
+	__u32 idx;
+	__u64 addr;
+
+	ring_buffer__poll(ringbuf, 1000);
+
+	if (xsk_ring_cons__peek(&xsk->comp, 1, &idx)) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+		xsk_ring_cons__release(&xsk->comp, 1);
+	}
+}
+
+#define swap(a, b, len) do { \
+	for (int i = 0; i < len; i++) { \
+		__u8 tmp = ((__u8 *)a)[i]; \
+		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+		((__u8 *)b)[i] = tmp; \
+	} \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+	struct ipv6hdr *ip6h = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	void *data;
+	__u32 idx;
+	int ret;
+	int len;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (ret != 1) {
+		printf("%p: failed to reserve tx slot\n", xsk);
+		return;
+	}
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+
+	if (eth->h_proto == htons(ETH_P_IP)) {
+		iph = (void *)(eth + 1);
+		udph = (void *)(iph + 1);
+	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
+		ip6h = (void *)(eth + 1);
+		udph = (void *)(ip6h + 1);
+	} else {
+		xsk_ring_prod__cancel(&xsk->tx, 1);
+		return;
+	}
+
+	len = ETH_HLEN;
+	if (ip6h)
+		len += ntohs(ip6h->payload_len);
+	if (iph)
+		len += ntohs(iph->tot_len);
+
+	memcpy(data, rx_packet, len);
+	swap(eth->h_dest, eth->h_source, ETH_ALEN);
+	if (iph)
+		swap(&iph->saddr, &iph->daddr, 4);
+	else
+		swap(&ip6h->saddr, &ip6h->daddr, 16);
+	swap(&udph->source, &udph->dest, 2);
+
+	xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id,
+			   struct ring_buffer *ringbuf)
 {
 	const struct xdp_desc *rx_desc;
 	struct pollfd fds[rxq + 1];
@@ -280,6 +357,11 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 			       xsk, idx, rx_desc->addr, addr, comp_addr);
 			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
 					    clock_id);
+
+			/* mirror packet back */
+			ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+			complete_tx(xsk, ringbuf);
+
 			xsk_ring_cons__release(&xsk->rx, 1);
 			refill_rx(xsk, comp_addr);
 		}
@@ -370,6 +452,7 @@ static void hwtstamp_enable(const char *ifname)
 static void cleanup(void)
 {
 	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
+	int syscall_fd;
 	int ret;
 	int i;
 
@@ -379,8 +462,26 @@ static void cleanup(void)
 			printf("detaching bpf program....\n");
 			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
 			if (ret)
-				printf("failed to detach XDP program: %d\n", ret);
+				printf("failed to detach RX XDP program: %d\n", ret);
 		}
+
+		struct devtx_attach_args args = {
+			.ifindex = ifindex,
+			.devtx_sb_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_sb),
+			.devtx_cp_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_cp),
+			.devtx_sb_retval = -1,
+			.devtx_cp_retval = -1,
+		};
+		DECLARE_LIBBPF_OPTS(bpf_test_run_opts, tattr,
+			.ctx_in = &args,
+			.ctx_size_in = sizeof(args),
+		);
+
+		syscall_fd = bpf_program__fd(bpf_obj->progs.detach_prog);
+		ret = bpf_prog_test_run_opts(syscall_fd, &tattr);
+		if (ret < 0 || args.devtx_sb_retval < 0 || args.devtx_cp_retval < 0)
+			printf("failed to detach TX XDP programs: %d %d %d\n",
+			       ret, args.devtx_sb_retval, args.devtx_cp_retval);
 	}
 
 	for (i = 0; i < rxq; i++)
@@ -404,10 +505,22 @@ static void timestamping_enable(int fd, int val)
 		error(1, errno, "setsockopt(SO_TIMESTAMPING)");
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+
+	printf("got tx timestamp sample %u %llu\n",
+	       sample->timestamp_retval, sample->timestamp);
+
+	return 0;
+}
+
 int main(int argc, char *argv[])
 {
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	clockid_t clock_id = CLOCK_TAI;
 	int server_fd = -1;
+	int syscall_fd;
 	int ret;
 	int i;
 
@@ -448,11 +561,26 @@ int main(int argc, char *argv[])
 	bpf_program__set_ifindex(prog, ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "devtx_sb");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	bpf_program__set_autoattach(prog, false);
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "devtx_cp");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	bpf_program__set_autoattach(prog, false);
+
 	printf("load bpf program...\n");
 	ret = xdp_hw_metadata__load(bpf_obj);
 	if (ret)
 		error(1, -ret, "xdp_hw_metadata__load");
 
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, NULL, NULL);
+	if (libbpf_get_error(tx_compl_ringbuf))
+		error(1, -libbpf_get_error(tx_compl_ringbuf), "ring_buffer__new");
+
 	printf("prepare skb endpoint...\n");
 	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
 	if (server_fd < 0)
@@ -472,15 +600,37 @@ int main(int argc, char *argv[])
 			error(1, -ret, "bpf_map_update_elem");
 	}
 
-	printf("attach bpf program...\n");
+	printf("attach rx bpf program...\n");
 	ret = bpf_xdp_attach(ifindex,
 			     bpf_program__fd(bpf_obj->progs.rx),
 			     XDP_FLAGS, NULL);
 	if (ret)
 		error(1, -ret, "bpf_xdp_attach");
 
+	printf("attach tx bpf programs...\n");
+	struct devtx_attach_args args = {
+		.ifindex = ifindex,
+		.devtx_sb_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_sb),
+		.devtx_cp_prog_fd = bpf_program__fd(bpf_obj->progs.devtx_cp),
+		.devtx_sb_retval = -1,
+		.devtx_cp_retval = -1,
+	};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, tattr,
+		.ctx_in = &args,
+		.ctx_size_in = sizeof(args),
+	);
+
+	syscall_fd = bpf_program__fd(bpf_obj->progs.attach_prog);
+	ret = bpf_prog_test_run_opts(syscall_fd, &tattr);
+	if (ret)
+		error(1, -ret, "bpf_prog_test_run_opts");
+	if (args.devtx_sb_retval < 0)
+		error(1, args.devtx_sb_retval, "devtx_sb_retval");
+	if (args.devtx_cp_retval < 0)
+		error(1, args.devtx_cp_retval, "devtx_cp_retval");
+
 	signal(SIGINT, handle_signal);
-	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id);
+	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id, tx_compl_ringbuf);
 	close(server_fd);
 	cleanup();
 	if (ret)
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2023-06-12 17:23 ` [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata " Stanislav Fomichev
@ 2023-06-12 21:00 ` Toke Høiland-Jørgensen
  2023-06-13 16:32   ` Stanislav Fomichev
  2023-06-16  0:09   ` Stanislav Fomichev
  2023-06-14  3:31 ` Jakub Kicinski
  8 siblings, 2 replies; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-12 21:00 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, willemb, dsahern, magnus.karlsson,
	bjorn, maciej.fijalkowski, netdev

Some immediate thoughts after glancing through this:

> --- Use cases ---
>
> The goal of this series is to add two new standard-ish places
> in the transmit path:
>
> 1. Right before the packet is transmitted (with access to TX
>    descriptors)
> 2. Right after the packet is actually transmitted and we've received the
>    completion (again, with access to TX completion descriptors)
>
> Accessing TX descriptors unlocks the following use-cases:
>
> - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> use device offloads. The existing case implements TX timestamp.
> - Observability: global per-netdev hooks can be used for tracing
> the packets and exploring completion descriptors for all sorts of
> device errors.
>
> Accessing TX descriptors also means that the hooks have to be called
> from the drivers.
>
> The hooks are a light-weight alternative to XDP at egress and currently
> don't provide any packet modification abilities. However, eventually,
> can expose new kfuncs to operate on the packet (or, rather, the actual
> descriptors; for performance sake).

dynptr?

> --- UAPI ---
>
> The hooks are implemented in a HID-BPF style. Meaning they don't
> expose any UAPI and are implemented as tracing programs that call
> a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> programs. The series expands device-bound infrastructure to tracing
> programs.

Not a fan of the "attach from BPF syscall program" thing. These are part
of the XDP data path API, and I think we should expose them as proper
bpf_link attachments from userspace with introspection etc. But I guess
the bpf_mprog thing will give us that?

> --- skb vs xdp ---
>
> The hooks operate on a new light-weight devtx_frame which contains:
> - data
> - len
> - sinfo
>
> This should allow us to have a unified (from BPF POW) place at TX
> and not be super-taxing (we need to copy 2 pointers + len to the stack
> for each invocation).

Not sure what I think about this one. At the very least I think we
should expose xdp->data_meta as well. I'm not sure what the use case for
accessing skbs is? If that *is* indeed useful, probably there will also
end up being a use case for accessing the full skb?

> --- Multiprog attachment ---
>
> Currently, attach/detach don't expose links and don't support multiple
> programs. I'm planning to use Daniel's bpf_mprog once it lands.
>
> --- TODO ---
>
> Things that I'm planning to do for the non-RFC series:
> - have some real device support to verify xdp_hw_metadata works

Would be good to see some performance numbers as well :)

> - freplace
> - Documentation/networking/xdp-rx-metadata.rst - like documentation
>
> --- CC ---
>
> CC'ing people only on the cover letter. Hopefully can find the rest via
> lore.

Well, I found it there, even though I was apparently left off the Cc
list :(

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs
  2023-06-12 17:23 ` [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs Stanislav Fomichev
@ 2023-06-13 14:47   ` Willem de Bruijn
  2023-06-13 19:00     ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2023-06-13 14:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 12, 2023 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Attach kfuncs that request and report TX timestamp via ringbuf.
> Confirm on the userspace side that the program has triggered
> and the timestamp is non-zero.
>
> Also make sure devtx_frame has a sensible pointers and data.
>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

> +SEC("fentry/devtx_sb")
> +int BPF_PROG(devtx_sb, const struct devtx_frame *frame)
> +{
> +       int ret;
> +
> +       ret = verify_frame(frame);
> +       if (ret < 0)
> +               __sync_add_and_fetch(&pkts_fail_tx, 1);

intend to return in these error cases? in this and following patch

> +
> +       ret = bpf_devtx_sb_request_timestamp(frame);
> +       if (ret < 0)
> +               __sync_add_and_fetch(&pkts_fail_tx, 1);
> +
> +       return 0;
> +}
> +
> +SEC("fentry/devtx_cp")
> +int BPF_PROG(devtx_cp, const struct devtx_frame *frame)
> +{
> +       struct devtx_sample *sample;
> +       int ret;
> +
> +       ret = verify_frame(frame);
> +       if (ret < 0)
> +               __sync_add_and_fetch(&pkts_fail_tx, 1);
> +
> +       sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> +       if (!sample)
> +               return 0;

return non-zero?

> +
> +       sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> +
> +       bpf_ringbuf_submit(sample, 0);
> +
> +       return 0;
> +}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
@ 2023-06-13 14:54   ` Willem de Bruijn
  2023-06-13 19:00     ` Stanislav Fomichev
  2023-06-13 15:08   ` Simon Horman
  2023-06-16  5:46   ` Kui-Feng Lee
  2 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2023-06-13 14:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 12, 2023 at 7:24 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> devtx is a lightweight set of hooks before and after packet transmission.
> The hook is supposed to work for both skb and xdp paths by exposing
> a light-weight packet wrapper via devtx_frame (header portion + frags).
>
> devtx is implemented as a tracing program which has access to the
> XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
> in the next patch, but the idea is similar to XDP metadata:
> the kfuncs have netdev-specific implementation, but common
> interface. Upon loading, the kfuncs are resolved to direct
> calls against per-netdev implementation. This can be achieved
> by marking devtx-tracing programs as dev-bound (largely
> reusing xdp-dev-bound program infrastructure).
>
> Attachment and detachment is implemented via syscall BPF program
> by calling bpf_devtx_sb_attach (attach to tx-submission)
> or bpf_devtx_cp_attach (attach to tx completion). Right now,
> the attachment does not return a link and doesn't support
> multiple programs. I plan to switch to Daniel's bpf_mprog infra
> once it's available.
>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>


> @@ -2238,6 +2238,8 @@ struct net_device {
>         unsigned int            real_num_rx_queues;
>
>         struct bpf_prog __rcu   *xdp_prog;
> +       struct bpf_prog __rcu   *devtx_sb;
> +       struct bpf_prog __rcu   *devtx_cp;

nit/subjective: non-obvious two letter acronyms are nr. How about tx
and txc (or txcomp)

> +static int __bpf_devtx_attach(struct net_device *netdev, int prog_fd,
> +                             const char *attach_func_name, struct bpf_prog **pprog)
> +{
> +       struct bpf_prog *prog;
> +       int ret = 0;
> +
> +       if (prog_fd < 0)
> +               return __bpf_devtx_detach(netdev, pprog);
> +
> +       if (*pprog)
> +               return -EBUSY;
> +
> +       prog = bpf_prog_get(prog_fd);
> +       if (IS_ERR(prog))
> +               return PTR_ERR(prog);
> +
> +       if (prog->type != BPF_PROG_TYPE_TRACING ||
> +           prog->expected_attach_type != BPF_TRACE_FENTRY ||
> +           !bpf_prog_is_dev_bound(prog->aux) ||
> +           !bpf_offload_dev_match(prog, netdev) ||
> +           strcmp(prog->aux->attach_func_name, attach_func_name)) {
> +               bpf_prog_put(prog);
> +               return -EINVAL;
> +       }
> +
> +       *pprog = prog;
> +       static_branch_inc(&devtx_enabled);
> +
> +       return ret;

nit: just return 0, no variable needed

> +}
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +                 "Global functions as their definitions will be in vmlinux BTF");
> +
> +/**
> + * bpf_devtx_sb_attach - Attach devtx 'packet submit' program
> + * @ifindex: netdev interface index.
> + * @prog_fd: BPF program file descriptor.
> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + */
> +__bpf_kfunc int bpf_devtx_sb_attach(int ifindex, int prog_fd)
> +{
> +       struct net_device *netdev;
> +       int ret;
> +
> +       netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> +       if (!netdev)
> +               return -EINVAL;
> +
> +       mutex_lock(&devtx_attach_lock);
> +       ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_sb", &netdev->devtx_sb);
> +       mutex_unlock(&devtx_attach_lock);
> +
> +       dev_put(netdev);
> +
> +       return ret;
> +}
> +
> +/**
> + * bpf_devtx_cp_attach - Attach devtx 'packet complete' program
> + * @ifindex: netdev interface index.
> + * @prog_fd: BPF program file descriptor.
> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + */
> +__bpf_kfunc int bpf_devtx_cp_attach(int ifindex, int prog_fd)
> +{
> +       struct net_device *netdev;
> +       int ret;
> +
> +       netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> +       if (!netdev)
> +               return -EINVAL;
> +
> +       mutex_lock(&devtx_attach_lock);
> +       ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_cp", &netdev->devtx_cp);
> +       mutex_unlock(&devtx_attach_lock);
> +
> +       dev_put(netdev);
> +
> +       return ret;
> +}

These two functions are near duplicates, aside from the arguments to
their inner call to __bpf_devtx_attach. Can be dedup-ed further?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata with devtx kfuncs
  2023-06-12 17:23 ` [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata " Stanislav Fomichev
@ 2023-06-13 15:03   ` Willem de Bruijn
  2023-06-13 19:00     ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2023-06-13 15:03 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 12, 2023 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> When we get packets on port 9091, we swap src/dst and send it out.
> At this point, we also request the timestamp and plumb it back
> to the userspace. The userspace simply prints the timestamp.
>
> Haven't really tested, still working on mlx5 patches...
>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>


> +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> @@ -10,7 +10,8 @@
>   *   - rx_hash
>   *
>   * TX:
> - * - TBD
> + * - UDP 9091 packets trigger TX reply

This branch on port is missing?

> +static void ping_pong(struct xsk *xsk, void *rx_packet)
> +{
> +       struct ipv6hdr *ip6h = NULL;
> +       struct iphdr *iph = NULL;
> +       struct xdp_desc *tx_desc;
> +       struct udphdr *udph;
> +       struct ethhdr *eth;
> +       void *data;
> +       __u32 idx;
> +       int ret;
> +       int len;
> +
> +       ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> +       if (ret != 1) {
> +               printf("%p: failed to reserve tx slot\n", xsk);
> +               return;
> +       }
> +
> +       tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> +       tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
> +       data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> +
> +       eth = data;
> +
> +       if (eth->h_proto == htons(ETH_P_IP)) {
> +               iph = (void *)(eth + 1);
> +               udph = (void *)(iph + 1);
> +       } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> +               ip6h = (void *)(eth + 1);
> +               udph = (void *)(ip6h + 1);
> +       } else {
> +               xsk_ring_prod__cancel(&xsk->tx, 1);
> +               return;
> +       }
> +
> +       len = ETH_HLEN;
> +       if (ip6h)
> +               len += ntohs(ip6h->payload_len);

sizeof(*ip6h) + ntohs(ip6h->payload_len) ?

> +       if (iph)
> +               len += ntohs(iph->tot_len);
> +
> +       memcpy(data, rx_packet, len);
> +       swap(eth->h_dest, eth->h_source, ETH_ALEN);
> +       if (iph)
> +               swap(&iph->saddr, &iph->daddr, 4);

need to recompute ipv4 checksum?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
  2023-06-13 14:54   ` Willem de Bruijn
@ 2023-06-13 15:08   ` Simon Horman
  2023-06-13 19:00     ` Stanislav Fomichev
  2023-06-16  5:46   ` Kui-Feng Lee
  2 siblings, 1 reply; 47+ messages in thread
From: Simon Horman @ 2023-06-13 15:08 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 12, 2023 at 10:23:03AM -0700, Stanislav Fomichev wrote:
> devtx is a lightweight set of hooks before and after packet transmission.
> The hook is supposed to work for both skb and xdp paths by exposing
> a light-weight packet wrapper via devtx_frame (header portion + frags).
> 
> devtx is implemented as a tracing program which has access to the
> XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
> in the next patch, but the idea is similar to XDP metadata:
> the kfuncs have netdev-specific implementation, but common
> interface. Upon loading, the kfuncs are resolved to direct
> calls against per-netdev implementation. This can be achieved
> by marking devtx-tracing programs as dev-bound (largely
> reusing xdp-dev-bound program infrastructure).
> 
> Attachment and detachment is implemented via syscall BPF program
> by calling bpf_devtx_sb_attach (attach to tx-submission)
> or bpf_devtx_cp_attach (attach to tx completion). Right now,
> the attachment does not return a link and doesn't support
> multiple programs. I plan to switch to Daniel's bpf_mprog infra
> once it's available.
> 
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

...

> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22976,11 +22976,13 @@ L:	bpf@vger.kernel.org
>  S:	Supported
>  F:	drivers/net/ethernet/*/*/*/*/*xdp*
>  F:	drivers/net/ethernet/*/*/*xdp*
> +F:	include/net/devtx.h
>  F:	include/net/xdp.h
>  F:	include/net/xdp_priv.h
>  F:	include/trace/events/xdp.h
>  F:	kernel/bpf/cpumap.c
>  F:	kernel/bpf/devmap.c
> +F:	net/core/devtx.c
>  F:	net/core/xdp.c
>  F:	samples/bpf/xdp*
>  F:	tools/testing/selftests/bpf/*/*xdp*

Hi Stan,

some feedback from my side.

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 08fbd4622ccf..e08e3fd39dfc 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2238,6 +2238,8 @@ struct net_device {
>  	unsigned int		real_num_rx_queues;
>  
>  	struct bpf_prog __rcu	*xdp_prog;
> +	struct bpf_prog	__rcu	*devtx_sb;
> +	struct bpf_prog	__rcu	*devtx_cp;

It would be good to add these new fields to the kernel doc
for struct net_device.

>  	unsigned long		gro_flush_timeout;
>  	int			napi_defer_hard_irqs;
>  #define GRO_LEGACY_MAX_SIZE	65536u
> diff --git a/include/net/devtx.h b/include/net/devtx.h
> new file mode 100644
> index 000000000000..7eab66d0ce80
> --- /dev/null
> +++ b/include/net/devtx.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __LINUX_NET_DEVTX_H__
> +#define __LINUX_NET_DEVTX_H__
> +
> +#include <linux/jump_label.h>
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <net/xdp.h>
> +
> +struct devtx_frame {
> +	void *data;
> +	u16 len;
> +	struct skb_shared_info *sinfo; /* for frags */
> +};
> +
> +#ifdef CONFIG_NET
> +void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx);
> +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx);
> +bool is_devtx_kfunc(u32 kfunc_id);
> +void devtx_shutdown(struct net_device *netdev);
> +
> +static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
> +{
> +	ctx->data = skb->data;
> +	ctx->len = skb_headlen(skb);
> +	ctx->sinfo = skb_shinfo(skb);
> +}
> +
> +static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
> +{
> +	ctx->data = xdpf->data;
> +	ctx->len = xdpf->len;
> +	ctx->sinfo = xdp_frame_has_frags(xdpf) ? xdp_get_shared_info_from_frame(xdpf) : NULL;
> +}
> +
> +DECLARE_STATIC_KEY_FALSE(devtx_enabled);
> +
> +static inline bool devtx_submit_enabled(struct net_device *netdev)
> +{
> +	return static_branch_unlikely(&devtx_enabled) &&
> +	       rcu_access_pointer(netdev->devtx_sb);
> +}
> +
> +static inline bool devtx_complete_enabled(struct net_device *netdev)
> +{
> +	return static_branch_unlikely(&devtx_enabled) &&
> +	       rcu_access_pointer(netdev->devtx_cp);
> +}
> +#else
> +static inline void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
> +{
> +}
> +
> +static inline void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> +{
> +}
> +
> +static inline bool is_devtx_kfunc(u32 kfunc_id)
> +{
> +	return false;
> +}
> +
> +static inline void devtx_shutdown(struct net_device *netdev)
> +{
> +}
> +
> +static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
> +{
> +}
> +
> +static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
> +{
> +}
> +#endif
> +
> +#endif /* __LINUX_NET_DEVTX_H__ */
> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> index 235d81f7e0ed..9cfe96422c80 100644
> --- a/kernel/bpf/offload.c
> +++ b/kernel/bpf/offload.c
> @@ -25,6 +25,7 @@
>  #include <linux/rhashtable.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/rwsem.h>
> +#include <net/devtx.h>
>  
>  /* Protects offdevs, members of bpf_offload_netdev and offload members
>   * of all progs.
> @@ -228,6 +229,7 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
>  	int err;
>  
>  	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
> +	    attr->prog_type != BPF_PROG_TYPE_TRACING &&
>  	    attr->prog_type != BPF_PROG_TYPE_XDP)
>  		return -EINVAL;
>  
> @@ -238,6 +240,10 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
>  	    attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY)
>  		return -EINVAL;
>  
> +	if (attr->prog_type == BPF_PROG_TYPE_TRACING &&
> +	    !is_devtx_kfunc(prog->aux->attach_btf_id))
> +		return -EINVAL;
> +
>  	netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex);
>  	if (!netdev)
>  		return -EINVAL;
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 8f367813bc68..c1db05ccfac7 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -39,4 +39,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
>  obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
>  obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
> +obj-$(CONFIG_BPF_SYSCALL) += devtx.o
>  obj-$(CONFIG_OF)	+= of_net.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3393c2f3dbe8..ef0e65e68024 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -150,6 +150,7 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/prandom.h>
>  #include <linux/once_lite.h>
> +#include <net/devtx.h>
>  
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -10875,6 +10876,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
>  		dev_shutdown(dev);
>  
>  		dev_xdp_uninstall(dev);
> +		devtx_shutdown(dev);
>  		bpf_dev_bound_netdev_unregister(dev);
>  
>  		netdev_offload_xstats_disable_all(dev);
> diff --git a/net/core/devtx.c b/net/core/devtx.c
> new file mode 100644
> index 000000000000..b7cbc26d1c01
> --- /dev/null
> +++ b/net/core/devtx.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <net/devtx.h>
> +#include <linux/filter.h>
> +
> +DEFINE_STATIC_KEY_FALSE(devtx_enabled);
> +EXPORT_SYMBOL_GPL(devtx_enabled);
> +
> +static void devtx_run(struct net_device *netdev, struct devtx_frame *ctx, struct bpf_prog **pprog)

Is an __rcu annotation appropriate for prog here?
Also elsewhere in this patch.

> +{
> +	struct bpf_prog *prog;
> +	void *real_ctx[1] = {ctx};
> +
> +	prog = rcu_dereference(*pprog);
> +	if (likely(prog))
> +		bpf_prog_run(prog, real_ctx);
> +}
> +
> +void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
> +{
> +	rcu_read_lock();
> +	devtx_run(netdev, ctx, &netdev->devtx_sb);
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(devtx_submit);
> +
> +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> +{
> +	rcu_read_lock();
> +	devtx_run(netdev, ctx, &netdev->devtx_cp);
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(devtx_complete);
> +
> +/**
> + * devtx_sb - Called for every egress netdev packet

As this is a kernel doc, it would be good to document the ctx parameter here.

> + *
> + * Note: this function is never actually called by the kernel and declared
> + * only to allow loading an attaching appropriate tracepoints.
> + */
> +__weak noinline void devtx_sb(struct devtx_frame *ctx)

I guess this is intentional.
But gcc complains that this is neither static nor is a forward
declaration provided. Likewise for devtx_cp()

> +{
> +}
> +
> +/**
> + * devtx_cp - Called upon egress netdev packet completion

Likewise, here too.

> + *
> + * Note: this function is never actually called by the kernel and declared
> + * only to allow loading an attaching appropriate tracepoints.
> + */
> +__weak noinline void devtx_cp(struct devtx_frame *ctx)
> +{
> +}
> +
> +BTF_SET8_START(bpf_devtx_hook_ids)
> +BTF_ID_FLAGS(func, devtx_sb)
> +BTF_ID_FLAGS(func, devtx_cp)
> +BTF_SET8_END(bpf_devtx_hook_ids)
> +
> +static const struct btf_kfunc_id_set bpf_devtx_hook_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &bpf_devtx_hook_ids,
> +};
> +
> +static DEFINE_MUTEX(devtx_attach_lock);
> +
> +static int __bpf_devtx_detach(struct net_device *netdev, struct bpf_prog **pprog)
> +{

As per my prior comment about *prog and __rcu annotations.
I'm genuinely unsure how the usage of **pprog in this function sits with RCU.

> +	if (!*pprog)
> +		return -EINVAL;
> +	bpf_prog_put(*pprog);
> +	*pprog = NULL;
> +
> +	static_branch_dec(&devtx_enabled);
> +	return 0;
> +}
> +
> +static int __bpf_devtx_attach(struct net_device *netdev, int prog_fd,
> +			      const char *attach_func_name, struct bpf_prog **pprog)
> +{
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	if (prog_fd < 0)
> +		return __bpf_devtx_detach(netdev, pprog);
> +
> +	if (*pprog)
> +		return -EBUSY;
> +
> +	prog = bpf_prog_get(prog_fd);
> +	if (IS_ERR(prog))
> +		return PTR_ERR(prog);
> +
> +	if (prog->type != BPF_PROG_TYPE_TRACING ||
> +	    prog->expected_attach_type != BPF_TRACE_FENTRY ||
> +	    !bpf_prog_is_dev_bound(prog->aux) ||
> +	    !bpf_offload_dev_match(prog, netdev) ||
> +	    strcmp(prog->aux->attach_func_name, attach_func_name)) {
> +		bpf_prog_put(prog);
> +		return -EINVAL;
> +	}
> +
> +	*pprog = prog;
> +	static_branch_inc(&devtx_enabled);
> +
> +	return ret;
> +}

...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc
  2023-06-12 17:23 ` [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc Stanislav Fomichev
@ 2023-06-13 15:14   ` Simon Horman
  2023-06-13 18:39     ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Simon Horman @ 2023-06-13 15:14 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 12, 2023 at 10:23:04AM -0700, Stanislav Fomichev wrote:

...

> +/**
> + * bpf_devtx_cp_timestamp - Read TX timestamp of the packet. Callable
> + * only from the devtx-complete hook.
> + * @ctx: devtx context pointer.

Hi Stan,

one minor nit: documentation of the timestamp parameter should go here.

> + *
> + * Returns 0 on success or ``-errno`` on error.
> + */
> +__bpf_kfunc int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  __diag_pop();

...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-12 21:00 ` [RFC bpf-next 0/7] bpf: netdev TX metadata Toke Høiland-Jørgensen
@ 2023-06-13 16:32   ` Stanislav Fomichev
  2023-06-13 17:18     ` Toke Høiland-Jørgensen
  2023-06-16  0:09   ` Stanislav Fomichev
  1 sibling, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 16:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On 06/12, Toke Høiland-Jørgensen wrote:
> Some immediate thoughts after glancing through this:
> 
> > --- Use cases ---
> >
> > The goal of this series is to add two new standard-ish places
> > in the transmit path:
> >
> > 1. Right before the packet is transmitted (with access to TX
> >    descriptors)
> > 2. Right after the packet is actually transmitted and we've received the
> >    completion (again, with access to TX completion descriptors)
> >
> > Accessing TX descriptors unlocks the following use-cases:
> >
> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > use device offloads. The existing case implements TX timestamp.
> > - Observability: global per-netdev hooks can be used for tracing
> > the packets and exploring completion descriptors for all sorts of
> > device errors.
> >
> > Accessing TX descriptors also means that the hooks have to be called
> > from the drivers.
> >
> > The hooks are a light-weight alternative to XDP at egress and currently
> > don't provide any packet modification abilities. However, eventually,
> > can expose new kfuncs to operate on the packet (or, rather, the actual
> > descriptors; for performance sake).
> 
> dynptr?

Haven't considered, let me explore, but not sure what it buys us
here?

> > --- UAPI ---
> >
> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > expose any UAPI and are implemented as tracing programs that call
> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > programs. The series expands device-bound infrastructure to tracing
> > programs.
> 
> Not a fan of the "attach from BPF syscall program" thing. These are part
> of the XDP data path API, and I think we should expose them as proper
> bpf_link attachments from userspace with introspection etc. But I guess
> the bpf_mprog thing will give us that?

bpf_mprog will just make those attach kfuncs return the link fd. The
syscall program will still stay :-(

> > --- skb vs xdp ---
> >
> > The hooks operate on a new light-weight devtx_frame which contains:
> > - data
> > - len
> > - sinfo
> >
> > This should allow us to have a unified (from BPF POW) place at TX
> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > for each invocation).
> 
> Not sure what I think about this one. At the very least I think we
> should expose xdp->data_meta as well. I'm not sure what the use case for
> accessing skbs is? If that *is* indeed useful, probably there will also
> end up being a use case for accessing the full skb?

skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
a good opportunity to unify? Or probably won't work because if
xdf_frame doesn't have frags, it won't have sinfo?

> > --- Multiprog attachment ---
> >
> > Currently, attach/detach don't expose links and don't support multiple
> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
> >
> > --- TODO ---
> >
> > Things that I'm planning to do for the non-RFC series:
> > - have some real device support to verify xdp_hw_metadata works
> 
> Would be good to see some performance numbers as well :)

+1 :-)

> > - freplace
> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
> >
> > --- CC ---
> >
> > CC'ing people only on the cover letter. Hopefully can find the rest via
> > lore.
> 
> Well, I found it there, even though I was apparently left off the Cc
> list :(
> 
> -Toke

Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
list, so decided to explicitly cc mostly netdev folks that might miss
it otherwise.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 16:32   ` Stanislav Fomichev
@ 2023-06-13 17:18     ` Toke Høiland-Jørgensen
  2023-06-13 18:39       ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-13 17:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On 06/12, Toke Høiland-Jørgensen wrote:
>> Some immediate thoughts after glancing through this:
>> 
>> > --- Use cases ---
>> >
>> > The goal of this series is to add two new standard-ish places
>> > in the transmit path:
>> >
>> > 1. Right before the packet is transmitted (with access to TX
>> >    descriptors)
>> > 2. Right after the packet is actually transmitted and we've received the
>> >    completion (again, with access to TX completion descriptors)
>> >
>> > Accessing TX descriptors unlocks the following use-cases:
>> >
>> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
>> > use device offloads. The existing case implements TX timestamp.
>> > - Observability: global per-netdev hooks can be used for tracing
>> > the packets and exploring completion descriptors for all sorts of
>> > device errors.
>> >
>> > Accessing TX descriptors also means that the hooks have to be called
>> > from the drivers.
>> >
>> > The hooks are a light-weight alternative to XDP at egress and currently
>> > don't provide any packet modification abilities. However, eventually,
>> > can expose new kfuncs to operate on the packet (or, rather, the actual
>> > descriptors; for performance sake).
>> 
>> dynptr?
>
> Haven't considered, let me explore, but not sure what it buys us
> here?

API consistency, certainly. Possibly also performance, if using the
slice thing that gets you a direct pointer to the pkt data? Not sure
about that, though, haven't done extensive benchmarking of dynptr yet...

>> > --- UAPI ---
>> >
>> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> > expose any UAPI and are implemented as tracing programs that call
>> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> > programs. The series expands device-bound infrastructure to tracing
>> > programs.
>> 
>> Not a fan of the "attach from BPF syscall program" thing. These are part
>> of the XDP data path API, and I think we should expose them as proper
>> bpf_link attachments from userspace with introspection etc. But I guess
>> the bpf_mprog thing will give us that?
>
> bpf_mprog will just make those attach kfuncs return the link fd. The
> syscall program will still stay :-(

Why does the attachment have to be done this way, exactly? Couldn't we
just use the regular bpf_link attachment from userspace? AFAICT it's not
really piggy-backing on the function override thing anyway when the
attachment is per-dev? Or am I misunderstanding how all this works?

>> > --- skb vs xdp ---
>> >
>> > The hooks operate on a new light-weight devtx_frame which contains:
>> > - data
>> > - len
>> > - sinfo
>> >
>> > This should allow us to have a unified (from BPF POW) place at TX
>> > and not be super-taxing (we need to copy 2 pointers + len to the stack
>> > for each invocation).
>> 
>> Not sure what I think about this one. At the very least I think we
>> should expose xdp->data_meta as well. I'm not sure what the use case for
>> accessing skbs is? If that *is* indeed useful, probably there will also
>> end up being a use case for accessing the full skb?
>
> skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> a good opportunity to unify? Or probably won't work because if
> xdf_frame doesn't have frags, it won't have sinfo?

No, it won't. But why do we need this unification between the skb and
xdp paths in the first place? Doesn't the skb path already have support
for these things? Seems like we could just stick to making this xdp-only
and keeping xdp_frame as the ctx argument?

>> > --- Multiprog attachment ---
>> >
>> > Currently, attach/detach don't expose links and don't support multiple
>> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
>> >
>> > --- TODO ---
>> >
>> > Things that I'm planning to do for the non-RFC series:
>> > - have some real device support to verify xdp_hw_metadata works
>> 
>> Would be good to see some performance numbers as well :)
>
> +1 :-)
>
>> > - freplace
>> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
>> >
>> > --- CC ---
>> >
>> > CC'ing people only on the cover letter. Hopefully can find the rest via
>> > lore.
>> 
>> Well, I found it there, even though I was apparently left off the Cc
>> list :(
>> 
>> -Toke
>
> Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
> list, so decided to explicitly cc mostly netdev folks that might miss
> it otherwise.

Haha, fair point! And no big deal, I did obviously see it. I was just
feeling a bit left out, that's all ;)

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 17:18     ` Toke Høiland-Jørgensen
@ 2023-06-13 18:39       ` Stanislav Fomichev
  2023-06-13 19:10         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 18:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On 06/12, Toke Høiland-Jørgensen wrote:
> >> Some immediate thoughts after glancing through this:
> >>
> >> > --- Use cases ---
> >> >
> >> > The goal of this series is to add two new standard-ish places
> >> > in the transmit path:
> >> >
> >> > 1. Right before the packet is transmitted (with access to TX
> >> >    descriptors)
> >> > 2. Right after the packet is actually transmitted and we've received the
> >> >    completion (again, with access to TX completion descriptors)
> >> >
> >> > Accessing TX descriptors unlocks the following use-cases:
> >> >
> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> >> > use device offloads. The existing case implements TX timestamp.
> >> > - Observability: global per-netdev hooks can be used for tracing
> >> > the packets and exploring completion descriptors for all sorts of
> >> > device errors.
> >> >
> >> > Accessing TX descriptors also means that the hooks have to be called
> >> > from the drivers.
> >> >
> >> > The hooks are a light-weight alternative to XDP at egress and currently
> >> > don't provide any packet modification abilities. However, eventually,
> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
> >> > descriptors; for performance sake).
> >>
> >> dynptr?
> >
> > Haven't considered, let me explore, but not sure what it buys us
> > here?
>
> API consistency, certainly. Possibly also performance, if using the
> slice thing that gets you a direct pointer to the pkt data? Not sure
> about that, though, haven't done extensive benchmarking of dynptr yet...

Same. Let's keep it on the table, I'll try to explore. I was just
thinking that having less abstraction here might be better
performance-wise.

> >> > --- UAPI ---
> >> >
> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> >> > expose any UAPI and are implemented as tracing programs that call
> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> >> > programs. The series expands device-bound infrastructure to tracing
> >> > programs.
> >>
> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> >> of the XDP data path API, and I think we should expose them as proper
> >> bpf_link attachments from userspace with introspection etc. But I guess
> >> the bpf_mprog thing will give us that?
> >
> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > syscall program will still stay :-(
>
> Why does the attachment have to be done this way, exactly? Couldn't we
> just use the regular bpf_link attachment from userspace? AFAICT it's not
> really piggy-backing on the function override thing anyway when the
> attachment is per-dev? Or am I misunderstanding how all this works?

It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
us an opportunity to fix things.
We can do it via a regular syscall path if there is a consensus.

> >> > --- skb vs xdp ---
> >> >
> >> > The hooks operate on a new light-weight devtx_frame which contains:
> >> > - data
> >> > - len
> >> > - sinfo
> >> >
> >> > This should allow us to have a unified (from BPF POW) place at TX
> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> >> > for each invocation).
> >>
> >> Not sure what I think about this one. At the very least I think we
> >> should expose xdp->data_meta as well. I'm not sure what the use case for
> >> accessing skbs is? If that *is* indeed useful, probably there will also
> >> end up being a use case for accessing the full skb?
> >
> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> > a good opportunity to unify? Or probably won't work because if
> > xdf_frame doesn't have frags, it won't have sinfo?
>
> No, it won't. But why do we need this unification between the skb and
> xdp paths in the first place? Doesn't the skb path already have support
> for these things? Seems like we could just stick to making this xdp-only
> and keeping xdp_frame as the ctx argument?

For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
to make it work for both cases?
We can always export metadata len via some kfunc, sure.

> >> > --- Multiprog attachment ---
> >> >
> >> > Currently, attach/detach don't expose links and don't support multiple
> >> > programs. I'm planning to use Daniel's bpf_mprog once it lands.
> >> >
> >> > --- TODO ---
> >> >
> >> > Things that I'm planning to do for the non-RFC series:
> >> > - have some real device support to verify xdp_hw_metadata works
> >>
> >> Would be good to see some performance numbers as well :)
> >
> > +1 :-)
> >
> >> > - freplace
> >> > - Documentation/networking/xdp-rx-metadata.rst - like documentation
> >> >
> >> > --- CC ---
> >> >
> >> > CC'ing people only on the cover letter. Hopefully can find the rest via
> >> > lore.
> >>
> >> Well, I found it there, even though I was apparently left off the Cc
> >> list :(
> >>
> >> -Toke
> >
> > Sure, I'll CC you explicitly next time! But I know you diligently follow bpf
> > list, so decided to explicitly cc mostly netdev folks that might miss
> > it otherwise.
>
> Haha, fair point! And no big deal, I did obviously see it. I was just
> feeling a bit left out, that's all ;)
>
> -Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc
  2023-06-13 15:14   ` Simon Horman
@ 2023-06-13 18:39     ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 18:39 UTC (permalink / raw)
  To: Simon Horman
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 8:14 AM Simon Horman <simon.horman@corigine.com> wrote:
>
> On Mon, Jun 12, 2023 at 10:23:04AM -0700, Stanislav Fomichev wrote:
>
> ...
>
> > +/**
> > + * bpf_devtx_cp_timestamp - Read TX timestamp of the packet. Callable
> > + * only from the devtx-complete hook.
> > + * @ctx: devtx context pointer.
>
> Hi Stan,
>
> one minor nit: documentation of the timestamp parameter should go here.

Thanks, will add!

> > + *
> > + * Returns 0 on success or ``-errno`` on error.
> > + */
> > +__bpf_kfunc int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +
> >  __diag_pop();
>
> ...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-13 15:08   ` Simon Horman
@ 2023-06-13 19:00     ` Stanislav Fomichev
  2023-06-14  7:02       ` Simon Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 19:00 UTC (permalink / raw)
  To: Simon Horman
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 8:08 AM Simon Horman <simon.horman@corigine.com> wrote:
>
> On Mon, Jun 12, 2023 at 10:23:03AM -0700, Stanislav Fomichev wrote:
> > devtx is a lightweight set of hooks before and after packet transmission.
> > The hook is supposed to work for both skb and xdp paths by exposing
> > a light-weight packet wrapper via devtx_frame (header portion + frags).
> >
> > devtx is implemented as a tracing program which has access to the
> > XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
> > in the next patch, but the idea is similar to XDP metadata:
> > the kfuncs have netdev-specific implementation, but common
> > interface. Upon loading, the kfuncs are resolved to direct
> > calls against per-netdev implementation. This can be achieved
> > by marking devtx-tracing programs as dev-bound (largely
> > reusing xdp-dev-bound program infrastructure).
> >
> > Attachment and detachment is implemented via syscall BPF program
> > by calling bpf_devtx_sb_attach (attach to tx-submission)
> > or bpf_devtx_cp_attach (attach to tx completion). Right now,
> > the attachment does not return a link and doesn't support
> > multiple programs. I plan to switch to Daniel's bpf_mprog infra
> > once it's available.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
> ...
>
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -22976,11 +22976,13 @@ L:  bpf@vger.kernel.org
> >  S:   Supported
> >  F:   drivers/net/ethernet/*/*/*/*/*xdp*
> >  F:   drivers/net/ethernet/*/*/*xdp*
> > +F:   include/net/devtx.h
> >  F:   include/net/xdp.h
> >  F:   include/net/xdp_priv.h
> >  F:   include/trace/events/xdp.h
> >  F:   kernel/bpf/cpumap.c
> >  F:   kernel/bpf/devmap.c
> > +F:   net/core/devtx.c
> >  F:   net/core/xdp.c
> >  F:   samples/bpf/xdp*
> >  F:   tools/testing/selftests/bpf/*/*xdp*
>
> Hi Stan,
>
> some feedback from my side.
>
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 08fbd4622ccf..e08e3fd39dfc 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2238,6 +2238,8 @@ struct net_device {
> >       unsigned int            real_num_rx_queues;
> >
> >       struct bpf_prog __rcu   *xdp_prog;
> > +     struct bpf_prog __rcu   *devtx_sb;
> > +     struct bpf_prog __rcu   *devtx_cp;
>
> It would be good to add these new fields to the kernel doc
> for struct net_device.

Sure, will do!

> >       unsigned long           gro_flush_timeout;
> >       int                     napi_defer_hard_irqs;
> >  #define GRO_LEGACY_MAX_SIZE  65536u
> > diff --git a/include/net/devtx.h b/include/net/devtx.h
> > new file mode 100644
> > index 000000000000..7eab66d0ce80
> > --- /dev/null
> > +++ b/include/net/devtx.h
> > @@ -0,0 +1,76 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +#ifndef __LINUX_NET_DEVTX_H__
> > +#define __LINUX_NET_DEVTX_H__
> > +
> > +#include <linux/jump_label.h>
> > +#include <linux/skbuff.h>
> > +#include <linux/netdevice.h>
> > +#include <net/xdp.h>
> > +
> > +struct devtx_frame {
> > +     void *data;
> > +     u16 len;
> > +     struct skb_shared_info *sinfo; /* for frags */
> > +};
> > +
> > +#ifdef CONFIG_NET
> > +void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx);
> > +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx);
> > +bool is_devtx_kfunc(u32 kfunc_id);
> > +void devtx_shutdown(struct net_device *netdev);
> > +
> > +static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
> > +{
> > +     ctx->data = skb->data;
> > +     ctx->len = skb_headlen(skb);
> > +     ctx->sinfo = skb_shinfo(skb);
> > +}
> > +
> > +static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
> > +{
> > +     ctx->data = xdpf->data;
> > +     ctx->len = xdpf->len;
> > +     ctx->sinfo = xdp_frame_has_frags(xdpf) ? xdp_get_shared_info_from_frame(xdpf) : NULL;
> > +}
> > +
> > +DECLARE_STATIC_KEY_FALSE(devtx_enabled);
> > +
> > +static inline bool devtx_submit_enabled(struct net_device *netdev)
> > +{
> > +     return static_branch_unlikely(&devtx_enabled) &&
> > +            rcu_access_pointer(netdev->devtx_sb);
> > +}
> > +
> > +static inline bool devtx_complete_enabled(struct net_device *netdev)
> > +{
> > +     return static_branch_unlikely(&devtx_enabled) &&
> > +            rcu_access_pointer(netdev->devtx_cp);
> > +}
> > +#else
> > +static inline void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
> > +{
> > +}
> > +
> > +static inline void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> > +{
> > +}
> > +
> > +static inline bool is_devtx_kfunc(u32 kfunc_id)
> > +{
> > +     return false;
> > +}
> > +
> > +static inline void devtx_shutdown(struct net_device *netdev)
> > +{
> > +}
> > +
> > +static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb)
> > +{
> > +}
> > +
> > +static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf)
> > +{
> > +}
> > +#endif
> > +
> > +#endif /* __LINUX_NET_DEVTX_H__ */
> > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > index 235d81f7e0ed..9cfe96422c80 100644
> > --- a/kernel/bpf/offload.c
> > +++ b/kernel/bpf/offload.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/rhashtable.h>
> >  #include <linux/rtnetlink.h>
> >  #include <linux/rwsem.h>
> > +#include <net/devtx.h>
> >
> >  /* Protects offdevs, members of bpf_offload_netdev and offload members
> >   * of all progs.
> > @@ -228,6 +229,7 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
> >       int err;
> >
> >       if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
> > +         attr->prog_type != BPF_PROG_TYPE_TRACING &&
> >           attr->prog_type != BPF_PROG_TYPE_XDP)
> >               return -EINVAL;
> >
> > @@ -238,6 +240,10 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
> >           attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY)
> >               return -EINVAL;
> >
> > +     if (attr->prog_type == BPF_PROG_TYPE_TRACING &&
> > +         !is_devtx_kfunc(prog->aux->attach_btf_id))
> > +             return -EINVAL;
> > +
> >       netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex);
> >       if (!netdev)
> >               return -EINVAL;
> > diff --git a/net/core/Makefile b/net/core/Makefile
> > index 8f367813bc68..c1db05ccfac7 100644
> > --- a/net/core/Makefile
> > +++ b/net/core/Makefile
> > @@ -39,4 +39,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
> >  obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
> >  obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
> >  obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
> > +obj-$(CONFIG_BPF_SYSCALL) += devtx.o
> >  obj-$(CONFIG_OF)     += of_net.o
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 3393c2f3dbe8..ef0e65e68024 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -150,6 +150,7 @@
> >  #include <linux/pm_runtime.h>
> >  #include <linux/prandom.h>
> >  #include <linux/once_lite.h>
> > +#include <net/devtx.h>
> >
> >  #include "dev.h"
> >  #include "net-sysfs.h"
> > @@ -10875,6 +10876,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
> >               dev_shutdown(dev);
> >
> >               dev_xdp_uninstall(dev);
> > +             devtx_shutdown(dev);
> >               bpf_dev_bound_netdev_unregister(dev);
> >
> >               netdev_offload_xstats_disable_all(dev);
> > diff --git a/net/core/devtx.c b/net/core/devtx.c
> > new file mode 100644
> > index 000000000000..b7cbc26d1c01
> > --- /dev/null
> > +++ b/net/core/devtx.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <net/devtx.h>
> > +#include <linux/filter.h>
> > +
> > +DEFINE_STATIC_KEY_FALSE(devtx_enabled);
> > +EXPORT_SYMBOL_GPL(devtx_enabled);
> > +
> > +static void devtx_run(struct net_device *netdev, struct devtx_frame *ctx, struct bpf_prog **pprog)
>
> Is an __rcu annotation appropriate for prog here?
> Also elsewhere in this patch.

Good point. Maybe I should rcu_dereference it them somewhere on top.
Let me try to find the best place..

> > +{
> > +     struct bpf_prog *prog;
> > +     void *real_ctx[1] = {ctx};
> > +
> > +     prog = rcu_dereference(*pprog);
> > +     if (likely(prog))
> > +             bpf_prog_run(prog, real_ctx);
> > +}
> > +
> > +void devtx_submit(struct net_device *netdev, struct devtx_frame *ctx)
> > +{
> > +     rcu_read_lock();
> > +     devtx_run(netdev, ctx, &netdev->devtx_sb);
> > +     rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL_GPL(devtx_submit);
> > +
> > +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> > +{
> > +     rcu_read_lock();
> > +     devtx_run(netdev, ctx, &netdev->devtx_cp);
> > +     rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL_GPL(devtx_complete);
> > +
> > +/**
> > + * devtx_sb - Called for every egress netdev packet
>
> As this is a kernel doc, it would be good to document the ctx parameter here.

I didn't really find a convincing way to add a comment, I've had the
following which I've removed prio to submission:
@ctx devtx_frame context

But it doesn't seem like it brings anything useful? Or ok to keep it that way?

> > + *
> > + * Note: this function is never actually called by the kernel and declared
> > + * only to allow loading an attaching appropriate tracepoints.
> > + */
> > +__weak noinline void devtx_sb(struct devtx_frame *ctx)
>
> I guess this is intentional.
> But gcc complains that this is neither static nor is a forward
> declaration provided. Likewise for devtx_cp()

Copy-pasted from hid-bpf; let me see if they have a forward decl somewhere..

> > +{
> > +}
> > +
> > +/**
> > + * devtx_cp - Called upon egress netdev packet completion
>
> Likewise, here too.
>
> > + *
> > + * Note: this function is never actually called by the kernel and declared
> > + * only to allow loading an attaching appropriate tracepoints.
> > + */
> > +__weak noinline void devtx_cp(struct devtx_frame *ctx)
> > +{
> > +}
> > +
> > +BTF_SET8_START(bpf_devtx_hook_ids)
> > +BTF_ID_FLAGS(func, devtx_sb)
> > +BTF_ID_FLAGS(func, devtx_cp)
> > +BTF_SET8_END(bpf_devtx_hook_ids)
> > +
> > +static const struct btf_kfunc_id_set bpf_devtx_hook_set = {
> > +     .owner = THIS_MODULE,
> > +     .set   = &bpf_devtx_hook_ids,
> > +};
> > +
> > +static DEFINE_MUTEX(devtx_attach_lock);
> > +
> > +static int __bpf_devtx_detach(struct net_device *netdev, struct bpf_prog **pprog)
> > +{
>
> As per my prior comment about *prog and __rcu annotations.
> I'm genuinely unsure how the usage of **pprog in this function sits with RCU.

Yeah, I'm a bit sloppy, let me figure out a proper way to annotate it.


> > +     if (!*pprog)
> > +             return -EINVAL;
> > +     bpf_prog_put(*pprog);
> > +     *pprog = NULL;
> > +
> > +     static_branch_dec(&devtx_enabled);
> > +     return 0;
> > +}
> > +
> > +static int __bpf_devtx_attach(struct net_device *netdev, int prog_fd,
> > +                           const char *attach_func_name, struct bpf_prog **pprog)
> > +{
> > +     struct bpf_prog *prog;
> > +     int ret = 0;
> > +
> > +     if (prog_fd < 0)
> > +             return __bpf_devtx_detach(netdev, pprog);
> > +
> > +     if (*pprog)
> > +             return -EBUSY;
> > +
> > +     prog = bpf_prog_get(prog_fd);
> > +     if (IS_ERR(prog))
> > +             return PTR_ERR(prog);
> > +
> > +     if (prog->type != BPF_PROG_TYPE_TRACING ||
> > +         prog->expected_attach_type != BPF_TRACE_FENTRY ||
> > +         !bpf_prog_is_dev_bound(prog->aux) ||
> > +         !bpf_offload_dev_match(prog, netdev) ||
> > +         strcmp(prog->aux->attach_func_name, attach_func_name)) {
> > +             bpf_prog_put(prog);
> > +             return -EINVAL;
> > +     }
> > +
> > +     *pprog = prog;
> > +     static_branch_inc(&devtx_enabled);
> > +
> > +     return ret;
> > +}
>
> ...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-13 14:54   ` Willem de Bruijn
@ 2023-06-13 19:00     ` Stanislav Fomichev
  2023-06-13 19:29       ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 19:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 7:55 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 7:24 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > devtx is a lightweight set of hooks before and after packet transmission.
> > The hook is supposed to work for both skb and xdp paths by exposing
> > a light-weight packet wrapper via devtx_frame (header portion + frags).
> >
> > devtx is implemented as a tracing program which has access to the
> > XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
> > in the next patch, but the idea is similar to XDP metadata:
> > the kfuncs have netdev-specific implementation, but common
> > interface. Upon loading, the kfuncs are resolved to direct
> > calls against per-netdev implementation. This can be achieved
> > by marking devtx-tracing programs as dev-bound (largely
> > reusing xdp-dev-bound program infrastructure).
> >
> > Attachment and detachment is implemented via syscall BPF program
> > by calling bpf_devtx_sb_attach (attach to tx-submission)
> > or bpf_devtx_cp_attach (attach to tx completion). Right now,
> > the attachment does not return a link and doesn't support
> > multiple programs. I plan to switch to Daniel's bpf_mprog infra
> > once it's available.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
>
> > @@ -2238,6 +2238,8 @@ struct net_device {
> >         unsigned int            real_num_rx_queues;
> >
> >         struct bpf_prog __rcu   *xdp_prog;
> > +       struct bpf_prog __rcu   *devtx_sb;
> > +       struct bpf_prog __rcu   *devtx_cp;
>
> nit/subjective: non-obvious two letter acronyms are nr. How about tx
> and txc (or txcomp)

devtx and devtxc? I was using devtxs vs devtxc initially, but that
seems confusing. I can probably spell them out here:
devtx_submit
devtx_complete

Should probably be better?

> > +static int __bpf_devtx_attach(struct net_device *netdev, int prog_fd,
> > +                             const char *attach_func_name, struct bpf_prog **pprog)
> > +{
> > +       struct bpf_prog *prog;
> > +       int ret = 0;
> > +
> > +       if (prog_fd < 0)
> > +               return __bpf_devtx_detach(netdev, pprog);
> > +
> > +       if (*pprog)
> > +               return -EBUSY;
> > +
> > +       prog = bpf_prog_get(prog_fd);
> > +       if (IS_ERR(prog))
> > +               return PTR_ERR(prog);
> > +
> > +       if (prog->type != BPF_PROG_TYPE_TRACING ||
> > +           prog->expected_attach_type != BPF_TRACE_FENTRY ||
> > +           !bpf_prog_is_dev_bound(prog->aux) ||
> > +           !bpf_offload_dev_match(prog, netdev) ||
> > +           strcmp(prog->aux->attach_func_name, attach_func_name)) {
> > +               bpf_prog_put(prog);
> > +               return -EINVAL;
> > +       }
> > +
> > +       *pprog = prog;
> > +       static_branch_inc(&devtx_enabled);
> > +
> > +       return ret;
>
> nit: just return 0, no variable needed

Ack.

> > +}
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +                 "Global functions as their definitions will be in vmlinux BTF");
> > +
> > +/**
> > + * bpf_devtx_sb_attach - Attach devtx 'packet submit' program
> > + * @ifindex: netdev interface index.
> > + * @prog_fd: BPF program file descriptor.
> > + *
> > + * Return:
> > + * * Returns 0 on success or ``-errno`` on error.
> > + */
> > +__bpf_kfunc int bpf_devtx_sb_attach(int ifindex, int prog_fd)
> > +{
> > +       struct net_device *netdev;
> > +       int ret;
> > +
> > +       netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> > +       if (!netdev)
> > +               return -EINVAL;
> > +
> > +       mutex_lock(&devtx_attach_lock);
> > +       ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_sb", &netdev->devtx_sb);
> > +       mutex_unlock(&devtx_attach_lock);
> > +
> > +       dev_put(netdev);
> > +
> > +       return ret;
> > +}
> > +
> > +/**
> > + * bpf_devtx_cp_attach - Attach devtx 'packet complete' program
> > + * @ifindex: netdev interface index.
> > + * @prog_fd: BPF program file descriptor.
> > + *
> > + * Return:
> > + * * Returns 0 on success or ``-errno`` on error.
> > + */
> > +__bpf_kfunc int bpf_devtx_cp_attach(int ifindex, int prog_fd)
> > +{
> > +       struct net_device *netdev;
> > +       int ret;
> > +
> > +       netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> > +       if (!netdev)
> > +               return -EINVAL;
> > +
> > +       mutex_lock(&devtx_attach_lock);
> > +       ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_cp", &netdev->devtx_cp);
> > +       mutex_unlock(&devtx_attach_lock);
> > +
> > +       dev_put(netdev);
> > +
> > +       return ret;
> > +}
>
> These two functions are near duplicates, aside from the arguments to
> their inner call to __bpf_devtx_attach. Can be dedup-ed further?

I've deduped most of the stuff via __bpf_devtx_attach; can probaby
dedup the rest with a bool argument, let me try...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata with devtx kfuncs
  2023-06-13 15:03   ` Willem de Bruijn
@ 2023-06-13 19:00     ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 19:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 8:03 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > When we get packets on port 9091, we swap src/dst and send it out.
> > At this point, we also request the timestamp and plumb it back
> > to the userspace. The userspace simply prints the timestamp.
> >
> > Haven't really tested, still working on mlx5 patches...
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
>
> > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > @@ -10,7 +10,8 @@
> >   *   - rx_hash
> >   *
> >   * TX:
> > - * - TBD
> > + * - UDP 9091 packets trigger TX reply
>
> This branch on port is missing?

That's the ping_pong part. Evey packet arriving on 9091 port gets
received by af_xdp and is sent back.

> > +static void ping_pong(struct xsk *xsk, void *rx_packet)
> > +{
> > +       struct ipv6hdr *ip6h = NULL;
> > +       struct iphdr *iph = NULL;
> > +       struct xdp_desc *tx_desc;
> > +       struct udphdr *udph;
> > +       struct ethhdr *eth;
> > +       void *data;
> > +       __u32 idx;
> > +       int ret;
> > +       int len;
> > +
> > +       ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > +       if (ret != 1) {
> > +               printf("%p: failed to reserve tx slot\n", xsk);
> > +               return;
> > +       }
> > +
> > +       tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > +       tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
> > +       data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > +
> > +       eth = data;
> > +
> > +       if (eth->h_proto == htons(ETH_P_IP)) {
> > +               iph = (void *)(eth + 1);
> > +               udph = (void *)(iph + 1);
> > +       } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> > +               ip6h = (void *)(eth + 1);
> > +               udph = (void *)(ip6h + 1);
> > +       } else {
> > +               xsk_ring_prod__cancel(&xsk->tx, 1);
> > +               return;
> > +       }
> > +
> > +       len = ETH_HLEN;
> > +       if (ip6h)
> > +               len += ntohs(ip6h->payload_len);
>
> sizeof(*ip6h) + ntohs(ip6h->payload_len) ?

Ooop, thanks, that's clearly not tested :-)

> > +       if (iph)
> > +               len += ntohs(iph->tot_len);
> > +
> > +       memcpy(data, rx_packet, len);
> > +       swap(eth->h_dest, eth->h_source, ETH_ALEN);
> > +       if (iph)
> > +               swap(&iph->saddr, &iph->daddr, 4);
>
> need to recompute ipv4 checksum?

Discussed offline: swapping the aligned-u16 chunks preserves the checksum.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs
  2023-06-13 14:47   ` Willem de Bruijn
@ 2023-06-13 19:00     ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 19:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 7:48 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 7:26 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Attach kfuncs that request and report TX timestamp via ringbuf.
> > Confirm on the userspace side that the program has triggered
> > and the timestamp is non-zero.
> >
> > Also make sure devtx_frame has a sensible pointers and data.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
> > +SEC("fentry/devtx_sb")
> > +int BPF_PROG(devtx_sb, const struct devtx_frame *frame)
> > +{
> > +       int ret;
> > +
> > +       ret = verify_frame(frame);
> > +       if (ret < 0)
> > +               __sync_add_and_fetch(&pkts_fail_tx, 1);
>
> intend to return in these error cases? in this and following patch

Yeah, let's do it.

> > +
> > +       ret = bpf_devtx_sb_request_timestamp(frame);
> > +       if (ret < 0)
> > +               __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +
> > +       return 0;
> > +}
> > +
> > +SEC("fentry/devtx_cp")
> > +int BPF_PROG(devtx_cp, const struct devtx_frame *frame)
> > +{
> > +       struct devtx_sample *sample;
> > +       int ret;
> > +
> > +       ret = verify_frame(frame);
> > +       if (ret < 0)
> > +               __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +
> > +       sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> > +       if (!sample)
> > +               return 0;
>
> return non-zero?

Can't return non-zero from fentry :-( The userspace verifies that
pkts_fail_tx stays zero, so that should be enough.


> > +
> > +       sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> > +
> > +       bpf_ringbuf_submit(sample, 0);
> > +
> > +       return 0;
> > +}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 18:39       ` Stanislav Fomichev
@ 2023-06-13 19:10         ` Toke Høiland-Jørgensen
  2023-06-13 21:17           ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-13 19:10 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On 06/12, Toke Høiland-Jørgensen wrote:
>> >> Some immediate thoughts after glancing through this:
>> >>
>> >> > --- Use cases ---
>> >> >
>> >> > The goal of this series is to add two new standard-ish places
>> >> > in the transmit path:
>> >> >
>> >> > 1. Right before the packet is transmitted (with access to TX
>> >> >    descriptors)
>> >> > 2. Right after the packet is actually transmitted and we've received the
>> >> >    completion (again, with access to TX completion descriptors)
>> >> >
>> >> > Accessing TX descriptors unlocks the following use-cases:
>> >> >
>> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
>> >> > use device offloads. The existing case implements TX timestamp.
>> >> > - Observability: global per-netdev hooks can be used for tracing
>> >> > the packets and exploring completion descriptors for all sorts of
>> >> > device errors.
>> >> >
>> >> > Accessing TX descriptors also means that the hooks have to be called
>> >> > from the drivers.
>> >> >
>> >> > The hooks are a light-weight alternative to XDP at egress and currently
>> >> > don't provide any packet modification abilities. However, eventually,
>> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
>> >> > descriptors; for performance sake).
>> >>
>> >> dynptr?
>> >
>> > Haven't considered, let me explore, but not sure what it buys us
>> > here?
>>
>> API consistency, certainly. Possibly also performance, if using the
>> slice thing that gets you a direct pointer to the pkt data? Not sure
>> about that, though, haven't done extensive benchmarking of dynptr yet...
>
> Same. Let's keep it on the table, I'll try to explore. I was just
> thinking that having less abstraction here might be better
> performance-wise.

Sure, let's evaluate this once we have performance numbers.

>> >> > --- UAPI ---
>> >> >
>> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> >> > expose any UAPI and are implemented as tracing programs that call
>> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> >> > programs. The series expands device-bound infrastructure to tracing
>> >> > programs.
>> >>
>> >> Not a fan of the "attach from BPF syscall program" thing. These are part
>> >> of the XDP data path API, and I think we should expose them as proper
>> >> bpf_link attachments from userspace with introspection etc. But I guess
>> >> the bpf_mprog thing will give us that?
>> >
>> > bpf_mprog will just make those attach kfuncs return the link fd. The
>> > syscall program will still stay :-(
>>
>> Why does the attachment have to be done this way, exactly? Couldn't we
>> just use the regular bpf_link attachment from userspace? AFAICT it's not
>> really piggy-backing on the function override thing anyway when the
>> attachment is per-dev? Or am I misunderstanding how all this works?
>
> It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> us an opportunity to fix things.
> We can do it via a regular syscall path if there is a consensus.

Yeah, the API exposed to the BPF program is kfunc-based in any case. If
we were to at some point conclude that this whole thing was not useful
at all and deprecate it, it doesn't seem to me that it makes much
difference whether that means "you can no longer create a link
attachment of this type via BPF_LINK_CREATE" or "you can no longer
create a link attachment of this type via BPF_PROG_RUN of a syscall type
program" doesn't really seem like a significant detail to me...

>> >> > --- skb vs xdp ---
>> >> >
>> >> > The hooks operate on a new light-weight devtx_frame which contains:
>> >> > - data
>> >> > - len
>> >> > - sinfo
>> >> >
>> >> > This should allow us to have a unified (from BPF POW) place at TX
>> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
>> >> > for each invocation).
>> >>
>> >> Not sure what I think about this one. At the very least I think we
>> >> should expose xdp->data_meta as well. I'm not sure what the use case for
>> >> accessing skbs is? If that *is* indeed useful, probably there will also
>> >> end up being a use case for accessing the full skb?
>> >
>> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
>> > a good opportunity to unify? Or probably won't work because if
>> > xdf_frame doesn't have frags, it won't have sinfo?
>>
>> No, it won't. But why do we need this unification between the skb and
>> xdp paths in the first place? Doesn't the skb path already have support
>> for these things? Seems like we could just stick to making this xdp-only
>> and keeping xdp_frame as the ctx argument?
>
> For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
> to make it work for both cases?
> We can always export metadata len via some kfunc, sure.

I wasn't referring to the metadata field specifically when talking about
the skb path. I'm wondering why we need these hooks to work for the skb
path at all? :)

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-13 19:00     ` Stanislav Fomichev
@ 2023-06-13 19:29       ` Willem de Bruijn
  0 siblings, 0 replies; 47+ messages in thread
From: Willem de Bruijn @ 2023-06-13 19:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 9:00 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jun 13, 2023 at 7:55 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 7:24 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > devtx is a lightweight set of hooks before and after packet transmission.
> > > The hook is supposed to work for both skb and xdp paths by exposing
> > > a light-weight packet wrapper via devtx_frame (header portion + frags).
> > >
> > > devtx is implemented as a tracing program which has access to the
> > > XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
> > > in the next patch, but the idea is similar to XDP metadata:
> > > the kfuncs have netdev-specific implementation, but common
> > > interface. Upon loading, the kfuncs are resolved to direct
> > > calls against per-netdev implementation. This can be achieved
> > > by marking devtx-tracing programs as dev-bound (largely
> > > reusing xdp-dev-bound program infrastructure).
> > >
> > > Attachment and detachment is implemented via syscall BPF program
> > > by calling bpf_devtx_sb_attach (attach to tx-submission)
> > > or bpf_devtx_cp_attach (attach to tx completion). Right now,
> > > the attachment does not return a link and doesn't support
> > > multiple programs. I plan to switch to Daniel's bpf_mprog infra
> > > once it's available.
> > >
> > > Cc: netdev@vger.kernel.org
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >
> >
> > > @@ -2238,6 +2238,8 @@ struct net_device {
> > >         unsigned int            real_num_rx_queues;
> > >
> > >         struct bpf_prog __rcu   *xdp_prog;
> > > +       struct bpf_prog __rcu   *devtx_sb;
> > > +       struct bpf_prog __rcu   *devtx_cp;
> >
> > nit/subjective: non-obvious two letter acronyms are nr. How about tx
> > and txc (or txcomp)
>
> devtx and devtxc? I was using devtxs vs devtxc initially, but that
> seems confusing. I can probably spell them out here:
> devtx_submit
> devtx_complete
>
> Should probably be better?

That's more clear, thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 19:10         ` Toke Høiland-Jørgensen
@ 2023-06-13 21:17           ` Stanislav Fomichev
  2023-06-13 22:32             ` Alexei Starovoitov
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 21:17 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On Tue, Jun 13, 2023 at 12:10 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Tue, Jun 13, 2023 at 10:18 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On 06/12, Toke Høiland-Jørgensen wrote:
> >> >> Some immediate thoughts after glancing through this:
> >> >>
> >> >> > --- Use cases ---
> >> >> >
> >> >> > The goal of this series is to add two new standard-ish places
> >> >> > in the transmit path:
> >> >> >
> >> >> > 1. Right before the packet is transmitted (with access to TX
> >> >> >    descriptors)
> >> >> > 2. Right after the packet is actually transmitted and we've received the
> >> >> >    completion (again, with access to TX completion descriptors)
> >> >> >
> >> >> > Accessing TX descriptors unlocks the following use-cases:
> >> >> >
> >> >> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> >> >> > use device offloads. The existing case implements TX timestamp.
> >> >> > - Observability: global per-netdev hooks can be used for tracing
> >> >> > the packets and exploring completion descriptors for all sorts of
> >> >> > device errors.
> >> >> >
> >> >> > Accessing TX descriptors also means that the hooks have to be called
> >> >> > from the drivers.
> >> >> >
> >> >> > The hooks are a light-weight alternative to XDP at egress and currently
> >> >> > don't provide any packet modification abilities. However, eventually,
> >> >> > can expose new kfuncs to operate on the packet (or, rather, the actual
> >> >> > descriptors; for performance sake).
> >> >>
> >> >> dynptr?
> >> >
> >> > Haven't considered, let me explore, but not sure what it buys us
> >> > here?
> >>
> >> API consistency, certainly. Possibly also performance, if using the
> >> slice thing that gets you a direct pointer to the pkt data? Not sure
> >> about that, though, haven't done extensive benchmarking of dynptr yet...
> >
> > Same. Let's keep it on the table, I'll try to explore. I was just
> > thinking that having less abstraction here might be better
> > performance-wise.
>
> Sure, let's evaluate this once we have performance numbers.
>
> >> >> > --- UAPI ---
> >> >> >
> >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> >> >> > expose any UAPI and are implemented as tracing programs that call
> >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> >> >> > programs. The series expands device-bound infrastructure to tracing
> >> >> > programs.
> >> >>
> >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> >> >> of the XDP data path API, and I think we should expose them as proper
> >> >> bpf_link attachments from userspace with introspection etc. But I guess
> >> >> the bpf_mprog thing will give us that?
> >> >
> >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> >> > syscall program will still stay :-(
> >>
> >> Why does the attachment have to be done this way, exactly? Couldn't we
> >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> >> really piggy-backing on the function override thing anyway when the
> >> attachment is per-dev? Or am I misunderstanding how all this works?
> >
> > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > us an opportunity to fix things.
> > We can do it via a regular syscall path if there is a consensus.
>
> Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> we were to at some point conclude that this whole thing was not useful
> at all and deprecate it, it doesn't seem to me that it makes much
> difference whether that means "you can no longer create a link
> attachment of this type via BPF_LINK_CREATE" or "you can no longer
> create a link attachment of this type via BPF_PROG_RUN of a syscall type
> program" doesn't really seem like a significant detail to me...

In this case, why do you prefer it to go via regular syscall? Seems
like we can avoid a bunch of boileplate syscall work with a kfunc that
does the attachment?
We might as well abstract it at, say, libbpf layer which would
generate/load this small bpf program to call a kfunc.

> >> >> > --- skb vs xdp ---
> >> >> >
> >> >> > The hooks operate on a new light-weight devtx_frame which contains:
> >> >> > - data
> >> >> > - len
> >> >> > - sinfo
> >> >> >
> >> >> > This should allow us to have a unified (from BPF POW) place at TX
> >> >> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> >> >> > for each invocation).
> >> >>
> >> >> Not sure what I think about this one. At the very least I think we
> >> >> should expose xdp->data_meta as well. I'm not sure what the use case for
> >> >> accessing skbs is? If that *is* indeed useful, probably there will also
> >> >> end up being a use case for accessing the full skb?
> >> >
> >> > skb_shared_info has meta_len, buf afaik, xdp doesn't use it. Maybe I
> >> > a good opportunity to unify? Or probably won't work because if
> >> > xdf_frame doesn't have frags, it won't have sinfo?
> >>
> >> No, it won't. But why do we need this unification between the skb and
> >> xdp paths in the first place? Doesn't the skb path already have support
> >> for these things? Seems like we could just stick to making this xdp-only
> >> and keeping xdp_frame as the ctx argument?
> >
> > For skb path, I'm assuming we can read sinfo->meta_len; it feels nice
> > to make it work for both cases?
> > We can always export metadata len via some kfunc, sure.
>
> I wasn't referring to the metadata field specifically when talking about
> the skb path. I'm wondering why we need these hooks to work for the skb
> path at all? :)

Aaah. I think John wanted them to trigger for skb path, so I'm trying
to explore whether having both makes sense.
But also, if we go purely xdp_frame, what about af_xdp in copy mode?
That's still skb-driven, right?
Not sure this skb vs xdp is a clear cut :-/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 21:17           ` Stanislav Fomichev
@ 2023-06-13 22:32             ` Alexei Starovoitov
  2023-06-13 23:16               ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2023-06-13 22:32 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Willem de Bruijn, David Ahern, Karlsson, Magnus,
	Björn Töpel, Fijalkowski, Maciej, Network Development

On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> > >> >> > --- UAPI ---
> > >> >> >
> > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > >> >> > expose any UAPI and are implemented as tracing programs that call
> > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > >> >> > programs. The series expands device-bound infrastructure to tracing
> > >> >> > programs.
> > >> >>
> > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > >> >> of the XDP data path API, and I think we should expose them as proper
> > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > >> >> the bpf_mprog thing will give us that?
> > >> >
> > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > >> > syscall program will still stay :-(
> > >>
> > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > >> really piggy-backing on the function override thing anyway when the
> > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > >
> > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > us an opportunity to fix things.
> > > We can do it via a regular syscall path if there is a consensus.
> >
> > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > we were to at some point conclude that this whole thing was not useful
> > at all and deprecate it, it doesn't seem to me that it makes much
> > difference whether that means "you can no longer create a link
> > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > program" doesn't really seem like a significant detail to me...
>
> In this case, why do you prefer it to go via regular syscall? Seems
> like we can avoid a bunch of boileplate syscall work with a kfunc that
> does the attachment?
> We might as well abstract it at, say, libbpf layer which would
> generate/load this small bpf program to call a kfunc.

I'm not sure we're on the same page here.
imo using syscall bpf prog that calls kfunc to do a per-device attach
is an overkill here.
It's an experimental feature, but you're already worried about
multiple netdevs?

Can you add an empty nop function and attach to it tracing style
with fentry ?
It won't be per-netdev, but do you have to do per-device demux
by the kernel? Can your tracing bpf prog do that instead?
It's just an ifindex compare.
This way than non-uapi bits will be even smaller and no need
to change struct netdevice.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 22:32             ` Alexei Starovoitov
@ 2023-06-13 23:16               ` Stanislav Fomichev
  2023-06-14  4:19                 ` Alexei Starovoitov
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-13 23:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Willem de Bruijn, David Ahern, Karlsson, Magnus,
	Björn Töpel, Fijalkowski, Maciej, Network Development

On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > > >> >> > --- UAPI ---
> > > >> >> >
> > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > >> >> > expose any UAPI and are implemented as tracing programs that call
> > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > >> >> > programs. The series expands device-bound infrastructure to tracing
> > > >> >> > programs.
> > > >> >>
> > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > > >> >> of the XDP data path API, and I think we should expose them as proper
> > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > > >> >> the bpf_mprog thing will give us that?
> > > >> >
> > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > > >> > syscall program will still stay :-(
> > > >>
> > > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > > >> really piggy-backing on the function override thing anyway when the
> > > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > > >
> > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > > us an opportunity to fix things.
> > > > We can do it via a regular syscall path if there is a consensus.
> > >
> > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > > we were to at some point conclude that this whole thing was not useful
> > > at all and deprecate it, it doesn't seem to me that it makes much
> > > difference whether that means "you can no longer create a link
> > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > > program" doesn't really seem like a significant detail to me...
> >
> > In this case, why do you prefer it to go via regular syscall? Seems
> > like we can avoid a bunch of boileplate syscall work with a kfunc that
> > does the attachment?
> > We might as well abstract it at, say, libbpf layer which would
> > generate/load this small bpf program to call a kfunc.
>
> I'm not sure we're on the same page here.
> imo using syscall bpf prog that calls kfunc to do a per-device attach
> is an overkill here.
> It's an experimental feature, but you're already worried about
> multiple netdevs?
>
> Can you add an empty nop function and attach to it tracing style
> with fentry ?
> It won't be per-netdev, but do you have to do per-device demux
> by the kernel? Can your tracing bpf prog do that instead?
> It's just an ifindex compare.
> This way than non-uapi bits will be even smaller and no need
> to change struct netdevice.

It's probably going to work if each driver has a separate set of tx
fentry points, something like:
  {veth,mlx5,etc}_devtx_submit()
  {veth,mlx5,etc}_devtx_complete()
Because I still need to have those netdev-bound tracing programs to
get access to driver kfuncs.

I can try to sketch something together for a v2.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2023-06-12 21:00 ` [RFC bpf-next 0/7] bpf: netdev TX metadata Toke Høiland-Jørgensen
@ 2023-06-14  3:31 ` Jakub Kicinski
  2023-06-14  3:54   ` David Ahern
  8 siblings, 1 reply; 47+ messages in thread
From: Jakub Kicinski @ 2023-06-14  3:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:
> The goal of this series is to add two new standard-ish places
> in the transmit path:
> 
> 1. Right before the packet is transmitted (with access to TX
>    descriptors)

I'm not sure that the Tx descriptors can be populated piecemeal.
If we were ever to support more standard offload features, which
require packet geometry (hdr offsets etc.) to be described "call
per feature" will end up duplicating arguments, and there will be
a lot of args..

And if there is an SKB path in the future combining the normal SKB
offloads with the half-rendered descriptors may be a pain.

> 2. Right after the packet is actually transmitted and we've received the
>    completion (again, with access to TX completion descriptors)

For completions trace-style attach makes perfect sense.

> Accessing TX descriptors unlocks the following use-cases:
> 
> - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> use device offloads. The existing case implements TX timestamp.
> - Observability: global per-netdev hooks can be used for tracing
> the packets and exploring completion descriptors for all sorts of
> device errors.
> 
> Accessing TX descriptors also means that the hooks have to be called
> from the drivers.
> 
> The hooks are a light-weight alternative to XDP at egress and currently

Hm, a lot of lightweight features lately. Unbearable lightness of
features.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14  3:31 ` Jakub Kicinski
@ 2023-06-14  3:54   ` David Ahern
  2023-06-14  5:05     ` Jakub Kicinski
  0 siblings, 1 reply; 47+ messages in thread
From: David Ahern @ 2023-06-14  3:54 UTC (permalink / raw)
  To: Jakub Kicinski, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:
>> The goal of this series is to add two new standard-ish places
>> in the transmit path:
>>
>> 1. Right before the packet is transmitted (with access to TX
>>    descriptors)

If a device requires multiple Tx descriptors per skb or multibuf frame,
how would that be handled within the XDP API?

> 
> I'm not sure that the Tx descriptors can be populated piecemeal.

If it is host memory before the pidx move, why would that matter? Do you
have a specific example in mind?

> If we were ever to support more standard offload features, which
> require packet geometry (hdr offsets etc.) to be described "call
> per feature" will end up duplicating arguments, and there will be
> a lot of args..
> 
> And if there is an SKB path in the future combining the normal SKB
> offloads with the half-rendered descriptors may be a pain.

Once the descriptor(s) is (are) populated, the skb is irrelevant is it
not? Only complication that comes to mind is wanting to add or remove
headers (e.g., tunnels) which will be much more complicated at this
point, but might still be possible on a per NIC (and maybe version) basis.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-13 23:16               ` Stanislav Fomichev
@ 2023-06-14  4:19                 ` Alexei Starovoitov
  2023-06-14 11:59                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2023-06-14  4:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Willem de Bruijn, David Ahern, Karlsson, Magnus,
	Björn Töpel, Fijalkowski, Maciej, Network Development

On Tue, Jun 13, 2023 at 4:16 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > > >> >> > --- UAPI ---
> > > > >> >> >
> > > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > >> >> > expose any UAPI and are implemented as tracing programs that call
> > > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > >> >> > programs. The series expands device-bound infrastructure to tracing
> > > > >> >> > programs.
> > > > >> >>
> > > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > >> >> of the XDP data path API, and I think we should expose them as proper
> > > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
> > > > >> >> the bpf_mprog thing will give us that?
> > > > >> >
> > > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
> > > > >> > syscall program will still stay :-(
> > > > >>
> > > > >> Why does the attachment have to be done this way, exactly? Couldn't we
> > > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
> > > > >> really piggy-backing on the function override thing anyway when the
> > > > >> attachment is per-dev? Or am I misunderstanding how all this works?
> > > > >
> > > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
> > > > > us an opportunity to fix things.
> > > > > We can do it via a regular syscall path if there is a consensus.
> > > >
> > > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
> > > > we were to at some point conclude that this whole thing was not useful
> > > > at all and deprecate it, it doesn't seem to me that it makes much
> > > > difference whether that means "you can no longer create a link
> > > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
> > > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
> > > > program" doesn't really seem like a significant detail to me...
> > >
> > > In this case, why do you prefer it to go via regular syscall? Seems
> > > like we can avoid a bunch of boileplate syscall work with a kfunc that
> > > does the attachment?
> > > We might as well abstract it at, say, libbpf layer which would
> > > generate/load this small bpf program to call a kfunc.
> >
> > I'm not sure we're on the same page here.
> > imo using syscall bpf prog that calls kfunc to do a per-device attach
> > is an overkill here.
> > It's an experimental feature, but you're already worried about
> > multiple netdevs?
> >
> > Can you add an empty nop function and attach to it tracing style
> > with fentry ?
> > It won't be per-netdev, but do you have to do per-device demux
> > by the kernel? Can your tracing bpf prog do that instead?
> > It's just an ifindex compare.
> > This way than non-uapi bits will be even smaller and no need
> > to change struct netdevice.
>
> It's probably going to work if each driver has a separate set of tx
> fentry points, something like:
>   {veth,mlx5,etc}_devtx_submit()
>   {veth,mlx5,etc}_devtx_complete()

Right. And per-driver descriptors.
The 'generic' xdptx metadata is unlikely to be practical.
Marshaling in and out of it is going to be too perf sensitive.
I'd just add an attach point in the driver with enough
args for bpf progs to make sense of the context and extend
the verifier to make few safe fields writeable.
kfuncs to read/request timestamp are probably too slow.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14  3:54   ` David Ahern
@ 2023-06-14  5:05     ` Jakub Kicinski
  2023-06-14 17:17       ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Kicinski @ 2023-06-14  5:05 UTC (permalink / raw)
  To: David Ahern
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, willemb,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev

On Tue, 13 Jun 2023 20:54:26 -0700 David Ahern wrote:
> On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> > On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:  
> >> The goal of this series is to add two new standard-ish places
> >> in the transmit path:
> >>
> >> 1. Right before the packet is transmitted (with access to TX
> >>    descriptors)  
> 
> If a device requires multiple Tx descriptors per skb or multibuf frame,
> how would that be handled within the XDP API?
> 
> > I'm not sure that the Tx descriptors can be populated piecemeal.  
> 
> If it is host memory before the pidx move, why would that matter? Do you
> have a specific example in mind?

I don't mean it's impossible implement, but it's may get cumbersome.
TSO/CSO/crypto may all need to know where L4 header starts, f.e.
Some ECN marking in the NIC may also want to know where L3 is.
So the offsets will get duplicated in each API.

> > If we were ever to support more standard offload features, which
> > require packet geometry (hdr offsets etc.) to be described "call
> > per feature" will end up duplicating arguments, and there will be
> > a lot of args..
> > 
> > And if there is an SKB path in the future combining the normal SKB
> > offloads with the half-rendered descriptors may be a pain.  
> 
> Once the descriptor(s) is (are) populated, the skb is irrelevant is it
> not? Only complication that comes to mind is wanting to add or remove
> headers (e.g., tunnels) which will be much more complicated at this
> point, but might still be possible on a per NIC (and maybe version) basis.

I guess one can write the skb descriptors first, then modify them from
the BPF. Either way I feel like the helper approach for Tx will result
in drivers saving the info into some local struct and then rendering
the descriptors after. We'll see.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-13 19:00     ` Stanislav Fomichev
@ 2023-06-14  7:02       ` Simon Horman
  2023-06-14 17:18         ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Simon Horman @ 2023-06-14  7:02 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Tue, Jun 13, 2023 at 12:00:25PM -0700, Stanislav Fomichev wrote:
> On Tue, Jun 13, 2023 at 8:08 AM Simon Horman <simon.horman@corigine.com> wrote:
> > On Mon, Jun 12, 2023 at 10:23:03AM -0700, Stanislav Fomichev wrote:

...

> > > +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> > > +{
> > > +     rcu_read_lock();
> > > +     devtx_run(netdev, ctx, &netdev->devtx_cp);
> > > +     rcu_read_unlock();
> > > +}
> > > +EXPORT_SYMBOL_GPL(devtx_complete);
> > > +
> > > +/**
> > > + * devtx_sb - Called for every egress netdev packet
> >
> > As this is a kernel doc, it would be good to document the ctx parameter here.
> 
> I didn't really find a convincing way to add a comment, I've had the
> following which I've removed prio to submission:
> @ctx devtx_frame context
> 
> But it doesn't seem like it brings anything useful? Or ok to keep it that way?

Thanks Stan,

I see what you are saying wrt it not bringing much value.
But I'm more thinking that something is better than nothing.
Anyway, I'll drop this topic if you prefer.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14  4:19                 ` Alexei Starovoitov
@ 2023-06-14 11:59                   ` Toke Høiland-Jørgensen
  2023-06-14 16:27                     ` Alexei Starovoitov
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-14 11:59 UTC (permalink / raw)
  To: Alexei Starovoitov, Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Jun 13, 2023 at 4:16 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Tue, Jun 13, 2023 at 3:32 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> >
>> > On Tue, Jun 13, 2023 at 2:17 PM Stanislav Fomichev <sdf@google.com> wrote:
>> > >
>> > > > >> >> > --- UAPI ---
>> > > > >> >> >
>> > > > >> >> > The hooks are implemented in a HID-BPF style. Meaning they don't
>> > > > >> >> > expose any UAPI and are implemented as tracing programs that call
>> > > > >> >> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
>> > > > >> >> > programs. The series expands device-bound infrastructure to tracing
>> > > > >> >> > programs.
>> > > > >> >>
>> > > > >> >> Not a fan of the "attach from BPF syscall program" thing. These are part
>> > > > >> >> of the XDP data path API, and I think we should expose them as proper
>> > > > >> >> bpf_link attachments from userspace with introspection etc. But I guess
>> > > > >> >> the bpf_mprog thing will give us that?
>> > > > >> >
>> > > > >> > bpf_mprog will just make those attach kfuncs return the link fd. The
>> > > > >> > syscall program will still stay :-(
>> > > > >>
>> > > > >> Why does the attachment have to be done this way, exactly? Couldn't we
>> > > > >> just use the regular bpf_link attachment from userspace? AFAICT it's not
>> > > > >> really piggy-backing on the function override thing anyway when the
>> > > > >> attachment is per-dev? Or am I misunderstanding how all this works?
>> > > > >
>> > > > > It's UAPI vs non-UAPI. I'm assuming kfunc makes it non-UAPI and gives
>> > > > > us an opportunity to fix things.
>> > > > > We can do it via a regular syscall path if there is a consensus.
>> > > >
>> > > > Yeah, the API exposed to the BPF program is kfunc-based in any case. If
>> > > > we were to at some point conclude that this whole thing was not useful
>> > > > at all and deprecate it, it doesn't seem to me that it makes much
>> > > > difference whether that means "you can no longer create a link
>> > > > attachment of this type via BPF_LINK_CREATE" or "you can no longer
>> > > > create a link attachment of this type via BPF_PROG_RUN of a syscall type
>> > > > program" doesn't really seem like a significant detail to me...
>> > >
>> > > In this case, why do you prefer it to go via regular syscall? Seems
>> > > like we can avoid a bunch of boileplate syscall work with a kfunc that
>> > > does the attachment?
>> > > We might as well abstract it at, say, libbpf layer which would
>> > > generate/load this small bpf program to call a kfunc.
>> >
>> > I'm not sure we're on the same page here.
>> > imo using syscall bpf prog that calls kfunc to do a per-device attach
>> > is an overkill here.
>> > It's an experimental feature, but you're already worried about
>> > multiple netdevs?
>> >
>> > Can you add an empty nop function and attach to it tracing style
>> > with fentry ?
>> > It won't be per-netdev, but do you have to do per-device demux
>> > by the kernel? Can your tracing bpf prog do that instead?
>> > It's just an ifindex compare.
>> > This way than non-uapi bits will be even smaller and no need
>> > to change struct netdevice.
>>
>> It's probably going to work if each driver has a separate set of tx
>> fentry points, something like:
>>   {veth,mlx5,etc}_devtx_submit()
>>   {veth,mlx5,etc}_devtx_complete()

I really don't get the opposition to exposing proper APIs; as a
dataplane developer I want to attach a program to an interface. The
kernel's role is to provide a consistent interface for this, not to
require users to become driver developers just to get at the required
details.

> Right. And per-driver descriptors.
> The 'generic' xdptx metadata is unlikely to be practical.
> Marshaling in and out of it is going to be too perf sensitive.
> I'd just add an attach point in the driver with enough
> args for bpf progs to make sense of the context and extend
> the verifier to make few safe fields writeable.

This is a rehashing of the argument we had on the RX side: just exposing
descriptors is a bad API because it forces BPF programs to deal with
hardware errata - which is exactly the kind of thing that belongs in a
driver. Exposing kfuncs allows drivers to deal with hardware quirks
while exposing a consistent API to (BPF) users.

> kfuncs to read/request timestamp are probably too slow.

Which is why we should be inlining them :)

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14 11:59                   ` Toke Høiland-Jørgensen
@ 2023-06-14 16:27                     ` Alexei Starovoitov
  2023-06-15 12:36                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2023-06-14 16:27 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Willem de Bruijn,
	David Ahern, Karlsson, Magnus, Björn Töpel,
	Fijalkowski, Maciej, Network Development

On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> >>
> >> It's probably going to work if each driver has a separate set of tx
> >> fentry points, something like:
> >>   {veth,mlx5,etc}_devtx_submit()
> >>   {veth,mlx5,etc}_devtx_complete()
>
> I really don't get the opposition to exposing proper APIs; as a
> dataplane developer I want to attach a program to an interface. The
> kernel's role is to provide a consistent interface for this, not to
> require users to become driver developers just to get at the required
> details.

Consistent interface can appear only when there is a consistency
across nic manufacturers.
I'm suggesting to experiment in the most unstable way and
if/when the consistency is discovered then generalize.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14  5:05     ` Jakub Kicinski
@ 2023-06-14 17:17       ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-14 17:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, willemb, magnus.karlsson,
	bjorn, maciej.fijalkowski, netdev

On 06/13, Jakub Kicinski wrote:
> On Tue, 13 Jun 2023 20:54:26 -0700 David Ahern wrote:
> > On 6/13/23 9:31 PM, Jakub Kicinski wrote:
> > > On Mon, 12 Jun 2023 10:23:00 -0700 Stanislav Fomichev wrote:  
> > >> The goal of this series is to add two new standard-ish places
> > >> in the transmit path:
> > >>
> > >> 1. Right before the packet is transmitted (with access to TX
> > >>    descriptors)  
> > 
> > If a device requires multiple Tx descriptors per skb or multibuf frame,
> > how would that be handled within the XDP API?
> > 
> > > I'm not sure that the Tx descriptors can be populated piecemeal.  
> > 
> > If it is host memory before the pidx move, why would that matter? Do you
> > have a specific example in mind?
> 
> I don't mean it's impossible implement, but it's may get cumbersome.
> TSO/CSO/crypto may all need to know where L4 header starts, f.e.
> Some ECN marking in the NIC may also want to know where L3 is.
> So the offsets will get duplicated in each API.
> 
> > > If we were ever to support more standard offload features, which
> > > require packet geometry (hdr offsets etc.) to be described "call
> > > per feature" will end up duplicating arguments, and there will be
> > > a lot of args..
> > > 
> > > And if there is an SKB path in the future combining the normal SKB
> > > offloads with the half-rendered descriptors may be a pain.  
> > 
> > Once the descriptor(s) is (are) populated, the skb is irrelevant is it
> > not? Only complication that comes to mind is wanting to add or remove
> > headers (e.g., tunnels) which will be much more complicated at this
> > point, but might still be possible on a per NIC (and maybe version) basis.
> 
> I guess one can write the skb descriptors first, then modify them from
> the BPF. Either way I feel like the helper approach for Tx will result
> in drivers saving the info into some local struct and then rendering
> the descriptors after. We'll see.

I agree that it's probably the "easiest" option to implement for the
majority of the devices that were designed without much of a
programmability this late in the stack. But maybe some devices can
or at least we can try to influence future designs :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-14  7:02       ` Simon Horman
@ 2023-06-14 17:18         ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-14 17:18 UTC (permalink / raw)
  To: Simon Horman
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On 06/14, Simon Horman wrote:
> On Tue, Jun 13, 2023 at 12:00:25PM -0700, Stanislav Fomichev wrote:
> > On Tue, Jun 13, 2023 at 8:08 AM Simon Horman <simon.horman@corigine.com> wrote:
> > > On Mon, Jun 12, 2023 at 10:23:03AM -0700, Stanislav Fomichev wrote:
> 
> ...
> 
> > > > +void devtx_complete(struct net_device *netdev, struct devtx_frame *ctx)
> > > > +{
> > > > +     rcu_read_lock();
> > > > +     devtx_run(netdev, ctx, &netdev->devtx_cp);
> > > > +     rcu_read_unlock();
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(devtx_complete);
> > > > +
> > > > +/**
> > > > + * devtx_sb - Called for every egress netdev packet
> > >
> > > As this is a kernel doc, it would be good to document the ctx parameter here.
> > 
> > I didn't really find a convincing way to add a comment, I've had the
> > following which I've removed prio to submission:
> > @ctx devtx_frame context
> > 
> > But it doesn't seem like it brings anything useful? Or ok to keep it that way?
> 
> Thanks Stan,
> 
> I see what you are saying wrt it not bringing much value.
> But I'm more thinking that something is better than nothing.
> Anyway, I'll drop this topic if you prefer.

Ack, thanks, I'll put it in for the consistency sake!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-14 16:27                     ` Alexei Starovoitov
@ 2023-06-15 12:36                       ` Toke Høiland-Jørgensen
  2023-06-15 16:10                         ` Alexei Starovoitov
  0 siblings, 1 reply; 47+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-15 12:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Willem de Bruijn,
	David Ahern, Karlsson, Magnus, Björn Töpel,
	Fijalkowski, Maciej, Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> >>
>> >> It's probably going to work if each driver has a separate set of tx
>> >> fentry points, something like:
>> >>   {veth,mlx5,etc}_devtx_submit()
>> >>   {veth,mlx5,etc}_devtx_complete()
>>
>> I really don't get the opposition to exposing proper APIs; as a
>> dataplane developer I want to attach a program to an interface. The
>> kernel's role is to provide a consistent interface for this, not to
>> require users to become driver developers just to get at the required
>> details.
>
> Consistent interface can appear only when there is a consistency
> across nic manufacturers.
> I'm suggesting to experiment in the most unstable way and
> if/when the consistency is discovered then generalize.

That would be fine for new experimental HW features, but we're talking
about timestamps here: a feature that is already supported by multiple
drivers and for which the stack has a working abstraction. There's no
reason why we can't have that for the XDP path as well.

-Toke

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-15 12:36                       ` Toke Høiland-Jørgensen
@ 2023-06-15 16:10                         ` Alexei Starovoitov
  2023-06-15 16:31                           ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2023-06-15 16:10 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Willem de Bruijn,
	David Ahern, Karlsson, Magnus, Björn Töpel,
	Fijalkowski, Maciej, Network Development

On Thu, Jun 15, 2023 at 5:36 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> >>
> >> >> It's probably going to work if each driver has a separate set of tx
> >> >> fentry points, something like:
> >> >>   {veth,mlx5,etc}_devtx_submit()
> >> >>   {veth,mlx5,etc}_devtx_complete()
> >>
> >> I really don't get the opposition to exposing proper APIs; as a
> >> dataplane developer I want to attach a program to an interface. The
> >> kernel's role is to provide a consistent interface for this, not to
> >> require users to become driver developers just to get at the required
> >> details.
> >
> > Consistent interface can appear only when there is a consistency
> > across nic manufacturers.
> > I'm suggesting to experiment in the most unstable way and
> > if/when the consistency is discovered then generalize.
>
> That would be fine for new experimental HW features, but we're talking
> about timestamps here: a feature that is already supported by multiple
> drivers and for which the stack has a working abstraction. There's no
> reason why we can't have that for the XDP path as well.

... has an abstraction to receive, but has no mechanism to set it
selectively per packet and read it on completion.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-15 16:10                         ` Alexei Starovoitov
@ 2023-06-15 16:31                           ` Stanislav Fomichev
  2023-06-16  1:50                             ` Jakub Kicinski
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-15 16:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Willem de Bruijn, David Ahern, Karlsson, Magnus,
	Björn Töpel, Fijalkowski, Maciej, Network Development

On Thu, Jun 15, 2023 at 9:10 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 15, 2023 at 5:36 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Wed, Jun 14, 2023 at 5:00 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > >>
> > >> >>
> > >> >> It's probably going to work if each driver has a separate set of tx
> > >> >> fentry points, something like:
> > >> >>   {veth,mlx5,etc}_devtx_submit()
> > >> >>   {veth,mlx5,etc}_devtx_complete()
> > >>
> > >> I really don't get the opposition to exposing proper APIs; as a
> > >> dataplane developer I want to attach a program to an interface. The
> > >> kernel's role is to provide a consistent interface for this, not to
> > >> require users to become driver developers just to get at the required
> > >> details.
> > >
> > > Consistent interface can appear only when there is a consistency
> > > across nic manufacturers.
> > > I'm suggesting to experiment in the most unstable way and
> > > if/when the consistency is discovered then generalize.
> >
> > That would be fine for new experimental HW features, but we're talking
> > about timestamps here: a feature that is already supported by multiple
> > drivers and for which the stack has a working abstraction. There's no
> > reason why we can't have that for the XDP path as well.
>
> ... has an abstraction to receive, but has no mechanism to set it
> selectively per packet and read it on completion.

Timestamp might be a bit of an outlier here where it's just setting
some bit in some existing descriptor.
For some features, the drivers might have to reserve extra descriptors
and I'm not sure how safe it would be to let the programs arbitrarily
mess with the descriptor queues like that.
I'll probably keep this kfunc approach for v2 rfc for now (will try to
get rid of the complicated attachment at least), but let's keep
discussing.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-12 21:00 ` [RFC bpf-next 0/7] bpf: netdev TX metadata Toke Høiland-Jørgensen
  2023-06-13 16:32   ` Stanislav Fomichev
@ 2023-06-16  0:09   ` Stanislav Fomichev
  2023-06-16  8:12     ` Magnus Karlsson
  1 sibling, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-16  0:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, willemb, dsahern, magnus.karlsson, bjorn,
	maciej.fijalkowski, netdev

On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Some immediate thoughts after glancing through this:
>
> > --- Use cases ---
> >
> > The goal of this series is to add two new standard-ish places
> > in the transmit path:
> >
> > 1. Right before the packet is transmitted (with access to TX
> >    descriptors)
> > 2. Right after the packet is actually transmitted and we've received the
> >    completion (again, with access to TX completion descriptors)
> >
> > Accessing TX descriptors unlocks the following use-cases:
> >
> > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > use device offloads. The existing case implements TX timestamp.
> > - Observability: global per-netdev hooks can be used for tracing
> > the packets and exploring completion descriptors for all sorts of
> > device errors.
> >
> > Accessing TX descriptors also means that the hooks have to be called
> > from the drivers.
> >
> > The hooks are a light-weight alternative to XDP at egress and currently
> > don't provide any packet modification abilities. However, eventually,
> > can expose new kfuncs to operate on the packet (or, rather, the actual
> > descriptors; for performance sake).
>
> dynptr?
>
> > --- UAPI ---
> >
> > The hooks are implemented in a HID-BPF style. Meaning they don't
> > expose any UAPI and are implemented as tracing programs that call
> > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > programs. The series expands device-bound infrastructure to tracing
> > programs.
>
> Not a fan of the "attach from BPF syscall program" thing. These are part
> of the XDP data path API, and I think we should expose them as proper
> bpf_link attachments from userspace with introspection etc. But I guess
> the bpf_mprog thing will give us that?
>
> > --- skb vs xdp ---
> >
> > The hooks operate on a new light-weight devtx_frame which contains:
> > - data
> > - len
> > - sinfo
> >
> > This should allow us to have a unified (from BPF POW) place at TX
> > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > for each invocation).
>
> Not sure what I think about this one. At the very least I think we
> should expose xdp->data_meta as well. I'm not sure what the use case for
> accessing skbs is? If that *is* indeed useful, probably there will also
> end up being a use case for accessing the full skb?

I spent some time looking at data_meta story on AF_XDP TX and it
doesn't look like it's supported (at least in a general way).
You obviously get some data_meta when you do XDP_TX, but if you want
to pass something to the bpf prog when doing TX via the AF_XDP ring,
it gets complicated.
In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
and pass something in the headroom.
If copy-mode, there is no support to do skb_metadata_set.

Probably makes sense to have something like tx_metalen on the xsk? And
skb_metadata_set it in copy more and skip it in zerocopy mode?
Or maybe I'm missing something?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-15 16:31                           ` Stanislav Fomichev
@ 2023-06-16  1:50                             ` Jakub Kicinski
  0 siblings, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2023-06-16  1:50 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Network Development

On Thu, 15 Jun 2023 09:31:19 -0700 Stanislav Fomichev wrote:
> Timestamp might be a bit of an outlier here where it's just setting
> some bit in some existing descriptor.
> For some features, the drivers might have to reserve extra descriptors
> and I'm not sure how safe it would be to let the programs arbitrarily
> mess with the descriptor queues like that.

I was gonna say, most NICs will have some form of descriptor chaining,
with strict ordering, to perform reasonably with small packets.

> I'll probably keep this kfunc approach for v2 rfc for now (will try to
> get rid of the complicated attachment at least), but let's keep
> discussing.

SGTM, FWIW.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
  2023-06-13 14:54   ` Willem de Bruijn
  2023-06-13 15:08   ` Simon Horman
@ 2023-06-16  5:46   ` Kui-Feng Lee
  2023-06-16 17:32     ` Stanislav Fomichev
  2 siblings, 1 reply; 47+ messages in thread
From: Kui-Feng Lee @ 2023-06-16  5:46 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev



On 6/12/23 10:23, Stanislav Fomichev wrote:
..... cut .....
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/**
> + * bpf_devtx_sb_attach - Attach devtx 'packet submit' program
> + * @ifindex: netdev interface index.
> + * @prog_fd: BPF program file descriptor.
> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + */
> +__bpf_kfunc int bpf_devtx_sb_attach(int ifindex, int prog_fd)
> +{
> +	struct net_device *netdev;
> +	int ret;
> +
> +	netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> +	if (!netdev)
> +		return -EINVAL;
> +
> +	mutex_lock(&devtx_attach_lock);
> +	ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_sb", &netdev->devtx_sb);
> +	mutex_unlock(&devtx_attach_lock);
> +
> +	dev_put(netdev);
> +
> +	return ret;
> +}

How about adding another detach kfunc instead of overloading
this one? It is easier to understand.

> +
> +/**
> + * bpf_devtx_cp_attach - Attach devtx 'packet complete' program
> + * @ifindex: netdev interface index.
> + * @prog_fd: BPF program file descriptor.
> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + */
> +__bpf_kfunc int bpf_devtx_cp_attach(int ifindex, int prog_fd)
> +{
> +	struct net_device *netdev;
> +	int ret;
> +
> +	netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> +	if (!netdev)
> +		return -EINVAL;
> +
> +	mutex_lock(&devtx_attach_lock);
> +	ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_cp", &netdev->devtx_cp);
> +	mutex_unlock(&devtx_attach_lock);
> +
> +	dev_put(netdev);
> +
> +	return ret;
> +}
> +
> +__diag_pop();
> +
> +bool is_devtx_kfunc(u32 kfunc_id)
> +{
> +	return !!btf_id_set8_contains(&bpf_devtx_hook_ids, kfunc_id);
> +}
> +
> +void devtx_shutdown(struct net_device *netdev)
> +{
> +	mutex_lock(&devtx_attach_lock);
> +	__bpf_devtx_detach(netdev, &netdev->devtx_sb);
> +	__bpf_devtx_detach(netdev, &netdev->devtx_cp);
> +	mutex_unlock(&devtx_attach_lock);
> +}
> +
> +BTF_SET8_START(bpf_devtx_syscall_kfunc_ids)
> +BTF_ID_FLAGS(func, bpf_devtx_sb_attach)
> +BTF_ID_FLAGS(func, bpf_devtx_cp_attach)
> +BTF_SET8_END(bpf_devtx_syscall_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set bpf_devtx_syscall_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &bpf_devtx_syscall_kfunc_ids,
> +};
> +
> +static int __init devtx_init(void)
> +{
> +	int ret;
> +
> +	ret = register_btf_fmodret_id_set(&bpf_devtx_hook_set);
> +	if (ret) {
> +		pr_warn("failed to register devtx hooks: %d", ret);
> +		return ret;
> +	}
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_devtx_syscall_kfunc_set);
> +	if (ret) {
> +		pr_warn("failed to register syscall kfuncs: %d", ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +late_initcall(devtx_init);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-16  0:09   ` Stanislav Fomichev
@ 2023-06-16  8:12     ` Magnus Karlsson
  2023-06-16 17:32       ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Magnus Karlsson @ 2023-06-16  8:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	willemb, dsahern, magnus.karlsson, bjorn, maciej.fijalkowski,
	netdev

On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
> > Some immediate thoughts after glancing through this:
> >
> > > --- Use cases ---
> > >
> > > The goal of this series is to add two new standard-ish places
> > > in the transmit path:
> > >
> > > 1. Right before the packet is transmitted (with access to TX
> > >    descriptors)
> > > 2. Right after the packet is actually transmitted and we've received the
> > >    completion (again, with access to TX completion descriptors)
> > >
> > > Accessing TX descriptors unlocks the following use-cases:
> > >
> > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > use device offloads. The existing case implements TX timestamp.
> > > - Observability: global per-netdev hooks can be used for tracing
> > > the packets and exploring completion descriptors for all sorts of
> > > device errors.
> > >
> > > Accessing TX descriptors also means that the hooks have to be called
> > > from the drivers.
> > >
> > > The hooks are a light-weight alternative to XDP at egress and currently
> > > don't provide any packet modification abilities. However, eventually,
> > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > descriptors; for performance sake).
> >
> > dynptr?
> >
> > > --- UAPI ---
> > >
> > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > expose any UAPI and are implemented as tracing programs that call
> > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > programs. The series expands device-bound infrastructure to tracing
> > > programs.
> >
> > Not a fan of the "attach from BPF syscall program" thing. These are part
> > of the XDP data path API, and I think we should expose them as proper
> > bpf_link attachments from userspace with introspection etc. But I guess
> > the bpf_mprog thing will give us that?
> >
> > > --- skb vs xdp ---
> > >
> > > The hooks operate on a new light-weight devtx_frame which contains:
> > > - data
> > > - len
> > > - sinfo
> > >
> > > This should allow us to have a unified (from BPF POW) place at TX
> > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > for each invocation).
> >
> > Not sure what I think about this one. At the very least I think we
> > should expose xdp->data_meta as well. I'm not sure what the use case for
> > accessing skbs is? If that *is* indeed useful, probably there will also
> > end up being a use case for accessing the full skb?
>
> I spent some time looking at data_meta story on AF_XDP TX and it
> doesn't look like it's supported (at least in a general way).
> You obviously get some data_meta when you do XDP_TX, but if you want
> to pass something to the bpf prog when doing TX via the AF_XDP ring,
> it gets complicated.

When we designed this some 5 - 6 years ago, we thought that there
would be an XDP for egress action in the "nearish" future that could
be used to interpret the metadata field in front of the packet.
Basically, the user would load an XDP egress program that would define
the metadata layout by the operations it would perform on the metadata
area. But since XDP on egress has not happened, you are right, there
is definitely something missing to be able to use metadata on Tx. Or
could your proposed hook points be used for something like this?

> In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> and pass something in the headroom.

This feature is mainly used to allow for multiple packets on the same
chunk (to save space) and also to be able to have packets spanning two
chunks. Even in aligned mode, you can start a packet at an arbitrary
address in the chunk as long as the whole packet fits into the chunk.
So no problem having headroom in any of the modes.


> If copy-mode, there is no support to do skb_metadata_set.
>
> Probably makes sense to have something like tx_metalen on the xsk? And
> skb_metadata_set it in copy more and skip it in zerocopy mode?
> Or maybe I'm missing something?
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-16  8:12     ` Magnus Karlsson
@ 2023-06-16 17:32       ` Stanislav Fomichev
  2023-06-16 23:10         ` Stanislav Fomichev
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-16 17:32 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Toke Høiland-Jørgensen, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	willemb, dsahern, magnus.karlsson, bjorn, maciej.fijalkowski,
	netdev

On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > >
> > > Some immediate thoughts after glancing through this:
> > >
> > > > --- Use cases ---
> > > >
> > > > The goal of this series is to add two new standard-ish places
> > > > in the transmit path:
> > > >
> > > > 1. Right before the packet is transmitted (with access to TX
> > > >    descriptors)
> > > > 2. Right after the packet is actually transmitted and we've received the
> > > >    completion (again, with access to TX completion descriptors)
> > > >
> > > > Accessing TX descriptors unlocks the following use-cases:
> > > >
> > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > use device offloads. The existing case implements TX timestamp.
> > > > - Observability: global per-netdev hooks can be used for tracing
> > > > the packets and exploring completion descriptors for all sorts of
> > > > device errors.
> > > >
> > > > Accessing TX descriptors also means that the hooks have to be called
> > > > from the drivers.
> > > >
> > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > don't provide any packet modification abilities. However, eventually,
> > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > descriptors; for performance sake).
> > >
> > > dynptr?
> > >
> > > > --- UAPI ---
> > > >
> > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > expose any UAPI and are implemented as tracing programs that call
> > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > programs. The series expands device-bound infrastructure to tracing
> > > > programs.
> > >
> > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > of the XDP data path API, and I think we should expose them as proper
> > > bpf_link attachments from userspace with introspection etc. But I guess
> > > the bpf_mprog thing will give us that?
> > >
> > > > --- skb vs xdp ---
> > > >
> > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > - data
> > > > - len
> > > > - sinfo
> > > >
> > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > for each invocation).
> > >
> > > Not sure what I think about this one. At the very least I think we
> > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > end up being a use case for accessing the full skb?
> >
> > I spent some time looking at data_meta story on AF_XDP TX and it
> > doesn't look like it's supported (at least in a general way).
> > You obviously get some data_meta when you do XDP_TX, but if you want
> > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > it gets complicated.
>
> When we designed this some 5 - 6 years ago, we thought that there
> would be an XDP for egress action in the "nearish" future that could
> be used to interpret the metadata field in front of the packet.
> Basically, the user would load an XDP egress program that would define
> the metadata layout by the operations it would perform on the metadata
> area. But since XDP on egress has not happened, you are right, there
> is definitely something missing to be able to use metadata on Tx. Or
> could your proposed hook points be used for something like this?

Thanks for the context!
Yes, the proposal is to use these new tx hooks to read out af_xdp
metadata and apply it to the packet via a bunch of tbd kfuncs.
AF_XDP and BPF programs would have to have a contract about the
metadata layout (same as we have on rx).

> > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > and pass something in the headroom.
>
> This feature is mainly used to allow for multiple packets on the same
> chunk (to save space) and also to be able to have packets spanning two
> chunks. Even in aligned mode, you can start a packet at an arbitrary
> address in the chunk as long as the whole packet fits into the chunk.
> So no problem having headroom in any of the modes.

But if I put it into the headroom it will only be passed down to the
driver in zero-copy mode, right?
If I do tx_desc->addr = packet_start, no medata (that goes prior to
packet_start) gets copied into skb in the copy mode (it seems).
Or do you suggest that the interface should be tx_desc->addr =
metadata_start and the bpf program should call the equivalent of
bpf_xdp_adjust_head to consume this metadata?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 3/7] bpf: implement devtx hook points
  2023-06-16  5:46   ` Kui-Feng Lee
@ 2023-06-16 17:32     ` Stanislav Fomichev
  0 siblings, 0 replies; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-16 17:32 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Thu, Jun 15, 2023 at 10:47 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 6/12/23 10:23, Stanislav Fomichev wrote:
> ..... cut .....
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +               "Global functions as their definitions will be in vmlinux BTF");
> > +
> > +/**
> > + * bpf_devtx_sb_attach - Attach devtx 'packet submit' program
> > + * @ifindex: netdev interface index.
> > + * @prog_fd: BPF program file descriptor.
> > + *
> > + * Return:
> > + * * Returns 0 on success or ``-errno`` on error.
> > + */
> > +__bpf_kfunc int bpf_devtx_sb_attach(int ifindex, int prog_fd)
> > +{
> > +     struct net_device *netdev;
> > +     int ret;
> > +
> > +     netdev = dev_get_by_index(current->nsproxy->net_ns, ifindex);
> > +     if (!netdev)
> > +             return -EINVAL;
> > +
> > +     mutex_lock(&devtx_attach_lock);
> > +     ret = __bpf_devtx_attach(netdev, prog_fd, "devtx_sb", &netdev->devtx_sb);
> > +     mutex_unlock(&devtx_attach_lock);
> > +
> > +     dev_put(netdev);
> > +
> > +     return ret;
> > +}
>
> How about adding another detach kfunc instead of overloading
> this one? It is easier to understand.

Originally I was planning to switch to links which would have avoided
the need of an explicit detach.
But, as discussed, I'll drop it all for v2 and will go with regular fentry.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-16 17:32       ` Stanislav Fomichev
@ 2023-06-16 23:10         ` Stanislav Fomichev
  2023-06-19  7:15           ` Magnus Karlsson
  0 siblings, 1 reply; 47+ messages in thread
From: Stanislav Fomichev @ 2023-06-16 23:10 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Toke Høiland-Jørgensen, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	willemb, dsahern, magnus.karlsson, bjorn, maciej.fijalkowski,
	netdev

On 06/16, Stanislav Fomichev wrote:
> On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > >
> > > > Some immediate thoughts after glancing through this:
> > > >
> > > > > --- Use cases ---
> > > > >
> > > > > The goal of this series is to add two new standard-ish places
> > > > > in the transmit path:
> > > > >
> > > > > 1. Right before the packet is transmitted (with access to TX
> > > > >    descriptors)
> > > > > 2. Right after the packet is actually transmitted and we've received the
> > > > >    completion (again, with access to TX completion descriptors)
> > > > >
> > > > > Accessing TX descriptors unlocks the following use-cases:
> > > > >
> > > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > > use device offloads. The existing case implements TX timestamp.
> > > > > - Observability: global per-netdev hooks can be used for tracing
> > > > > the packets and exploring completion descriptors for all sorts of
> > > > > device errors.
> > > > >
> > > > > Accessing TX descriptors also means that the hooks have to be called
> > > > > from the drivers.
> > > > >
> > > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > > don't provide any packet modification abilities. However, eventually,
> > > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > > descriptors; for performance sake).
> > > >
> > > > dynptr?
> > > >
> > > > > --- UAPI ---
> > > > >
> > > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > > expose any UAPI and are implemented as tracing programs that call
> > > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > > programs. The series expands device-bound infrastructure to tracing
> > > > > programs.
> > > >
> > > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > of the XDP data path API, and I think we should expose them as proper
> > > > bpf_link attachments from userspace with introspection etc. But I guess
> > > > the bpf_mprog thing will give us that?
> > > >
> > > > > --- skb vs xdp ---
> > > > >
> > > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > > - data
> > > > > - len
> > > > > - sinfo
> > > > >
> > > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > > for each invocation).
> > > >
> > > > Not sure what I think about this one. At the very least I think we
> > > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > > end up being a use case for accessing the full skb?
> > >
> > > I spent some time looking at data_meta story on AF_XDP TX and it
> > > doesn't look like it's supported (at least in a general way).
> > > You obviously get some data_meta when you do XDP_TX, but if you want
> > > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > > it gets complicated.
> >
> > When we designed this some 5 - 6 years ago, we thought that there
> > would be an XDP for egress action in the "nearish" future that could
> > be used to interpret the metadata field in front of the packet.
> > Basically, the user would load an XDP egress program that would define
> > the metadata layout by the operations it would perform on the metadata
> > area. But since XDP on egress has not happened, you are right, there
> > is definitely something missing to be able to use metadata on Tx. Or
> > could your proposed hook points be used for something like this?
> 
> Thanks for the context!
> Yes, the proposal is to use these new tx hooks to read out af_xdp
> metadata and apply it to the packet via a bunch of tbd kfuncs.
> AF_XDP and BPF programs would have to have a contract about the
> metadata layout (same as we have on rx).
> 
> > > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > > and pass something in the headroom.
> >
> > This feature is mainly used to allow for multiple packets on the same
> > chunk (to save space) and also to be able to have packets spanning two
> > chunks. Even in aligned mode, you can start a packet at an arbitrary
> > address in the chunk as long as the whole packet fits into the chunk.
> > So no problem having headroom in any of the modes.
> 
> But if I put it into the headroom it will only be passed down to the
> driver in zero-copy mode, right?
> If I do tx_desc->addr = packet_start, no medata (that goes prior to
> packet_start) gets copied into skb in the copy mode (it seems).
> Or do you suggest that the interface should be tx_desc->addr =
> metadata_start and the bpf program should call the equivalent of
> bpf_xdp_adjust_head to consume this metadata?

For copy-mode, here is what I've prototyped. That seems to work.
For zero-copy, I don't think we need anything extra (besides exposing
xsk->tx_meta_len at the hook point, tbd). Does the patch below make
sense?

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..30018b3b862d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
 	struct list_head flush_node;
 	struct xsk_buff_pool *pool;
 	u16 queue_id;
+	u8 tx_metadata_len;
 	bool zc;
 	enum {
 		XSK_READY = 0,
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a8096f4ce..2374eafff7db 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cc1e7f15fa73..a95872712547 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -493,14 +493,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-		skb_put(skb, len);
+		skb_put(skb, len + xs->tx_metadata_len);
 
 		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+		buffer -= xs->tx_metadata_len;
+
 		err = skb_store_bits(skb, 0, buffer, len);
 		if (unlikely(err)) {
 			kfree_skb(skb);
 			return ERR_PTR(err);
 		}
+
+		if (xs->tx_metadata_len) {
+			skb_metadata_set(skb, xs->tx_metadata_len);
+			__skb_pull(skb, xs->tx_metadata_len);
+		}
 	}
 
 	skb->dev = dev;
@@ -1137,6 +1144,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_TX_METADATA_LEN:
+	{
+		int val;
+
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_sockptr(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		if (val >= 256)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		if (xs->state != XSK_READY) {
+			mutex_unlock(&xs->mutex);
+			return -EBUSY;
+		}
+		xs->tx_metadata_len = val;
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC bpf-next 0/7] bpf: netdev TX metadata
  2023-06-16 23:10         ` Stanislav Fomichev
@ 2023-06-19  7:15           ` Magnus Karlsson
  0 siblings, 0 replies; 47+ messages in thread
From: Magnus Karlsson @ 2023-06-19  7:15 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	willemb, dsahern, magnus.karlsson, bjorn, maciej.fijalkowski,
	netdev

On Sat, 17 Jun 2023 at 01:10, Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/16, Stanislav Fomichev wrote:
> > On Fri, Jun 16, 2023 at 1:13 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Fri, 16 Jun 2023 at 02:09, Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Mon, Jun 12, 2023 at 2:01 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > > >
> > > > > Some immediate thoughts after glancing through this:
> > > > >
> > > > > > --- Use cases ---
> > > > > >
> > > > > > The goal of this series is to add two new standard-ish places
> > > > > > in the transmit path:
> > > > > >
> > > > > > 1. Right before the packet is transmitted (with access to TX
> > > > > >    descriptors)
> > > > > > 2. Right after the packet is actually transmitted and we've received the
> > > > > >    completion (again, with access to TX completion descriptors)
> > > > > >
> > > > > > Accessing TX descriptors unlocks the following use-cases:
> > > > > >
> > > > > > - Setting device hints at TX: XDP/AF_XDP might use these new hooks to
> > > > > > use device offloads. The existing case implements TX timestamp.
> > > > > > - Observability: global per-netdev hooks can be used for tracing
> > > > > > the packets and exploring completion descriptors for all sorts of
> > > > > > device errors.
> > > > > >
> > > > > > Accessing TX descriptors also means that the hooks have to be called
> > > > > > from the drivers.
> > > > > >
> > > > > > The hooks are a light-weight alternative to XDP at egress and currently
> > > > > > don't provide any packet modification abilities. However, eventually,
> > > > > > can expose new kfuncs to operate on the packet (or, rather, the actual
> > > > > > descriptors; for performance sake).
> > > > >
> > > > > dynptr?
> > > > >
> > > > > > --- UAPI ---
> > > > > >
> > > > > > The hooks are implemented in a HID-BPF style. Meaning they don't
> > > > > > expose any UAPI and are implemented as tracing programs that call
> > > > > > a bunch of kfuncs. The attach/detach operation happen via BPF syscall
> > > > > > programs. The series expands device-bound infrastructure to tracing
> > > > > > programs.
> > > > >
> > > > > Not a fan of the "attach from BPF syscall program" thing. These are part
> > > > > of the XDP data path API, and I think we should expose them as proper
> > > > > bpf_link attachments from userspace with introspection etc. But I guess
> > > > > the bpf_mprog thing will give us that?
> > > > >
> > > > > > --- skb vs xdp ---
> > > > > >
> > > > > > The hooks operate on a new light-weight devtx_frame which contains:
> > > > > > - data
> > > > > > - len
> > > > > > - sinfo
> > > > > >
> > > > > > This should allow us to have a unified (from BPF POW) place at TX
> > > > > > and not be super-taxing (we need to copy 2 pointers + len to the stack
> > > > > > for each invocation).
> > > > >
> > > > > Not sure what I think about this one. At the very least I think we
> > > > > should expose xdp->data_meta as well. I'm not sure what the use case for
> > > > > accessing skbs is? If that *is* indeed useful, probably there will also
> > > > > end up being a use case for accessing the full skb?
> > > >
> > > > I spent some time looking at data_meta story on AF_XDP TX and it
> > > > doesn't look like it's supported (at least in a general way).
> > > > You obviously get some data_meta when you do XDP_TX, but if you want
> > > > to pass something to the bpf prog when doing TX via the AF_XDP ring,
> > > > it gets complicated.
> > >
> > > When we designed this some 5 - 6 years ago, we thought that there
> > > would be an XDP for egress action in the "nearish" future that could
> > > be used to interpret the metadata field in front of the packet.
> > > Basically, the user would load an XDP egress program that would define
> > > the metadata layout by the operations it would perform on the metadata
> > > area. But since XDP on egress has not happened, you are right, there
> > > is definitely something missing to be able to use metadata on Tx. Or
> > > could your proposed hook points be used for something like this?
> >
> > Thanks for the context!
> > Yes, the proposal is to use these new tx hooks to read out af_xdp
> > metadata and apply it to the packet via a bunch of tbd kfuncs.
> > AF_XDP and BPF programs would have to have a contract about the
> > metadata layout (same as we have on rx).
> >
> > > > In zerocopy mode, we can probably use XDP_UMEM_UNALIGNED_CHUNK_FLAG
> > > > and pass something in the headroom.
> > >
> > > This feature is mainly used to allow for multiple packets on the same
> > > chunk (to save space) and also to be able to have packets spanning two
> > > chunks. Even in aligned mode, you can start a packet at an arbitrary
> > > address in the chunk as long as the whole packet fits into the chunk.
> > > So no problem having headroom in any of the modes.
> >
> > But if I put it into the headroom it will only be passed down to the
> > driver in zero-copy mode, right?
> > If I do tx_desc->addr = packet_start, no medata (that goes prior to
> > packet_start) gets copied into skb in the copy mode (it seems).
> > Or do you suggest that the interface should be tx_desc->addr =
> > metadata_start and the bpf program should call the equivalent of
> > bpf_xdp_adjust_head to consume this metadata?
>
> For copy-mode, here is what I've prototyped. That seems to work.
> For zero-copy, I don't think we need anything extra (besides exposing
> xsk->tx_meta_len at the hook point, tbd). Does the patch below make
> sense?

Was just going to suggest adding a setsockopt, so this makes perfect
sense to me. Thanks!

> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..30018b3b862d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -51,6 +51,7 @@ struct xdp_sock {
>         struct list_head flush_node;
>         struct xsk_buff_pool *pool;
>         u16 queue_id;
> +       u8 tx_metadata_len;
>         bool zc;
>         enum {
>                 XSK_READY = 0,
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index a78a8096f4ce..2374eafff7db 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING       6
>  #define XDP_STATISTICS                 7
>  #define XDP_OPTIONS                    8
> +#define XDP_TX_METADATA_LEN            9
>
>  struct xdp_umem_reg {
>         __u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cc1e7f15fa73..a95872712547 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -493,14 +493,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>                         return ERR_PTR(err);
>
>                 skb_reserve(skb, hr);
> -               skb_put(skb, len);
> +               skb_put(skb, len + xs->tx_metadata_len);
>
>                 buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +               buffer -= xs->tx_metadata_len;
> +
>                 err = skb_store_bits(skb, 0, buffer, len);
>                 if (unlikely(err)) {
>                         kfree_skb(skb);
>                         return ERR_PTR(err);
>                 }
> +
> +               if (xs->tx_metadata_len) {
> +                       skb_metadata_set(skb, xs->tx_metadata_len);
> +                       __skb_pull(skb, xs->tx_metadata_len);
> +               }
>         }
>
>         skb->dev = dev;
> @@ -1137,6 +1144,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>                 mutex_unlock(&xs->mutex);
>                 return err;
>         }
> +       case XDP_TX_METADATA_LEN:
> +       {
> +               int val;
> +
> +               if (optlen < sizeof(val))
> +                       return -EINVAL;
> +               if (copy_from_sockptr(&val, optval, sizeof(val)))
> +                       return -EFAULT;
> +
> +               if (val >= 256)
> +                       return -EINVAL;
> +
> +               mutex_lock(&xs->mutex);
> +               if (xs->state != XSK_READY) {
> +                       mutex_unlock(&xs->mutex);
> +                       return -EBUSY;
> +               }
> +               xs->tx_metadata_len = val;
> +               mutex_unlock(&xs->mutex);
> +               return err;
> +       }
>         default:
>                 break;
>         }

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-06-19  7:15 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 17:23 [RFC bpf-next 0/7] bpf: netdev TX metadata Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 1/7] bpf: rename some xdp-metadata functions into dev-bound Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 2/7] bpf: resolve single typedef when walking structs Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 3/7] bpf: implement devtx hook points Stanislav Fomichev
2023-06-13 14:54   ` Willem de Bruijn
2023-06-13 19:00     ` Stanislav Fomichev
2023-06-13 19:29       ` Willem de Bruijn
2023-06-13 15:08   ` Simon Horman
2023-06-13 19:00     ` Stanislav Fomichev
2023-06-14  7:02       ` Simon Horman
2023-06-14 17:18         ` Stanislav Fomichev
2023-06-16  5:46   ` Kui-Feng Lee
2023-06-16 17:32     ` Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 4/7] bpf: implement devtx timestamp kfunc Stanislav Fomichev
2023-06-13 15:14   ` Simon Horman
2023-06-13 18:39     ` Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 5/7] net: veth: implement devtx timestamp kfuncs Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 6/7] selftests/bpf: extend xdp_metadata with devtx kfuncs Stanislav Fomichev
2023-06-13 14:47   ` Willem de Bruijn
2023-06-13 19:00     ` Stanislav Fomichev
2023-06-12 17:23 ` [RFC bpf-next 7/7] selftests/bpf: extend xdp_hw_metadata " Stanislav Fomichev
2023-06-13 15:03   ` Willem de Bruijn
2023-06-13 19:00     ` Stanislav Fomichev
2023-06-12 21:00 ` [RFC bpf-next 0/7] bpf: netdev TX metadata Toke Høiland-Jørgensen
2023-06-13 16:32   ` Stanislav Fomichev
2023-06-13 17:18     ` Toke Høiland-Jørgensen
2023-06-13 18:39       ` Stanislav Fomichev
2023-06-13 19:10         ` Toke Høiland-Jørgensen
2023-06-13 21:17           ` Stanislav Fomichev
2023-06-13 22:32             ` Alexei Starovoitov
2023-06-13 23:16               ` Stanislav Fomichev
2023-06-14  4:19                 ` Alexei Starovoitov
2023-06-14 11:59                   ` Toke Høiland-Jørgensen
2023-06-14 16:27                     ` Alexei Starovoitov
2023-06-15 12:36                       ` Toke Høiland-Jørgensen
2023-06-15 16:10                         ` Alexei Starovoitov
2023-06-15 16:31                           ` Stanislav Fomichev
2023-06-16  1:50                             ` Jakub Kicinski
2023-06-16  0:09   ` Stanislav Fomichev
2023-06-16  8:12     ` Magnus Karlsson
2023-06-16 17:32       ` Stanislav Fomichev
2023-06-16 23:10         ` Stanislav Fomichev
2023-06-19  7:15           ` Magnus Karlsson
2023-06-14  3:31 ` Jakub Kicinski
2023-06-14  3:54   ` David Ahern
2023-06-14  5:05     ` Jakub Kicinski
2023-06-14 17:17       ` Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).