All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next v2 00/11] bpf: Netdev TX metadata
@ 2023-06-21 17:02 Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 01/11] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
                   ` (11 more replies)
  0 siblings, 12 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, brouer, netdev

--- Changes since RFC v1 ---

- Support passing metadata via XSK
  - Showcase how to consume this metadata at TX in the selftests
- Sample untested mlx5 implementation
- Simplify attach/detach story with simple global fentry (Alexei)
- Add 'return 0' in xdp_metadata selftest (Willem)
- Add missing 'sizeof(*ip6h)' in xdp_hw_metadata selftest (Willem)
- Document 'timestamp' argument of kfunc (Simon)
- Not relevant due to attach/detach rework:
  - s/devtx_sb/devtx_submit/ in netdev (Willem)
  - s/devtx_cp/devtx_complete/ in netdev (Willem)
  - Document 'devtx_complete' and 'devtx_submit' in netdev (Simon)
  - Add devtx_sb/devtx_cp forward declaration (Simon)
  - Add missing __rcu/rcu_dereference annotations (Simon)

v1: https://lore.kernel.org/bpf/CAJ8uoz2zOHpBRfKhN97eR0VWipBTxnh=R9G57Z2UUujX4JzneQ@mail.gmail.com/T/#md354573364f75a8598e443dd51114b4feb4c3714

--- Use cases ---

The goal of this series is to add two new standard-ish places
in the transmit path:

1. Right before the packet is transmitted (with access to TX
   descriptors)
2. Right after the packet is actually transmitted and we've received the
   completion (again, with access to TX completion descriptors)

Accessing TX descriptors unlocks the following use-cases:

- Setting device hints at TX: XDP/AF_XDP might use these new hooks to
use device offloads. The existing case implements TX timestamp.
- Observability: global per-netdev hooks can be used for tracing
the packets and exploring completion descriptors for all sorts of
device errors.

Accessing TX descriptors also means that the hooks have to be called
from the drivers.

The hooks are a light-weight alternative to XDP at egress and currently
don't provide any packet modification abilities. However, eventually,
can expose new kfuncs to operate on the packet (or, rather, the actual
descriptors; for performance sake).

--- UAPI ---

The hooks are implemented in a HID-BPF style. Meaning they don't
expose any UAPI and are implemented as tracing programs that call
a bunch of kfuncs. The attach/detach operation happen via regular
global fentry points. Network namespace and ifindex are exposed
to allow filtering out particular netdev.

--- skb vs xdp ---

The hooks operate on a new light-weight devtx_frame which contains:
- data
- len
- metadata_len
- sinfo (frags)
- netdev

This should allow us to have a unified (from BPF POW) place at TX
and not be super-taxing (we need to copy 2 pointers + len to the stack
for each invocation).

--- TODO ---

Things that I'm planning to do for the non-RFC series:
- have some real device support to verify xdp_hw_metadata works
  - performance numbers with/without feature enabled (Toke)
- freplace
- explore dynptr (Toke)
- Documentation/networking/xdp-rx-metadata.rst - like documentation

--- CC ---

CC'ing people only on the cover letter. Hopefully can find the rest via
lore.

Cc: toke@kernel.org
Cc: willemb@google.com
Cc: dsahern@kernel.org
Cc: john.fastabend@gmail.com
Cc: magnus.karlsson@intel.com
Cc: bjorn@kernel.org
Cc: maciej.fijalkowski@intel.com
Cc: brouer@redhat.com
Cc: netdev@vger.kernel.org

Stanislav Fomichev (11):
  bpf: Rename some xdp-metadata functions into dev-bound
  bpf: Resolve single typedef when walking structs
  xsk: Support XDP_TX_METADATA_LEN
  bpf: Implement devtx hook points
  bpf: Implement devtx timestamp kfunc
  net: veth: Implement devtx timestamp kfuncs
  selftests/xsk: Support XDP_TX_METADATA_LEN
  selftests/bpf: Add helper to query current netns cookie
  selftests/bpf: Extend xdp_metadata with devtx kfuncs
  selftests/bpf: Extend xdp_hw_metadata with devtx kfuncs
  net/mlx5e: Support TX timestamp metadata

 MAINTAINERS                                   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  11 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  96 ++++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   9 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |   3 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  16 ++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  26 ++-
 drivers/net/veth.c                            | 116 +++++++++-
 include/linux/netdevice.h                     |   4 +
 include/net/devtx.h                           |  71 +++++++
 include/net/offload.h                         |  38 ++++
 include/net/xdp.h                             |  18 +-
 include/net/xdp_sock.h                        |   1 +
 include/net/xsk_buff_pool.h                   |   1 +
 include/uapi/linux/if_xdp.h                   |   1 +
 kernel/bpf/btf.c                              |   2 +
 kernel/bpf/offload.c                          |  49 ++++-
 kernel/bpf/verifier.c                         |   4 +-
 net/core/Makefile                             |   1 +
 net/core/dev.c                                |   1 +
 net/core/devtx.c                              | 149 +++++++++++++
 net/core/xdp.c                                |  20 +-
 net/xdp/xsk.c                                 |  31 ++-
 net/xdp/xsk_buff_pool.c                       |   1 +
 net/xdp/xsk_queue.h                           |   7 +-
 tools/testing/selftests/bpf/network_helpers.c |  21 ++
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  62 +++++-
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 107 ++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        | 118 +++++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 198 ++++++++++++++++--
 tools/testing/selftests/bpf/xdp_metadata.h    |  14 ++
 tools/testing/selftests/bpf/xsk.c             |  17 ++
 tools/testing/selftests/bpf/xsk.h             |   1 +
 34 files changed, 1142 insertions(+), 75 deletions(-)
 create mode 100644 include/net/devtx.h
 create mode 100644 include/net/offload.h
 create mode 100644 net/core/devtx.c

-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 01/11] bpf: Rename some xdp-metadata functions into dev-bound
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs Stanislav Fomichev
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

No functional changes.

To make existing dev-bound infrastructure more generic and be
less tightly bound to the xdp layer, rename some functions and
move kfunc-related things into kernel/bpf/offload.c

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/offload.h | 28 ++++++++++++++++++++++++++++
 include/net/xdp.h     | 18 +-----------------
 kernel/bpf/offload.c  | 26 ++++++++++++++++++++++++--
 kernel/bpf/verifier.c |  4 ++--
 net/core/xdp.c        | 20 ++------------------
 5 files changed, 57 insertions(+), 39 deletions(-)
 create mode 100644 include/net/offload.h

diff --git a/include/net/offload.h b/include/net/offload.h
new file mode 100644
index 000000000000..264a35881473
--- /dev/null
+++ b/include/net/offload.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_OFFLOAD_H__
+#define __LINUX_NET_OFFLOAD_H__
+
+#include <linux/types.h>
+
+#define XDP_METADATA_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			      bpf_xdp_metadata_rx_timestamp) \
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			      bpf_xdp_metadata_rx_hash)
+
+enum {
+#define NETDEV_METADATA_KFUNC(name, _) name,
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+MAX_NETDEV_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 bpf_dev_bound_kfunc_id(int id);
+bool bpf_is_dev_bound_kfunc(u32 btf_id);
+#else
+static inline u32 bpf_dev_bound_kfunc_id(int id) { return 0; }
+static inline bool bpf_is_dev_bound_kfunc(u32 btf_id) { return false; }
+#endif
+
+#endif /* __LINUX_NET_OFFLOAD_H__ */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..de4c3b70abde 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -9,6 +9,7 @@
 #include <linux/skbuff.h> /* skb_shared_info */
 #include <uapi/linux/netdev.h>
 #include <linux/bitfield.h>
+#include <net/offload.h>
 
 /**
  * DOC: XDP RX-queue information
@@ -384,19 +385,6 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
-#define XDP_METADATA_KFUNC_xxx	\
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
-			   bpf_xdp_metadata_rx_timestamp) \
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
-			   bpf_xdp_metadata_rx_hash) \
-
-enum {
-#define XDP_METADATA_KFUNC(name, _) name,
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-MAX_XDP_METADATA_KFUNC,
-};
-
 enum xdp_rss_hash_type {
 	/* First part: Individual bits for L3/L4 types */
 	XDP_RSS_L3_IPV4		= BIT(0),
@@ -444,14 +432,10 @@ enum xdp_rss_hash_type {
 };
 
 #ifdef CONFIG_NET
-u32 bpf_xdp_metadata_kfunc_id(int id);
-bool bpf_dev_bound_kfunc_id(u32 btf_id);
 void xdp_set_features_flag(struct net_device *dev, xdp_features_t val);
 void xdp_features_set_redirect_target(struct net_device *dev, bool support_sg);
 void xdp_features_clear_redirect_target(struct net_device *dev);
 #else
-static inline u32 bpf_xdp_metadata_kfunc_id(int id) { return 0; }
-static inline bool bpf_dev_bound_kfunc_id(u32 btf_id) { return false; }
 
 static inline void
 xdp_set_features_flag(struct net_device *dev, xdp_features_t val)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8a26cd8814c1..235d81f7e0ed 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -844,9 +844,9 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	if (!ops)
 		goto out;
 
-	if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+	if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
 		p = ops->xmo_rx_timestamp;
-	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
 out:
 	up_read(&bpf_devs_lock);
@@ -854,6 +854,28 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	return p;
 }
 
+BTF_SET_START(dev_bound_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET_END(dev_bound_kfunc_ids)
+
+BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+
+u32 bpf_dev_bound_kfunc_id(int id)
+{
+	/* dev_bound_kfunc_ids is sorted and can't be used */
+	return dev_bound_kfunc_ids_unsorted[id];
+}
+
+bool bpf_is_dev_bound_kfunc(u32 btf_id)
+{
+	return btf_id_set_contains(&dev_bound_kfunc_ids, btf_id);
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1e38584d497c..4db48b5af47e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2721,7 +2721,7 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		}
 	}
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		err = bpf_dev_bound_kfunc_check(&env->log, prog_aux);
 		if (err)
 			return err;
@@ -17757,7 +17757,7 @@ static void specialize_kfunc(struct bpf_verifier_env *env,
 	void *xdp_kfunc;
 	bool is_rdonly;
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		xdp_kfunc = bpf_dev_bound_resolve_kfunc(prog, func_id);
 		if (xdp_kfunc) {
 			*addr = (unsigned long)xdp_kfunc;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 41e5ca8643ec..819767697370 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -741,9 +741,9 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-#define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
 XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
+#undef NETDEV_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
 
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
@@ -751,22 +751,6 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
 	.set   = &xdp_metadata_kfunc_ids,
 };
 
-BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
-#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-
-u32 bpf_xdp_metadata_kfunc_id(int id)
-{
-	/* xdp_metadata_kfunc_ids is sorted and can't be used */
-	return xdp_metadata_kfunc_ids_unsorted[id];
-}
-
-bool bpf_dev_bound_kfunc_id(u32 btf_id)
-{
-	return btf_id_set8_contains(&xdp_metadata_kfunc_ids, btf_id);
-}
-
 static int __init xdp_metadata_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 01/11] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-22  5:17   ` Alexei Starovoitov
  2023-06-21 17:02 ` [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

It is impossible to use skb_frag_t in the tracing program. So let's
resolve a single typedef when walking the struct.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/btf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index bd2cac057928..9bdaa1225e8a 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6140,6 +6140,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
 	*flag = 0;
 again:
 	tname = __btf_name_by_offset(btf, t->name_off);
+	if (btf_type_is_typedef(t))
+		t = btf_type_by_id(btf, t->type);
 	if (!btf_type_is_struct(t)) {
 		bpf_log(log, "Type '%s' is not a struct\n", tname);
 		return -EINVAL;
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 01/11] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-22  9:11   ` Jesper D. Brouer
  2023-06-22 15:26   ` Simon Horman
  2023-06-21 17:02 ` [RFC bpf-next v2 04/11] bpf: Implement devtx hook points Stanislav Fomichev
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa

For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xdp_sock.h      |  1 +
 include/net/xsk_buff_pool.h |  1 +
 include/uapi/linux/if_xdp.h |  1 +
 net/xdp/xsk.c               | 31 ++++++++++++++++++++++++++++++-
 net/xdp/xsk_buff_pool.c     |  1 +
 net/xdp/xsk_queue.h         |  7 ++++---
 6 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..30018b3b862d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
 	struct list_head flush_node;
 	struct xsk_buff_pool *pool;
 	u16 queue_id;
+	u8 tx_metadata_len;
 	bool zc;
 	enum {
 		XSK_READY = 0,
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index a8d7b8a3688a..751fea51a6af 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -75,6 +75,7 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u8 tx_metadata_len; /* inherited from xsk_sock */
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a8096f4ce..2374eafff7db 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cc1e7f15fa73..c9b2daba7b6d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -485,6 +485,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		int err;
 
 		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+		hr = max(hr, L1_CACHE_ALIGN((u32)xs->tx_metadata_len));
 		tr = dev->needed_tailroom;
 		len = desc->len;
 
@@ -493,14 +494,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-		skb_put(skb, len);
+		skb_put(skb, len + xs->tx_metadata_len);
 
 		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+		buffer -= xs->tx_metadata_len;
+
 		err = skb_store_bits(skb, 0, buffer, len);
 		if (unlikely(err)) {
 			kfree_skb(skb);
 			return ERR_PTR(err);
 		}
+
+		if (xs->tx_metadata_len) {
+			skb_metadata_set(skb, xs->tx_metadata_len);
+			__skb_pull(skb, xs->tx_metadata_len);
+		}
 	}
 
 	skb->dev = dev;
@@ -1137,6 +1145,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_TX_METADATA_LEN:
+	{
+		int val;
+
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_sockptr(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		if (val >= 256)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		if (xs->state != XSK_READY) {
+			mutex_unlock(&xs->mutex);
+			return -EBUSY;
+		}
+		xs->tx_metadata_len = val;
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 26f6d304451e..66ff9c345a67 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 		XDP_PACKET_HEADROOM;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
+	pool->tx_metadata_len = xs->tx_metadata_len;
 	INIT_LIST_HEAD(&pool->free_list);
 	INIT_LIST_HEAD(&pool->xsk_tx_list);
 	spin_lock_init(&pool->xsk_tx_list_lock);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 6d40a77fccbe..c8d287c18d64 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -133,12 +133,13 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
 static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 					    struct xdp_desc *desc)
 {
-	u64 offset = desc->addr & (pool->chunk_size - 1);
+	u64 addr = desc->addr - pool->tx_metadata_len;
+	u64 offset = addr & (pool->chunk_size - 1);
 
 	if (offset + desc->len > pool->chunk_size)
 		return false;
 
-	if (desc->addr >= pool->addrs_cnt)
+	if (addr >= pool->addrs_cnt)
 		return false;
 
 	if (desc->options)
@@ -149,7 +150,7 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
 					      struct xdp_desc *desc)
 {
-	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
 
 	if (desc->len > pool->chunk_size)
 		return false;
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 04/11] bpf: Implement devtx hook points
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

devtx is a lightweight set of hooks before and after packet transmission.
The hook is supposed to work for both skb and xdp paths by exposing
a light-weight packet wrapper via devtx_frame (header portion + frags).

devtx is implemented as a tracing program which has access to the
XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
in the next patch, but the idea is similar to XDP metadata:
the kfuncs have netdev-specific implementation, but common
interface. Upon loading, the kfuncs are resolved to direct
calls against per-netdev implementation. This can be achieved
by marking devtx-tracing programs as dev-bound (largely
reusing xdp-dev-bound program infrastructure).

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 MAINTAINERS          |  2 ++
 include/net/devtx.h  | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/offload.c | 15 +++++++++
 net/core/Makefile    |  1 +
 net/core/dev.c       |  1 +
 net/core/devtx.c     | 76 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 166 insertions(+)
 create mode 100644 include/net/devtx.h
 create mode 100644 net/core/devtx.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c904dba1733b..516529b42e66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22976,11 +22976,13 @@ L:	bpf@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/*/*/*/*/*xdp*
 F:	drivers/net/ethernet/*/*/*xdp*
+F:	include/net/devtx.h
 F:	include/net/xdp.h
 F:	include/net/xdp_priv.h
 F:	include/trace/events/xdp.h
 F:	kernel/bpf/cpumap.c
 F:	kernel/bpf/devmap.c
+F:	net/core/devtx.c
 F:	net/core/xdp.c
 F:	samples/bpf/xdp*
 F:	tools/testing/selftests/bpf/*/*xdp*
diff --git a/include/net/devtx.h b/include/net/devtx.h
new file mode 100644
index 000000000000..d1c75fd9b377
--- /dev/null
+++ b/include/net/devtx.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_DEVTX_H__
+#define __LINUX_NET_DEVTX_H__
+
+#include <linux/jump_label.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+
+struct devtx_frame {
+	void *data;
+	u16 len;
+	u8 meta_len;
+	struct skb_shared_info *sinfo; /* for frags */
+	struct net_device *netdev;
+};
+
+#ifdef CONFIG_NET
+void devtx_hooks_enable(void);
+void devtx_hooks_disable(void);
+bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo);
+int devtx_hooks_register(struct btf_id_set8 *set, const struct xdp_metadata_ops *xmo);
+void devtx_hooks_unregister(struct btf_id_set8 *set);
+
+static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb,
+					struct net_device *netdev)
+{
+	ctx->data = skb->data;
+	ctx->len = skb_headlen(skb);
+	ctx->meta_len = skb_metadata_len(skb);
+	ctx->sinfo = skb_shinfo(skb);
+	ctx->netdev = netdev;
+}
+
+static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf,
+					struct net_device *netdev)
+{
+	ctx->data = xdpf->data;
+	ctx->len = xdpf->len;
+	ctx->meta_len = xdpf->metasize & 0xff;
+	ctx->sinfo = xdp_frame_has_frags(xdpf) ? xdp_get_shared_info_from_frame(xdpf) : NULL;
+	ctx->netdev = netdev;
+}
+
+DECLARE_STATIC_KEY_FALSE(devtx_enabled_key);
+
+static inline bool devtx_enabled(void)
+{
+	return static_branch_unlikely(&devtx_enabled_key);
+}
+#else
+static inline void devtx_hooks_enable(void) {}
+static inline void devtx_hooks_disable(void) {}
+static inline bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo) {}
+static inline int devtx_hooks_register(struct btf_id_set8 *set,
+				       const struct xdp_metadata_ops *xmo) {}
+static inline void devtx_hooks_unregister(struct btf_id_set8 *set) {}
+
+static inline void devtx_frame_from_skb(struct devtx_frame *ctx, struct sk_buff *skb,
+					struct net_device *netdev) {}
+static inline void devtx_frame_from_xdp(struct devtx_frame *ctx, struct xdp_frame *xdpf,
+					struct net_device *netdev) {}
+
+static inline bool devtx_enabled(void)
+{
+	return false;
+}
+#endif
+
+#endif /* __LINUX_NET_DEVTX_H__ */
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 235d81f7e0ed..f01a1aa0f627 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -25,6 +25,7 @@
 #include <linux/rhashtable.h>
 #include <linux/rtnetlink.h>
 #include <linux/rwsem.h>
+#include <net/devtx.h>
 
 /* Protects offdevs, members of bpf_offload_netdev and offload members
  * of all progs.
@@ -228,6 +229,7 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	int err;
 
 	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
+	    attr->prog_type != BPF_PROG_TYPE_TRACING &&
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
@@ -242,6 +244,15 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	if (!netdev)
 		return -EINVAL;
 
+	/* Make sure device-bound tracing programs are being attached
+	 * to the appropriate netdev.
+	 */
+	if (attr->prog_type == BPF_PROG_TYPE_TRACING &&
+	    !devtx_hooks_match(prog->aux->attach_btf_id, netdev->xdp_metadata_ops)) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	err = bpf_dev_offload_check(netdev);
 	if (err)
 		goto out;
@@ -252,6 +263,9 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	err = __bpf_prog_dev_bound_init(prog, netdev);
 	up_write(&bpf_devs_lock);
 
+	if (!err)
+		devtx_hooks_enable();
+
 out:
 	dev_put(netdev);
 	return err;
@@ -384,6 +398,7 @@ void bpf_prog_dev_bound_destroy(struct bpf_prog *prog)
 		ondev = bpf_offload_find_netdev(netdev);
 		if (!ondev->offdev && list_empty(&ondev->progs))
 			__bpf_offload_dev_netdev_unregister(NULL, netdev);
+		devtx_hooks_disable();
 	}
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
diff --git a/net/core/Makefile b/net/core/Makefile
index 8f367813bc68..c1db05ccfac7 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -39,4 +39,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
 obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += devtx.o
 obj-$(CONFIG_OF)	+= of_net.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 3393c2f3dbe8..e2f4618ee1c5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -150,6 +150,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/prandom.h>
 #include <linux/once_lite.h>
+#include <net/devtx.h>
 
 #include "dev.h"
 #include "net-sysfs.h"
diff --git a/net/core/devtx.c b/net/core/devtx.c
new file mode 100644
index 000000000000..bad694439ae3
--- /dev/null
+++ b/net/core/devtx.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <net/devtx.h>
+#include <linux/filter.h>
+
+DEFINE_STATIC_KEY_FALSE(devtx_enabled_key);
+EXPORT_SYMBOL_GPL(devtx_enabled_key);
+
+struct devtx_hook_entry {
+	struct list_head devtx_hooks;
+	struct btf_id_set8 *set;
+	const struct xdp_metadata_ops *xmo;
+};
+
+static LIST_HEAD(devtx_hooks);
+static DEFINE_MUTEX(devtx_hooks_lock);
+
+void devtx_hooks_enable(void)
+{
+	static_branch_inc(&devtx_enabled_key);
+}
+
+void devtx_hooks_disable(void)
+{
+	static_branch_dec(&devtx_enabled_key);
+}
+
+bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo)
+{
+	struct devtx_hook_entry *entry, *tmp;
+	bool match = false;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_for_each_entry_safe(entry, tmp, &devtx_hooks, devtx_hooks) {
+		if (btf_id_set8_contains(entry->set, attach_btf_id)) {
+			match = entry->xmo == xmo;
+			break;
+		}
+	}
+	mutex_unlock(&devtx_hooks_lock);
+
+	return match;
+}
+
+int devtx_hooks_register(struct btf_id_set8 *set, const struct xdp_metadata_ops *xmo)
+{
+	struct devtx_hook_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->set = set;
+	entry->xmo = xmo;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_add(&entry->devtx_hooks, &devtx_hooks);
+	mutex_unlock(&devtx_hooks_lock);
+
+	return 0;
+}
+
+void devtx_hooks_unregister(struct btf_id_set8 *set)
+{
+	struct devtx_hook_entry *entry, *tmp;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_for_each_entry_safe(entry, tmp, &devtx_hooks, devtx_hooks) {
+		if (entry->set == set) {
+			list_del(&entry->devtx_hooks);
+			kfree(entry);
+			break;
+		}
+	}
+	mutex_unlock(&devtx_hooks_lock);
+}
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 04/11] bpf: Implement devtx hook points Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-22 12:07   ` Jesper D. Brouer
  2023-06-21 17:02 ` [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Two kfuncs, one per hook point:

1. at submit time - bpf_devtx_sb_request_timestamp - to request HW
   to put TX timestamp into TX completion descriptors

2. at completion time - bpf_devtx_cp_timestamp - to read out
   TX timestamp

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/netdevice.h |  4 +++
 include/net/offload.h     | 10 ++++++
 kernel/bpf/offload.c      |  8 +++++
 net/core/devtx.c          | 73 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 95 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 08fbd4622ccf..2fdb0731eb67 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1651,10 +1651,14 @@ struct net_device_ops {
 						  bool cycles);
 };
 
+struct devtx_frame;
+
 struct xdp_metadata_ops {
 	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
 	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
 			       enum xdp_rss_hash_type *rss_type);
+	int	(*xmo_sb_request_timestamp)(const struct devtx_frame *ctx);
+	int	(*xmo_cp_timestamp)(const struct devtx_frame *ctx, u64 *timestamp);
 };
 
 /**
diff --git a/include/net/offload.h b/include/net/offload.h
index 264a35881473..36899b64f4c8 100644
--- a/include/net/offload.h
+++ b/include/net/offload.h
@@ -10,9 +10,19 @@
 	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
 			      bpf_xdp_metadata_rx_hash)
 
+#define DEVTX_SB_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_SB_KFUNC_REQUEST_TIMESTAMP, \
+			      bpf_devtx_sb_request_timestamp)
+
+#define DEVTX_CP_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_CP_KFUNC_TIMESTAMP, \
+			      bpf_devtx_cp_timestamp)
+
 enum {
 #define NETDEV_METADATA_KFUNC(name, _) name,
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 MAX_NETDEV_METADATA_KFUNC,
 };
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index f01a1aa0f627..45a243af49be 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -863,6 +863,10 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 		p = ops->xmo_rx_timestamp;
 	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
+	else if (func_id == bpf_dev_bound_kfunc_id(DEVTX_SB_KFUNC_REQUEST_TIMESTAMP))
+		p = ops->xmo_sb_request_timestamp;
+	else if (func_id == bpf_dev_bound_kfunc_id(DEVTX_CP_KFUNC_TIMESTAMP))
+		p = ops->xmo_cp_timestamp;
 out:
 	up_read(&bpf_devs_lock);
 
@@ -872,12 +876,16 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 BTF_SET_START(dev_bound_kfunc_ids)
 #define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 BTF_SET_END(dev_bound_kfunc_ids)
 
 BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
 #define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SB_KFUNC_xxx
+DEVTX_CP_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 
 u32 bpf_dev_bound_kfunc_id(int id)
diff --git a/net/core/devtx.c b/net/core/devtx.c
index bad694439ae3..4267a8fe6711 100644
--- a/net/core/devtx.c
+++ b/net/core/devtx.c
@@ -74,3 +74,76 @@ void devtx_hooks_unregister(struct btf_id_set8 *set)
 	}
 	mutex_unlock(&devtx_hooks_lock);
 }
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * bpf_devtx_sb_request_timestamp - Request TX timestamp on the packet.
+ * Callable only from the devtx-submit hook.
+ * @ctx: devtx context pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * bpf_devtx_cp_timestamp - Read TX timestamp of the packet. Callable
+ * only from the devtx-complete hook.
+ * @ctx: devtx context pointer.
+ * @timestamp: Return value pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp)
+{
+	return -EOPNOTSUPP;
+}
+
+__diag_pop();
+
+BTF_SET8_START(devtx_sb_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+DEVTX_SB_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_sb_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_sb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_sb_kfunc_ids,
+};
+
+BTF_SET8_START(devtx_cp_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+DEVTX_CP_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_cp_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_cp_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_cp_kfunc_ids,
+};
+
+static int __init devtx_init(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_sb_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_sb kfuncs: %d", ret);
+		return ret;
+	}
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_cp_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_cp completion kfuncs: %d", ret);
+		return ret;
+	}
+
+	return 0;
+}
+late_initcall(devtx_init);
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-23 23:29   ` Vinicius Costa Gomes
  2023-06-21 17:02 ` [RFC bpf-next v2 07/11] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Have a software-based example for kfuncs to showcase how it
can be used in the real devices and to have something to
test against in the selftests.

Both path (skb & xdp) are covered. Only the skb path is really
tested though.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 116 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..632f0f3771e4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <net/page_pool.h>
+#include <net/devtx.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -123,6 +124,13 @@ struct veth_xdp_buff {
 	struct sk_buff *skb;
 };
 
+struct veth_devtx_frame {
+	struct devtx_frame frame;
+	bool request_timestamp;
+	ktime_t xdp_tx_timestamp;
+	struct sk_buff *skb;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -313,10 +321,43 @@ static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 	return NET_RX_SUCCESS;
 }
 
+__weak noinline void veth_devtx_submit(struct devtx_frame *ctx)
+{
+}
+
+__weak noinline void veth_devtx_complete(struct devtx_frame *ctx)
+{
+}
+
+BTF_SET8_START(veth_devtx_hook_ids)
+BTF_ID_FLAGS(func, veth_devtx_submit)
+BTF_ID_FLAGS(func, veth_devtx_complete)
+BTF_SET8_END(veth_devtx_hook_ids)
+
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool request_timestamp)
 {
-	return __dev_forward_skb(dev, skb) ?: xdp ?
+	struct net_device *orig_dev = skb->dev;
+	int ret;
+
+	ret = __dev_forward_skb(dev, skb);
+	if (ret)
+		return ret;
+
+	if (devtx_enabled()) {
+		struct veth_devtx_frame ctx;
+
+		if (unlikely(request_timestamp))
+			__net_timestamp(skb);
+
+		devtx_frame_from_skb(&ctx.frame, skb, orig_dev);
+		ctx.frame.data -= ETH_HLEN; /* undo eth_type_trans pull */
+		ctx.frame.len += ETH_HLEN;
+		ctx.skb = skb;
+		veth_devtx_complete(&ctx.frame);
+	}
+
+	return xdp ?
 		veth_xdp_rx(rq, skb) :
 		__netif_rx(skb);
 }
@@ -343,6 +384,7 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	bool request_timestamp = false;
 	struct veth_rq *rq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
@@ -356,6 +398,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
+	if (devtx_enabled()) {
+		struct veth_devtx_frame ctx;
+
+		devtx_frame_from_skb(&ctx.frame, skb, dev);
+		ctx.request_timestamp = false;
+		veth_devtx_submit(&ctx.frame);
+		request_timestamp = ctx.request_timestamp;
+	}
+
 	rcv_priv = netdev_priv(rcv);
 	rxq = skb_get_queue_mapping(skb);
 	if (rxq < rcv->real_num_rx_queues) {
@@ -370,7 +421,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
-	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb, rq, use_napi, request_timestamp) == NET_RX_SUCCESS)) {
 		if (!use_napi)
 			dev_lstats_add(dev, length);
 	} else {
@@ -483,6 +534,7 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	int i, ret = -ENXIO, nxmit = 0;
+	ktime_t tx_timestamp = 0;
 	struct net_device *rcv;
 	unsigned int max_len;
 	struct veth_rq *rq;
@@ -511,9 +563,32 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 		void *ptr = veth_xdp_to_ptr(frame);
 
 		if (unlikely(xdp_get_frame_len(frame) > max_len ||
-			     __ptr_ring_produce(&rq->xdp_ring, ptr)))
+			     __ptr_ring_full(&rq->xdp_ring)))
+			break;
+
+		if (devtx_enabled()) {
+			struct veth_devtx_frame ctx;
+
+			devtx_frame_from_xdp(&ctx.frame, frame, dev);
+			ctx.request_timestamp = false;
+			veth_devtx_submit(&ctx.frame);
+
+			if (unlikely(ctx.request_timestamp))
+				tx_timestamp = ktime_get_real();
+		}
+
+		if (unlikely(__ptr_ring_produce(&rq->xdp_ring, ptr)))
 			break;
 		nxmit++;
+
+		if (devtx_enabled()) {
+			struct veth_devtx_frame ctx;
+
+			devtx_frame_from_xdp(&ctx.frame, frame, dev);
+			ctx.xdp_tx_timestamp = tx_timestamp;
+			ctx.skb = NULL;
+			veth_devtx_complete(&ctx.frame);
+		}
 	}
 	spin_unlock(&rq->xdp_ring.producer_lock);
 
@@ -1732,6 +1807,28 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int veth_devtx_sb_request_timestamp(const struct devtx_frame *_ctx)
+{
+	struct veth_devtx_frame *ctx = (struct veth_devtx_frame *)_ctx;
+
+	ctx->request_timestamp = true;
+
+	return 0;
+}
+
+static int veth_devtx_cp_timestamp(const struct devtx_frame *_ctx, u64 *timestamp)
+{
+	struct veth_devtx_frame *ctx = (struct veth_devtx_frame *)_ctx;
+
+	if (ctx->skb) {
+		*timestamp = ctx->skb->tstamp;
+		return 0;
+	}
+
+	*timestamp = ctx->xdp_tx_timestamp;
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1756,6 +1853,8 @@ static const struct net_device_ops veth_netdev_ops = {
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
 	.xmo_rx_hash			= veth_xdp_rx_hash,
+	.xmo_sb_request_timestamp	= veth_devtx_sb_request_timestamp,
+	.xmo_cp_timestamp		= veth_devtx_cp_timestamp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
@@ -2041,11 +2140,20 @@ static struct rtnl_link_ops veth_link_ops = {
 
 static __init int veth_init(void)
 {
+	int ret;
+
+	ret = devtx_hooks_register(&veth_devtx_hook_ids, &veth_xdp_metadata_ops);
+	if (ret) {
+		pr_warn("failed to register devtx hooks: %d", ret);
+		return ret;
+	}
+
 	return rtnl_link_register(&veth_link_ops);
 }
 
 static __exit void veth_exit(void)
 {
+	devtx_hooks_unregister(&veth_devtx_hook_ids);
 	rtnl_link_unregister(&veth_link_ops);
 }
 
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 07/11] selftests/xsk: Support XDP_TX_METADATA_LEN
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 08/11] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa

Add new config field and call setsockopt.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/xsk.c | 17 +++++++++++++++++
 tools/testing/selftests/bpf/xsk.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index 687d83e707f8..c659713e2d43 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -47,6 +47,10 @@
  #define PF_XDP AF_XDP
 #endif
 
+#ifndef XDP_TX_METADATA_LEN
+#define XDP_TX_METADATA_LEN 9
+#endif
+
 #define pr_warn(fmt, ...) fprintf(stderr, fmt, ##__VA_ARGS__)
 
 #define XSKMAP_SIZE 1
@@ -124,12 +128,14 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		cfg->rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 		cfg->tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
 		cfg->bind_flags = 0;
+		cfg->tx_metadata_len = 0;
 		return 0;
 	}
 
 	cfg->rx_size = usr_cfg->rx_size;
 	cfg->tx_size = usr_cfg->tx_size;
 	cfg->bind_flags = usr_cfg->bind_flags;
+	cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
 
 	return 0;
 }
@@ -479,6 +485,17 @@ int xsk_socket__create_shared(struct xsk_socket **xsk_ptr,
 			umem->tx_ring_setup_done = true;
 	}
 
+	if (xsk->config.tx_metadata_len) {
+		int optval = xsk->config.tx_metadata_len;
+
+		err = setsockopt(xsk->fd, SOL_XDP, XDP_TX_METADATA_LEN,
+				 &optval, sizeof(optval));
+		if (err) {
+			err = -errno;
+			goto out_put_ctx;
+		}
+	}
+
 	err = xsk_get_mmap_offsets(xsk->fd, &off);
 	if (err) {
 		err = -errno;
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index 8da8d557768b..57e0af403aa8 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -212,6 +212,7 @@ struct xsk_socket_config {
 	__u32 rx_size;
 	__u32 tx_size;
 	__u16 bind_flags;
+	__u8 tx_metadata_len;
 };
 
 /* Set config to NULL to get the default configuration. */
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 08/11] selftests/bpf: Add helper to query current netns cookie
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 07/11] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa

Will be used by the subsequent selftests.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..34102fce5a88 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -450,3 +450,24 @@ int get_socket_local_port(int sock_fd)
 
 	return -1;
 }
+
+#ifndef SO_NETNS_COOKIE
+#define SO_NETNS_COOKIE 71
+#endif
+
+__u64 get_net_cookie(void)
+{
+	socklen_t optlen;
+	__u64 optval = 0;
+	int fd;
+
+	fd = socket(AF_LOCAL, SOCK_DGRAM, 0);
+	if (fd >= 0) {
+		optlen = sizeof(optval);
+		getsockopt(fd, SOL_SOCKET, SO_NETNS_COOKIE, &optval, &optlen);
+		close(fd);
+	}
+
+	return optval;
+}
+
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..380047161aac 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -57,6 +57,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
 		  struct sockaddr_storage *addr, socklen_t *len);
 char *ping_command(int family);
 int get_socket_local_port(int sock_fd);
+__u64 get_net_cookie(void);
 
 struct nstoken;
 /**
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 08/11] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-23 11:12   ` Jesper D. Brouer
  2023-06-21 17:02 ` [RFC bpf-next v2 10/11] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Attach kfuncs that request and report TX timestamp via ringbuf.
Confirm on the userspace side that the program has triggered
and the timestamp is non-zero.

Also make sure devtx_frame has a sensible pointers and data.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  62 ++++++++-
 .../selftests/bpf/progs/xdp_metadata.c        | 118 ++++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |  14 +++
 3 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..ca4f3106ce6d 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -42,6 +42,9 @@ struct xsk {
 	struct xsk_ring_prod tx;
 	struct xsk_ring_cons rx;
 	struct xsk_socket *socket;
+	int tx_completions;
+	u32 last_tx_timestamp_retval;
+	u64 last_tx_timestamp;
 };
 
 static int open_xsk(int ifindex, struct xsk *xsk)
@@ -51,6 +54,7 @@ static int open_xsk(int ifindex, struct xsk *xsk)
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.bind_flags = XDP_COPY,
+		.tx_metadata_len = TX_META_LEN,
 	};
 	const struct xsk_umem_config umem_config = {
 		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
@@ -138,6 +142,7 @@ static void ip_csum(struct iphdr *iph)
 
 static int generate_packet(struct xsk *xsk, __u16 dst_port)
 {
+	struct xdp_tx_meta *meta;
 	struct xdp_desc *tx_desc;
 	struct udphdr *udph;
 	struct ethhdr *eth;
@@ -151,10 +156,13 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 		return -1;
 
 	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
-	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + TX_META_LEN;
 	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
 	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
 
+	meta = data - TX_META_LEN;
+	meta->request_timestamp = 1;
+
 	eth = data;
 	iph = (void *)(eth + 1);
 	udph = (void *)(iph + 1);
@@ -192,7 +200,8 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 	return 0;
 }
 
-static void complete_tx(struct xsk *xsk)
+static void complete_tx(struct xsk *xsk, struct xdp_metadata *bpf_obj,
+			struct ring_buffer *ringbuf)
 {
 	__u32 idx;
 	__u64 addr;
@@ -202,6 +211,13 @@ static void complete_tx(struct xsk *xsk)
 
 		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
 		xsk_ring_cons__release(&xsk->comp, 1);
+
+		ring_buffer__poll(ringbuf, 1000);
+
+		ASSERT_EQ(bpf_obj->bss->pkts_fail_tx, 0, "pkts_fail_tx");
+		ASSERT_GE(xsk->tx_completions, 1, "tx_completions");
+		ASSERT_EQ(xsk->last_tx_timestamp_retval, 0, "last_tx_timestamp_retval");
+		ASSERT_GE(xsk->last_tx_timestamp, 0, "last_tx_timestamp");
 	}
 }
 
@@ -276,8 +292,24 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	return 0;
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+	struct xsk *xsk = ctx;
+
+	printf("%p: got tx timestamp sample %u %llu\n",
+	       xsk, sample->timestamp_retval, sample->timestamp);
+
+	xsk->tx_completions++;
+	xsk->last_tx_timestamp_retval = sample->timestamp_retval;
+	xsk->last_tx_timestamp = sample->timestamp;
+
+	return 0;
+}
+
 void test_xdp_metadata(void)
 {
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	struct xdp_metadata2 *bpf_obj2 = NULL;
 	struct xdp_metadata *bpf_obj = NULL;
 	struct bpf_program *new_prog, *prog;
@@ -290,6 +322,7 @@ void test_xdp_metadata(void)
 	int retries = 10;
 	int rx_ifindex;
 	int tx_ifindex;
+	int syscall_fd;
 	int sock_fd;
 	int ret;
 
@@ -323,6 +356,14 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
 		goto out;
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_submit");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_complete");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
 	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
 	bpf_program__set_ifindex(prog, rx_ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
@@ -330,6 +371,18 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
 		goto out;
 
+	bpf_obj->data->ifindex = tx_ifindex;
+	bpf_obj->data->net_cookie = get_net_cookie();
+
+	ret = xdp_metadata__attach(bpf_obj);
+	if (!ASSERT_OK(ret, "xdp_metadata__attach"))
+		goto out;
+
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, &tx_xsk, NULL);
+	if (!ASSERT_OK_PTR(tx_compl_ringbuf, "ring_buffer__new"))
+		goto out;
+
 	/* Make sure we can't add dev-bound programs to prog maps. */
 	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
 	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
@@ -364,7 +417,8 @@ void test_xdp_metadata(void)
 		       "verify_xsk_metadata"))
 		goto out;
 
-	complete_tx(&tx_xsk);
+	/* Verify AF_XDP TX packet has completion event with a timestamp. */
+	complete_tx(&tx_xsk, bpf_obj, tx_compl_ringbuf);
 
 	/* Make sure freplace correctly picks up original bound device
 	 * and doesn't crash.
@@ -402,5 +456,7 @@ void test_xdp_metadata(void)
 	xdp_metadata__destroy(bpf_obj);
 	if (tok)
 		close_netns(tok);
+	if (tx_compl_ringbuf)
+		ring_buffer__free(tx_compl_ringbuf);
 	SYS_NOFAIL("ip netns del xdp_metadata");
 }
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index d151d406a123..fc025183d45a 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -4,6 +4,11 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -19,10 +24,25 @@ struct {
 	__type(value, __u32);
 } prog_arr SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 10);
+} tx_compl_buf SEC(".maps");
+
+__u64 pkts_fail_tx = 0;
+
+int ifindex = -1;
+__u64 net_cookie = -1;
+
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
+extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
+
+extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
+extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -61,4 +81,102 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+static inline int verify_frame(const struct devtx_frame *frame)
+{
+	struct ethhdr eth = {};
+
+	/* all the pointers are set up correctly */
+	if (!frame->data)
+		return -1;
+	if (!frame->sinfo)
+		return -1;
+
+	/* can get to the frags */
+	if (frame->sinfo->nr_frags != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_page != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_len != 0)
+		return -1;
+	if (frame->sinfo->frags[0].bv_offset != 0)
+		return -1;
+
+	/* the data has something that looks like ethernet */
+	if (frame->len != 46)
+		return -1;
+	bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
+
+	if (eth.h_proto != bpf_htons(ETH_P_IP))
+		return -1;
+
+	return 0;
+}
+
+SEC("fentry/veth_devtx_submit")
+int BPF_PROG(tx_submit, const struct devtx_frame *frame)
+{
+	struct xdp_tx_meta meta = {};
+	int ret;
+
+	if (frame->netdev->ifindex != ifindex)
+		return 0;
+	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
+		return 0;
+	if (frame->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	ret = verify_frame(frame);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	ret = bpf_devtx_sb_request_timestamp(frame);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	return 0;
+}
+
+SEC("fentry/veth_devtx_complete")
+int BPF_PROG(tx_complete, const struct devtx_frame *frame)
+{
+	struct xdp_tx_meta meta = {};
+	struct devtx_sample *sample;
+	int ret;
+
+	if (frame->netdev->ifindex != ifindex)
+		return 0;
+	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
+		return 0;
+	if (frame->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	ret = verify_frame(frame);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample)
+		return 0;
+
+	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 938a729bd307..e410f2b95e64 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -18,3 +18,17 @@ struct xdp_meta {
 		__s32 rx_hash_err;
 	};
 };
+
+struct devtx_sample {
+	int timestamp_retval;
+	__u64 timestamp;
+};
+
+#define TX_META_LEN	8
+
+struct xdp_tx_meta {
+	__u8 request_timestamp;
+	__u8 padding0;
+	__u16 padding1;
+	__u32 padding2;
+};
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 10/11] selftests/bpf: Extend xdp_hw_metadata with devtx kfuncs
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-21 17:02 ` [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata Stanislav Fomichev
  2023-06-22  8:41 ` [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
  11 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

When we get packets on port 9091, we swap src/dst and send it out.
At this point, we also request the timestamp and plumb it back
to the userspace. The userspace simply prints the timestamp.

Haven't really tested, still working on mlx5 patches...

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 107 ++++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 198 ++++++++++++++++--
 2 files changed, 285 insertions(+), 20 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index b2dfd7066c6e..84f10d6b11f1 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -4,6 +4,7 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -12,14 +13,30 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 10);
+} tx_compl_buf SEC(".maps");
+
 __u64 pkts_skip = 0;
+__u64 pkts_tx_skip = 0;
 __u64 pkts_fail = 0;
 __u64 pkts_redir = 0;
+__u64 pkts_fail_tx = 0;
+__u64 pkts_ringbuf_full = 0;
+
+int ifindex = -1;
+__u64 net_cookie = -1;
 
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
+extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
+
+extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
+extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -90,4 +107,94 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+/* This is not strictly required; only to showcase how to access the payload. */
+static __always_inline bool tx_filter(const struct devtx_frame *frame)
+{
+	int port_offset = sizeof(struct ethhdr) + offsetof(struct udphdr, source);
+	struct ethhdr eth = {};
+	struct udphdr udp = {};
+
+	bpf_probe_read_kernel(&eth.h_proto, sizeof(eth.h_proto),
+			      frame->data + offsetof(struct ethhdr, h_proto));
+
+	if (eth.h_proto == bpf_htons(ETH_P_IP)) {
+		port_offset += sizeof(struct iphdr);
+	} else if (eth.h_proto == bpf_htons(ETH_P_IPV6)) {
+		port_offset += sizeof(struct ipv6hdr);
+	} else {
+		__sync_add_and_fetch(&pkts_tx_skip, 1);
+		return false;
+	}
+
+	bpf_probe_read_kernel(&udp.source, sizeof(udp.source), frame->data + port_offset);
+
+	/* Replies to UDP:9091 */
+	if (udp.source != bpf_htons(9091)) {
+		__sync_add_and_fetch(&pkts_tx_skip, 1);
+		return false;
+	}
+
+	return true;
+}
+
+SEC("fentry")
+int BPF_PROG(tx_submit, const struct devtx_frame *frame)
+{
+	struct xdp_tx_meta meta = {};
+	int ret;
+
+	if (frame->netdev->ifindex != ifindex)
+		return 0;
+	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
+		return 0;
+	if (frame->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	if (!tx_filter(frame))
+		return 0;
+
+	ret = bpf_devtx_sb_request_timestamp(frame);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	return 0;
+}
+
+SEC("fentry")
+int BPF_PROG(tx_complete, const struct devtx_frame *frame)
+{
+	struct xdp_tx_meta meta = {};
+	struct devtx_sample *sample;
+
+	if (frame->netdev->ifindex != ifindex)
+		return 0;
+	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
+		return 0;
+	if (frame->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	if (!tx_filter(frame))
+		return 0;
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample) {
+		__sync_add_and_fetch(&pkts_ringbuf_full, 1);
+		return 0;
+	}
+
+	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..0bbe8377a34b 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,8 @@
  *   - rx_hash
  *
  * TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
  */
 
 #include <test_progs.h>
@@ -28,6 +29,8 @@
 #include <net/if.h>
 #include <poll.h>
 #include <time.h>
+#include <unistd.h>
+#include <libgen.h>
 
 #include "xdp_metadata.h"
 
@@ -54,13 +57,14 @@ int rxq;
 
 void test__fail(void) { /* for network_helpers.c */ }
 
-static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
+static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id, int flags)
 {
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	const struct xsk_socket_config socket_config = {
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
-		.bind_flags = XDP_COPY,
+		.bind_flags = flags,
+		.tx_metadata_len = TX_META_LEN,
 	};
 	const struct xsk_umem_config umem_config = {
 		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
@@ -228,7 +232,87 @@ static void verify_skb_metadata(int fd)
 	printf("skb hwtstamp is not found!\n");
 }
 
-static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
+static void complete_tx(struct xsk *xsk, struct ring_buffer *ringbuf)
+{
+	__u32 idx;
+	__u64 addr;
+
+	ring_buffer__poll(ringbuf, 1000);
+
+	if (xsk_ring_cons__peek(&xsk->comp, 1, &idx)) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+		xsk_ring_cons__release(&xsk->comp, 1);
+	}
+}
+
+#define swap(a, b, len) do { \
+	for (int i = 0; i < len; i++) { \
+		__u8 tmp = ((__u8 *)a)[i]; \
+		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+		((__u8 *)b)[i] = tmp; \
+	} \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+	struct ipv6hdr *ip6h = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_tx_meta *meta;
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	void *data;
+	__u32 idx;
+	int ret;
+	int len;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (ret != 1) {
+		printf("%p: failed to reserve tx slot\n", xsk);
+		return;
+	}
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + TX_META_LEN;
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	meta = data - TX_META_LEN;
+	meta->request_timestamp = 1;
+
+	eth = data;
+
+	if (eth->h_proto == htons(ETH_P_IP)) {
+		iph = (void *)(eth + 1);
+		udph = (void *)(iph + 1);
+	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
+		ip6h = (void *)(eth + 1);
+		udph = (void *)(ip6h + 1);
+	} else {
+		xsk_ring_prod__cancel(&xsk->tx, 1);
+		return;
+	}
+
+	len = ETH_HLEN;
+	if (ip6h)
+		len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
+	if (iph)
+		len += ntohs(iph->tot_len);
+
+	memcpy(data, rx_packet, len);
+	swap(eth->h_dest, eth->h_source, ETH_ALEN);
+	if (iph)
+		swap(&iph->saddr, &iph->daddr, 4);
+	else
+		swap(&ip6h->saddr, &ip6h->daddr, 16);
+	swap(&udph->source, &udph->dest, 2);
+
+	xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id,
+			   struct ring_buffer *ringbuf)
 {
 	const struct xdp_desc *rx_desc;
 	struct pollfd fds[rxq + 1];
@@ -251,8 +335,9 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 	while (true) {
 		errno = 0;
 		ret = poll(fds, rxq + 1, 1000);
-		printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
+		printf("poll: %d (%d) skip=%llu/%llu fail=%llu redir=%llu\n",
 		       ret, errno, bpf_obj->bss->pkts_skip,
+		       bpf_obj->bss->pkts_tx_skip,
 		       bpf_obj->bss->pkts_fail, bpf_obj->bss->pkts_redir);
 		if (ret < 0)
 			break;
@@ -280,6 +365,11 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 			       xsk, idx, rx_desc->addr, addr, comp_addr);
 			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
 					    clock_id);
+
+			/* mirror packet back */
+			ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+			complete_tx(xsk, ringbuf);
+
 			xsk_ring_cons__release(&xsk->rx, 1);
 			refill_rx(xsk, comp_addr);
 		}
@@ -373,16 +463,6 @@ static void cleanup(void)
 	int ret;
 	int i;
 
-	if (bpf_obj) {
-		opts.old_prog_fd = bpf_program__fd(bpf_obj->progs.rx);
-		if (opts.old_prog_fd >= 0) {
-			printf("detaching bpf program....\n");
-			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
-			if (ret)
-				printf("failed to detach XDP program: %d\n", ret);
-		}
-	}
-
 	for (i = 0; i < rxq; i++)
 		close_xsk(&rx_xsk[i]);
 
@@ -404,21 +484,69 @@ static void timestamping_enable(int fd, int val)
 		error(1, errno, "setsockopt(SO_TIMESTAMPING)");
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+
+	printf("got tx timestamp sample %u %llu\n",
+	       sample->timestamp_retval, sample->timestamp);
+
+	return 0;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <ifname>\n"
+		"OPTS:\n"
+		"    -s    symbol name for tx_submit\n"
+		"    -c    symbol name for tx_complete\n"
+		"    -C    run in copy mode\n",
+		prog);
+}
+
 int main(int argc, char *argv[])
 {
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	clockid_t clock_id = CLOCK_TAI;
+	char *tx_complete = NULL;
+	char *tx_submit = NULL;
+	int bind_flags = 0;
 	int server_fd = -1;
+	int opt;
 	int ret;
 	int i;
 
 	struct bpf_program *prog;
 
-	if (argc != 2) {
+	while ((opt = getopt(argc, argv, "s:c:C")) != -1) {
+		switch (opt) {
+		case 's':
+			tx_submit = optarg;
+			break;
+		case 'c':
+			tx_complete = optarg;
+			break;
+		case 'C':
+			bind_flags |= XDP_COPY;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (argc < 2) {
 		fprintf(stderr, "pass device name\n");
 		return -1;
 	}
 
-	ifname = argv[1];
+	if (optind >= argc) {
+		usage(basename(argv[0]));
+		return 1;
+	}
+
+	ifname = argv[optind];
 	ifindex = if_nametoindex(ifname);
 	rxq = rxq_num(ifname);
 
@@ -432,7 +560,7 @@ int main(int argc, char *argv[])
 
 	for (i = 0; i < rxq; i++) {
 		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
-		ret = open_xsk(ifindex, &rx_xsk[i], i);
+		ret = open_xsk(ifindex, &rx_xsk[i], i, bind_flags);
 		if (ret)
 			error(1, -ret, "open_xsk");
 
@@ -444,15 +572,45 @@ int main(int argc, char *argv[])
 	if (libbpf_get_error(bpf_obj))
 		error(1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
 
+	bpf_obj->data->ifindex = ifindex;
+	bpf_obj->data->net_cookie = get_net_cookie();
+
 	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
 	bpf_program__set_ifindex(prog, ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_submit");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	if (tx_submit) {
+		printf("attaching devtx submit program to %s\n", tx_submit);
+		bpf_program__set_attach_target(prog, 0, tx_submit);
+	} else {
+		printf("skipping devtx submit program\n");
+		bpf_program__set_autoattach(prog, false);
+	}
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_complete");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	if (tx_complete) {
+		printf("attaching devtx complete program to %s\n", tx_complete);
+		bpf_program__set_attach_target(prog, 0, tx_complete);
+	} else {
+		printf("skipping devtx complete program\n");
+		bpf_program__set_autoattach(prog, false);
+	}
+
 	printf("load bpf program...\n");
 	ret = xdp_hw_metadata__load(bpf_obj);
 	if (ret)
 		error(1, -ret, "xdp_hw_metadata__load");
 
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, NULL, NULL);
+	if (libbpf_get_error(tx_compl_ringbuf))
+		error(1, -libbpf_get_error(tx_compl_ringbuf), "ring_buffer__new");
+
 	printf("prepare skb endpoint...\n");
 	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
 	if (server_fd < 0)
@@ -472,7 +630,7 @@ int main(int argc, char *argv[])
 			error(1, -ret, "bpf_map_update_elem");
 	}
 
-	printf("attach bpf program...\n");
+	printf("attach rx bpf program...\n");
 	ret = bpf_xdp_attach(ifindex,
 			     bpf_program__fd(bpf_obj->progs.rx),
 			     XDP_FLAGS, NULL);
@@ -480,7 +638,7 @@ int main(int argc, char *argv[])
 		error(1, -ret, "bpf_xdp_attach");
 
 	signal(SIGINT, handle_signal);
-	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id);
+	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id, tx_compl_ringbuf);
 	close(server_fd);
 	cleanup();
 	if (ret)
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 10/11] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev
@ 2023-06-21 17:02 ` Stanislav Fomichev
  2023-06-22 19:57   ` Alexei Starovoitov
  2023-06-22  8:41 ` [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
  11 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-21 17:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

WIP, not tested, only to show the overall idea.
Non-AF_XDP paths are marked with 'false' for now.

Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
 .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
 6 files changed, 156 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 879d698b6119..e4509464e0b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -6,6 +6,7 @@
 
 #include "en.h"
 #include <linux/indirect_call_wrapper.h>
+#include <net/devtx.h>
 
 #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 
@@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
 
 	return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
 }
+
+struct mlx5e_devtx_frame {
+	struct devtx_frame frame;
+	struct mlx5_cqe64 *cqe; /* tx completion */
+	struct mlx5e_tx_wqe *wqe; /* tx */
+};
+
+void mlx5e_devtx_submit(struct devtx_frame *ctx);
+void mlx5e_devtx_complete(struct devtx_frame *ctx);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f0e6095809fa..0cb0f0799cbc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -255,9 +255,30 @@ static int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int mlx5e_devtx_sb_request_timestamp(const struct devtx_frame *ctx)
+{
+	/* Nothing to do here, CQE always has a timestamp. */
+	return 0;
+}
+
+static int mlx5e_devtx_cp_timestamp(const struct devtx_frame *_ctx, u64 *timestamp)
+{
+	const struct mlx5e_devtx_frame *ctx = (void *)_ctx;
+	u64 ts;
+
+	if (unlikely(!ctx->cqe))
+		return -ENODATA;
+
+	ts = get_cqe_ts(ctx->cqe);
+	*timestamp = mlx5_real_time_cyc2time(NULL, ts);
+	return 0;
+}
+
 const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
 	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+	.xmo_sb_request_timestamp	= mlx5e_devtx_sb_request_timestamp,
+	.xmo_cp_timestamp		= mlx5e_devtx_cp_timestamp,
 };
 
 /* returns true if packet was consumed by xdp */
@@ -453,6 +474,23 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
 
 	mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
 
+	if (devtx_enabled()) {
+		struct mlx5e_xmit_data_frags *xdptxdf =
+			container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
+
+		struct mlx5e_devtx_frame ctx = {
+			.frame = {
+				.data = p->data,
+				.len = p->len,
+				.meta_len = sq->xsk_pool->tx_metadata_len,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+				.netdev = sq->cq.netdev,
+			},
+			.wqe = sq->mpwqe.wqe,
+		};
+		mlx5e_devtx_submit(&ctx.frame);
+	}
+
 	if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
 		mlx5e_xdp_mpwqe_complete(sq);
 
@@ -560,6 +598,20 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 		dseg++;
 	}
 
+	if (devtx_enabled()) {
+		struct mlx5e_devtx_frame ctx = {
+			.frame = {
+				.data = xdptxd->data,
+				.len = xdptxd->len,
+				.meta_len = sq->xsk_pool->tx_metadata_len,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+				.netdev = sq->cq.netdev,
+			},
+			.wqe = wqe,
+		};
+		mlx5e_devtx_submit(&ctx.frame);
+	}
+
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
 
 	if (test_bit(MLX5E_SQ_STATE_XDP_MULTIBUF, &sq->state)) {
@@ -607,7 +659,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				  struct mlx5e_xdp_wqe_info *wi,
 				  u32 *xsk_frames,
-				  struct xdp_frame_bulk *bq)
+				  struct xdp_frame_bulk *bq,
+				  struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
 	u16 i;
@@ -626,6 +679,14 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 			dma_addr = xdpi.frame.dma_addr;
 
+			if (false && devtx_enabled()) {
+				struct mlx5e_devtx_frame ctx;
+
+				devtx_frame_from_xdp(&ctx.frame, xdpf, sq->cq.netdev);
+				ctx.cqe = cqe;
+				mlx5e_devtx_complete(&ctx.frame);
+			}
+
 			dma_unmap_single(sq->pdev, dma_addr,
 					 xdpf->len, DMA_TO_DEVICE);
 			if (xdp_frame_has_frags(xdpf)) {
@@ -659,6 +720,20 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 				page = xdpi.page.page;
 
+				if (false && devtx_enabled()) {
+					struct mlx5e_devtx_frame ctx = {
+						.frame = {
+							.data = page,
+							.len = PAGE_SIZE,
+							.meta_len = sq->xsk_pool->tx_metadata_len,
+							.netdev = sq->cq.netdev,
+						},
+						.cqe = cqe,
+					};
+
+					mlx5e_devtx_complete(&ctx.frame);
+				}
+
 				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 				 * as we know this is a page_pool page.
 				 */
@@ -670,6 +745,21 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 		}
 		case MLX5E_XDP_XMIT_MODE_XSK:
 			/* AF_XDP send */
+
+			if (devtx_enabled()) {
+				struct mlx5e_devtx_frame ctx = {
+					.frame = {
+						.data = xdpi.frame.xsk_head,
+						.len = xdpi.page.xsk_head_len,
+						.meta_len = sq->xsk_pool->tx_metadata_len,
+						.netdev = sq->cq.netdev,
+					},
+					.cqe = cqe,
+				};
+
+				mlx5e_devtx_complete(&ctx.frame);
+			}
+
 			(*xsk_frames)++;
 			break;
 		default:
@@ -720,7 +810,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
 
 			sqcc += wi->num_wqebbs;
 
-			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cqe);
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -767,7 +857,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 
 		sq->cc += wi->num_wqebbs;
 
-		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL);
 	}
 
 	xdp_flush_frame_bulk(&bq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 9e8e6184f9e4..860638e1209b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -50,6 +50,11 @@ struct mlx5e_xdp_buff {
 	struct mlx5e_rq *rq;
 };
 
+struct mlx5e_xdp_md {
+	struct xdp_md md;
+	struct mlx5_cqe64 *cqe;
+};
+
 /* XDP packets can be transmitted in different ways. On completion, we need to
  * distinguish between them to clean up things in a proper way.
  */
@@ -82,18 +87,20 @@ enum mlx5e_xdp_xmit_mode {
  *    num, page_1, page_2, ... , page_num.
  *
  * MLX5E_XDP_XMIT_MODE_XSK:
- *    none.
+ *    frame.xsk_head + page.xsk_head_len for header portion only.
  */
 union mlx5e_xdp_info {
 	enum mlx5e_xdp_xmit_mode mode;
 	union {
 		struct xdp_frame *xdpf;
 		dma_addr_t dma_addr;
+		void *xsk_head;
 	} frame;
 	union {
 		struct mlx5e_rq *rq;
 		u8 num;
 		struct page *page;
+		u32 xsk_head_len;
 	} page;
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..1b97d6f6a9ba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -96,6 +96,9 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 
 		xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
 
+		xdpi.frame.xsk_head = xdptxd.data;
+		xdpi.page.xsk_head_len = xdptxd.len;
+
 		ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
 				      mlx5e_xmit_xdp_frame, sq, &xdptxd,
 				      check_result);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index c7eb6b238c2b..f8d3e210408a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -758,6 +758,14 @@ static void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq, struct mlx5e_t
 	for (i = 0; i < wi->num_fifo_pkts; i++) {
 		struct sk_buff *skb = mlx5e_skb_fifo_pop(&sq->db.skb_fifo);
 
+		if (false && devtx_enabled()) {
+			struct mlx5e_devtx_frame ctx = {};
+
+			devtx_frame_from_skb(&ctx.frame, skb, sq->cq.netdev);
+			ctx.cqe = cqe;
+			mlx5e_devtx_complete(&ctx.frame);
+		}
+
 		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
 	}
 }
@@ -826,6 +834,14 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			sqcc += wi->num_wqebbs;
 
 			if (likely(wi->skb)) {
+				if (false && devtx_enabled()) {
+					struct mlx5e_devtx_frame ctx = {};
+
+					devtx_frame_from_skb(&ctx.frame, wi->skb, cq->netdev);
+					ctx.cqe = cqe;
+					mlx5e_devtx_complete(&ctx.frame);
+				}
+
 				mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
 				mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index a7eb65cd0bdd..7160389a5bc6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -48,6 +48,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/version.h>
 #include <net/devlink.h>
+#include <net/devtx.h>
 #include "mlx5_core.h"
 #include "thermal.h"
 #include "lib/eq.h"
@@ -73,6 +74,7 @@
 #include "sf/dev/dev.h"
 #include "sf/sf.h"
 #include "mlx5_irq.h"
+#include "en/xdp.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -2132,6 +2134,19 @@ static void mlx5_core_verify_params(void)
 	}
 }
 
+__weak noinline void mlx5e_devtx_submit(struct devtx_frame *ctx)
+{
+}
+
+__weak noinline void mlx5e_devtx_complete(struct devtx_frame *ctx)
+{
+}
+
+BTF_SET8_START(mlx5e_devtx_hook_ids)
+BTF_ID_FLAGS(func, mlx5e_devtx_submit)
+BTF_ID_FLAGS(func, mlx5e_devtx_complete)
+BTF_SET8_END(mlx5e_devtx_hook_ids)
+
 static int __init mlx5_init(void)
 {
 	int err;
@@ -2144,9 +2159,15 @@ static int __init mlx5_init(void)
 	mlx5_core_verify_params();
 	mlx5_register_debugfs();
 
+	err = devtx_hooks_register(&mlx5e_devtx_hook_ids, &mlx5e_xdp_metadata_ops);
+	if (err) {
+		pr_warn("failed to register devtx hooks: %d", err);
+		goto err_debug;
+	}
+
 	err = mlx5e_init();
 	if (err)
-		goto err_debug;
+		goto err_devtx;
 
 	err = mlx5_sf_driver_register();
 	if (err)
@@ -2162,6 +2183,8 @@ static int __init mlx5_init(void)
 	mlx5_sf_driver_unregister();
 err_sf:
 	mlx5e_cleanup();
+err_devtx:
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -2169,6 +2192,7 @@ static int __init mlx5_init(void)
 
 static void __exit mlx5_cleanup(void)
 {
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_sf_driver_unregister();
 	mlx5e_cleanup();
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs
  2023-06-21 17:02 ` [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs Stanislav Fomichev
@ 2023-06-22  5:17   ` Alexei Starovoitov
  2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-22  5:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Wed, Jun 21, 2023 at 10:02 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> It is impossible to use skb_frag_t in the tracing program. So let's
> resolve a single typedef when walking the struct.
>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Pls send this one separately without RFC, but with a selftest,
so we can land it soon.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata
  2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2023-06-21 17:02 ` [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata Stanislav Fomichev
@ 2023-06-22  8:41 ` Jesper Dangaard Brouer
  2023-06-22 17:55   ` Stanislav Fomichev
  11 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-22  8:41 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev, xdp-hints


On 21/06/2023 19.02, Stanislav Fomichev wrote:
> CC'ing people only on the cover letter. Hopefully can find the rest via
> lore.

Could you please Cc me on all the patches, please.
(also please use hawk@kernel.org instead of my RH addr)

Also consider Cc'ing xdp-hints@xdp-project.net as we have end-users and
NIC engineers that can bring value to this conversation.

--Jesper


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-21 17:02 ` [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-06-22  9:11   ` Jesper D. Brouer
  2023-06-22 17:55     ` Stanislav Fomichev
  2023-06-22 15:26   ` Simon Horman
  1 sibling, 1 reply; 72+ messages in thread
From: Jesper D. Brouer @ 2023-06-22  9:11 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints


This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)

On 21/06/2023 19.02, Stanislav Fomichev wrote:
> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> and carry some TX metadata in the headroom. For copy mode, there
> is no way currently to populate skb metadata.
> 
> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> to treat as metadata. Metadata bytes come prior to tx_desc address
> (same as in RX case).

 From looking at the code, this introduces a socket option for this TX 
metadata length (tx_metadata_len).
This implies the same fixed TX metadata size is used for all packets.
Maybe describe this in patch desc.

What is the plan for dealing with cases that doesn't populate same/full
TX metadata size ?


> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   include/net/xdp_sock.h      |  1 +
>   include/net/xsk_buff_pool.h |  1 +
>   include/uapi/linux/if_xdp.h |  1 +
>   net/xdp/xsk.c               | 31 ++++++++++++++++++++++++++++++-
>   net/xdp/xsk_buff_pool.c     |  1 +
>   net/xdp/xsk_queue.h         |  7 ++++---
>   6 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..30018b3b862d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -51,6 +51,7 @@ struct xdp_sock {
>   	struct list_head flush_node;
>   	struct xsk_buff_pool *pool;
>   	u16 queue_id;
> +	u8 tx_metadata_len;
>   	bool zc;
>   	enum {
>   		XSK_READY = 0,
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index a8d7b8a3688a..751fea51a6af 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -75,6 +75,7 @@ struct xsk_buff_pool {
>   	u32 chunk_size;
>   	u32 chunk_shift;
>   	u32 frame_len;
> +	u8 tx_metadata_len; /* inherited from xsk_sock */
>   	u8 cached_need_wakeup;
>   	bool uses_need_wakeup;
>   	bool dma_need_sync;
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index a78a8096f4ce..2374eafff7db 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
>   #define XDP_UMEM_COMPLETION_RING	6
>   #define XDP_STATISTICS			7
>   #define XDP_OPTIONS			8
> +#define XDP_TX_METADATA_LEN		9
>   
>   struct xdp_umem_reg {
>   	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cc1e7f15fa73..c9b2daba7b6d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -485,6 +485,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>   		int err;
>   
>   		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> +		hr = max(hr, L1_CACHE_ALIGN((u32)xs->tx_metadata_len));
>   		tr = dev->needed_tailroom;
>   		len = desc->len;
>   
> @@ -493,14 +494,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>   			return ERR_PTR(err);
>   
>   		skb_reserve(skb, hr);
> -		skb_put(skb, len);
> +		skb_put(skb, len + xs->tx_metadata_len);
>   
>   		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +		buffer -= xs->tx_metadata_len;
> +
>   		err = skb_store_bits(skb, 0, buffer, len);
>   		if (unlikely(err)) {
>   			kfree_skb(skb);
>   			return ERR_PTR(err);
>   		}
> +
> +		if (xs->tx_metadata_len) {
> +			skb_metadata_set(skb, xs->tx_metadata_len);
> +			__skb_pull(skb, xs->tx_metadata_len);
> +		}
>   	}
>   
>   	skb->dev = dev;
> @@ -1137,6 +1145,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   		mutex_unlock(&xs->mutex);
>   		return err;
>   	}
> +	case XDP_TX_METADATA_LEN:
> +	{
> +		int val;
> +
> +		if (optlen < sizeof(val))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&val, optval, sizeof(val)))
> +			return -EFAULT;
> +
> +		if (val >= 256)
> +			return -EINVAL;
> +
> +		mutex_lock(&xs->mutex);
> +		if (xs->state != XSK_READY) {
> +			mutex_unlock(&xs->mutex);
> +			return -EBUSY;
> +		}
> +		xs->tx_metadata_len = val;
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>   	default:
>   		break;
>   	}
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 26f6d304451e..66ff9c345a67 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>   		XDP_PACKET_HEADROOM;
>   	pool->umem = umem;
>   	pool->addrs = umem->addrs;
> +	pool->tx_metadata_len = xs->tx_metadata_len;
>   	INIT_LIST_HEAD(&pool->free_list);
>   	INIT_LIST_HEAD(&pool->xsk_tx_list);
>   	spin_lock_init(&pool->xsk_tx_list_lock);
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 6d40a77fccbe..c8d287c18d64 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -133,12 +133,13 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
>   static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>   					    struct xdp_desc *desc)
>   {
> -	u64 offset = desc->addr & (pool->chunk_size - 1);
> +	u64 addr = desc->addr - pool->tx_metadata_len;
> +	u64 offset = addr & (pool->chunk_size - 1);
>   
>   	if (offset + desc->len > pool->chunk_size)
>   		return false;
>   
> -	if (desc->addr >= pool->addrs_cnt)
> +	if (addr >= pool->addrs_cnt)
>   		return false;
>   
>   	if (desc->options)
> @@ -149,7 +150,7 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>   static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
>   					      struct xdp_desc *desc)
>   {
> -	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
> +	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
>   
>   	if (desc->len > pool->chunk_size)
>   		return false;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc
  2023-06-21 17:02 ` [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
@ 2023-06-22 12:07   ` Jesper D. Brouer
  2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper D. Brouer @ 2023-06-22 12:07 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev



On 21/06/2023 19.02, Stanislav Fomichev wrote:
> Two kfuncs, one per hook point:
> 
> 1. at submit time - bpf_devtx_sb_request_timestamp - to request HW
>     to put TX timestamp into TX completion descriptors
> 
> 2. at completion time - bpf_devtx_cp_timestamp - to read out
>     TX timestamp
> 
[...]
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 08fbd4622ccf..2fdb0731eb67 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
[...]
>   struct xdp_metadata_ops {
>   	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
>   	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
>   			       enum xdp_rss_hash_type *rss_type);
> +	int	(*xmo_sb_request_timestamp)(const struct devtx_frame *ctx);
> +	int	(*xmo_cp_timestamp)(const struct devtx_frame *ctx, u64 *timestamp);
>   };

The "sb" and "cp" abbreviations, what do they stand for?

--Jesper

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-21 17:02 ` [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
  2023-06-22  9:11   ` Jesper D. Brouer
@ 2023-06-22 15:26   ` Simon Horman
  2023-06-22 17:55     ` Stanislav Fomichev
  1 sibling, 1 reply; 72+ messages in thread
From: Simon Horman @ 2023-06-22 15:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Wed, Jun 21, 2023 at 10:02:36AM -0700, Stanislav Fomichev wrote:

...

> @@ -1137,6 +1145,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return err;
>  	}
> +	case XDP_TX_METADATA_LEN:
> +	{
> +		int val;
> +
> +		if (optlen < sizeof(val))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&val, optval, sizeof(val)))
> +			return -EFAULT;
> +
> +		if (val >= 256)
> +			return -EINVAL;
> +
> +		mutex_lock(&xs->mutex);
> +		if (xs->state != XSK_READY) {
> +			mutex_unlock(&xs->mutex);
> +			return -EBUSY;
> +		}
> +		xs->tx_metadata_len = val;
> +		mutex_unlock(&xs->mutex);
> +		return err;

Hi Stan,

clang-16 complains that err is used uninitialised here.

 net/xdp/xsk.c:1167:10: warning: variable 'err' is uninitialized when used here [-Wuninitialized]
                 return err;
                        ^~~
 net/xdp/xsk.c:1065:9: note: initialize the variable 'err' to silence this warning
         int err;
                ^
                 = 0

> +	}
>  	default:
>  		break;
>  	}

...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-22 15:26   ` Simon Horman
@ 2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Simon Horman
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Thu, Jun 22, 2023 at 8:26 AM Simon Horman <simon.horman@corigine.com> wrote:
>
> On Wed, Jun 21, 2023 at 10:02:36AM -0700, Stanislav Fomichev wrote:
>
> ...
>
> > @@ -1137,6 +1145,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               mutex_unlock(&xs->mutex);
> >               return err;
> >       }
> > +     case XDP_TX_METADATA_LEN:
> > +     {
> > +             int val;
> > +
> > +             if (optlen < sizeof(val))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&val, optval, sizeof(val)))
> > +                     return -EFAULT;
> > +
> > +             if (val >= 256)
> > +                     return -EINVAL;
> > +
> > +             mutex_lock(&xs->mutex);
> > +             if (xs->state != XSK_READY) {
> > +                     mutex_unlock(&xs->mutex);
> > +                     return -EBUSY;
> > +             }
> > +             xs->tx_metadata_len = val;
> > +             mutex_unlock(&xs->mutex);
> > +             return err;
>
> Hi Stan,
>
> clang-16 complains that err is used uninitialised here.

Oh, thanks, will do 'return 0' instead!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-22  9:11   ` Jesper D. Brouer
@ 2023-06-22 17:55     ` Stanislav Fomichev
  2023-06-23 10:24       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Jesper D. Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>
>
> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > and carry some TX metadata in the headroom. For copy mode, there
> > is no way currently to populate skb metadata.
> >
> > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > to treat as metadata. Metadata bytes come prior to tx_desc address
> > (same as in RX case).
>
>  From looking at the code, this introduces a socket option for this TX
> metadata length (tx_metadata_len).
> This implies the same fixed TX metadata size is used for all packets.
> Maybe describe this in patch desc.

I was planning to do a proper documentation page once we settle on all
the details (similar to the one we have for rx).

> What is the plan for dealing with cases that doesn't populate same/full
> TX metadata size ?

Do we need to support that? I was assuming that the TX layout would be
fixed between the userspace and BPF.
If every packet would have a different metadata length, it seems like
a nightmare to parse?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc
  2023-06-22 12:07   ` Jesper D. Brouer
@ 2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Jesper D. Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev

On Thu, Jun 22, 2023 at 5:07 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>
>
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > Two kfuncs, one per hook point:
> >
> > 1. at submit time - bpf_devtx_sb_request_timestamp - to request HW
> >     to put TX timestamp into TX completion descriptors
> >
> > 2. at completion time - bpf_devtx_cp_timestamp - to read out
> >     TX timestamp
> >
> [...]
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 08fbd4622ccf..2fdb0731eb67 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> [...]
> >   struct xdp_metadata_ops {
> >       int     (*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
> >       int     (*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
> >                              enum xdp_rss_hash_type *rss_type);
> > +     int     (*xmo_sb_request_timestamp)(const struct devtx_frame *ctx);
> > +     int     (*xmo_cp_timestamp)(const struct devtx_frame *ctx, u64 *timestamp);
> >   };
>
> The "sb" and "cp" abbreviations, what do they stand for?

SuBmit and ComPlete. Should I spell them out? Or any other suitable
abbreviation?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata
  2023-06-22  8:41 ` [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
@ 2023-06-22 17:55   ` Stanislav Fomichev
  0 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev, xdp-hints

On Thu, Jun 22, 2023 at 1:41 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > CC'ing people only on the cover letter. Hopefully can find the rest via
> > lore.
>
> Could you please Cc me on all the patches, please.
> (also please use hawk@kernel.org instead of my RH addr)
>
> Also consider Cc'ing xdp-hints@xdp-project.net as we have end-users and
> NIC engineers that can bring value to this conversation.

Definitely! Didn't want to spam people too much (assuming they mostly
use the lore for reading).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs
  2023-06-22  5:17   ` Alexei Starovoitov
@ 2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Wed, Jun 21, 2023 at 10:18 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 21, 2023 at 10:02 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > It is impossible to use skb_frag_t in the tracing program. So let's
> > resolve a single typedef when walking the struct.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
> Pls send this one separately without RFC, but with a selftest,
> so we can land it soon.

Sure, will do!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-21 17:02 ` [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata Stanislav Fomichev
@ 2023-06-22 19:57   ` Alexei Starovoitov
  2023-06-22 20:13     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-22 19:57 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> WIP, not tested, only to show the overall idea.
> Non-AF_XDP paths are marked with 'false' for now.
> 
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
>  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
>  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
>  6 files changed, 156 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> index 879d698b6119..e4509464e0b1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> @@ -6,6 +6,7 @@
>  
>  #include "en.h"
>  #include <linux/indirect_call_wrapper.h>
> +#include <net/devtx.h>
>  
>  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
>  
> @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
>  
>  	return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
>  }
> +
> +struct mlx5e_devtx_frame {
> +	struct devtx_frame frame;
> +	struct mlx5_cqe64 *cqe; /* tx completion */

cqe is only valid at completion.

> +	struct mlx5e_tx_wqe *wqe; /* tx */

wqe is only valid at submission.

imo that's a very clear sign that this is not a generic datastructure.
The code is trying hard to make 'frame' part of it look common,
but it won't help bpf prog to be 'generic'.
It is still going to precisely coded for completion vs submission.
Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-22 19:57   ` Alexei Starovoitov
@ 2023-06-22 20:13     ` Stanislav Fomichev
  2023-06-22 21:47       ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 20:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Thu, Jun 22, 2023 at 12:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> > WIP, not tested, only to show the overall idea.
> > Non-AF_XDP paths are marked with 'false' for now.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
> >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
> >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
> >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
> >  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
> >  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
> >  6 files changed, 156 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > index 879d698b6119..e4509464e0b1 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > @@ -6,6 +6,7 @@
> >
> >  #include "en.h"
> >  #include <linux/indirect_call_wrapper.h>
> > +#include <net/devtx.h>
> >
> >  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> >
> > @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
> >
> >       return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
> >  }
> > +
> > +struct mlx5e_devtx_frame {
> > +     struct devtx_frame frame;
> > +     struct mlx5_cqe64 *cqe; /* tx completion */
>
> cqe is only valid at completion.
>
> > +     struct mlx5e_tx_wqe *wqe; /* tx */
>
> wqe is only valid at submission.
>
> imo that's a very clear sign that this is not a generic datastructure.
> The code is trying hard to make 'frame' part of it look common,
> but it won't help bpf prog to be 'generic'.
> It is still going to precisely coded for completion vs submission.
> Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
> As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.

The reason I went with this abstract context is to allow the programs
to be attached to the different devices.
For example, the xdp_hw_metadata we currently have is not really tied
down to the particular implementation.
If every hook declaration looks different, it seems impossible to
create portable programs.

The frame part is not really needed, we can probably rename it to ctx
and pass data/frags over the arguments?

struct devtx_ctx {
  struct net_device *netdev;
  /* the devices will be able to create wrappers to stash descriptor pointers */
};
void veth_devtx_submit(struct devtx_ctx *ctx, void *data, u16 len, u8
meta_len, struct skb_shared_info *sinfo);

But striving to have a similar hook declaration seems useful to
program portability sake?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-22 20:13     ` Stanislav Fomichev
@ 2023-06-22 21:47       ` Alexei Starovoitov
  2023-06-22 22:13         ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-22 21:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Thu, Jun 22, 2023 at 1:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Thu, Jun 22, 2023 at 12:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> > > WIP, not tested, only to show the overall idea.
> > > Non-AF_XDP paths are marked with 'false' for now.
> > >
> > > Cc: netdev@vger.kernel.org
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
> > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
> > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
> > >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
> > >  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
> > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
> > >  6 files changed, 156 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > index 879d698b6119..e4509464e0b1 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > @@ -6,6 +6,7 @@
> > >
> > >  #include "en.h"
> > >  #include <linux/indirect_call_wrapper.h>
> > > +#include <net/devtx.h>
> > >
> > >  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> > >
> > > @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
> > >
> > >       return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
> > >  }
> > > +
> > > +struct mlx5e_devtx_frame {
> > > +     struct devtx_frame frame;
> > > +     struct mlx5_cqe64 *cqe; /* tx completion */
> >
> > cqe is only valid at completion.
> >
> > > +     struct mlx5e_tx_wqe *wqe; /* tx */
> >
> > wqe is only valid at submission.
> >
> > imo that's a very clear sign that this is not a generic datastructure.
> > The code is trying hard to make 'frame' part of it look common,
> > but it won't help bpf prog to be 'generic'.
> > It is still going to precisely coded for completion vs submission.
> > Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
> > As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.
>
> The reason I went with this abstract context is to allow the programs
> to be attached to the different devices.
> For example, the xdp_hw_metadata we currently have is not really tied
> down to the particular implementation.
> If every hook declaration looks different, it seems impossible to
> create portable programs.
>
> The frame part is not really needed, we can probably rename it to ctx
> and pass data/frags over the arguments?
>
> struct devtx_ctx {
>   struct net_device *netdev;
>   /* the devices will be able to create wrappers to stash descriptor pointers */
> };
> void veth_devtx_submit(struct devtx_ctx *ctx, void *data, u16 len, u8
> meta_len, struct skb_shared_info *sinfo);
>
> But striving to have a similar hook declaration seems useful to
> program portability sake?

portability across what ?
'timestamp' on veth doesn't have a real use. It's testing only.
Even testing is a bit dubious.
I can see a need for bpf prog to run in the datacenter on mlx, brcm
and whatever other nics, but they will have completely different
hw descriptors. timestamp kfuncs to request/read can be common,
but to read the descriptors bpf prog authors would need to write
different code anyway.
So kernel code going out its way to present somewhat common devtx_ctx
just doesn't help. It adds code to the kernel, but bpf prog still
has to be tailored for mlx and brcm differently.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-22 21:47       ` Alexei Starovoitov
@ 2023-06-22 22:13         ` Stanislav Fomichev
  2023-06-23  2:35           ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 22:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Thu, Jun 22, 2023 at 2:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 22, 2023 at 1:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Thu, Jun 22, 2023 at 12:58 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> > > > WIP, not tested, only to show the overall idea.
> > > > Non-AF_XDP paths are marked with 'false' for now.
> > > >
> > > > Cc: netdev@vger.kernel.org
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
> > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
> > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
> > > >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
> > > >  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
> > > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
> > > >  6 files changed, 156 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > index 879d698b6119..e4509464e0b1 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > @@ -6,6 +6,7 @@
> > > >
> > > >  #include "en.h"
> > > >  #include <linux/indirect_call_wrapper.h>
> > > > +#include <net/devtx.h>
> > > >
> > > >  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> > > >
> > > > @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
> > > >
> > > >       return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
> > > >  }
> > > > +
> > > > +struct mlx5e_devtx_frame {
> > > > +     struct devtx_frame frame;
> > > > +     struct mlx5_cqe64 *cqe; /* tx completion */
> > >
> > > cqe is only valid at completion.
> > >
> > > > +     struct mlx5e_tx_wqe *wqe; /* tx */
> > >
> > > wqe is only valid at submission.
> > >
> > > imo that's a very clear sign that this is not a generic datastructure.
> > > The code is trying hard to make 'frame' part of it look common,
> > > but it won't help bpf prog to be 'generic'.
> > > It is still going to precisely coded for completion vs submission.
> > > Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
> > > As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.
> >
> > The reason I went with this abstract context is to allow the programs
> > to be attached to the different devices.
> > For example, the xdp_hw_metadata we currently have is not really tied
> > down to the particular implementation.
> > If every hook declaration looks different, it seems impossible to
> > create portable programs.
> >
> > The frame part is not really needed, we can probably rename it to ctx
> > and pass data/frags over the arguments?
> >
> > struct devtx_ctx {
> >   struct net_device *netdev;
> >   /* the devices will be able to create wrappers to stash descriptor pointers */
> > };
> > void veth_devtx_submit(struct devtx_ctx *ctx, void *data, u16 len, u8
> > meta_len, struct skb_shared_info *sinfo);
> >
> > But striving to have a similar hook declaration seems useful to
> > program portability sake?
>
> portability across what ?
> 'timestamp' on veth doesn't have a real use. It's testing only.
> Even testing is a bit dubious.
> I can see a need for bpf prog to run in the datacenter on mlx, brcm
> and whatever other nics, but they will have completely different
> hw descriptors. timestamp kfuncs to request/read can be common,
> but to read the descriptors bpf prog authors would need to write
> different code anyway.
> So kernel code going out its way to present somewhat common devtx_ctx
> just doesn't help. It adds code to the kernel, but bpf prog still
> has to be tailored for mlx and brcm differently.

Isn't it the same discussion/arguments we had during the RX series?
We want to provide common sane interfaces/abstractions via kfuncs.
That will make most BPF programs portable from mlx to brcm (for
example) without doing a rewrite.
We're also exposing raw (readonly) descriptors (via that get_ctx
helper) to the users who know what to do with them.
Most users don't know what to do with raw descriptors; the specs are
not public; things can change depending on fw version/etc/etc.
So the progs that touch raw descriptors are not the primary use-case.
(that was the tl;dr for rx part, seems like it applies here?)

Let's maybe discuss that mlx5 example? Are you proposing to do
something along these lines?

void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);

If yes, I'm missing how we define the common kfuncs in this case. The
kfuncs need to have some common context. We're defining them with:
bpf_devtx_<kfunc>(const struct devtx_frame *ctx);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-22 22:13         ` Stanislav Fomichev
@ 2023-06-23  2:35           ` Alexei Starovoitov
  2023-06-23 10:16             ` Maryam Tahhan
                               ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-23  2:35 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Thu, Jun 22, 2023 at 3:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Thu, Jun 22, 2023 at 2:47 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 22, 2023 at 1:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Thu, Jun 22, 2023 at 12:58 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> > > > > WIP, not tested, only to show the overall idea.
> > > > > Non-AF_XDP paths are marked with 'false' for now.
> > > > >
> > > > > Cc: netdev@vger.kernel.org
> > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > ---
> > > > >  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
> > > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
> > > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
> > > > >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
> > > > >  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
> > > > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
> > > > >  6 files changed, 156 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > index 879d698b6119..e4509464e0b1 100644
> > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > @@ -6,6 +6,7 @@
> > > > >
> > > > >  #include "en.h"
> > > > >  #include <linux/indirect_call_wrapper.h>
> > > > > +#include <net/devtx.h>
> > > > >
> > > > >  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> > > > >
> > > > > @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
> > > > >
> > > > >       return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
> > > > >  }
> > > > > +
> > > > > +struct mlx5e_devtx_frame {
> > > > > +     struct devtx_frame frame;
> > > > > +     struct mlx5_cqe64 *cqe; /* tx completion */
> > > >
> > > > cqe is only valid at completion.
> > > >
> > > > > +     struct mlx5e_tx_wqe *wqe; /* tx */
> > > >
> > > > wqe is only valid at submission.
> > > >
> > > > imo that's a very clear sign that this is not a generic datastructure.
> > > > The code is trying hard to make 'frame' part of it look common,
> > > > but it won't help bpf prog to be 'generic'.
> > > > It is still going to precisely coded for completion vs submission.
> > > > Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
> > > > As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.
> > >
> > > The reason I went with this abstract context is to allow the programs
> > > to be attached to the different devices.
> > > For example, the xdp_hw_metadata we currently have is not really tied
> > > down to the particular implementation.
> > > If every hook declaration looks different, it seems impossible to
> > > create portable programs.
> > >
> > > The frame part is not really needed, we can probably rename it to ctx
> > > and pass data/frags over the arguments?
> > >
> > > struct devtx_ctx {
> > >   struct net_device *netdev;
> > >   /* the devices will be able to create wrappers to stash descriptor pointers */
> > > };
> > > void veth_devtx_submit(struct devtx_ctx *ctx, void *data, u16 len, u8
> > > meta_len, struct skb_shared_info *sinfo);
> > >
> > > But striving to have a similar hook declaration seems useful to
> > > program portability sake?
> >
> > portability across what ?
> > 'timestamp' on veth doesn't have a real use. It's testing only.
> > Even testing is a bit dubious.
> > I can see a need for bpf prog to run in the datacenter on mlx, brcm
> > and whatever other nics, but they will have completely different
> > hw descriptors. timestamp kfuncs to request/read can be common,
> > but to read the descriptors bpf prog authors would need to write
> > different code anyway.
> > So kernel code going out its way to present somewhat common devtx_ctx
> > just doesn't help. It adds code to the kernel, but bpf prog still
> > has to be tailored for mlx and brcm differently.
>
> Isn't it the same discussion/arguments we had during the RX series?

Right, but there we already have xdp_md as an abstraction.
Extra kfuncs don't change that.
Here is the whole new 'ctx' being proposed with assumption that
it will be shared between completion and submission and will be
useful in both.

But there is skb at submission time and no skb at completion.
xdp_frame is there, but it's the last record of what was sent on the wire.
Parsing it with bpf is like examining steps in a sand. They are gone.
Parsing at submission makes sense, not at completion
and the driver has a way to associate wqe with cqe.

> We want to provide common sane interfaces/abstractions via kfuncs.
> That will make most BPF programs portable from mlx to brcm (for
> example) without doing a rewrite.
> We're also exposing raw (readonly) descriptors (via that get_ctx
> helper) to the users who know what to do with them.
> Most users don't know what to do with raw descriptors;

Why do you think so?
Who are those users?
I see your proposal and thumbs up from onlookers.
afaict there are zero users for rx side hw hints too.

> the specs are
> not public; things can change depending on fw version/etc/etc.
> So the progs that touch raw descriptors are not the primary use-case.
> (that was the tl;dr for rx part, seems like it applies here?)
>
> Let's maybe discuss that mlx5 example? Are you proposing to do
> something along these lines?
>
> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
>
> If yes, I'm missing how we define the common kfuncs in this case. The
> kfuncs need to have some common context. We're defining them with:
> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);

I'm looking at xdp_metadata and wondering who's using it.
I haven't seen a single bug report.
No bugs means no one is using it. There is zero chance that we managed
to implement it bug-free on the first try.
So new tx side things look like a feature creep to me.
rx side is far from proven to be useful for anything.
Yet you want to add new things.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23  2:35           ` Alexei Starovoitov
@ 2023-06-23 10:16             ` Maryam Tahhan
  2023-06-23 16:32               ` Alexei Starovoitov
  2023-06-23 17:24             ` Stanislav Fomichev
  2023-06-23 18:57             ` Donald Hunter
  2 siblings, 1 reply; 72+ messages in thread
From: Maryam Tahhan @ 2023-06-23 10:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development, Wiles, Keith,
	Jesper Brouer

On 23/06/2023 03:35, Alexei Starovoitov wrote:
> Why do you think so?
> Who are those users?
> I see your proposal and thumbs up from onlookers.
> afaict there are zero users for rx side hw hints too.
>
>> the specs are
>> not public; things can change depending on fw version/etc/etc.
>> So the progs that touch raw descriptors are not the primary use-case.
>> (that was the tl;dr for rx part, seems like it applies here?)
>>
>> Let's maybe discuss that mlx5 example? Are you proposing to do
>> something along these lines?
>>
>> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
>> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
>>
>> If yes, I'm missing how we define the common kfuncs in this case. The
>> kfuncs need to have some common context. We're defining them with:
>> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
> I'm looking at xdp_metadata and wondering who's using it.
> I haven't seen a single bug report.
> No bugs means no one is using it. There is zero chance that we managed
> to implement it bug-free on the first try.
> So new tx side things look like a feature creep to me.
> rx side is far from proven to be useful for anything.
> Yet you want to add new things.
>

Hi folks

We in CNDP (https://github.com/CloudNativeDataPlane/cndp) have been 
looking to use xdp_metadata to relay receive side offloads from the NIC 
to our AF_XDP applications. We see this is a key feature that is 
essential for the viability of AF_XDP in the real world. We would love 
to see something adopted for the TX side alongside what's on the RX 
side. We don't want to waste cycles do everything in software when the 
NIC HW supports many features that we need.

Thank you
Maryam


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-22 17:55     ` Stanislav Fomichev
@ 2023-06-23 10:24       ` Jesper Dangaard Brouer
  2023-06-23 17:41         ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-23 10:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints



On 22/06/2023 19.55, Stanislav Fomichev wrote:
> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>>
>>
>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>>
>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>>> and carry some TX metadata in the headroom. For copy mode, there
>>> is no way currently to populate skb metadata.
>>>
>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>>> (same as in RX case).
>>
>>   From looking at the code, this introduces a socket option for this TX
>> metadata length (tx_metadata_len).
>> This implies the same fixed TX metadata size is used for all packets.
>> Maybe describe this in patch desc.
> 
> I was planning to do a proper documentation page once we settle on all
> the details (similar to the one we have for rx).
> 
>> What is the plan for dealing with cases that doesn't populate same/full
>> TX metadata size ?
> 
> Do we need to support that? I was assuming that the TX layout would be
> fixed between the userspace and BPF.

I hope you don't mean fixed layout, as the whole point is adding
flexibility and extensibility.

> If every packet would have a different metadata length, it seems like
> a nightmare to parse?
> 

No parsing is really needed.  We can simply use BTF IDs and type cast in
BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
see [1] and [2].

It seems we are talking slightly past each-other(?).  Let me rephrase
and reframe the question, what is your *plan* for dealing with different
*types* of TX metadata.  The different struct *types* will of-cause have
different sizes, but that is okay as long as they fit into the maximum
size set by this new socket option XDP_TX_METADATA_LEN.
Thus, in principle I'm fine with XSK having configured a fixed headroom
for metadata, but we need a plan for handling more than one type and
perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
data ("leftover" since last time this mem was used).

With this kfunc approach, then things in-principle, becomes a contract
between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
components can as illustrated here [1]+[2] can coordinate based on local
BPF-prog BTF IDs.  This approach works as-is today, but patchset
selftests examples don't use this and instead have a very static
approach (that people will copy-paste).

An unsolved problem with TX-hook is that it can also get packets from
XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
BPF-prog know if metadata is valid and intended to be used for e.g.
requesting the timestamp? (imagine metadata size happen to match)

--Jesper

BPF-prog API bpf_core_type_id_local:
  - [1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80

Userspace API btf__find_by_name_kind:
  - [2] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs
  2023-06-21 17:02 ` [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
@ 2023-06-23 11:12   ` Jesper D. Brouer
  2023-06-23 17:40     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper D. Brouer @ 2023-06-23 11:12 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev, xdp-hints



On 21/06/2023 19.02, Stanislav Fomichev wrote:
> Attach kfuncs that request and report TX timestamp via ringbuf.
> Confirm on the userspace side that the program has triggered
> and the timestamp is non-zero.
> 
> Also make sure devtx_frame has a sensible pointers and data.
> 
[...]


> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index d151d406a123..fc025183d45a 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
[...]
> @@ -19,10 +24,25 @@ struct {
>   	__type(value, __u32);
>   } prog_arr SEC(".maps");
>   
> +struct {
> +	__uint(type, BPF_MAP_TYPE_RINGBUF);
> +	__uint(max_entries, 10);
> +} tx_compl_buf SEC(".maps");
> +
> +__u64 pkts_fail_tx = 0;
> +
> +int ifindex = -1;
> +__u64 net_cookie = -1;
> +
>   extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
>   					 __u64 *timestamp) __ksym;
>   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>   				    enum xdp_rss_hash_type *rss_type) __ksym;
> +extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
> +extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
> +
> +extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
> +extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
>   
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
> @@ -61,4 +81,102 @@ int rx(struct xdp_md *ctx)
>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }
>   
> +static inline int verify_frame(const struct devtx_frame *frame)
> +{
> +	struct ethhdr eth = {};
> +
> +	/* all the pointers are set up correctly */
> +	if (!frame->data)
> +		return -1;
> +	if (!frame->sinfo)
> +		return -1;
> +
> +	/* can get to the frags */
> +	if (frame->sinfo->nr_frags != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_page != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_len != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_offset != 0)
> +		return -1;
> +
> +	/* the data has something that looks like ethernet */
> +	if (frame->len != 46)
> +		return -1;
> +	bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
> +
> +	if (eth.h_proto != bpf_htons(ETH_P_IP))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +SEC("fentry/veth_devtx_submit")
> +int BPF_PROG(tx_submit, const struct devtx_frame *frame)
> +{
> +	struct xdp_tx_meta meta = {};
> +	int ret;
> +
> +	if (frame->netdev->ifindex != ifindex)
> +		return 0;
> +	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> +		return 0;
> +	if (frame->meta_len != TX_META_LEN)
> +		return 0;
> +
> +	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> +	if (!meta.request_timestamp)
> +		return 0;
> +
> +	ret = verify_frame(frame);
> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	ret = bpf_devtx_sb_request_timestamp(frame);

My original design thoughts were that BPF-progs would write into
metadata area, with the intend that at TX-complete we can access this
metadata area again.

In this case with request_timestamp it would make sense to me, to store
a sequence number (+ the TX-queue number), such that program code can
correlate on complete event.

Like xdp_hw_metadata example, I would likely also to add a software 
timestamp, what I could check at TX complete hook.

> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	return 0;
> +}
> +
> +SEC("fentry/veth_devtx_complete")
> +int BPF_PROG(tx_complete, const struct devtx_frame *frame)
> +{
> +	struct xdp_tx_meta meta = {};
> +	struct devtx_sample *sample;
> +	int ret;
> +
> +	if (frame->netdev->ifindex != ifindex)
> +		return 0;
> +	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> +		return 0;
> +	if (frame->meta_len != TX_META_LEN)
> +		return 0;
> +
> +	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> +	if (!meta.request_timestamp)
> +		return 0;
> +
> +	ret = verify_frame(frame);
> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> +	if (!sample)
> +		return 0;

Sending this via a ringbuffer to userspace, will make it hard to
correlate. (For AF_XDP it would help a little to add the TX-queue
number, as this hook isn't queue bound but AF_XDP is).

> +
> +	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> +

I were expecting to see, information being written into the metadata 
area of the frame, such that AF_XDP completion-queue handling can 
extract this obtained timestamp.


> +	bpf_ringbuf_submit(sample, 0);
> +
> +	return 0;
> +}
> +
>   char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
> index 938a729bd307..e410f2b95e64 100644
> --- a/tools/testing/selftests/bpf/xdp_metadata.h
> +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> @@ -18,3 +18,17 @@ struct xdp_meta {
>   		__s32 rx_hash_err;
>   	};
>   };
> +
> +struct devtx_sample {
> +	int timestamp_retval;
> +	__u64 timestamp;
> +};
> +
> +#define TX_META_LEN	8

Very static design.

> +
> +struct xdp_tx_meta {
> +	__u8 request_timestamp;
> +	__u8 padding0;
> +	__u16 padding1;
> +	__u32 padding2;
> +};

padding2 could be a btf_id for creating a more flexible design.

--Jesper

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23 10:16             ` Maryam Tahhan
@ 2023-06-23 16:32               ` Alexei Starovoitov
  2023-06-23 17:47                 ` Maryam Tahhan
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-23 16:32 UTC (permalink / raw)
  To: Maryam Tahhan
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Network Development, Wiles, Keith, Jesper Brouer

On Fri, Jun 23, 2023 at 3:16 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> On 23/06/2023 03:35, Alexei Starovoitov wrote:
> > Why do you think so?
> > Who are those users?
> > I see your proposal and thumbs up from onlookers.
> > afaict there are zero users for rx side hw hints too.
> >
> >> the specs are
> >> not public; things can change depending on fw version/etc/etc.
> >> So the progs that touch raw descriptors are not the primary use-case.
> >> (that was the tl;dr for rx part, seems like it applies here?)
> >>
> >> Let's maybe discuss that mlx5 example? Are you proposing to do
> >> something along these lines?
> >>
> >> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
> >> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
> >>
> >> If yes, I'm missing how we define the common kfuncs in this case. The
> >> kfuncs need to have some common context. We're defining them with:
> >> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
> > I'm looking at xdp_metadata and wondering who's using it.
> > I haven't seen a single bug report.
> > No bugs means no one is using it. There is zero chance that we managed
> > to implement it bug-free on the first try.
> > So new tx side things look like a feature creep to me.
> > rx side is far from proven to be useful for anything.
> > Yet you want to add new things.
> >
>
> Hi folks
>
> We in CNDP (https://github.com/CloudNativeDataPlane/cndp) have been

with TCP stack in user space over af_xdp...

> looking to use xdp_metadata to relay receive side offloads from the NIC
> to our AF_XDP applications. We see this is a key feature that is
> essential for the viability of AF_XDP in the real world. We would love
> to see something adopted for the TX side alongside what's on the RX
> side. We don't want to waste cycles do everything in software when the
> NIC HW supports many features that we need.

Please specify "many features". If that means HW TSO to accelerate
your TCP in user space, then sorry, but no.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23  2:35           ` Alexei Starovoitov
  2023-06-23 10:16             ` Maryam Tahhan
@ 2023-06-23 17:24             ` Stanislav Fomichev
  2023-06-23 18:57             ` Donald Hunter
  2 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-23 17:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Thu, Jun 22, 2023 at 7:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 22, 2023 at 3:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Thu, Jun 22, 2023 at 2:47 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jun 22, 2023 at 1:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Thu, Jun 22, 2023 at 12:58 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 21, 2023 at 10:02:44AM -0700, Stanislav Fomichev wrote:
> > > > > > WIP, not tested, only to show the overall idea.
> > > > > > Non-AF_XDP paths are marked with 'false' for now.
> > > > > >
> > > > > > Cc: netdev@vger.kernel.org
> > > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > > ---
> > > > > >  .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 11 +++
> > > > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 96 ++++++++++++++++++-
> > > > > >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  9 +-
> > > > > >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  3 +
> > > > > >  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 16 ++++
> > > > > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 26 ++++-
> > > > > >  6 files changed, 156 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > > index 879d698b6119..e4509464e0b1 100644
> > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > > > > > @@ -6,6 +6,7 @@
> > > > > >
> > > > > >  #include "en.h"
> > > > > >  #include <linux/indirect_call_wrapper.h>
> > > > > > +#include <net/devtx.h>
> > > > > >
> > > > > >  #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> > > > > >
> > > > > > @@ -506,4 +507,14 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
> > > > > >
> > > > > >       return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
> > > > > >  }
> > > > > > +
> > > > > > +struct mlx5e_devtx_frame {
> > > > > > +     struct devtx_frame frame;
> > > > > > +     struct mlx5_cqe64 *cqe; /* tx completion */
> > > > >
> > > > > cqe is only valid at completion.
> > > > >
> > > > > > +     struct mlx5e_tx_wqe *wqe; /* tx */
> > > > >
> > > > > wqe is only valid at submission.
> > > > >
> > > > > imo that's a very clear sign that this is not a generic datastructure.
> > > > > The code is trying hard to make 'frame' part of it look common,
> > > > > but it won't help bpf prog to be 'generic'.
> > > > > It is still going to precisely coded for completion vs submission.
> > > > > Similarly a bpf prog for completion in veth will be different than bpf prog for completion in mlx5.
> > > > > As I stated earlier this 'generalization' and 'common' datastructure only adds code complexity.
> > > >
> > > > The reason I went with this abstract context is to allow the programs
> > > > to be attached to the different devices.
> > > > For example, the xdp_hw_metadata we currently have is not really tied
> > > > down to the particular implementation.
> > > > If every hook declaration looks different, it seems impossible to
> > > > create portable programs.
> > > >
> > > > The frame part is not really needed, we can probably rename it to ctx
> > > > and pass data/frags over the arguments?
> > > >
> > > > struct devtx_ctx {
> > > >   struct net_device *netdev;
> > > >   /* the devices will be able to create wrappers to stash descriptor pointers */
> > > > };
> > > > void veth_devtx_submit(struct devtx_ctx *ctx, void *data, u16 len, u8
> > > > meta_len, struct skb_shared_info *sinfo);
> > > >
> > > > But striving to have a similar hook declaration seems useful to
> > > > program portability sake?
> > >
> > > portability across what ?
> > > 'timestamp' on veth doesn't have a real use. It's testing only.
> > > Even testing is a bit dubious.
> > > I can see a need for bpf prog to run in the datacenter on mlx, brcm
> > > and whatever other nics, but they will have completely different
> > > hw descriptors. timestamp kfuncs to request/read can be common,
> > > but to read the descriptors bpf prog authors would need to write
> > > different code anyway.
> > > So kernel code going out its way to present somewhat common devtx_ctx
> > > just doesn't help. It adds code to the kernel, but bpf prog still
> > > has to be tailored for mlx and brcm differently.
> >
> > Isn't it the same discussion/arguments we had during the RX series?
>
> Right, but there we already have xdp_md as an abstraction.
> Extra kfuncs don't change that.
> Here is the whole new 'ctx' being proposed with assumption that
> it will be shared between completion and submission and will be
> useful in both.
>
> But there is skb at submission time and no skb at completion.
> xdp_frame is there, but it's the last record of what was sent on the wire.
> Parsing it with bpf is like examining steps in a sand. They are gone.
> Parsing at submission makes sense, not at completion
> and the driver has a way to associate wqe with cqe.

Right, and I'm not exposing neither skb nor xdp_md/frame, so we're on
the same page?
Or are you suggesting to further split devtx_frame into two contexts?
One for submit and another for complete?
And don't expose the payload at the complete time?
Having payload at complete might still be useful though, at least the header.
In case the users want only to inspect completion based on some marker/flow.

> > We want to provide common sane interfaces/abstractions via kfuncs.
> > That will make most BPF programs portable from mlx to brcm (for
> > example) without doing a rewrite.
> > We're also exposing raw (readonly) descriptors (via that get_ctx
> > helper) to the users who know what to do with them.
> > Most users don't know what to do with raw descriptors;
>
> Why do you think so?
> Who are those users?
> I see your proposal and thumbs up from onlookers.
> afaict there are zero users for rx side hw hints too.

My bias comes from the point of view of our internal use-cases where
we'd like to have rx/tx timestamps in the device-agnostic fashion.
I'm happy to incorporate other requirements as I did with exposing raw
descriptors at rx using get_ctx helper.
Regarding the usage: for the external ones I'm assuming it will take
time until it all percolates through the distros...

> > the specs are
> > not public; things can change depending on fw version/etc/etc.
> > So the progs that touch raw descriptors are not the primary use-case.
> > (that was the tl;dr for rx part, seems like it applies here?)
> >
> > Let's maybe discuss that mlx5 example? Are you proposing to do
> > something along these lines?
> >
> > void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
> > void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
> >
> > If yes, I'm missing how we define the common kfuncs in this case. The
> > kfuncs need to have some common context. We're defining them with:
> > bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
>
> I'm looking at xdp_metadata and wondering who's using it.
> I haven't seen a single bug report.
> No bugs means no one is using it. There is zero chance that we managed
> to implement it bug-free on the first try.
> So new tx side things look like a feature creep to me.
> rx side is far from proven to be useful for anything.
> Yet you want to add new things.

I've been talking about both tx and rx timestamps right from the
beginning, so it's not really a new feature.
But what's the concern here? IIUC, the whole point of it being
kfunc-based is that we can wipe it all if/when it becomes a dead
weight.

Regarding the users, there is also a bit of a chicken and egg problem:
We have some internal interest in using AF_XDP, but it lacks multibuf
(which is in the review) and the offloads (which I'm trying to move
forward for both rx/tx).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs
  2023-06-23 11:12   ` Jesper D. Brouer
@ 2023-06-23 17:40     ` Stanislav Fomichev
  0 siblings, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-23 17:40 UTC (permalink / raw)
  To: Jesper D. Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev, xdp-hints

On Fri, Jun 23, 2023 at 4:12 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>
>
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > Attach kfuncs that request and report TX timestamp via ringbuf.
> > Confirm on the userspace side that the program has triggered
> > and the timestamp is non-zero.
> >
> > Also make sure devtx_frame has a sensible pointers and data.
> >
> [...]
>
>
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > index d151d406a123..fc025183d45a 100644
> > --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> [...]
> > @@ -19,10 +24,25 @@ struct {
> >       __type(value, __u32);
> >   } prog_arr SEC(".maps");
> >
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_RINGBUF);
> > +     __uint(max_entries, 10);
> > +} tx_compl_buf SEC(".maps");
> > +
> > +__u64 pkts_fail_tx = 0;
> > +
> > +int ifindex = -1;
> > +__u64 net_cookie = -1;
> > +
> >   extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
> >                                        __u64 *timestamp) __ksym;
> >   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
> >                                   enum xdp_rss_hash_type *rss_type) __ksym;
> > +extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
> > +extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
> > +
> > +extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
> > +extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
> >
> >   SEC("xdp")
> >   int rx(struct xdp_md *ctx)
> > @@ -61,4 +81,102 @@ int rx(struct xdp_md *ctx)
> >       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >   }
> >
> > +static inline int verify_frame(const struct devtx_frame *frame)
> > +{
> > +     struct ethhdr eth = {};
> > +
> > +     /* all the pointers are set up correctly */
> > +     if (!frame->data)
> > +             return -1;
> > +     if (!frame->sinfo)
> > +             return -1;
> > +
> > +     /* can get to the frags */
> > +     if (frame->sinfo->nr_frags != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_page != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_len != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_offset != 0)
> > +             return -1;
> > +
> > +     /* the data has something that looks like ethernet */
> > +     if (frame->len != 46)
> > +             return -1;
> > +     bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
> > +
> > +     if (eth.h_proto != bpf_htons(ETH_P_IP))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("fentry/veth_devtx_submit")
> > +int BPF_PROG(tx_submit, const struct devtx_frame *frame)
> > +{
> > +     struct xdp_tx_meta meta = {};
> > +     int ret;
> > +
> > +     if (frame->netdev->ifindex != ifindex)
> > +             return 0;
> > +     if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> > +             return 0;
> > +     if (frame->meta_len != TX_META_LEN)
> > +             return 0;
> > +
> > +     bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> > +     if (!meta.request_timestamp)
> > +             return 0;
> > +
> > +     ret = verify_frame(frame);
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     ret = bpf_devtx_sb_request_timestamp(frame);
>
> My original design thoughts were that BPF-progs would write into
> metadata area, with the intend that at TX-complete we can access this
> metadata area again.
>
> In this case with request_timestamp it would make sense to me, to store
> a sequence number (+ the TX-queue number), such that program code can
> correlate on complete event.

Yeah, we can probably follow up on that. I'm trying to start with a
read-only path for now.
Can we expose metadata mutating operations via some new kfunc helpers?
Something that returns a ptr/dynptr to the metadata portion?

> Like xdp_hw_metadata example, I would likely also to add a software
> timestamp, what I could check at TX complete hook.
>
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("fentry/veth_devtx_complete")
> > +int BPF_PROG(tx_complete, const struct devtx_frame *frame)
> > +{
> > +     struct xdp_tx_meta meta = {};
> > +     struct devtx_sample *sample;
> > +     int ret;
> > +
> > +     if (frame->netdev->ifindex != ifindex)
> > +             return 0;
> > +     if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> > +             return 0;
> > +     if (frame->meta_len != TX_META_LEN)
> > +             return 0;
> > +
> > +     bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> > +     if (!meta.request_timestamp)
> > +             return 0;
> > +
> > +     ret = verify_frame(frame);
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> > +     if (!sample)
> > +             return 0;
>
> Sending this via a ringbuffer to userspace, will make it hard to
> correlate. (For AF_XDP it would help a little to add the TX-queue
> number, as this hook isn't queue bound but AF_XDP is).

Agreed. I was looking into putting the metadata back into the ring initially.
It's somewhat doable for zero-copy, but needs some special care for copy mode.
So I've decided not to over-complicate the series and land the
read-only hooks at least.
Does it sound fair? We can allow mutating metadata separately.

> > +
> > +     sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> > +
>
> I were expecting to see, information being written into the metadata
> area of the frame, such that AF_XDP completion-queue handling can
> extract this obtained timestamp.

SG, will add!

> > +     bpf_ringbuf_submit(sample, 0);
> > +
> > +     return 0;
> > +}
> > +
> >   char _license[] SEC("license") = "GPL";
> > diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
> > index 938a729bd307..e410f2b95e64 100644
> > --- a/tools/testing/selftests/bpf/xdp_metadata.h
> > +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> > @@ -18,3 +18,17 @@ struct xdp_meta {
> >               __s32 rx_hash_err;
> >       };
> >   };
> > +
> > +struct devtx_sample {
> > +     int timestamp_retval;
> > +     __u64 timestamp;
> > +};
> > +
> > +#define TX_META_LEN  8
>
> Very static design.
>
> > +
> > +struct xdp_tx_meta {
> > +     __u8 request_timestamp;
> > +     __u8 padding0;
> > +     __u16 padding1;
> > +     __u32 padding2;
> > +};
>
> padding2 could be a btf_id for creating a more flexible design.

Right, up to the programs on how to make it more flexible (same as
rx), will add more on that in your other reply.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-23 10:24       ` Jesper Dangaard Brouer
@ 2023-06-23 17:41         ` Stanislav Fomichev
  2023-06-24  9:02           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-23 17:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >>
> >>
> >> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >>
> >> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >>> and carry some TX metadata in the headroom. For copy mode, there
> >>> is no way currently to populate skb metadata.
> >>>
> >>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >>> (same as in RX case).
> >>
> >>   From looking at the code, this introduces a socket option for this TX
> >> metadata length (tx_metadata_len).
> >> This implies the same fixed TX metadata size is used for all packets.
> >> Maybe describe this in patch desc.
> >
> > I was planning to do a proper documentation page once we settle on all
> > the details (similar to the one we have for rx).
> >
> >> What is the plan for dealing with cases that doesn't populate same/full
> >> TX metadata size ?
> >
> > Do we need to support that? I was assuming that the TX layout would be
> > fixed between the userspace and BPF.
>
> I hope you don't mean fixed layout, as the whole point is adding
> flexibility and extensibility.

I do mean a fixed layout between the userspace (af_xdp) and devtx program.
At least fixed max size of the metadata. The userspace and the bpf
prog can then use this fixed space to implement some flexibility
(btf_ids, versioned structs, bitmasks, tlv, etc).
If we were to make the metalen vary per packet, we'd have to signal
its size per packet. Probably not worth it?

> > If every packet would have a different metadata length, it seems like
> > a nightmare to parse?
> >
>
> No parsing is really needed.  We can simply use BTF IDs and type cast in
> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> see [1] and [2].
>
> It seems we are talking slightly past each-other(?).  Let me rephrase
> and reframe the question, what is your *plan* for dealing with different
> *types* of TX metadata.  The different struct *types* will of-cause have
> different sizes, but that is okay as long as they fit into the maximum
> size set by this new socket option XDP_TX_METADATA_LEN.
> Thus, in principle I'm fine with XSK having configured a fixed headroom
> for metadata, but we need a plan for handling more than one type and
> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> data ("leftover" since last time this mem was used).

Yeah, I think the above correctly catches my expectation here. Some
headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
offloaded to the bpf program via btf_id/tlv/etc.

Regarding leftover metadata: can we assume the userspace will take
care of setting it up?

> With this kfunc approach, then things in-principle, becomes a contract
> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> components can as illustrated here [1]+[2] can coordinate based on local
> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> selftests examples don't use this and instead have a very static
> approach (that people will copy-paste).
>
> An unsolved problem with TX-hook is that it can also get packets from
> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> BPF-prog know if metadata is valid and intended to be used for e.g.
> requesting the timestamp? (imagine metadata size happen to match)

My assumption was the bpf program can do ifindex/netns filtering. Plus
maybe check that the meta_len is the one that's expected.
Will that be enough to handle XDP_REDIRECT?


> --Jesper
>
> BPF-prog API bpf_core_type_id_local:
>   - [1]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>
> Userspace API btf__find_by_name_kind:
>   - [2]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23 16:32               ` Alexei Starovoitov
@ 2023-06-23 17:47                 ` Maryam Tahhan
  0 siblings, 0 replies; 72+ messages in thread
From: Maryam Tahhan @ 2023-06-23 17:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Network Development, Wiles, Keith, Jesper Brouer

On 23/06/2023 17:32, Alexei Starovoitov wrote:
> On Fri, Jun 23, 2023 at 3:16 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>> On 23/06/2023 03:35, Alexei Starovoitov wrote:
>>> Why do you think so?
>>> Who are those users?
>>> I see your proposal and thumbs up from onlookers.
>>> afaict there are zero users for rx side hw hints too.
>>>
>>>> the specs are
>>>> not public; things can change depending on fw version/etc/etc.
>>>> So the progs that touch raw descriptors are not the primary use-case.
>>>> (that was the tl;dr for rx part, seems like it applies here?)
>>>>
>>>> Let's maybe discuss that mlx5 example? Are you proposing to do
>>>> something along these lines?
>>>>
>>>> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
>>>> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
>>>>
>>>> If yes, I'm missing how we define the common kfuncs in this case. The
>>>> kfuncs need to have some common context. We're defining them with:
>>>> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
>>> I'm looking at xdp_metadata and wondering who's using it.
>>> I haven't seen a single bug report.
>>> No bugs means no one is using it. There is zero chance that we managed
>>> to implement it bug-free on the first try.
>>> So new tx side things look like a feature creep to me.
>>> rx side is far from proven to be useful for anything.
>>> Yet you want to add new things.
>>>
>> Hi folks
>>
>> We in CNDP (https://github.com/CloudNativeDataPlane/cndp) have been
> with TCP stack in user space over af_xdp...
>
>> looking to use xdp_metadata to relay receive side offloads from the NIC
>> to our AF_XDP applications. We see this is a key feature that is
>> essential for the viability of AF_XDP in the real world. We would love
>> to see something adopted for the TX side alongside what's on the RX
>> side. We don't want to waste cycles do everything in software when the
>> NIC HW supports many features that we need.
> Please specify "many features". If that means HW TSO to accelerate
> your TCP in user space, then sorry, but no.

Our TCP "stack" does NOT work without the kernel, it's a "lightweight 
data plane", the kernel is the control plane you may remember my 
presentation

at FOSDEM 23 in Brussels [1].


We need things as simple as TX check summing and I'm not sure about TSO 
yet (maybe in time). The Hybrid Networking Stack goal is not to compete

with the Kernel but rather provide a new approach to high performance 
Cloud Native networking which uses the Kernel + XDP and AF_XDP. We would

like to show how high performance networking use cases can use the in 
kernel fast path to achieve the performance they are looking for.


You can find more details about what we are trying to do here [2]


[1] https://fosdem.org/2023/schedule/event/hybrid_netstack/

[2] https://next.redhat.com/2022/12/07/the-hybrid-networking-stack/




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23  2:35           ` Alexei Starovoitov
  2023-06-23 10:16             ` Maryam Tahhan
  2023-06-23 17:24             ` Stanislav Fomichev
@ 2023-06-23 18:57             ` Donald Hunter
  2023-06-24  0:25               ` John Fastabend
  2 siblings, 1 reply; 72+ messages in thread
From: Donald Hunter @ 2023-06-23 18:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Jun 22, 2023 at 3:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> We want to provide common sane interfaces/abstractions via kfuncs.
>> That will make most BPF programs portable from mlx to brcm (for
>> example) without doing a rewrite.
>> We're also exposing raw (readonly) descriptors (via that get_ctx
>> helper) to the users who know what to do with them.
>> Most users don't know what to do with raw descriptors;
>
> Why do you think so?
> Who are those users?
> I see your proposal and thumbs up from onlookers.
> afaict there are zero users for rx side hw hints too.

We have customers in various sectors that want to use rx hw timestamps.

There are several use cases especially in Telco where they use DPDK
today and want to move to AF_XDP but they need to be able to benefit
from the hw capabilities of the NICs they purchase. Not having access to
hw offloads on rx and tx is a barrier to AF_XDP adoption.

The most notable gaps in AF_XDP are checksum offloads and multi buffer
support.

>> the specs are
>> not public; things can change depending on fw version/etc/etc.
>> So the progs that touch raw descriptors are not the primary use-case.
>> (that was the tl;dr for rx part, seems like it applies here?)
>>
>> Let's maybe discuss that mlx5 example? Are you proposing to do
>> something along these lines?
>>
>> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
>> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
>>
>> If yes, I'm missing how we define the common kfuncs in this case. The
>> kfuncs need to have some common context. We're defining them with:
>> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
>
> I'm looking at xdp_metadata and wondering who's using it.
> I haven't seen a single bug report.
> No bugs means no one is using it. There is zero chance that we managed
> to implement it bug-free on the first try.

Nobody is using xdp_metadata today, not because they don't want to, but
because there was no consensus for how to use it. We have internal POCs
that use xdp_metadata and are still trying to build the foundations
needed to support it consistently across different hardware. Jesper
Brouer proposed a way to describe xdp_metadata with BTF and it was
rejected. The current plan to use kfuncs for xdp hints is the most
recent attempt to find a solution.

> So new tx side things look like a feature creep to me.
> rx side is far from proven to be useful for anything.
> Yet you want to add new things.

We have telcos and large enterprises that either use DPDK or use
proprietary solutions for getting traffic to user space. They want to
move to AF_XDP but without at least RX and TX checksum offloads they are
paying a CPU tax for using AF_XDP. One customer is also waiting for
multi-buffer support to land before they can adopt AF_XDP.

So, no it's not feature creep, it's a set of required features to reach
minimum viable product to be able to move out of a lab and replace
legacy in production.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-21 17:02 ` [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
@ 2023-06-23 23:29   ` Vinicius Costa Gomes
  2023-06-26 17:00     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Vinicius Costa Gomes @ 2023-06-23 23:29 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, netdev

Stanislav Fomichev <sdf@google.com> writes:

> Have a software-based example for kfuncs to showcase how it
> can be used in the real devices and to have something to
> test against in the selftests.
>
> Both path (skb & xdp) are covered. Only the skb path is really
> tested though.
>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Not really related to this patch, but to how it would work with
different drivers/hardware.

In some of our hardware (the ones handled by igc/igb, for example), the
timestamp notification comes some time after the transmit completion
event.

From what I could gather, the idea would be for the driver to "hold" the
completion until the timestamp is ready and then signal the completion
of the frame. Is that right?


Cheers,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-23 18:57             ` Donald Hunter
@ 2023-06-24  0:25               ` John Fastabend
  2023-06-24  2:52                 ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2023-06-24  0:25 UTC (permalink / raw)
  To: Donald Hunter, Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Network Development

Donald Hunter wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Jun 22, 2023 at 3:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> We want to provide common sane interfaces/abstractions via kfuncs.
> >> That will make most BPF programs portable from mlx to brcm (for
> >> example) without doing a rewrite.
> >> We're also exposing raw (readonly) descriptors (via that get_ctx
> >> helper) to the users who know what to do with them.
> >> Most users don't know what to do with raw descriptors;
> >
> > Why do you think so?
> > Who are those users?
> > I see your proposal and thumbs up from onlookers.
> > afaict there are zero users for rx side hw hints too.
> 
> We have customers in various sectors that want to use rx hw timestamps.
> 
> There are several use cases especially in Telco where they use DPDK
> today and want to move to AF_XDP but they need to be able to benefit
> from the hw capabilities of the NICs they purchase. Not having access to
> hw offloads on rx and tx is a barrier to AF_XDP adoption.
> 
> The most notable gaps in AF_XDP are checksum offloads and multi buffer
> support.
> 
> >> the specs are
> >> not public; things can change depending on fw version/etc/etc.
> >> So the progs that touch raw descriptors are not the primary use-case.
> >> (that was the tl;dr for rx part, seems like it applies here?)
> >>
> >> Let's maybe discuss that mlx5 example? Are you proposing to do
> >> something along these lines?
> >>
> >> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
> >> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
> >>
> >> If yes, I'm missing how we define the common kfuncs in this case. The
> >> kfuncs need to have some common context. We're defining them with:
> >> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
> >
> > I'm looking at xdp_metadata and wondering who's using it.
> > I haven't seen a single bug report.
> > No bugs means no one is using it. There is zero chance that we managed
> > to implement it bug-free on the first try.
> 
> Nobody is using xdp_metadata today, not because they don't want to, but
> because there was no consensus for how to use it. We have internal POCs
> that use xdp_metadata and are still trying to build the foundations
> needed to support it consistently across different hardware. Jesper
> Brouer proposed a way to describe xdp_metadata with BTF and it was
> rejected. The current plan to use kfuncs for xdp hints is the most
> recent attempt to find a solution.

The hold up on my side is getting it in a LST kernel so we can get it
deployed in real environments. Although my plan is to just cast the
ctx to a kernel ctx and read the data out we need.

> 
> > So new tx side things look like a feature creep to me.
> > rx side is far from proven to be useful for anything.
> > Yet you want to add new things.

From my side if we just had a hook there and could cast the ctx to
something BTF type safe so we can simply read through the descriptor
I think that would sufficient for many use cases. To write into the
descriptor that might take more thought a writeable BTF flag?

> 
> We have telcos and large enterprises that either use DPDK or use
> proprietary solutions for getting traffic to user space. They want to
> move to AF_XDP but without at least RX and TX checksum offloads they are
> paying a CPU tax for using AF_XDP. One customer is also waiting for
> multi-buffer support to land before they can adopt AF_XDP.
> 
> So, no it's not feature creep, it's a set of required features to reach
> minimum viable product to be able to move out of a lab and replace
> legacy in production.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-24  0:25               ` John Fastabend
@ 2023-06-24  2:52                 ` Alexei Starovoitov
  2023-06-24 21:38                   ` Jakub Kicinski
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  2:52 UTC (permalink / raw)
  To: John Fastabend
  Cc: Donald Hunter, Stanislav Fomichev, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Network Development

On Fri, Jun 23, 2023 at 5:25 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Donald Hunter wrote:
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Thu, Jun 22, 2023 at 3:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >>
> > >> We want to provide common sane interfaces/abstractions via kfuncs.
> > >> That will make most BPF programs portable from mlx to brcm (for
> > >> example) without doing a rewrite.
> > >> We're also exposing raw (readonly) descriptors (via that get_ctx
> > >> helper) to the users who know what to do with them.
> > >> Most users don't know what to do with raw descriptors;
> > >
> > > Why do you think so?
> > > Who are those users?
> > > I see your proposal and thumbs up from onlookers.
> > > afaict there are zero users for rx side hw hints too.
> >
> > We have customers in various sectors that want to use rx hw timestamps.
> >
> > There are several use cases especially in Telco where they use DPDK
> > today and want to move to AF_XDP but they need to be able to benefit
> > from the hw capabilities of the NICs they purchase. Not having access to
> > hw offloads on rx and tx is a barrier to AF_XDP adoption.
> >
> > The most notable gaps in AF_XDP are checksum offloads and multi buffer
> > support.
> >
> > >> the specs are
> > >> not public; things can change depending on fw version/etc/etc.
> > >> So the progs that touch raw descriptors are not the primary use-case.
> > >> (that was the tl;dr for rx part, seems like it applies here?)
> > >>
> > >> Let's maybe discuss that mlx5 example? Are you proposing to do
> > >> something along these lines?
> > >>
> > >> void mlx5e_devtx_submit(struct mlx5e_tx_wqe *wqe);
> > >> void mlx5e_devtx_complete(struct mlx5_cqe64 *cqe);
> > >>
> > >> If yes, I'm missing how we define the common kfuncs in this case. The
> > >> kfuncs need to have some common context. We're defining them with:
> > >> bpf_devtx_<kfunc>(const struct devtx_frame *ctx);
> > >
> > > I'm looking at xdp_metadata and wondering who's using it.
> > > I haven't seen a single bug report.
> > > No bugs means no one is using it. There is zero chance that we managed
> > > to implement it bug-free on the first try.
> >
> > Nobody is using xdp_metadata today, not because they don't want to, but
> > because there was no consensus for how to use it. We have internal POCs
> > that use xdp_metadata and are still trying to build the foundations
> > needed to support it consistently across different hardware. Jesper
> > Brouer proposed a way to describe xdp_metadata with BTF and it was
> > rejected. The current plan to use kfuncs for xdp hints is the most
> > recent attempt to find a solution.
>
> The hold up on my side is getting it in a LST kernel so we can get it
> deployed in real environments. Although my plan is to just cast the
> ctx to a kernel ctx and read the data out we need.

+1

> >
> > > So new tx side things look like a feature creep to me.
> > > rx side is far from proven to be useful for anything.
> > > Yet you want to add new things.
>
> From my side if we just had a hook there and could cast the ctx to
> something BTF type safe so we can simply read through the descriptor
> I think that would sufficient for many use cases. To write into the
> descriptor that might take more thought a writeable BTF flag?

That's pretty much what I'm suggesting.
Add two driver specific __weak nop hook points where necessary
with few driver specific kfuncs.
Don't build generic infra when it's too early to generalize.

It would mean that bpf progs will be driver specific,
but when something novel like this is being proposed it's better
to start with minimal code change to core kernel (ideally none)
and when common things are found then generalize.

Sounds like Stanislav use case is timestamps in TX
while Donald's are checksums on RX, TX. These use cases are too different.
To make HW TX checksum compute checksum driven by AF_XDP
a lot more needs to be done than what Stan is proposing for timestamps.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-23 17:41         ` Stanislav Fomichev
@ 2023-06-24  9:02           ` Jesper Dangaard Brouer
  2023-06-26 17:00             ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-24  9:02 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints



On 23/06/2023 19.41, Stanislav Fomichev wrote:
> On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>>>>
>>>>
>>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>>>>
>>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>>>>> and carry some TX metadata in the headroom. For copy mode, there
>>>>> is no way currently to populate skb metadata.
>>>>>
>>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>>>>> (same as in RX case).
>>>>
>>>>    From looking at the code, this introduces a socket option for this TX
>>>> metadata length (tx_metadata_len).
>>>> This implies the same fixed TX metadata size is used for all packets.
>>>> Maybe describe this in patch desc.
>>>
>>> I was planning to do a proper documentation page once we settle on all
>>> the details (similar to the one we have for rx).
>>>
>>>> What is the plan for dealing with cases that doesn't populate same/full
>>>> TX metadata size ?
>>>
>>> Do we need to support that? I was assuming that the TX layout would be
>>> fixed between the userspace and BPF.
>>
>> I hope you don't mean fixed layout, as the whole point is adding
>> flexibility and extensibility.
> 
> I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> At least fixed max size of the metadata. The userspace and the bpf
> prog can then use this fixed space to implement some flexibility
> (btf_ids, versioned structs, bitmasks, tlv, etc).
> If we were to make the metalen vary per packet, we'd have to signal
> its size per packet. Probably not worth it?

Existing XDP metadata implementation also expand in a fixed/limited
sized memory area, but communicate size per packet in this area (also
for validation purposes).  BUT for AF_XDP we don't have room for another
pointer or size in the AF_XDP descriptor (see struct xdp_desc).


> 
>>> If every packet would have a different metadata length, it seems like
>>> a nightmare to parse?
>>>
>>
>> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> see [1] and [2].
>>
>> It seems we are talking slightly past each-other(?).  Let me rephrase
>> and reframe the question, what is your *plan* for dealing with different
>> *types* of TX metadata.  The different struct *types* will of-cause have
>> different sizes, but that is okay as long as they fit into the maximum
>> size set by this new socket option XDP_TX_METADATA_LEN.
>> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> for metadata, but we need a plan for handling more than one type and
>> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> data ("leftover" since last time this mem was used).
> 
> Yeah, I think the above correctly catches my expectation here. Some
> headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> offloaded to the bpf program via btf_id/tlv/etc.
> 
> Regarding leftover metadata: can we assume the userspace will take
> care of setting it up?
> 
>> With this kfunc approach, then things in-principle, becomes a contract
>> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> components can as illustrated here [1]+[2] can coordinate based on local
>> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> selftests examples don't use this and instead have a very static
>> approach (that people will copy-paste).
>>
>> An unsolved problem with TX-hook is that it can also get packets from
>> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> BPF-prog know if metadata is valid and intended to be used for e.g.
>> requesting the timestamp? (imagine metadata size happen to match)
> 
> My assumption was the bpf program can do ifindex/netns filtering. Plus
> maybe check that the meta_len is the one that's expected.
> Will that be enough to handle XDP_REDIRECT?

I don't think so, using the meta_len (+ ifindex/netns) to communicate
activation of TX hardware hints is too weak and not enough.  This is an
implicit API for BPF-programmers to understand and can lead to implicit
activation.

Think about what will happen for your AF_XDP send use-case.  For
performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
is fixed even if not used (and can contain garbage), it can by accident
create hard-to-debug situations.  As discussed with Magnus+Maryam
before, we found it was practical (and faster than mem zero) to extend
AF_XDP descriptor (see struct xdp_desc) with some flags to
indicate/communicate this frame comes with TX metadata hints.

>>
>> BPF-prog API bpf_core_type_id_local:
>>    - [1]
>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>>
>> Userspace API btf__find_by_name_kind:
>>    - [2]
>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
>>
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-24  2:52                 ` Alexei Starovoitov
@ 2023-06-24 21:38                   ` Jakub Kicinski
  2023-06-25  1:12                     ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Jakub Kicinski @ 2023-06-24 21:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Donald Hunter, Stanislav Fomichev, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On Fri, 23 Jun 2023 19:52:03 -0700 Alexei Starovoitov wrote:
> That's pretty much what I'm suggesting.
> Add two driver specific __weak nop hook points where necessary
> with few driver specific kfuncs.
> Don't build generic infra when it's too early to generalize.
> 
> It would mean that bpf progs will be driver specific,
> but when something novel like this is being proposed it's better
> to start with minimal code change to core kernel (ideally none)
> and when common things are found then generalize.
> 
> Sounds like Stanislav use case is timestamps in TX
> while Donald's are checksums on RX, TX. These use cases are too different.
> To make HW TX checksum compute checksum driven by AF_XDP
> a lot more needs to be done than what Stan is proposing for timestamps.

I'd think HW TX csum is actually simpler than dealing with time,
will you change your mind if Stan posts Tx csum within a few days? :)

The set of offloads is barely changing, the lack of clarity 
on what is needed seems overstated. IMHO AF_XDP is getting no use
today, because everything remotely complex was stripped out of 
the implementation to get it merged. Aren't we hand waving the
complexity away simply because we don't want to deal with it?

These are the features today's devices support (rx/tx is a mirror):
 - L4 csum
 - segmentation
 - time reporting

Some may also support:
 - forwarding md tagging
 - Tx launch time
 - no fcs
Legacy / irrelevant:
 - VLAN insertion

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-24 21:38                   ` Jakub Kicinski
@ 2023-06-25  1:12                     ` Stanislav Fomichev
  2023-06-26 21:36                       ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-25  1:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On 06/24, Jakub Kicinski wrote:
> On Fri, 23 Jun 2023 19:52:03 -0700 Alexei Starovoitov wrote:
> > That's pretty much what I'm suggesting.
> > Add two driver specific __weak nop hook points where necessary
> > with few driver specific kfuncs.
> > Don't build generic infra when it's too early to generalize.
> > 
> > It would mean that bpf progs will be driver specific,
> > but when something novel like this is being proposed it's better
> > to start with minimal code change to core kernel (ideally none)
> > and when common things are found then generalize.
> > 
> > Sounds like Stanislav use case is timestamps in TX
> > while Donald's are checksums on RX, TX. These use cases are too different.
> > To make HW TX checksum compute checksum driven by AF_XDP
> > a lot more needs to be done than what Stan is proposing for timestamps.
> 
> I'd think HW TX csum is actually simpler than dealing with time,
> will you change your mind if Stan posts Tx csum within a few days? :)
> 
> The set of offloads is barely changing, the lack of clarity 
> on what is needed seems overstated. IMHO AF_XDP is getting no use
> today, because everything remotely complex was stripped out of 
> the implementation to get it merged. Aren't we hand waving the
> complexity away simply because we don't want to deal with it?
> 
> These are the features today's devices support (rx/tx is a mirror):
>  - L4 csum
>  - segmentation
>  - time reporting
> 
> Some may also support:
>  - forwarding md tagging
>  - Tx launch time
>  - no fcs
> Legacy / irrelevant:
>  - VLAN insertion

Right, the goal of the series is to lay out the foundation to support
AF_XDP offloads. I'm starting with tx timestamp because that's more
pressing. But, as I mentioned in another thread, we do have other
users that want to adopt AF_XDP, but due to missing tx offloads, they
aren't able to.

IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
timestamp to TX checksum offload, we don't need a lot:
- define another generic kfunc bpf_request_tx_csum(from, to)
- drivers implement it
- af_xdp users call this kfunc from devtx hook

We seem to be arguing over start-with-my-specific-narrow-use-case vs
start-with-generic implementation, so maybe time for the office hours?
I can try to present some cohesive plan of how we start with the framework
plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
commonality in these offloads, so I'm probably not communicating it
properly..

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-24  9:02           ` Jesper Dangaard Brouer
@ 2023-06-26 17:00             ` Stanislav Fomichev
  2023-06-28  8:09               ` Magnus Karlsson
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 17:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >>>>
> >>>>
> >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >>>>
> >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >>>>> is no way currently to populate skb metadata.
> >>>>>
> >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >>>>> (same as in RX case).
> >>>>
> >>>>    From looking at the code, this introduces a socket option for this TX
> >>>> metadata length (tx_metadata_len).
> >>>> This implies the same fixed TX metadata size is used for all packets.
> >>>> Maybe describe this in patch desc.
> >>>
> >>> I was planning to do a proper documentation page once we settle on all
> >>> the details (similar to the one we have for rx).
> >>>
> >>>> What is the plan for dealing with cases that doesn't populate same/full
> >>>> TX metadata size ?
> >>>
> >>> Do we need to support that? I was assuming that the TX layout would be
> >>> fixed between the userspace and BPF.
> >>
> >> I hope you don't mean fixed layout, as the whole point is adding
> >> flexibility and extensibility.
> >
> > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > At least fixed max size of the metadata. The userspace and the bpf
> > prog can then use this fixed space to implement some flexibility
> > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > If we were to make the metalen vary per packet, we'd have to signal
> > its size per packet. Probably not worth it?
>
> Existing XDP metadata implementation also expand in a fixed/limited
> sized memory area, but communicate size per packet in this area (also
> for validation purposes).  BUT for AF_XDP we don't have room for another
> pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>
>
> >
> >>> If every packet would have a different metadata length, it seems like
> >>> a nightmare to parse?
> >>>
> >>
> >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> see [1] and [2].
> >>
> >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> and reframe the question, what is your *plan* for dealing with different
> >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> different sizes, but that is okay as long as they fit into the maximum
> >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> for metadata, but we need a plan for handling more than one type and
> >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> data ("leftover" since last time this mem was used).
> >
> > Yeah, I think the above correctly catches my expectation here. Some
> > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > offloaded to the bpf program via btf_id/tlv/etc.
> >
> > Regarding leftover metadata: can we assume the userspace will take
> > care of setting it up?
> >
> >> With this kfunc approach, then things in-principle, becomes a contract
> >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> components can as illustrated here [1]+[2] can coordinate based on local
> >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> selftests examples don't use this and instead have a very static
> >> approach (that people will copy-paste).
> >>
> >> An unsolved problem with TX-hook is that it can also get packets from
> >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> requesting the timestamp? (imagine metadata size happen to match)
> >
> > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > maybe check that the meta_len is the one that's expected.
> > Will that be enough to handle XDP_REDIRECT?
>
> I don't think so, using the meta_len (+ ifindex/netns) to communicate
> activation of TX hardware hints is too weak and not enough.  This is an
> implicit API for BPF-programmers to understand and can lead to implicit
> activation.
>
> Think about what will happen for your AF_XDP send use-case.  For
> performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> is fixed even if not used (and can contain garbage), it can by accident
> create hard-to-debug situations.  As discussed with Magnus+Maryam
> before, we found it was practical (and faster than mem zero) to extend
> AF_XDP descriptor (see struct xdp_desc) with some flags to
> indicate/communicate this frame comes with TX metadata hints.

What is that "if not used" situation? Can the metadata itself have
is_used bit? The userspace has to initialize at least that bit.
We can definitely add that extra "has_metadata" bit to the descriptor,
but I'm trying to understand whether we can do without it.


> >>
> >> BPF-prog API bpf_core_type_id_local:
> >>    - [1]
> >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
> >>
> >> Userspace API btf__find_by_name_kind:
> >>    - [2]
> >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
> >>
> >
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-23 23:29   ` Vinicius Costa Gomes
@ 2023-06-26 17:00     ` Stanislav Fomichev
  2023-06-26 22:00       ` Vinicius Costa Gomes
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 17:00 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Fri, Jun 23, 2023 at 4:29 PM Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > Have a software-based example for kfuncs to showcase how it
> > can be used in the real devices and to have something to
> > test against in the selftests.
> >
> > Both path (skb & xdp) are covered. Only the skb path is really
> > tested though.
> >
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
> Not really related to this patch, but to how it would work with
> different drivers/hardware.
>
> In some of our hardware (the ones handled by igc/igb, for example), the
> timestamp notification comes some time after the transmit completion
> event.
>
> From what I could gather, the idea would be for the driver to "hold" the
> completion until the timestamp is ready and then signal the completion
> of the frame. Is that right?

Yeah, that might be the option. Do you think it could work?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-25  1:12                     ` Stanislav Fomichev
@ 2023-06-26 21:36                       ` Stanislav Fomichev
  2023-06-26 22:37                         ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 21:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On 06/24, Stanislav Fomichev wrote:
> On 06/24, Jakub Kicinski wrote:
> > On Fri, 23 Jun 2023 19:52:03 -0700 Alexei Starovoitov wrote:
> > > That's pretty much what I'm suggesting.
> > > Add two driver specific __weak nop hook points where necessary
> > > with few driver specific kfuncs.
> > > Don't build generic infra when it's too early to generalize.
> > > 
> > > It would mean that bpf progs will be driver specific,
> > > but when something novel like this is being proposed it's better
> > > to start with minimal code change to core kernel (ideally none)
> > > and when common things are found then generalize.
> > > 
> > > Sounds like Stanislav use case is timestamps in TX
> > > while Donald's are checksums on RX, TX. These use cases are too different.
> > > To make HW TX checksum compute checksum driven by AF_XDP
> > > a lot more needs to be done than what Stan is proposing for timestamps.
> > 
> > I'd think HW TX csum is actually simpler than dealing with time,
> > will you change your mind if Stan posts Tx csum within a few days? :)
> > 
> > The set of offloads is barely changing, the lack of clarity 
> > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > today, because everything remotely complex was stripped out of 
> > the implementation to get it merged. Aren't we hand waving the
> > complexity away simply because we don't want to deal with it?
> > 
> > These are the features today's devices support (rx/tx is a mirror):
> >  - L4 csum
> >  - segmentation
> >  - time reporting
> > 
> > Some may also support:
> >  - forwarding md tagging
> >  - Tx launch time
> >  - no fcs
> > Legacy / irrelevant:
> >  - VLAN insertion
> 
> Right, the goal of the series is to lay out the foundation to support
> AF_XDP offloads. I'm starting with tx timestamp because that's more
> pressing. But, as I mentioned in another thread, we do have other
> users that want to adopt AF_XDP, but due to missing tx offloads, they
> aren't able to.
> 
> IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> timestamp to TX checksum offload, we don't need a lot:
> - define another generic kfunc bpf_request_tx_csum(from, to)
> - drivers implement it
> - af_xdp users call this kfunc from devtx hook
> 
> We seem to be arguing over start-with-my-specific-narrow-use-case vs
> start-with-generic implementation, so maybe time for the office hours?
> I can try to present some cohesive plan of how we start with the framework
> plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> commonality in these offloads, so I'm probably not communicating it
> properly..

Or, maybe a better suggestion: let me try to implement TX checksum
kfunc in the v3 (to show how to build on top this series).
Having code is better than doing slides :-D

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-26 17:00     ` Stanislav Fomichev
@ 2023-06-26 22:00       ` Vinicius Costa Gomes
  2023-06-26 23:29         ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Vinicius Costa Gomes @ 2023-06-26 22:00 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Fri, Jun 23, 2023 at 4:29 PM Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > Have a software-based example for kfuncs to showcase how it
>> > can be used in the real devices and to have something to
>> > test against in the selftests.
>> >
>> > Both path (skb & xdp) are covered. Only the skb path is really
>> > tested though.
>> >
>> > Cc: netdev@vger.kernel.org
>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>
>> Not really related to this patch, but to how it would work with
>> different drivers/hardware.
>>
>> In some of our hardware (the ones handled by igc/igb, for example), the
>> timestamp notification comes some time after the transmit completion
>> event.
>>
>> From what I could gather, the idea would be for the driver to "hold" the
>> completion until the timestamp is ready and then signal the completion
>> of the frame. Is that right?
>
> Yeah, that might be the option. Do you think it could work?
>

For the skb and XDP cases, yeah, just holding the completion for a while
seems like it's going to work.

XDP ZC looks more complicated to me, not sure if it's only a matter of
adding something like:

void xsk_tx_completed_one(struct xsk_buff_pool *pool, struct xdp_buff *xdp);

Or if more changes would be needed. I am trying to think about the case
that the user sent a single "timestamp" packet among a bunch of
"non-timestamp" packets.


Cheers,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-26 21:36                       ` Stanislav Fomichev
@ 2023-06-26 22:37                         ` Alexei Starovoitov
  2023-06-26 23:29                           ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-26 22:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On Mon, Jun 26, 2023 at 2:36 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> > >
> > > I'd think HW TX csum is actually simpler than dealing with time,
> > > will you change your mind if Stan posts Tx csum within a few days? :)

Absolutely :) Happy to change my mind.

> > > The set of offloads is barely changing, the lack of clarity
> > > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > > today, because everything remotely complex was stripped out of
> > > the implementation to get it merged. Aren't we hand waving the
> > > complexity away simply because we don't want to deal with it?
> > >
> > > These are the features today's devices support (rx/tx is a mirror):
> > >  - L4 csum
> > >  - segmentation
> > >  - time reporting
> > >
> > > Some may also support:
> > >  - forwarding md tagging
> > >  - Tx launch time
> > >  - no fcs
> > > Legacy / irrelevant:
> > >  - VLAN insertion
> >
> > Right, the goal of the series is to lay out the foundation to support
> > AF_XDP offloads. I'm starting with tx timestamp because that's more
> > pressing. But, as I mentioned in another thread, we do have other
> > users that want to adopt AF_XDP, but due to missing tx offloads, they
> > aren't able to.
> >
> > IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> > timestamp to TX checksum offload, we don't need a lot:
> > - define another generic kfunc bpf_request_tx_csum(from, to)
> > - drivers implement it
> > - af_xdp users call this kfunc from devtx hook
> >
> > We seem to be arguing over start-with-my-specific-narrow-use-case vs
> > start-with-generic implementation, so maybe time for the office hours?
> > I can try to present some cohesive plan of how we start with the framework
> > plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> > commonality in these offloads, so I'm probably not communicating it
> > properly..
>
> Or, maybe a better suggestion: let me try to implement TX checksum
> kfunc in the v3 (to show how to build on top this series).
> Having code is better than doing slides :-D

That would certainly help :)
What I got out of your lsfmmbpf talk is that timestamp is your
main and only use case. tx checksum for af_xdp is the other use case,
but it's not yours, so you sort-of use it as an extra justification
for timestamp. Hence my negative reaction to 'generality'.
I think we'll have better results in designing an api
when we look at these two use cases independently.
And implement them in patches solving specifically timestamp
with normal skb traffic and tx checksum for af_xdp as two independent apis.
If it looks like we can extract a common framework out of them. Great.
But trying to generalize before truly addressing both cases
is likely to cripple both apis.

It doesn't have to be only two use cases.
I completely agree with Kuba that:
 - L4 csum
 - segmentation
 - time reporting
are universal HW NIC features and we need to have an api
that exposes these features in programmable way to bpf progs in the kernel
and through af_xdp to user space.
I mainly suggest addressing them one by one and look
for common code bits and api similarities later.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-26 22:37                         ` Alexei Starovoitov
@ 2023-06-26 23:29                           ` Stanislav Fomichev
  2023-06-27 13:35                             ` Toke Høiland-Jørgensen
  2023-06-27 21:43                             ` John Fastabend
  0 siblings, 2 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 23:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On Mon, Jun 26, 2023 at 3:37 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 26, 2023 at 2:36 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > > >
> > > > I'd think HW TX csum is actually simpler than dealing with time,
> > > > will you change your mind if Stan posts Tx csum within a few days? :)
>
> Absolutely :) Happy to change my mind.
>
> > > > The set of offloads is barely changing, the lack of clarity
> > > > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > > > today, because everything remotely complex was stripped out of
> > > > the implementation to get it merged. Aren't we hand waving the
> > > > complexity away simply because we don't want to deal with it?
> > > >
> > > > These are the features today's devices support (rx/tx is a mirror):
> > > >  - L4 csum
> > > >  - segmentation
> > > >  - time reporting
> > > >
> > > > Some may also support:
> > > >  - forwarding md tagging
> > > >  - Tx launch time
> > > >  - no fcs
> > > > Legacy / irrelevant:
> > > >  - VLAN insertion
> > >
> > > Right, the goal of the series is to lay out the foundation to support
> > > AF_XDP offloads. I'm starting with tx timestamp because that's more
> > > pressing. But, as I mentioned in another thread, we do have other
> > > users that want to adopt AF_XDP, but due to missing tx offloads, they
> > > aren't able to.
> > >
> > > IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> > > timestamp to TX checksum offload, we don't need a lot:
> > > - define another generic kfunc bpf_request_tx_csum(from, to)
> > > - drivers implement it
> > > - af_xdp users call this kfunc from devtx hook
> > >
> > > We seem to be arguing over start-with-my-specific-narrow-use-case vs
> > > start-with-generic implementation, so maybe time for the office hours?
> > > I can try to present some cohesive plan of how we start with the framework
> > > plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> > > commonality in these offloads, so I'm probably not communicating it
> > > properly..
> >
> > Or, maybe a better suggestion: let me try to implement TX checksum
> > kfunc in the v3 (to show how to build on top this series).
> > Having code is better than doing slides :-D
>
> That would certainly help :)
> What I got out of your lsfmmbpf talk is that timestamp is your
> main and only use case. tx checksum for af_xdp is the other use case,
> but it's not yours, so you sort-of use it as an extra justification
> for timestamp. Hence my negative reaction to 'generality'.
> I think we'll have better results in designing an api
> when we look at these two use cases independently.
> And implement them in patches solving specifically timestamp
> with normal skb traffic and tx checksum for af_xdp as two independent apis.
> If it looks like we can extract a common framework out of them. Great.
> But trying to generalize before truly addressing both cases
> is likely to cripple both apis.

I need timestamps for the af_xdp case and I don't really care about skb :-(
I brought skb into the picture mostly to cover John's cases.
So maybe let's drop the skb case for now and focus on af_xdp?
skb is convenient testing-wise though (with veth), but maybe I can
somehow carve-out af_xdp skbs only out of it..

Regarding timestamp vs checksum: timestamp is more pressing, but I do
have people around that want to use af_xdp but need multibuf + tx
offloads, so I was hoping to at least have a path for more tx offloads
after we're done with tx timestamp "offload"..

> It doesn't have to be only two use cases.
> I completely agree with Kuba that:
>  - L4 csum
>  - segmentation
>  - time reporting
> are universal HW NIC features and we need to have an api
> that exposes these features in programmable way to bpf progs in the kernel
> and through af_xdp to user space.
> I mainly suggest addressing them one by one and look
> for common code bits and api similarities later.

Ack, let me see if I can fit tx csum into the picture. I still feel
like we need these dev-bound tracing programs if we want to trigger
kfuncs safely, but maybe we can simplify further..

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-26 22:00       ` Vinicius Costa Gomes
@ 2023-06-26 23:29         ` Stanislav Fomichev
  2023-06-27  1:38           ` Vinicius Costa Gomes
  0 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 23:29 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

On Mon, Jun 26, 2023 at 3:00 PM Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Fri, Jun 23, 2023 at 4:29 PM Vinicius Costa Gomes
> > <vinicius.gomes@intel.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > Have a software-based example for kfuncs to showcase how it
> >> > can be used in the real devices and to have something to
> >> > test against in the selftests.
> >> >
> >> > Both path (skb & xdp) are covered. Only the skb path is really
> >> > tested though.
> >> >
> >> > Cc: netdev@vger.kernel.org
> >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >>
> >> Not really related to this patch, but to how it would work with
> >> different drivers/hardware.
> >>
> >> In some of our hardware (the ones handled by igc/igb, for example), the
> >> timestamp notification comes some time after the transmit completion
> >> event.
> >>
> >> From what I could gather, the idea would be for the driver to "hold" the
> >> completion until the timestamp is ready and then signal the completion
> >> of the frame. Is that right?
> >
> > Yeah, that might be the option. Do you think it could work?
> >
>
> For the skb and XDP cases, yeah, just holding the completion for a while
> seems like it's going to work.
>
> XDP ZC looks more complicated to me, not sure if it's only a matter of
> adding something like:

[..]

> void xsk_tx_completed_one(struct xsk_buff_pool *pool, struct xdp_buff *xdp);
>
> Or if more changes would be needed. I am trying to think about the case
> that the user sent a single "timestamp" packet among a bunch of
> "non-timestamp" packets.

Since you're passing xdp_buff as an argument I'm assuming that is
suggesting out-of-order completions?
The completion queue is a single index, we can't do ooo stuff.
So you'd have to hold a bunch of packets until you receive the
timestamp completion; after this event, you can complete the whole
batch (1 packet waiting for the timestamp + a bunch that have been
transmitted afterwards but were still unacknowleged in the queue).

(lmk if I've misinterpreted)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs
  2023-06-26 23:29         ` Stanislav Fomichev
@ 2023-06-27  1:38           ` Vinicius Costa Gomes
  0 siblings, 0 replies; 72+ messages in thread
From: Vinicius Costa Gomes @ 2023-06-27  1:38 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Mon, Jun 26, 2023 at 3:00 PM Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Fri, Jun 23, 2023 at 4:29 PM Vinicius Costa Gomes
>> > <vinicius.gomes@intel.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > Have a software-based example for kfuncs to showcase how it
>> >> > can be used in the real devices and to have something to
>> >> > test against in the selftests.
>> >> >
>> >> > Both path (skb & xdp) are covered. Only the skb path is really
>> >> > tested though.
>> >> >
>> >> > Cc: netdev@vger.kernel.org
>> >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> >>
>> >> Not really related to this patch, but to how it would work with
>> >> different drivers/hardware.
>> >>
>> >> In some of our hardware (the ones handled by igc/igb, for example), the
>> >> timestamp notification comes some time after the transmit completion
>> >> event.
>> >>
>> >> From what I could gather, the idea would be for the driver to "hold" the
>> >> completion until the timestamp is ready and then signal the completion
>> >> of the frame. Is that right?
>> >
>> > Yeah, that might be the option. Do you think it could work?
>> >
>>
>> For the skb and XDP cases, yeah, just holding the completion for a while
>> seems like it's going to work.
>>
>> XDP ZC looks more complicated to me, not sure if it's only a matter of
>> adding something like:
>
> [..]
>
>> void xsk_tx_completed_one(struct xsk_buff_pool *pool, struct xdp_buff *xdp);
>>
>> Or if more changes would be needed. I am trying to think about the case
>> that the user sent a single "timestamp" packet among a bunch of
>> "non-timestamp" packets.
>
> Since you're passing xdp_buff as an argument I'm assuming that is
> suggesting out-of-order completions?
> The completion queue is a single index, we can't do ooo stuff.
> So you'd have to hold a bunch of packets until you receive the
> timestamp completion; after this event, you can complete the whole
> batch (1 packet waiting for the timestamp + a bunch that have been
> transmitted afterwards but were still unacknowleged in the queue).
>
> (lmk if I've misinterpreted)

Not at all, it was me that wasn't aware that out-of-order completions
were out of the picture.

So, yeah, what you are proposing, accumulating the pending completions
while there's a pending timestamp request, seems the only way to go.

The logic seems simple enough, but the fact that the "timestamp ready"
interrupt is not associated with any queue seems that it will make
things a bit "interesting" to get it right :-) 

But I don't have any better suggestions.


Thank you,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-26 23:29                           ` Stanislav Fomichev
@ 2023-06-27 13:35                             ` Toke Høiland-Jørgensen
  2023-06-27 21:43                             ` John Fastabend
  1 sibling, 0 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-27 13:35 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov
  Cc: Jakub Kicinski, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

Stanislav Fomichev <sdf@google.com> writes:

> Ack, let me see if I can fit tx csum into the picture. I still feel
> like we need these dev-bound tracing programs if we want to trigger
> kfuncs safely, but maybe we can simplify further..

FWIW, I absolutely think we should go with "attach to ifindex + dev
bound" model instead of the "attach to driver kfunc and check ifindex in
BPF". The latter may be fine for BPF kernel devs, but it's a terrible
API for a dataplane, which IMO is what we're building here...

-Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-26 23:29                           ` Stanislav Fomichev
  2023-06-27 13:35                             ` Toke Høiland-Jørgensen
@ 2023-06-27 21:43                             ` John Fastabend
  2023-06-27 22:56                               ` Stanislav Fomichev
  2023-06-28 18:52                               ` Jakub Kicinski
  1 sibling, 2 replies; 72+ messages in thread
From: John Fastabend @ 2023-06-27 21:43 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov
  Cc: Jakub Kicinski, John Fastabend, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

Stanislav Fomichev wrote:
> On Mon, Jun 26, 2023 at 3:37 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 26, 2023 at 2:36 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > > >
> > > > > I'd think HW TX csum is actually simpler than dealing with time,
> > > > > will you change your mind if Stan posts Tx csum within a few days? :)
> >
> > Absolutely :) Happy to change my mind.
> >
> > > > > The set of offloads is barely changing, the lack of clarity
> > > > > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > > > > today, because everything remotely complex was stripped out of
> > > > > the implementation to get it merged. Aren't we hand waving the
> > > > > complexity away simply because we don't want to deal with it?
> > > > >
> > > > > These are the features today's devices support (rx/tx is a mirror):
> > > > >  - L4 csum
> > > > >  - segmentation
> > > > >  - time reporting
> > > > >
> > > > > Some may also support:
> > > > >  - forwarding md tagging
> > > > >  - Tx launch time
> > > > >  - no fcs
> > > > > Legacy / irrelevant:
> > > > >  - VLAN insertion
> > > >
> > > > Right, the goal of the series is to lay out the foundation to support
> > > > AF_XDP offloads. I'm starting with tx timestamp because that's more
> > > > pressing. But, as I mentioned in another thread, we do have other
> > > > users that want to adopt AF_XDP, but due to missing tx offloads, they
> > > > aren't able to.
> > > >
> > > > IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> > > > timestamp to TX checksum offload, we don't need a lot:
> > > > - define another generic kfunc bpf_request_tx_csum(from, to)
> > > > - drivers implement it
> > > > - af_xdp users call this kfunc from devtx hook
> > > >
> > > > We seem to be arguing over start-with-my-specific-narrow-use-case vs
> > > > start-with-generic implementation, so maybe time for the office hours?
> > > > I can try to present some cohesive plan of how we start with the framework
> > > > plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> > > > commonality in these offloads, so I'm probably not communicating it
> > > > properly..
> > >
> > > Or, maybe a better suggestion: let me try to implement TX checksum
> > > kfunc in the v3 (to show how to build on top this series).
> > > Having code is better than doing slides :-D
> >
> > That would certainly help :)
> > What I got out of your lsfmmbpf talk is that timestamp is your
> > main and only use case. tx checksum for af_xdp is the other use case,
> > but it's not yours, so you sort-of use it as an extra justification
> > for timestamp. Hence my negative reaction to 'generality'.
> > I think we'll have better results in designing an api
> > when we look at these two use cases independently.
> > And implement them in patches solving specifically timestamp
> > with normal skb traffic and tx checksum for af_xdp as two independent apis.
> > If it looks like we can extract a common framework out of them. Great.
> > But trying to generalize before truly addressing both cases
> > is likely to cripple both apis.
> 
> I need timestamps for the af_xdp case and I don't really care about skb :-(
> I brought skb into the picture mostly to cover John's cases.
> So maybe let's drop the skb case for now and focus on af_xdp?
> skb is convenient testing-wise though (with veth), but maybe I can
> somehow carve-out af_xdp skbs only out of it..

I'm ok if your drop my use case but I read above and I seem to have a
slightly different opinion/approach in mind.

What I think would be the most straight-forward thing and most flexible
is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
don't spend any cycles building the metadata thing or have to even
worry about read kfuncs. The BPF program has read access to any
fields they need. And with the skb, xdp pointer we have the context
that created the descriptor and generate meaningful metrics.

I'm clearly sacrificing usability in some sense of a general user that
doesn't know about drivers, hardware and so on for performance,
flexibility and simplicity of implementation. In general I'm OK with
this. I have trouble understanding who the dev is that is coding at
this layer, but can't read kernel driver code or at least has a good
understanding of the hardware. We are deep in optimization and
performance world once we get to putting hooks in the driver we should
expect a certain amount of understanding before we let folks just plant
hooks here. Its going to be very easy to cause all sort of damage
even if we go to the other extreme and make a unified interface and
push the complexity onto kernel devs to maintain. I really believe
folks writing AF_XDP code (for DPDK or otherwise) have a really good
handle on networking, hardware, and drivers. I also expect that
AF_XDP users will mostly be abstracted away from AF_XDP internals
by DPDK and other libs or applications. My $.02 is these will be
primarily middle box type application built for special purpose l2/l3/l4
firewalling, DDOS, etc and telco protocols. Rant off.

But I can admit <drvname>_devtx_submit_xdp(<drvname>descriptor, ...)
is a bit raw. For one its going to require an object file per
driver/hardware and maybe worse multiple objects per driver/hw to
deal with hw descriptor changes with features. My $.02 is we could
solve this with better attach time linking. Now you have to at
compile time know what you are going to attach to and how to parse
the descriptor. If we had attach time linking we could dynamically
link to the hardware specific code at link time. And then its up
to us here or others who really understand the hardware to write
a ice_read_ts, mlx_read_tx but that can all just be normal BPF code.

Also I hand waved through a step where at attach time we have
some way to say link the thing that is associated with the
driver I'm about to attach to. As a first step a loader could
do this.

Its maybe more core work and less wrangling drivers then and
it means kfuncs become just blocks of BPF that anyone can
write. The big win in my mind is I don't need to know now
what I want tomorrow because I should have access. Also we push
the complexity and maintenance out of driver/kernel and into
BPF libs and users. Then we don't have to touch BPF core just
to add new things.

Last bit that complicates things is I need a way to also write
allowed values into the descriptor. We don't have anything that
can do this now. So maybe kfuncs for the write tstamp flags and
friends?

Anyways, my $.02.



> 
> Regarding timestamp vs checksum: timestamp is more pressing, but I do

One request would be to also include a driver that doesn't have
always on timestamps so some writer is needed. CSUM enabling
I'm interested to see what the signature looks like? On the
skb side we use the skb a fair amount to build the checksum
it seems so I guess AF_XDP needs to push down the csum_offset?
In the SKB case its less interesting I think becuase the
stack is already handling it.

> have people around that want to use af_xdp but need multibuf + tx
> offloads, so I was hoping to at least have a path for more tx offloads
> after we're done with tx timestamp "offload"..
> 
> > It doesn't have to be only two use cases.
> > I completely agree with Kuba that:
> >  - L4 csum
> >  - segmentation
> >  - time reporting
> > are universal HW NIC features and we need to have an api
> > that exposes these features in programmable way to bpf progs in the kernel
> > and through af_xdp to user space.
> > I mainly suggest addressing them one by one and look
> > for common code bits and api similarities later.
> 
> Ack, let me see if I can fit tx csum into the picture. I still feel
> like we need these dev-bound tracing programs if we want to trigger
> kfuncs safely, but maybe we can simplify further..

Its not clear to me how you get to a dev specific attach here
without complicating the hot path more. I think we need to
really be careful to not make hotpath more complex. Will
follow along for sure to see what gets created.

Thanks,
John

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-27 21:43                             ` John Fastabend
@ 2023-06-27 22:56                               ` Stanislav Fomichev
  2023-06-27 23:33                                 ` John Fastabend
  2023-06-28 18:52                               ` Jakub Kicinski
  1 sibling, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-27 22:56 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Jakub Kicinski, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On 06/27, John Fastabend wrote:
> Stanislav Fomichev wrote:
> > On Mon, Jun 26, 2023 at 3:37 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 26, 2023 at 2:36 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > > >
> > > > > > I'd think HW TX csum is actually simpler than dealing with time,
> > > > > > will you change your mind if Stan posts Tx csum within a few days? :)
> > >
> > > Absolutely :) Happy to change my mind.
> > >
> > > > > > The set of offloads is barely changing, the lack of clarity
> > > > > > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > > > > > today, because everything remotely complex was stripped out of
> > > > > > the implementation to get it merged. Aren't we hand waving the
> > > > > > complexity away simply because we don't want to deal with it?
> > > > > >
> > > > > > These are the features today's devices support (rx/tx is a mirror):
> > > > > >  - L4 csum
> > > > > >  - segmentation
> > > > > >  - time reporting
> > > > > >
> > > > > > Some may also support:
> > > > > >  - forwarding md tagging
> > > > > >  - Tx launch time
> > > > > >  - no fcs
> > > > > > Legacy / irrelevant:
> > > > > >  - VLAN insertion
> > > > >
> > > > > Right, the goal of the series is to lay out the foundation to support
> > > > > AF_XDP offloads. I'm starting with tx timestamp because that's more
> > > > > pressing. But, as I mentioned in another thread, we do have other
> > > > > users that want to adopt AF_XDP, but due to missing tx offloads, they
> > > > > aren't able to.
> > > > >
> > > > > IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> > > > > timestamp to TX checksum offload, we don't need a lot:
> > > > > - define another generic kfunc bpf_request_tx_csum(from, to)
> > > > > - drivers implement it
> > > > > - af_xdp users call this kfunc from devtx hook
> > > > >
> > > > > We seem to be arguing over start-with-my-specific-narrow-use-case vs
> > > > > start-with-generic implementation, so maybe time for the office hours?
> > > > > I can try to present some cohesive plan of how we start with the framework
> > > > > plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> > > > > commonality in these offloads, so I'm probably not communicating it
> > > > > properly..
> > > >
> > > > Or, maybe a better suggestion: let me try to implement TX checksum
> > > > kfunc in the v3 (to show how to build on top this series).
> > > > Having code is better than doing slides :-D
> > >
> > > That would certainly help :)
> > > What I got out of your lsfmmbpf talk is that timestamp is your
> > > main and only use case. tx checksum for af_xdp is the other use case,
> > > but it's not yours, so you sort-of use it as an extra justification
> > > for timestamp. Hence my negative reaction to 'generality'.
> > > I think we'll have better results in designing an api
> > > when we look at these two use cases independently.
> > > And implement them in patches solving specifically timestamp
> > > with normal skb traffic and tx checksum for af_xdp as two independent apis.
> > > If it looks like we can extract a common framework out of them. Great.
> > > But trying to generalize before truly addressing both cases
> > > is likely to cripple both apis.
> > 
> > I need timestamps for the af_xdp case and I don't really care about skb :-(
> > I brought skb into the picture mostly to cover John's cases.
> > So maybe let's drop the skb case for now and focus on af_xdp?
> > skb is convenient testing-wise though (with veth), but maybe I can
> > somehow carve-out af_xdp skbs only out of it..
> 
> I'm ok if your drop my use case but I read above and I seem to have a
> slightly different opinion/approach in mind.
> 
> What I think would be the most straight-forward thing and most flexible
> is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
> and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
> corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
> don't spend any cycles building the metadata thing or have to even
> worry about read kfuncs. The BPF program has read access to any
> fields they need. And with the skb, xdp pointer we have the context
> that created the descriptor and generate meaningful metrics.
> 
> I'm clearly sacrificing usability in some sense of a general user that
> doesn't know about drivers, hardware and so on for performance,
> flexibility and simplicity of implementation. In general I'm OK with
> this. I have trouble understanding who the dev is that is coding at
> this layer, but can't read kernel driver code or at least has a good
> understanding of the hardware. We are deep in optimization and
> performance world once we get to putting hooks in the driver we should
> expect a certain amount of understanding before we let folks just plant
> hooks here. Its going to be very easy to cause all sort of damage
> even if we go to the other extreme and make a unified interface and
> push the complexity onto kernel devs to maintain. I really believe
> folks writing AF_XDP code (for DPDK or otherwise) have a really good
> handle on networking, hardware, and drivers. I also expect that
> AF_XDP users will mostly be abstracted away from AF_XDP internals
> by DPDK and other libs or applications. My $.02 is these will be
> primarily middle box type application built for special purpose l2/l3/l4
> firewalling, DDOS, etc and telco protocols. Rant off.
> 
> But I can admit <drvname>_devtx_submit_xdp(<drvname>descriptor, ...)
> is a bit raw. For one its going to require an object file per
> driver/hardware and maybe worse multiple objects per driver/hw to
> deal with hw descriptor changes with features. My $.02 is we could
> solve this with better attach time linking. Now you have to at
> compile time know what you are going to attach to and how to parse
> the descriptor. If we had attach time linking we could dynamically
> link to the hardware specific code at link time. And then its up
> to us here or others who really understand the hardware to write
> a ice_read_ts, mlx_read_tx but that can all just be normal BPF code.
> 
> Also I hand waved through a step where at attach time we have
> some way to say link the thing that is associated with the
> driver I'm about to attach to. As a first step a loader could
> do this.
> 
> Its maybe more core work and less wrangling drivers then and
> it means kfuncs become just blocks of BPF that anyone can
> write. The big win in my mind is I don't need to know now
> what I want tomorrow because I should have access. Also we push
> the complexity and maintenance out of driver/kernel and into
> BPF libs and users. Then we don't have to touch BPF core just
> to add new things.
> 
> Last bit that complicates things is I need a way to also write
> allowed values into the descriptor. We don't have anything that
> can do this now. So maybe kfuncs for the write tstamp flags and
> friends?

And in this case, the kfuncs would be non-generic and look something
like the following?

  bpf_devtx_<drvname>_request_timestamp(<drivname>descriptor, xdp_frame)

I feel like this can work, yeah. The interface is raw, but maybe
you are both right in assuming that different nics will
expose different capabilities and we shouldn't try to pretend
there is some commonality. I'll try to explore that idea more with
the tx-csum..

Worst case, userspace can do:

int bpf_devtx_request_timestamp(some-generic-prog-abstraction-to-pass-ctx)
{
#ifdef DEVICE_MLX5
  return mlx5_request_timestamp(...);
#elif DEVICE_VETH
  return veth_request-timestamp(...);
#else
  ...
#endif
}

One thing we should probably spend more time in this case is documenting
it all. Or maybe having some DEVTX_XXX macros for those kfunc definitions
and hooks to make them discoverable.

But yeah, I see the appeal. The only ugly part with this all is that
my xdp_hw_metadata would not be portable at all :-/ But it might be
a good place/reason to actually figure out how to do it :-)

> Anyways, my $.02.
> 
> 
> 
> > 
> > Regarding timestamp vs checksum: timestamp is more pressing, but I do
> 
> One request would be to also include a driver that doesn't have
> always on timestamps so some writer is needed. CSUM enabling
> I'm interested to see what the signature looks like? On the
> skb side we use the skb a fair amount to build the checksum
> it seems so I guess AF_XDP needs to push down the csum_offset?
> In the SKB case its less interesting I think becuase the
> stack is already handling it.

I need access to your lab :-p

Regarding the signature, csum_start + csum_offset maybe? As we have in
skb?

Although, from a quick grepping, I see some of the drivers have only
a fixed set of tx checksum configurations they support:

switch (skb->csum_offset) {
case offsetof(struct tcphdr, check):
	tx_desc->flags |= DO_TCP_IP_TX_CSUM_AT_FIXED_OFFSET;
	break;
default:
	/* sw fallback */
}

So maybe that's another argument in favor of not doing a generic
layer and just expose whatever HW has in a non-portable way...
(otoh, still accepting csum_offset+start + doing that switch inside
is probably an ok common interface)

> > have people around that want to use af_xdp but need multibuf + tx
> > offloads, so I was hoping to at least have a path for more tx offloads
> > after we're done with tx timestamp "offload"..
> > 
> > > It doesn't have to be only two use cases.
> > > I completely agree with Kuba that:
> > >  - L4 csum
> > >  - segmentation
> > >  - time reporting
> > > are universal HW NIC features and we need to have an api
> > > that exposes these features in programmable way to bpf progs in the kernel
> > > and through af_xdp to user space.
> > > I mainly suggest addressing them one by one and look
> > > for common code bits and api similarities later.
> > 
> > Ack, let me see if I can fit tx csum into the picture. I still feel
> > like we need these dev-bound tracing programs if we want to trigger
> > kfuncs safely, but maybe we can simplify further..
> 
> Its not clear to me how you get to a dev specific attach here
> without complicating the hot path more. I think we need to
> really be careful to not make hotpath more complex. Will
> follow along for sure to see what gets created.

Agreed. I've yet to test it with some real HW (still in the process of
trying to get back my lab configuration which was changed recently) :-(

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-27 22:56                               ` Stanislav Fomichev
@ 2023-06-27 23:33                                 ` John Fastabend
  2023-06-27 23:50                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2023-06-27 23:33 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: Alexei Starovoitov, Jakub Kicinski, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

Stanislav Fomichev wrote:
> On 06/27, John Fastabend wrote:
> > Stanislav Fomichev wrote:
> > > On Mon, Jun 26, 2023 at 3:37 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 26, 2023 at 2:36 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > > >
> > > > > > > I'd think HW TX csum is actually simpler than dealing with time,
> > > > > > > will you change your mind if Stan posts Tx csum within a few days? :)
> > > >
> > > > Absolutely :) Happy to change my mind.
> > > >
> > > > > > > The set of offloads is barely changing, the lack of clarity
> > > > > > > on what is needed seems overstated. IMHO AF_XDP is getting no use
> > > > > > > today, because everything remotely complex was stripped out of
> > > > > > > the implementation to get it merged. Aren't we hand waving the
> > > > > > > complexity away simply because we don't want to deal with it?
> > > > > > >
> > > > > > > These are the features today's devices support (rx/tx is a mirror):
> > > > > > >  - L4 csum
> > > > > > >  - segmentation
> > > > > > >  - time reporting
> > > > > > >
> > > > > > > Some may also support:
> > > > > > >  - forwarding md tagging
> > > > > > >  - Tx launch time
> > > > > > >  - no fcs
> > > > > > > Legacy / irrelevant:
> > > > > > >  - VLAN insertion
> > > > > >
> > > > > > Right, the goal of the series is to lay out the foundation to support
> > > > > > AF_XDP offloads. I'm starting with tx timestamp because that's more
> > > > > > pressing. But, as I mentioned in another thread, we do have other
> > > > > > users that want to adopt AF_XDP, but due to missing tx offloads, they
> > > > > > aren't able to.
> > > > > >
> > > > > > IMHO, with pre-tx/post-tx hooks, it's pretty easy to go from TX
> > > > > > timestamp to TX checksum offload, we don't need a lot:
> > > > > > - define another generic kfunc bpf_request_tx_csum(from, to)
> > > > > > - drivers implement it
> > > > > > - af_xdp users call this kfunc from devtx hook
> > > > > >
> > > > > > We seem to be arguing over start-with-my-specific-narrow-use-case vs
> > > > > > start-with-generic implementation, so maybe time for the office hours?
> > > > > > I can try to present some cohesive plan of how we start with the framework
> > > > > > plus tx-timestamp and expand with tx-checksum/etc. There is a lot of
> > > > > > commonality in these offloads, so I'm probably not communicating it
> > > > > > properly..
> > > > >
> > > > > Or, maybe a better suggestion: let me try to implement TX checksum
> > > > > kfunc in the v3 (to show how to build on top this series).
> > > > > Having code is better than doing slides :-D
> > > >
> > > > That would certainly help :)
> > > > What I got out of your lsfmmbpf talk is that timestamp is your
> > > > main and only use case. tx checksum for af_xdp is the other use case,
> > > > but it's not yours, so you sort-of use it as an extra justification
> > > > for timestamp. Hence my negative reaction to 'generality'.
> > > > I think we'll have better results in designing an api
> > > > when we look at these two use cases independently.
> > > > And implement them in patches solving specifically timestamp
> > > > with normal skb traffic and tx checksum for af_xdp as two independent apis.
> > > > If it looks like we can extract a common framework out of them. Great.
> > > > But trying to generalize before truly addressing both cases
> > > > is likely to cripple both apis.
> > > 
> > > I need timestamps for the af_xdp case and I don't really care about skb :-(
> > > I brought skb into the picture mostly to cover John's cases.
> > > So maybe let's drop the skb case for now and focus on af_xdp?
> > > skb is convenient testing-wise though (with veth), but maybe I can
> > > somehow carve-out af_xdp skbs only out of it..
> > 
> > I'm ok if your drop my use case but I read above and I seem to have a
> > slightly different opinion/approach in mind.
> > 
> > What I think would be the most straight-forward thing and most flexible
> > is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
> > and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
> > corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
> > don't spend any cycles building the metadata thing or have to even
> > worry about read kfuncs. The BPF program has read access to any
> > fields they need. And with the skb, xdp pointer we have the context
> > that created the descriptor and generate meaningful metrics.
> > 
> > I'm clearly sacrificing usability in some sense of a general user that
> > doesn't know about drivers, hardware and so on for performance,
> > flexibility and simplicity of implementation. In general I'm OK with
> > this. I have trouble understanding who the dev is that is coding at
> > this layer, but can't read kernel driver code or at least has a good
> > understanding of the hardware. We are deep in optimization and
> > performance world once we get to putting hooks in the driver we should
> > expect a certain amount of understanding before we let folks just plant
> > hooks here. Its going to be very easy to cause all sort of damage
> > even if we go to the other extreme and make a unified interface and
> > push the complexity onto kernel devs to maintain. I really believe
> > folks writing AF_XDP code (for DPDK or otherwise) have a really good
> > handle on networking, hardware, and drivers. I also expect that
> > AF_XDP users will mostly be abstracted away from AF_XDP internals
> > by DPDK and other libs or applications. My $.02 is these will be
> > primarily middle box type application built for special purpose l2/l3/l4
> > firewalling, DDOS, etc and telco protocols. Rant off.
> > 
> > But I can admit <drvname>_devtx_submit_xdp(<drvname>descriptor, ...)
> > is a bit raw. For one its going to require an object file per
> > driver/hardware and maybe worse multiple objects per driver/hw to
> > deal with hw descriptor changes with features. My $.02 is we could
> > solve this with better attach time linking. Now you have to at
> > compile time know what you are going to attach to and how to parse
> > the descriptor. If we had attach time linking we could dynamically
> > link to the hardware specific code at link time. And then its up
> > to us here or others who really understand the hardware to write
> > a ice_read_ts, mlx_read_tx but that can all just be normal BPF code.
> > 
> > Also I hand waved through a step where at attach time we have
> > some way to say link the thing that is associated with the
> > driver I'm about to attach to. As a first step a loader could
> > do this.
> > 
> > Its maybe more core work and less wrangling drivers then and
> > it means kfuncs become just blocks of BPF that anyone can
> > write. The big win in my mind is I don't need to know now
> > what I want tomorrow because I should have access. Also we push
> > the complexity and maintenance out of driver/kernel and into
> > BPF libs and users. Then we don't have to touch BPF core just
> > to add new things.
> > 
> > Last bit that complicates things is I need a way to also write
> > allowed values into the descriptor. We don't have anything that
> > can do this now. So maybe kfuncs for the write tstamp flags and
> > friends?
> 
> And in this case, the kfuncs would be non-generic and look something
> like the following?
> 
>   bpf_devtx_<drvname>_request_timestamp(<drivname>descriptor, xdp_frame)

Yeah, for writing into the descriptor I couldn't think up anythign more
clever. Anyways we will want to JIT that into insns if we are touching
every pkt.

> 
> I feel like this can work, yeah. The interface is raw, but maybe
> you are both right in assuming that different nics will
> expose different capabilities and we shouldn't try to pretend
> there is some commonality. I'll try to explore that idea more with
> the tx-csum..

I think it should be handle at the next level up with BPF libraries
and at BPF community level by building abstractions on top of BPF
instead of trying to bury them into where they feel less natural
to me. A project like Tetragon, DPDK, ... would abstract these behind
their APIs and users writing a AF_XDP widget on top of these would
probably never need to know about it. Certainly we wont have end
users of Tetragon have to care about driver they are attaching to.

> 
> Worst case, userspace can do:
> 
> int bpf_devtx_request_timestamp(some-generic-prog-abstraction-to-pass-ctx)
> {
> #ifdef DEVICE_MLX5
>   return mlx5_request_timestamp(...);
> #elif DEVICE_VETH
>   return veth_request-timestamp(...);
> #else
>   ...
> #endif
> }

Yeah I think so and then carry a couple different object files
for the environment around. We do this already for some things.
Its not ideal but it works. I think a good end goal would be

 int bpf_devtx_request_timestamp(...)
 {
	set_ts = dlsym( dl_handle, request-timestamp);
	return set_ts(...) 
 }

Then we could at attach time take that dlsym and rewrite it.

> 
> One thing we should probably spend more time in this case is documenting
> it all. Or maybe having some DEVTX_XXX macros for those kfunc definitions
> and hooks to make them discoverable.

More docs would be great for sure.

> 
> But yeah, I see the appeal. The only ugly part with this all is that
> my xdp_hw_metadata would not be portable at all :-/ But it might be
> a good place/reason to actually figure out how to do it :-)

Agree you lose portability at the low level BPF, but we could
get it back with BPF libs. I think its the basic tradeoff here
is I want to keep the interface raw as possible and push the
details into the BPF program/loader. So you lose the low level
portability, but I think we get a lot for it and can get it
back if a BPF community builds the libs and tooling to solve
the problem at the next layer up. My thinking is kernel dev
intuition is to solve it in the kernel, but we take on a lot
of complexity to do this when we could push it out to userspace
where its easier to manage versioning and the complexity.

> 
> > Anyways, my $.02.
> > 
> > 
> > 
> > > 
> > > Regarding timestamp vs checksum: timestamp is more pressing, but I do
> > 
> > One request would be to also include a driver that doesn't have
> > always on timestamps so some writer is needed. CSUM enabling
> > I'm interested to see what the signature looks like? On the
> > skb side we use the skb a fair amount to build the checksum
> > it seems so I guess AF_XDP needs to push down the csum_offset?
> > In the SKB case its less interesting I think becuase the
> > stack is already handling it.
> 
> I need access to your lab :-p

I can likely try to at least prototype for some nic.

> 
> Regarding the signature, csum_start + csum_offset maybe? As we have in
> skb?
> 
> Although, from a quick grepping, I see some of the drivers have only
> a fixed set of tx checksum configurations they support:
> 
> switch (skb->csum_offset) {
> case offsetof(struct tcphdr, check):
> 	tx_desc->flags |= DO_TCP_IP_TX_CSUM_AT_FIXED_OFFSET;
> 	break;
> default:
> 	/* sw fallback */
> }

Yeah because they might not want or no how to do other protocols.

> 
> So maybe that's another argument in favor of not doing a generic
> layer and just expose whatever HW has in a non-portable way...
> (otoh, still accepting csum_offset+start + doing that switch inside
> is probably an ok common interface)

That is what I was thinking.

> 
> > > have people around that want to use af_xdp but need multibuf + tx
> > > offloads, so I was hoping to at least have a path for more tx offloads
> > > after we're done with tx timestamp "offload"..
> > > 
> > > > It doesn't have to be only two use cases.
> > > > I completely agree with Kuba that:
> > > >  - L4 csum
> > > >  - segmentation
> > > >  - time reporting
> > > > are universal HW NIC features and we need to have an api
> > > > that exposes these features in programmable way to bpf progs in the kernel
> > > > and through af_xdp to user space.
> > > > I mainly suggest addressing them one by one and look
> > > > for common code bits and api similarities later.
> > > 
> > > Ack, let me see if I can fit tx csum into the picture. I still feel
> > > like we need these dev-bound tracing programs if we want to trigger
> > > kfuncs safely, but maybe we can simplify further..
> > 
> > Its not clear to me how you get to a dev specific attach here
> > without complicating the hot path more. I think we need to
> > really be careful to not make hotpath more complex. Will
> > follow along for sure to see what gets created.
> 
> Agreed. I've yet to test it with some real HW (still in the process of
> trying to get back my lab configuration which was changed recently) :-(

:( Another question would be is there a use case to have netdev
specific programs? That might drive the need to attach to a
specific netdev. I can't think of a use case off-hand I guess
Toke might have something in mind based on his reply earlier.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-27 23:33                                 ` John Fastabend
@ 2023-06-27 23:50                                   ` Alexei Starovoitov
  0 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov @ 2023-06-27 23:50 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stanislav Fomichev, Jakub Kicinski, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On Tue, Jun 27, 2023 at 4:33 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Yeah I think so and then carry a couple different object files
> for the environment around. We do this already for some things.
> Its not ideal but it works. I think a good end goal would be
>
>  int bpf_devtx_request_timestamp(...)
>  {
>         set_ts = dlsym( dl_handle, request-timestamp);
>         return set_ts(...)
>  }
>
> Then we could at attach time take that dlsym and rewrite it.

Sounds like we need polymorphic kfuncs.
Same kfunc name called by bpf prog, but implementation would
change depending on {attach_btf_id, prog_type, etc}.
The existing bpf_xdp_metadata_rx_hash is almost that.
Except it's driver specific.
We should probably generalize this mechanism.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-26 17:00             ` Stanislav Fomichev
@ 2023-06-28  8:09               ` Magnus Karlsson
  2023-06-28 18:49                 ` Stanislav Fomichev
  0 siblings, 1 reply; 72+ messages in thread
From: Magnus Karlsson @ 2023-06-28  8:09 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> >
> > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > >>
> > >>
> > >>
> > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > >>>>
> > >>>>
> > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > >>>>
> > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > >>>>> is no way currently to populate skb metadata.
> > >>>>>
> > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > >>>>> (same as in RX case).
> > >>>>
> > >>>>    From looking at the code, this introduces a socket option for this TX
> > >>>> metadata length (tx_metadata_len).
> > >>>> This implies the same fixed TX metadata size is used for all packets.
> > >>>> Maybe describe this in patch desc.
> > >>>
> > >>> I was planning to do a proper documentation page once we settle on all
> > >>> the details (similar to the one we have for rx).
> > >>>
> > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > >>>> TX metadata size ?
> > >>>
> > >>> Do we need to support that? I was assuming that the TX layout would be
> > >>> fixed between the userspace and BPF.
> > >>
> > >> I hope you don't mean fixed layout, as the whole point is adding
> > >> flexibility and extensibility.
> > >
> > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > At least fixed max size of the metadata. The userspace and the bpf
> > > prog can then use this fixed space to implement some flexibility
> > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > If we were to make the metalen vary per packet, we'd have to signal
> > > its size per packet. Probably not worth it?
> >
> > Existing XDP metadata implementation also expand in a fixed/limited
> > sized memory area, but communicate size per packet in this area (also
> > for validation purposes).  BUT for AF_XDP we don't have room for another
> > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >
> >
> > >
> > >>> If every packet would have a different metadata length, it seems like
> > >>> a nightmare to parse?
> > >>>
> > >>
> > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > >> see [1] and [2].
> > >>
> > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > >> and reframe the question, what is your *plan* for dealing with different
> > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > >> different sizes, but that is okay as long as they fit into the maximum
> > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > >> for metadata, but we need a plan for handling more than one type and
> > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > >> data ("leftover" since last time this mem was used).
> > >
> > > Yeah, I think the above correctly catches my expectation here. Some
> > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > offloaded to the bpf program via btf_id/tlv/etc.
> > >
> > > Regarding leftover metadata: can we assume the userspace will take
> > > care of setting it up?
> > >
> > >> With this kfunc approach, then things in-principle, becomes a contract
> > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > >> components can as illustrated here [1]+[2] can coordinate based on local
> > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > >> selftests examples don't use this and instead have a very static
> > >> approach (that people will copy-paste).
> > >>
> > >> An unsolved problem with TX-hook is that it can also get packets from
> > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > >> requesting the timestamp? (imagine metadata size happen to match)
> > >
> > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > maybe check that the meta_len is the one that's expected.
> > > Will that be enough to handle XDP_REDIRECT?
> >
> > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > activation of TX hardware hints is too weak and not enough.  This is an
> > implicit API for BPF-programmers to understand and can lead to implicit
> > activation.
> >
> > Think about what will happen for your AF_XDP send use-case.  For
> > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > is fixed even if not used (and can contain garbage), it can by accident
> > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > before, we found it was practical (and faster than mem zero) to extend
> > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > indicate/communicate this frame comes with TX metadata hints.
>
> What is that "if not used" situation? Can the metadata itself have
> is_used bit? The userspace has to initialize at least that bit.
> We can definitely add that extra "has_metadata" bit to the descriptor,
> but I'm trying to understand whether we can do without it.

To me, this "has_metadata" bit in the descriptor is just an
optimization. If it is 0, then there is no need to go and check the
metadata field and you save some performance. Regardless of this bit,
you need some way to say "is_used" for each metadata entry (at least
when the number of metadata entries is >1). Three options come to mind
each with their pros and cons.

#1: Let each metadata entry have an invalid state. Not possible for
every metadata and requires the user/kernel to go scan through every
entry for every packet.

#2: Have a field of bits at the start of the metadata section (closest
to packet data) that signifies if a metadata entry is valid or not. If
there are N metadata entries in the metadata area, then N bits in this
field would be used to signify if the corresponding metadata is used
or not. Only requires the user/kernel to scan the valid entries plus
one access for the "is_used" bits.

#3: Have N bits in the AF_XDP descriptor options field instead of the
N bits in the metadata area of #2. Faster but would consume many
precious bits in the fixed descriptor and cap the number of metadata
entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
multi-buffer work, and 15 for some future use. Depends on how daring
we are.

The "has_metadata" bit suggestion can be combined with 1 or 2.
Approach 3 is just a fine grained extension of the idea itself.

IMO, the best approach unfortunately depends on the metadata itself.
If it is rarely valid, you want something like the "has_metadata" bit.
If it is nearly always valid and used, approach #1 (if possible for
the metadata) should be the fastest. The decision also depends on the
number of metadata entries you have per packet. Sorry that I do not
have a good answer. My feeling is that we need something like #1 or
#2, or maybe both, then if needed we can add the "has_metadata" bit or
bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
a combo) in the eBPF program itself? Would provide us with the
flexibility, if possible.

>
> > >>
> > >> BPF-prog API bpf_core_type_id_local:
> > >>    - [1]
> > >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
> > >>
> > >> Userspace API btf__find_by_name_kind:
> > >>    - [2]
> > >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
> > >>
> > >
> >
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28  8:09               ` Magnus Karlsson
@ 2023-06-28 18:49                 ` Stanislav Fomichev
  2023-06-29  6:15                   ` Magnus Karlsson
  2023-06-29 11:30                   ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 2 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-28 18:49 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> > >
> > >
> > >
> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > > <jbrouer@redhat.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > > >>>>
> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > > >>>>> is no way currently to populate skb metadata.
> > > >>>>>
> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > > >>>>> (same as in RX case).
> > > >>>>
> > > >>>>    From looking at the code, this introduces a socket option for this TX
> > > >>>> metadata length (tx_metadata_len).
> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> > > >>>> Maybe describe this in patch desc.
> > > >>>
> > > >>> I was planning to do a proper documentation page once we settle on all
> > > >>> the details (similar to the one we have for rx).
> > > >>>
> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > > >>>> TX metadata size ?
> > > >>>
> > > >>> Do we need to support that? I was assuming that the TX layout would be
> > > >>> fixed between the userspace and BPF.
> > > >>
> > > >> I hope you don't mean fixed layout, as the whole point is adding
> > > >> flexibility and extensibility.
> > > >
> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > > At least fixed max size of the metadata. The userspace and the bpf
> > > > prog can then use this fixed space to implement some flexibility
> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > > If we were to make the metalen vary per packet, we'd have to signal
> > > > its size per packet. Probably not worth it?
> > >
> > > Existing XDP metadata implementation also expand in a fixed/limited
> > > sized memory area, but communicate size per packet in this area (also
> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> > >
> > >
> > > >
> > > >>> If every packet would have a different metadata length, it seems like
> > > >>> a nightmare to parse?
> > > >>>
> > > >>
> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > > >> see [1] and [2].
> > > >>
> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > > >> and reframe the question, what is your *plan* for dealing with different
> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > > >> different sizes, but that is okay as long as they fit into the maximum
> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > > >> for metadata, but we need a plan for handling more than one type and
> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > > >> data ("leftover" since last time this mem was used).
> > > >
> > > > Yeah, I think the above correctly catches my expectation here. Some
> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > > offloaded to the bpf program via btf_id/tlv/etc.
> > > >
> > > > Regarding leftover metadata: can we assume the userspace will take
> > > > care of setting it up?
> > > >
> > > >> With this kfunc approach, then things in-principle, becomes a contract
> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > > >> selftests examples don't use this and instead have a very static
> > > >> approach (that people will copy-paste).
> > > >>
> > > >> An unsolved problem with TX-hook is that it can also get packets from
> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > > >> requesting the timestamp? (imagine metadata size happen to match)
> > > >
> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > > maybe check that the meta_len is the one that's expected.
> > > > Will that be enough to handle XDP_REDIRECT?
> > >
> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > > activation of TX hardware hints is too weak and not enough.  This is an
> > > implicit API for BPF-programmers to understand and can lead to implicit
> > > activation.
> > >
> > > Think about what will happen for your AF_XDP send use-case.  For
> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > > is fixed even if not used (and can contain garbage), it can by accident
> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > > before, we found it was practical (and faster than mem zero) to extend
> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > > indicate/communicate this frame comes with TX metadata hints.
> >
> > What is that "if not used" situation? Can the metadata itself have
> > is_used bit? The userspace has to initialize at least that bit.
> > We can definitely add that extra "has_metadata" bit to the descriptor,
> > but I'm trying to understand whether we can do without it.
>
> To me, this "has_metadata" bit in the descriptor is just an
> optimization. If it is 0, then there is no need to go and check the
> metadata field and you save some performance. Regardless of this bit,
> you need some way to say "is_used" for each metadata entry (at least
> when the number of metadata entries is >1). Three options come to mind
> each with their pros and cons.
>
> #1: Let each metadata entry have an invalid state. Not possible for
> every metadata and requires the user/kernel to go scan through every
> entry for every packet.
>
> #2: Have a field of bits at the start of the metadata section (closest
> to packet data) that signifies if a metadata entry is valid or not. If
> there are N metadata entries in the metadata area, then N bits in this
> field would be used to signify if the corresponding metadata is used
> or not. Only requires the user/kernel to scan the valid entries plus
> one access for the "is_used" bits.
>
> #3: Have N bits in the AF_XDP descriptor options field instead of the
> N bits in the metadata area of #2. Faster but would consume many
> precious bits in the fixed descriptor and cap the number of metadata
> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> multi-buffer work, and 15 for some future use. Depends on how daring
> we are.
>
> The "has_metadata" bit suggestion can be combined with 1 or 2.
> Approach 3 is just a fine grained extension of the idea itself.
>
> IMO, the best approach unfortunately depends on the metadata itself.
> If it is rarely valid, you want something like the "has_metadata" bit.
> If it is nearly always valid and used, approach #1 (if possible for
> the metadata) should be the fastest. The decision also depends on the
> number of metadata entries you have per packet. Sorry that I do not
> have a good answer. My feeling is that we need something like #1 or
> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> a combo) in the eBPF program itself? Would provide us with the
> flexibility, if possible.

Here is my take on it, lmk if I'm missing something:

af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
plan to use metadata on tx.
This essentially requires allocating a fixed headroom to carry the metadata.
af_xdp machinery exports this fixed len into the bpf programs somehow
(devtx_frame.meta_len in this series).
Then it's up to the userspace and bpf program to agree on the layout.
If not every packet is expected to carry the metadata, there might be
some bitmask in the metadata area to indicate that.

Iow, the metadata isn't interpreted by the kernel. It's up to the prog
to interpret it and call appropriate kfunc to enable some offload.

Jesper raises a valid point with "what about redirected packets?". But
I'm not sure we need to care? Presumably the programs that do
xdp_redirect will have to conform to the same metadata layout?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-27 21:43                             ` John Fastabend
  2023-06-27 22:56                               ` Stanislav Fomichev
@ 2023-06-28 18:52                               ` Jakub Kicinski
  2023-06-29 11:43                                 ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 72+ messages in thread
From: Jakub Kicinski @ 2023-06-28 18:52 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stanislav Fomichev, Alexei Starovoitov, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

On Tue, 27 Jun 2023 14:43:57 -0700 John Fastabend wrote:
> What I think would be the most straight-forward thing and most flexible
> is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
> and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
> corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
> don't spend any cycles building the metadata thing or have to even
> worry about read kfuncs. The BPF program has read access to any
> fields they need. And with the skb, xdp pointer we have the context
> that created the descriptor and generate meaningful metrics.

Sorry but this is not going to happen without my nack. DPDK was a much
cleaner bifurcation point than trying to write datapath drivers in BPF.
Users having to learn how to render descriptors for all the NICs
and queue formats out there is not reasonable. Isovalent hired
a lot of former driver developers so you may feel like it's a good
idea, as a middleware provider. But for the rest of us the matrix
of HW x queue format x people writing BPF is too large. If we can
write some poor man's DPDK / common BPF driver library to be selected
at linking time - we can as well provide a generic interface in
the kernel itself. Again, we never merged explicit DPDK support, 
your idea is strictly worse.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28 18:49                 ` Stanislav Fomichev
@ 2023-06-29  6:15                   ` Magnus Karlsson
  2023-06-29 11:30                   ` [xdp-hints] " Toke Høiland-Jørgensen
  1 sibling, 0 replies; 72+ messages in thread
From: Magnus Karlsson @ 2023-06-29  6:15 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Wed, 28 Jun 2023 at 20:49, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > > >
> > > >
> > > >
> > > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > > > <jbrouer@redhat.com> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > > > >>>>
> > > > >>>>
> > > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > > > >>>>
> > > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > > > >>>>> is no way currently to populate skb metadata.
> > > > >>>>>
> > > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > > > >>>>> (same as in RX case).
> > > > >>>>
> > > > >>>>    From looking at the code, this introduces a socket option for this TX
> > > > >>>> metadata length (tx_metadata_len).
> > > > >>>> This implies the same fixed TX metadata size is used for all packets.
> > > > >>>> Maybe describe this in patch desc.
> > > > >>>
> > > > >>> I was planning to do a proper documentation page once we settle on all
> > > > >>> the details (similar to the one we have for rx).
> > > > >>>
> > > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > > > >>>> TX metadata size ?
> > > > >>>
> > > > >>> Do we need to support that? I was assuming that the TX layout would be
> > > > >>> fixed between the userspace and BPF.
> > > > >>
> > > > >> I hope you don't mean fixed layout, as the whole point is adding
> > > > >> flexibility and extensibility.
> > > > >
> > > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > > > At least fixed max size of the metadata. The userspace and the bpf
> > > > > prog can then use this fixed space to implement some flexibility
> > > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > > > If we were to make the metalen vary per packet, we'd have to signal
> > > > > its size per packet. Probably not worth it?
> > > >
> > > > Existing XDP metadata implementation also expand in a fixed/limited
> > > > sized memory area, but communicate size per packet in this area (also
> > > > for validation purposes).  BUT for AF_XDP we don't have room for another
> > > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> > > >
> > > >
> > > > >
> > > > >>> If every packet would have a different metadata length, it seems like
> > > > >>> a nightmare to parse?
> > > > >>>
> > > > >>
> > > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > > > >> see [1] and [2].
> > > > >>
> > > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > > > >> and reframe the question, what is your *plan* for dealing with different
> > > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > > > >> different sizes, but that is okay as long as they fit into the maximum
> > > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > > > >> for metadata, but we need a plan for handling more than one type and
> > > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > > > >> data ("leftover" since last time this mem was used).
> > > > >
> > > > > Yeah, I think the above correctly catches my expectation here. Some
> > > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > > > offloaded to the bpf program via btf_id/tlv/etc.
> > > > >
> > > > > Regarding leftover metadata: can we assume the userspace will take
> > > > > care of setting it up?
> > > > >
> > > > >> With this kfunc approach, then things in-principle, becomes a contract
> > > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > > > >> components can as illustrated here [1]+[2] can coordinate based on local
> > > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > > > >> selftests examples don't use this and instead have a very static
> > > > >> approach (that people will copy-paste).
> > > > >>
> > > > >> An unsolved problem with TX-hook is that it can also get packets from
> > > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > > > >> requesting the timestamp? (imagine metadata size happen to match)
> > > > >
> > > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > > > maybe check that the meta_len is the one that's expected.
> > > > > Will that be enough to handle XDP_REDIRECT?
> > > >
> > > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > > > activation of TX hardware hints is too weak and not enough.  This is an
> > > > implicit API for BPF-programmers to understand and can lead to implicit
> > > > activation.
> > > >
> > > > Think about what will happen for your AF_XDP send use-case.  For
> > > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > > > is fixed even if not used (and can contain garbage), it can by accident
> > > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > > > before, we found it was practical (and faster than mem zero) to extend
> > > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > > > indicate/communicate this frame comes with TX metadata hints.
> > >
> > > What is that "if not used" situation? Can the metadata itself have
> > > is_used bit? The userspace has to initialize at least that bit.
> > > We can definitely add that extra "has_metadata" bit to the descriptor,
> > > but I'm trying to understand whether we can do without it.
> >
> > To me, this "has_metadata" bit in the descriptor is just an
> > optimization. If it is 0, then there is no need to go and check the
> > metadata field and you save some performance. Regardless of this bit,
> > you need some way to say "is_used" for each metadata entry (at least
> > when the number of metadata entries is >1). Three options come to mind
> > each with their pros and cons.
> >
> > #1: Let each metadata entry have an invalid state. Not possible for
> > every metadata and requires the user/kernel to go scan through every
> > entry for every packet.
> >
> > #2: Have a field of bits at the start of the metadata section (closest
> > to packet data) that signifies if a metadata entry is valid or not. If
> > there are N metadata entries in the metadata area, then N bits in this
> > field would be used to signify if the corresponding metadata is used
> > or not. Only requires the user/kernel to scan the valid entries plus
> > one access for the "is_used" bits.
> >
> > #3: Have N bits in the AF_XDP descriptor options field instead of the
> > N bits in the metadata area of #2. Faster but would consume many
> > precious bits in the fixed descriptor and cap the number of metadata
> > entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> > multi-buffer work, and 15 for some future use. Depends on how daring
> > we are.
> >
> > The "has_metadata" bit suggestion can be combined with 1 or 2.
> > Approach 3 is just a fine grained extension of the idea itself.
> >
> > IMO, the best approach unfortunately depends on the metadata itself.
> > If it is rarely valid, you want something like the "has_metadata" bit.
> > If it is nearly always valid and used, approach #1 (if possible for
> > the metadata) should be the fastest. The decision also depends on the
> > number of metadata entries you have per packet. Sorry that I do not
> > have a good answer. My feeling is that we need something like #1 or
> > #2, or maybe both, then if needed we can add the "has_metadata" bit or
> > bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> > a combo) in the eBPF program itself? Would provide us with the
> > flexibility, if possible.
>
> Here is my take on it, lmk if I'm missing something:
>
> af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> plan to use metadata on tx.
> This essentially requires allocating a fixed headroom to carry the metadata.
> af_xdp machinery exports this fixed len into the bpf programs somehow
> (devtx_frame.meta_len in this series).
> Then it's up to the userspace and bpf program to agree on the layout.
> If not every packet is expected to carry the metadata, there might be
> some bitmask in the metadata area to indicate that.
>
> Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> to interpret it and call appropriate kfunc to enable some offload.

Sounds good. This flexibility is needed.

> Jesper raises a valid point with "what about redirected packets?". But
> I'm not sure we need to care? Presumably the programs that do
> xdp_redirect will have to conform to the same metadata layout?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28 18:49                 ` Stanislav Fomichev
  2023-06-29  6:15                   ` Magnus Karlsson
@ 2023-06-29 11:30                   ` Toke Høiland-Jørgensen
  2023-06-29 11:48                     ` Magnus Karlsson
  1 sibling, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 11:30 UTC (permalink / raw)
  To: Stanislav Fomichev, Magnus Karlsson
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
>>
>> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >
>> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> > <jbrouer@redhat.com> wrote:
>> > >
>> > >
>> > >
>> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> > > > <jbrouer@redhat.com> wrote:
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> > > >>>>
>> > > >>>>
>> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> > > >>>>
>> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> > > >>>>> is no way currently to populate skb metadata.
>> > > >>>>>
>> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> > > >>>>> (same as in RX case).
>> > > >>>>
>> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> > > >>>> metadata length (tx_metadata_len).
>> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> > > >>>> Maybe describe this in patch desc.
>> > > >>>
>> > > >>> I was planning to do a proper documentation page once we settle on all
>> > > >>> the details (similar to the one we have for rx).
>> > > >>>
>> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> > > >>>> TX metadata size ?
>> > > >>>
>> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> > > >>> fixed between the userspace and BPF.
>> > > >>
>> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> > > >> flexibility and extensibility.
>> > > >
>> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> > > > At least fixed max size of the metadata. The userspace and the bpf
>> > > > prog can then use this fixed space to implement some flexibility
>> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> > > > If we were to make the metalen vary per packet, we'd have to signal
>> > > > its size per packet. Probably not worth it?
>> > >
>> > > Existing XDP metadata implementation also expand in a fixed/limited
>> > > sized memory area, but communicate size per packet in this area (also
>> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> > >
>> > >
>> > > >
>> > > >>> If every packet would have a different metadata length, it seems like
>> > > >>> a nightmare to parse?
>> > > >>>
>> > > >>
>> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> > > >> see [1] and [2].
>> > > >>
>> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> > > >> and reframe the question, what is your *plan* for dealing with different
>> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> > > >> different sizes, but that is okay as long as they fit into the maximum
>> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> > > >> for metadata, but we need a plan for handling more than one type and
>> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> > > >> data ("leftover" since last time this mem was used).
>> > > >
>> > > > Yeah, I think the above correctly catches my expectation here. Some
>> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> > > >
>> > > > Regarding leftover metadata: can we assume the userspace will take
>> > > > care of setting it up?
>> > > >
>> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> > > >> selftests examples don't use this and instead have a very static
>> > > >> approach (that people will copy-paste).
>> > > >>
>> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> > > >
>> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> > > > maybe check that the meta_len is the one that's expected.
>> > > > Will that be enough to handle XDP_REDIRECT?
>> > >
>> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> > > activation of TX hardware hints is too weak and not enough.  This is an
>> > > implicit API for BPF-programmers to understand and can lead to implicit
>> > > activation.
>> > >
>> > > Think about what will happen for your AF_XDP send use-case.  For
>> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> > > is fixed even if not used (and can contain garbage), it can by accident
>> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> > > before, we found it was practical (and faster than mem zero) to extend
>> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> > > indicate/communicate this frame comes with TX metadata hints.
>> >
>> > What is that "if not used" situation? Can the metadata itself have
>> > is_used bit? The userspace has to initialize at least that bit.
>> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> > but I'm trying to understand whether we can do without it.
>>
>> To me, this "has_metadata" bit in the descriptor is just an
>> optimization. If it is 0, then there is no need to go and check the
>> metadata field and you save some performance. Regardless of this bit,
>> you need some way to say "is_used" for each metadata entry (at least
>> when the number of metadata entries is >1). Three options come to mind
>> each with their pros and cons.
>>
>> #1: Let each metadata entry have an invalid state. Not possible for
>> every metadata and requires the user/kernel to go scan through every
>> entry for every packet.
>>
>> #2: Have a field of bits at the start of the metadata section (closest
>> to packet data) that signifies if a metadata entry is valid or not. If
>> there are N metadata entries in the metadata area, then N bits in this
>> field would be used to signify if the corresponding metadata is used
>> or not. Only requires the user/kernel to scan the valid entries plus
>> one access for the "is_used" bits.
>>
>> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> N bits in the metadata area of #2. Faster but would consume many
>> precious bits in the fixed descriptor and cap the number of metadata
>> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> multi-buffer work, and 15 for some future use. Depends on how daring
>> we are.
>>
>> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> Approach 3 is just a fine grained extension of the idea itself.
>>
>> IMO, the best approach unfortunately depends on the metadata itself.
>> If it is rarely valid, you want something like the "has_metadata" bit.
>> If it is nearly always valid and used, approach #1 (if possible for
>> the metadata) should be the fastest. The decision also depends on the
>> number of metadata entries you have per packet. Sorry that I do not
>> have a good answer. My feeling is that we need something like #1 or
>> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> a combo) in the eBPF program itself? Would provide us with the
>> flexibility, if possible.
>
> Here is my take on it, lmk if I'm missing something:
>
> af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> plan to use metadata on tx.
> This essentially requires allocating a fixed headroom to carry the metadata.
> af_xdp machinery exports this fixed len into the bpf programs somehow
> (devtx_frame.meta_len in this series).
> Then it's up to the userspace and bpf program to agree on the layout.
> If not every packet is expected to carry the metadata, there might be
> some bitmask in the metadata area to indicate that.
>
> Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> to interpret it and call appropriate kfunc to enable some offload.

The reason for the flag on RX is mostly performance: there's a
substantial performance hit from reading the metadata area because it's
not cache-hot; we want to avoid that when no metadata is in use. Putting
the flag inside the metadata area itself doesn't work for this, because
then you incur the cache miss just to read the flag.

This effect is probably most pronounced on RX (because the packet is
coming in off the NIC, so very unlikely that the memory has been touched
before), but I see no reason it could not also occur on TX (it'll mostly
depend on data alignment I guess?). So I think having a separate "TX
metadata" flag in the descriptor is probably worth it for the
performance gains?

> Jesper raises a valid point with "what about redirected packets?". But
> I'm not sure we need to care? Presumably the programs that do
> xdp_redirect will have to conform to the same metadata layout?

I don't think they have to do anything different semantically (agreeing
on the data format in the metadata area will have to be by some out of
band mechanism); but the performance angle is definitely valid here as
well...

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-28 18:52                               ` Jakub Kicinski
@ 2023-06-29 11:43                                 ` Toke Høiland-Jørgensen
  2023-06-30 18:54                                   ` Stanislav Fomichev
  2023-07-01  0:52                                   ` John Fastabend
  0 siblings, 2 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 11:43 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend
  Cc: Stanislav Fomichev, Alexei Starovoitov, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 27 Jun 2023 14:43:57 -0700 John Fastabend wrote:
>> What I think would be the most straight-forward thing and most flexible
>> is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
>> and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
>> corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
>> don't spend any cycles building the metadata thing or have to even
>> worry about read kfuncs. The BPF program has read access to any
>> fields they need. And with the skb, xdp pointer we have the context
>> that created the descriptor and generate meaningful metrics.
>
> Sorry but this is not going to happen without my nack. DPDK was a much
> cleaner bifurcation point than trying to write datapath drivers in BPF.
> Users having to learn how to render descriptors for all the NICs
> and queue formats out there is not reasonable. Isovalent hired
> a lot of former driver developers so you may feel like it's a good
> idea, as a middleware provider. But for the rest of us the matrix
> of HW x queue format x people writing BPF is too large. If we can
> write some poor man's DPDK / common BPF driver library to be selected
> at linking time - we can as well provide a generic interface in
> the kernel itself. Again, we never merged explicit DPDK support, 
> your idea is strictly worse.

I agree: we're writing an operating system kernel here. The *whole
point* of an operating system is to provide an abstraction over
different types of hardware and provide a common API so users don't have
to deal with the hardware details.

I feel like there's some tension between "BPF as a dataplane API" and
"BPF as a kernel extension language" here, especially as the BPF
subsystem has grown more features in the latter direction. In my mind,
XDP is still very much a dataplane API; in fact that's one of the main
selling points wrt DPDK: you can get high performance networking but
still take advantage of the kernel drivers and other abstractions that
the kernel provides. If you're going for raw performance and the ability
to twiddle every tiny detail of the hardware, DPDK fills that niche
quite nicely (and also shows us the pains of going that route).

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 11:30                   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2023-06-29 11:48                     ` Magnus Karlsson
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Magnus Karlsson @ 2023-06-29 11:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> >>
> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >
> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> > <jbrouer@redhat.com> wrote:
> >> > >
> >> > >
> >> > >
> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> > > > <jbrouer@redhat.com> wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> > > >>>>
> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> > > >>>>> is no way currently to populate skb metadata.
> >> > > >>>>>
> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> > > >>>>> (same as in RX case).
> >> > > >>>>
> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> > > >>>> metadata length (tx_metadata_len).
> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> > > >>>> Maybe describe this in patch desc.
> >> > > >>>
> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> > > >>> the details (similar to the one we have for rx).
> >> > > >>>
> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> > > >>>> TX metadata size ?
> >> > > >>>
> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> > > >>> fixed between the userspace and BPF.
> >> > > >>
> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> > > >> flexibility and extensibility.
> >> > > >
> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> > > > prog can then use this fixed space to implement some flexibility
> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> > > > its size per packet. Probably not worth it?
> >> > >
> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> > > sized memory area, but communicate size per packet in this area (also
> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> > >
> >> > >
> >> > > >
> >> > > >>> If every packet would have a different metadata length, it seems like
> >> > > >>> a nightmare to parse?
> >> > > >>>
> >> > > >>
> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> > > >> see [1] and [2].
> >> > > >>
> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> > > >> data ("leftover" since last time this mem was used).
> >> > > >
> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> > > >
> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> > > > care of setting it up?
> >> > > >
> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> > > >> selftests examples don't use this and instead have a very static
> >> > > >> approach (that people will copy-paste).
> >> > > >>
> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> > > >
> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> > > > maybe check that the meta_len is the one that's expected.
> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> > >
> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> > > activation.
> >> > >
> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >
> >> > What is that "if not used" situation? Can the metadata itself have
> >> > is_used bit? The userspace has to initialize at least that bit.
> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> > but I'm trying to understand whether we can do without it.
> >>
> >> To me, this "has_metadata" bit in the descriptor is just an
> >> optimization. If it is 0, then there is no need to go and check the
> >> metadata field and you save some performance. Regardless of this bit,
> >> you need some way to say "is_used" for each metadata entry (at least
> >> when the number of metadata entries is >1). Three options come to mind
> >> each with their pros and cons.
> >>
> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> every metadata and requires the user/kernel to go scan through every
> >> entry for every packet.
> >>
> >> #2: Have a field of bits at the start of the metadata section (closest
> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> there are N metadata entries in the metadata area, then N bits in this
> >> field would be used to signify if the corresponding metadata is used
> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> one access for the "is_used" bits.
> >>
> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> N bits in the metadata area of #2. Faster but would consume many
> >> precious bits in the fixed descriptor and cap the number of metadata
> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> we are.
> >>
> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> Approach 3 is just a fine grained extension of the idea itself.
> >>
> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> If it is nearly always valid and used, approach #1 (if possible for
> >> the metadata) should be the fastest. The decision also depends on the
> >> number of metadata entries you have per packet. Sorry that I do not
> >> have a good answer. My feeling is that we need something like #1 or
> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> a combo) in the eBPF program itself? Would provide us with the
> >> flexibility, if possible.
> >
> > Here is my take on it, lmk if I'm missing something:
> >
> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> > plan to use metadata on tx.
> > This essentially requires allocating a fixed headroom to carry the metadata.
> > af_xdp machinery exports this fixed len into the bpf programs somehow
> > (devtx_frame.meta_len in this series).
> > Then it's up to the userspace and bpf program to agree on the layout.
> > If not every packet is expected to carry the metadata, there might be
> > some bitmask in the metadata area to indicate that.
> >
> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> > to interpret it and call appropriate kfunc to enable some offload.
>
> The reason for the flag on RX is mostly performance: there's a
> substantial performance hit from reading the metadata area because it's
> not cache-hot; we want to avoid that when no metadata is in use. Putting
> the flag inside the metadata area itself doesn't work for this, because
> then you incur the cache miss just to read the flag.

Not necessarily. Let us say that the flag is 4 bytes. Increase the
start address of the packet buffer with 4 and the flags field will be
on the same cache line as the first 60 bytes of the packet data
(assuming a 64 byte cache line size and the flags field is closest to
the start of the packet data). As long as you write something in those
first 60 bytes of packet data that cache line will be brought in and
will likely be in the cache when you access the bits in the metadata
field. The trick works similarly for Rx by setting the umem headroom
accordingly. But you are correct in that dedicating a bit in the
descriptor will make sure it is always hot, while the trick above is
dependent on the app wanting to read or write the first cache line
worth of packet data.

> This effect is probably most pronounced on RX (because the packet is
> coming in off the NIC, so very unlikely that the memory has been touched
> before), but I see no reason it could not also occur on TX (it'll mostly
> depend on data alignment I guess?). So I think having a separate "TX
> metadata" flag in the descriptor is probably worth it for the
> performance gains?
>
> > Jesper raises a valid point with "what about redirected packets?". But
> > I'm not sure we need to care? Presumably the programs that do
> > xdp_redirect will have to conform to the same metadata layout?
>
> I don't think they have to do anything different semantically (agreeing
> on the data format in the metadata area will have to be by some out of
> band mechanism); but the performance angle is definitely valid here as
> well...
>
> -Toke
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 11:48                     ` Magnus Karlsson
@ 2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  2023-06-29 16:21                         ` Stanislav Fomichev
  2023-06-30  6:22                         ` Magnus Karlsson
  0 siblings, 2 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 12:01 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> > <magnus.karlsson@gmail.com> wrote:
>> >>
>> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >
>> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> > <jbrouer@redhat.com> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> > > > <jbrouer@redhat.com> wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> > > >>>>
>> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> > > >>>>> is no way currently to populate skb metadata.
>> >> > > >>>>>
>> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> > > >>>>> (same as in RX case).
>> >> > > >>>>
>> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> > > >>>> metadata length (tx_metadata_len).
>> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> > > >>>> Maybe describe this in patch desc.
>> >> > > >>>
>> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> > > >>> the details (similar to the one we have for rx).
>> >> > > >>>
>> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> > > >>>> TX metadata size ?
>> >> > > >>>
>> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> > > >>> fixed between the userspace and BPF.
>> >> > > >>
>> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> > > >> flexibility and extensibility.
>> >> > > >
>> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> > > > prog can then use this fixed space to implement some flexibility
>> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> > > > its size per packet. Probably not worth it?
>> >> > >
>> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> > > sized memory area, but communicate size per packet in this area (also
>> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> > >
>> >> > >
>> >> > > >
>> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> > > >>> a nightmare to parse?
>> >> > > >>>
>> >> > > >>
>> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> > > >> see [1] and [2].
>> >> > > >>
>> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> > > >> data ("leftover" since last time this mem was used).
>> >> > > >
>> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> > > >
>> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> > > > care of setting it up?
>> >> > > >
>> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> > > >> selftests examples don't use this and instead have a very static
>> >> > > >> approach (that people will copy-paste).
>> >> > > >>
>> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> > > >
>> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> > > > maybe check that the meta_len is the one that's expected.
>> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> > >
>> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> > > activation.
>> >> > >
>> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >
>> >> > What is that "if not used" situation? Can the metadata itself have
>> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> > but I'm trying to understand whether we can do without it.
>> >>
>> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> optimization. If it is 0, then there is no need to go and check the
>> >> metadata field and you save some performance. Regardless of this bit,
>> >> you need some way to say "is_used" for each metadata entry (at least
>> >> when the number of metadata entries is >1). Three options come to mind
>> >> each with their pros and cons.
>> >>
>> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> every metadata and requires the user/kernel to go scan through every
>> >> entry for every packet.
>> >>
>> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> there are N metadata entries in the metadata area, then N bits in this
>> >> field would be used to signify if the corresponding metadata is used
>> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> one access for the "is_used" bits.
>> >>
>> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> N bits in the metadata area of #2. Faster but would consume many
>> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> we are.
>> >>
>> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> Approach 3 is just a fine grained extension of the idea itself.
>> >>
>> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> the metadata) should be the fastest. The decision also depends on the
>> >> number of metadata entries you have per packet. Sorry that I do not
>> >> have a good answer. My feeling is that we need something like #1 or
>> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> a combo) in the eBPF program itself? Would provide us with the
>> >> flexibility, if possible.
>> >
>> > Here is my take on it, lmk if I'm missing something:
>> >
>> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> > plan to use metadata on tx.
>> > This essentially requires allocating a fixed headroom to carry the metadata.
>> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> > (devtx_frame.meta_len in this series).
>> > Then it's up to the userspace and bpf program to agree on the layout.
>> > If not every packet is expected to carry the metadata, there might be
>> > some bitmask in the metadata area to indicate that.
>> >
>> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> > to interpret it and call appropriate kfunc to enable some offload.
>>
>> The reason for the flag on RX is mostly performance: there's a
>> substantial performance hit from reading the metadata area because it's
>> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> the flag inside the metadata area itself doesn't work for this, because
>> then you incur the cache miss just to read the flag.
>
> Not necessarily. Let us say that the flag is 4 bytes. Increase the
> start address of the packet buffer with 4 and the flags field will be
> on the same cache line as the first 60 bytes of the packet data
> (assuming a 64 byte cache line size and the flags field is closest to
> the start of the packet data). As long as you write something in those
> first 60 bytes of packet data that cache line will be brought in and
> will likely be in the cache when you access the bits in the metadata
> field. The trick works similarly for Rx by setting the umem headroom
> accordingly.

Yeah, a trick like that was what I was alluding to with the "could" in
this bit:

>> but I see no reason it could not also occur on TX (it'll mostly
>> depend on data alignment I guess?).

right below the text you quoted ;)

> But you are correct in that dedicating a bit in the descriptor will
> make sure it is always hot, while the trick above is dependent on the
> app wanting to read or write the first cache line worth of packet
> data.

Exactly; which is why I think it's worth the flag bit :)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
@ 2023-06-29 16:21                         ` Stanislav Fomichev
  2023-06-29 20:58                           ` Toke Høiland-Jørgensen
  2023-06-30  6:22                         ` Magnus Karlsson
  1 sibling, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-29 16:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Magnus Karlsson, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, Jun 29, 2023 at 5:01 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> >> > <magnus.karlsson@gmail.com> wrote:
> >> >>
> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >> >
> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> >> > <jbrouer@redhat.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> >> > > > <jbrouer@redhat.com> wrote:
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> >> > > >>>>
> >> >> > > >>>>
> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> >> > > >>>>
> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> >> > > >>>>> is no way currently to populate skb metadata.
> >> >> > > >>>>>
> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> >> > > >>>>> (same as in RX case).
> >> >> > > >>>>
> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> >> > > >>>> metadata length (tx_metadata_len).
> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> >> > > >>>> Maybe describe this in patch desc.
> >> >> > > >>>
> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> >> > > >>> the details (similar to the one we have for rx).
> >> >> > > >>>
> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> >> > > >>>> TX metadata size ?
> >> >> > > >>>
> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> >> > > >>> fixed between the userspace and BPF.
> >> >> > > >>
> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> >> > > >> flexibility and extensibility.
> >> >> > > >
> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> >> > > > prog can then use this fixed space to implement some flexibility
> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> >> > > > its size per packet. Probably not worth it?
> >> >> > >
> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> >> > > sized memory area, but communicate size per packet in this area (also
> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> >> > >
> >> >> > >
> >> >> > > >
> >> >> > > >>> If every packet would have a different metadata length, it seems like
> >> >> > > >>> a nightmare to parse?
> >> >> > > >>>
> >> >> > > >>
> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> >> > > >> see [1] and [2].
> >> >> > > >>
> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> >> > > >> data ("leftover" since last time this mem was used).
> >> >> > > >
> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> >> > > >
> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> >> > > > care of setting it up?
> >> >> > > >
> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> >> > > >> selftests examples don't use this and instead have a very static
> >> >> > > >> approach (that people will copy-paste).
> >> >> > > >>
> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> >> > > >
> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> >> > > > maybe check that the meta_len is the one that's expected.
> >> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> >> > >
> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> >> > > activation.
> >> >> > >
> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >> >
> >> >> > What is that "if not used" situation? Can the metadata itself have
> >> >> > is_used bit? The userspace has to initialize at least that bit.
> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> >> > but I'm trying to understand whether we can do without it.
> >> >>
> >> >> To me, this "has_metadata" bit in the descriptor is just an
> >> >> optimization. If it is 0, then there is no need to go and check the
> >> >> metadata field and you save some performance. Regardless of this bit,
> >> >> you need some way to say "is_used" for each metadata entry (at least
> >> >> when the number of metadata entries is >1). Three options come to mind
> >> >> each with their pros and cons.
> >> >>
> >> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> >> every metadata and requires the user/kernel to go scan through every
> >> >> entry for every packet.
> >> >>
> >> >> #2: Have a field of bits at the start of the metadata section (closest
> >> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> >> there are N metadata entries in the metadata area, then N bits in this
> >> >> field would be used to signify if the corresponding metadata is used
> >> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> >> one access for the "is_used" bits.
> >> >>
> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> >> N bits in the metadata area of #2. Faster but would consume many
> >> >> precious bits in the fixed descriptor and cap the number of metadata
> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> >> we are.
> >> >>
> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> >> Approach 3 is just a fine grained extension of the idea itself.
> >> >>
> >> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> >> If it is nearly always valid and used, approach #1 (if possible for
> >> >> the metadata) should be the fastest. The decision also depends on the
> >> >> number of metadata entries you have per packet. Sorry that I do not
> >> >> have a good answer. My feeling is that we need something like #1 or
> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> >> a combo) in the eBPF program itself? Would provide us with the
> >> >> flexibility, if possible.
> >> >
> >> > Here is my take on it, lmk if I'm missing something:
> >> >
> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> >> > plan to use metadata on tx.
> >> > This essentially requires allocating a fixed headroom to carry the metadata.
> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
> >> > (devtx_frame.meta_len in this series).
> >> > Then it's up to the userspace and bpf program to agree on the layout.
> >> > If not every packet is expected to carry the metadata, there might be
> >> > some bitmask in the metadata area to indicate that.
> >> >
> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> >> > to interpret it and call appropriate kfunc to enable some offload.
> >>
> >> The reason for the flag on RX is mostly performance: there's a
> >> substantial performance hit from reading the metadata area because it's
> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
> >> the flag inside the metadata area itself doesn't work for this, because
> >> then you incur the cache miss just to read the flag.
> >
> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
> > start address of the packet buffer with 4 and the flags field will be
> > on the same cache line as the first 60 bytes of the packet data
> > (assuming a 64 byte cache line size and the flags field is closest to
> > the start of the packet data). As long as you write something in those
> > first 60 bytes of packet data that cache line will be brought in and
> > will likely be in the cache when you access the bits in the metadata
> > field. The trick works similarly for Rx by setting the umem headroom
> > accordingly.
>
> Yeah, a trick like that was what I was alluding to with the "could" in
> this bit:
>
> >> but I see no reason it could not also occur on TX (it'll mostly
> >> depend on data alignment I guess?).
>
> right below the text you quoted ;)
>
> > But you are correct in that dedicating a bit in the descriptor will
> > make sure it is always hot, while the trick above is dependent on the
> > app wanting to read or write the first cache line worth of packet
> > data.
>
> Exactly; which is why I think it's worth the flag bit :)

Ack. Let me add this to the list of things to follow up on. I'm
assuming it's fair to start without the flag and add it later as a
performance optimization?
We have a fair bit of core things we need to agree on first :-D

> -Toke
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 16:21                         ` Stanislav Fomichev
@ 2023-06-29 20:58                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 20:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Magnus Karlsson, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Jun 29, 2023 at 5:01 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> >> > <magnus.karlsson@gmail.com> wrote:
>> >> >>
>> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >> >
>> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> >> > <jbrouer@redhat.com> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> >> > > > <jbrouer@redhat.com> wrote:
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> >> > > >>>>
>> >> >> > > >>>>
>> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> >> > > >>>>
>> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> >> > > >>>>> is no way currently to populate skb metadata.
>> >> >> > > >>>>>
>> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> >> > > >>>>> (same as in RX case).
>> >> >> > > >>>>
>> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> >> > > >>>> metadata length (tx_metadata_len).
>> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> >> > > >>>> Maybe describe this in patch desc.
>> >> >> > > >>>
>> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> >> > > >>> the details (similar to the one we have for rx).
>> >> >> > > >>>
>> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> >> > > >>>> TX metadata size ?
>> >> >> > > >>>
>> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> >> > > >>> fixed between the userspace and BPF.
>> >> >> > > >>
>> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> >> > > >> flexibility and extensibility.
>> >> >> > > >
>> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> >> > > > prog can then use this fixed space to implement some flexibility
>> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> >> > > > its size per packet. Probably not worth it?
>> >> >> > >
>> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> >> > > sized memory area, but communicate size per packet in this area (also
>> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> >> > >
>> >> >> > >
>> >> >> > > >
>> >> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> >> > > >>> a nightmare to parse?
>> >> >> > > >>>
>> >> >> > > >>
>> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> >> > > >> see [1] and [2].
>> >> >> > > >>
>> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> >> > > >> data ("leftover" since last time this mem was used).
>> >> >> > > >
>> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> >> > > >
>> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> >> > > > care of setting it up?
>> >> >> > > >
>> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> >> > > >> selftests examples don't use this and instead have a very static
>> >> >> > > >> approach (that people will copy-paste).
>> >> >> > > >>
>> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> >> > > >
>> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> >> > > > maybe check that the meta_len is the one that's expected.
>> >> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> >> > >
>> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> >> > > activation.
>> >> >> > >
>> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >> >
>> >> >> > What is that "if not used" situation? Can the metadata itself have
>> >> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> >> > but I'm trying to understand whether we can do without it.
>> >> >>
>> >> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> >> optimization. If it is 0, then there is no need to go and check the
>> >> >> metadata field and you save some performance. Regardless of this bit,
>> >> >> you need some way to say "is_used" for each metadata entry (at least
>> >> >> when the number of metadata entries is >1). Three options come to mind
>> >> >> each with their pros and cons.
>> >> >>
>> >> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> >> every metadata and requires the user/kernel to go scan through every
>> >> >> entry for every packet.
>> >> >>
>> >> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> >> there are N metadata entries in the metadata area, then N bits in this
>> >> >> field would be used to signify if the corresponding metadata is used
>> >> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> >> one access for the "is_used" bits.
>> >> >>
>> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> >> N bits in the metadata area of #2. Faster but would consume many
>> >> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> >> we are.
>> >> >>
>> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> >> Approach 3 is just a fine grained extension of the idea itself.
>> >> >>
>> >> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> >> the metadata) should be the fastest. The decision also depends on the
>> >> >> number of metadata entries you have per packet. Sorry that I do not
>> >> >> have a good answer. My feeling is that we need something like #1 or
>> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> >> a combo) in the eBPF program itself? Would provide us with the
>> >> >> flexibility, if possible.
>> >> >
>> >> > Here is my take on it, lmk if I'm missing something:
>> >> >
>> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> >> > plan to use metadata on tx.
>> >> > This essentially requires allocating a fixed headroom to carry the metadata.
>> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> >> > (devtx_frame.meta_len in this series).
>> >> > Then it's up to the userspace and bpf program to agree on the layout.
>> >> > If not every packet is expected to carry the metadata, there might be
>> >> > some bitmask in the metadata area to indicate that.
>> >> >
>> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> >> > to interpret it and call appropriate kfunc to enable some offload.
>> >>
>> >> The reason for the flag on RX is mostly performance: there's a
>> >> substantial performance hit from reading the metadata area because it's
>> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> >> the flag inside the metadata area itself doesn't work for this, because
>> >> then you incur the cache miss just to read the flag.
>> >
>> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
>> > start address of the packet buffer with 4 and the flags field will be
>> > on the same cache line as the first 60 bytes of the packet data
>> > (assuming a 64 byte cache line size and the flags field is closest to
>> > the start of the packet data). As long as you write something in those
>> > first 60 bytes of packet data that cache line will be brought in and
>> > will likely be in the cache when you access the bits in the metadata
>> > field. The trick works similarly for Rx by setting the umem headroom
>> > accordingly.
>>
>> Yeah, a trick like that was what I was alluding to with the "could" in
>> this bit:
>>
>> >> but I see no reason it could not also occur on TX (it'll mostly
>> >> depend on data alignment I guess?).
>>
>> right below the text you quoted ;)
>>
>> > But you are correct in that dedicating a bit in the descriptor will
>> > make sure it is always hot, while the trick above is dependent on the
>> > app wanting to read or write the first cache line worth of packet
>> > data.
>>
>> Exactly; which is why I think it's worth the flag bit :)
>
> Ack. Let me add this to the list of things to follow up on. I'm
> assuming it's fair to start without the flag and add it later as a
> performance optimization?
> We have a fair bit of core things we need to agree on first :-D

Certainly no objection as long as we are doing RFC patches, but I think
we should probably add this before merging something; no reason to
change API more than we have to :)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  2023-06-29 16:21                         ` Stanislav Fomichev
@ 2023-06-30  6:22                         ` Magnus Karlsson
  2023-06-30  9:19                           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 72+ messages in thread
From: Magnus Karlsson @ 2023-06-30  6:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, 29 Jun 2023 at 14:01, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> >> > <magnus.karlsson@gmail.com> wrote:
> >> >>
> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >> >
> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> >> > <jbrouer@redhat.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> >> > > > <jbrouer@redhat.com> wrote:
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> >> > > >>>>
> >> >> > > >>>>
> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> >> > > >>>>
> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> >> > > >>>>> is no way currently to populate skb metadata.
> >> >> > > >>>>>
> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> >> > > >>>>> (same as in RX case).
> >> >> > > >>>>
> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> >> > > >>>> metadata length (tx_metadata_len).
> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> >> > > >>>> Maybe describe this in patch desc.
> >> >> > > >>>
> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> >> > > >>> the details (similar to the one we have for rx).
> >> >> > > >>>
> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> >> > > >>>> TX metadata size ?
> >> >> > > >>>
> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> >> > > >>> fixed between the userspace and BPF.
> >> >> > > >>
> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> >> > > >> flexibility and extensibility.
> >> >> > > >
> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> >> > > > prog can then use this fixed space to implement some flexibility
> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> >> > > > its size per packet. Probably not worth it?
> >> >> > >
> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> >> > > sized memory area, but communicate size per packet in this area (also
> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> >> > >
> >> >> > >
> >> >> > > >
> >> >> > > >>> If every packet would have a different metadata length, it seems like
> >> >> > > >>> a nightmare to parse?
> >> >> > > >>>
> >> >> > > >>
> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> >> > > >> see [1] and [2].
> >> >> > > >>
> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> >> > > >> data ("leftover" since last time this mem was used).
> >> >> > > >
> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> >> > > >
> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> >> > > > care of setting it up?
> >> >> > > >
> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> >> > > >> selftests examples don't use this and instead have a very static
> >> >> > > >> approach (that people will copy-paste).
> >> >> > > >>
> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> >> > > >
> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> >> > > > maybe check that the meta_len is the one that's expected.
> >> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> >> > >
> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> >> > > activation.
> >> >> > >
> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >> >
> >> >> > What is that "if not used" situation? Can the metadata itself have
> >> >> > is_used bit? The userspace has to initialize at least that bit.
> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> >> > but I'm trying to understand whether we can do without it.
> >> >>
> >> >> To me, this "has_metadata" bit in the descriptor is just an
> >> >> optimization. If it is 0, then there is no need to go and check the
> >> >> metadata field and you save some performance. Regardless of this bit,
> >> >> you need some way to say "is_used" for each metadata entry (at least
> >> >> when the number of metadata entries is >1). Three options come to mind
> >> >> each with their pros and cons.
> >> >>
> >> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> >> every metadata and requires the user/kernel to go scan through every
> >> >> entry for every packet.
> >> >>
> >> >> #2: Have a field of bits at the start of the metadata section (closest
> >> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> >> there are N metadata entries in the metadata area, then N bits in this
> >> >> field would be used to signify if the corresponding metadata is used
> >> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> >> one access for the "is_used" bits.
> >> >>
> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> >> N bits in the metadata area of #2. Faster but would consume many
> >> >> precious bits in the fixed descriptor and cap the number of metadata
> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> >> we are.
> >> >>
> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> >> Approach 3 is just a fine grained extension of the idea itself.
> >> >>
> >> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> >> If it is nearly always valid and used, approach #1 (if possible for
> >> >> the metadata) should be the fastest. The decision also depends on the
> >> >> number of metadata entries you have per packet. Sorry that I do not
> >> >> have a good answer. My feeling is that we need something like #1 or
> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> >> a combo) in the eBPF program itself? Would provide us with the
> >> >> flexibility, if possible.
> >> >
> >> > Here is my take on it, lmk if I'm missing something:
> >> >
> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> >> > plan to use metadata on tx.
> >> > This essentially requires allocating a fixed headroom to carry the metadata.
> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
> >> > (devtx_frame.meta_len in this series).
> >> > Then it's up to the userspace and bpf program to agree on the layout.
> >> > If not every packet is expected to carry the metadata, there might be
> >> > some bitmask in the metadata area to indicate that.
> >> >
> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> >> > to interpret it and call appropriate kfunc to enable some offload.
> >>
> >> The reason for the flag on RX is mostly performance: there's a
> >> substantial performance hit from reading the metadata area because it's
> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
> >> the flag inside the metadata area itself doesn't work for this, because
> >> then you incur the cache miss just to read the flag.
> >
> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
> > start address of the packet buffer with 4 and the flags field will be
> > on the same cache line as the first 60 bytes of the packet data
> > (assuming a 64 byte cache line size and the flags field is closest to
> > the start of the packet data). As long as you write something in those
> > first 60 bytes of packet data that cache line will be brought in and
> > will likely be in the cache when you access the bits in the metadata
> > field. The trick works similarly for Rx by setting the umem headroom
> > accordingly.
>
> Yeah, a trick like that was what I was alluding to with the "could" in
> this bit:
>
> >> but I see no reason it could not also occur on TX (it'll mostly
> >> depend on data alignment I guess?).
>
> right below the text you quoted ;)

Ouch! Sorry Toke. Was a bit too trigger-happy there.

> > But you are correct in that dedicating a bit in the descriptor will
> > make sure it is always hot, while the trick above is dependent on the
> > app wanting to read or write the first cache line worth of packet
> > data.
>
> Exactly; which is why I think it's worth the flag bit :)
>
> -Toke
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-30  6:22                         ` Magnus Karlsson
@ 2023-06-30  9:19                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-30  9:19 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Thu, 29 Jun 2023 at 14:01, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> >> > <magnus.karlsson@gmail.com> wrote:
>> >> >>
>> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >> >
>> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> >> > <jbrouer@redhat.com> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> >> > > > <jbrouer@redhat.com> wrote:
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> >> > > >>>>
>> >> >> > > >>>>
>> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> >> > > >>>>
>> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> >> > > >>>>> is no way currently to populate skb metadata.
>> >> >> > > >>>>>
>> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> >> > > >>>>> (same as in RX case).
>> >> >> > > >>>>
>> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> >> > > >>>> metadata length (tx_metadata_len).
>> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> >> > > >>>> Maybe describe this in patch desc.
>> >> >> > > >>>
>> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> >> > > >>> the details (similar to the one we have for rx).
>> >> >> > > >>>
>> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> >> > > >>>> TX metadata size ?
>> >> >> > > >>>
>> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> >> > > >>> fixed between the userspace and BPF.
>> >> >> > > >>
>> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> >> > > >> flexibility and extensibility.
>> >> >> > > >
>> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> >> > > > prog can then use this fixed space to implement some flexibility
>> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> >> > > > its size per packet. Probably not worth it?
>> >> >> > >
>> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> >> > > sized memory area, but communicate size per packet in this area (also
>> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> >> > >
>> >> >> > >
>> >> >> > > >
>> >> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> >> > > >>> a nightmare to parse?
>> >> >> > > >>>
>> >> >> > > >>
>> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> >> > > >> see [1] and [2].
>> >> >> > > >>
>> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> >> > > >> data ("leftover" since last time this mem was used).
>> >> >> > > >
>> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> >> > > >
>> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> >> > > > care of setting it up?
>> >> >> > > >
>> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> >> > > >> selftests examples don't use this and instead have a very static
>> >> >> > > >> approach (that people will copy-paste).
>> >> >> > > >>
>> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> >> > > >
>> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> >> > > > maybe check that the meta_len is the one that's expected.
>> >> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> >> > >
>> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> >> > > activation.
>> >> >> > >
>> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >> >
>> >> >> > What is that "if not used" situation? Can the metadata itself have
>> >> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> >> > but I'm trying to understand whether we can do without it.
>> >> >>
>> >> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> >> optimization. If it is 0, then there is no need to go and check the
>> >> >> metadata field and you save some performance. Regardless of this bit,
>> >> >> you need some way to say "is_used" for each metadata entry (at least
>> >> >> when the number of metadata entries is >1). Three options come to mind
>> >> >> each with their pros and cons.
>> >> >>
>> >> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> >> every metadata and requires the user/kernel to go scan through every
>> >> >> entry for every packet.
>> >> >>
>> >> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> >> there are N metadata entries in the metadata area, then N bits in this
>> >> >> field would be used to signify if the corresponding metadata is used
>> >> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> >> one access for the "is_used" bits.
>> >> >>
>> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> >> N bits in the metadata area of #2. Faster but would consume many
>> >> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> >> we are.
>> >> >>
>> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> >> Approach 3 is just a fine grained extension of the idea itself.
>> >> >>
>> >> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> >> the metadata) should be the fastest. The decision also depends on the
>> >> >> number of metadata entries you have per packet. Sorry that I do not
>> >> >> have a good answer. My feeling is that we need something like #1 or
>> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> >> a combo) in the eBPF program itself? Would provide us with the
>> >> >> flexibility, if possible.
>> >> >
>> >> > Here is my take on it, lmk if I'm missing something:
>> >> >
>> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> >> > plan to use metadata on tx.
>> >> > This essentially requires allocating a fixed headroom to carry the metadata.
>> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> >> > (devtx_frame.meta_len in this series).
>> >> > Then it's up to the userspace and bpf program to agree on the layout.
>> >> > If not every packet is expected to carry the metadata, there might be
>> >> > some bitmask in the metadata area to indicate that.
>> >> >
>> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> >> > to interpret it and call appropriate kfunc to enable some offload.
>> >>
>> >> The reason for the flag on RX is mostly performance: there's a
>> >> substantial performance hit from reading the metadata area because it's
>> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> >> the flag inside the metadata area itself doesn't work for this, because
>> >> then you incur the cache miss just to read the flag.
>> >
>> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
>> > start address of the packet buffer with 4 and the flags field will be
>> > on the same cache line as the first 60 bytes of the packet data
>> > (assuming a 64 byte cache line size and the flags field is closest to
>> > the start of the packet data). As long as you write something in those
>> > first 60 bytes of packet data that cache line will be brought in and
>> > will likely be in the cache when you access the bits in the metadata
>> > field. The trick works similarly for Rx by setting the umem headroom
>> > accordingly.
>>
>> Yeah, a trick like that was what I was alluding to with the "could" in
>> this bit:
>>
>> >> but I see no reason it could not also occur on TX (it'll mostly
>> >> depend on data alignment I guess?).
>>
>> right below the text you quoted ;)
>
> Ouch! Sorry Toke. Was a bit too trigger-happy there.

Haha, no worries, seems like we're basically in agreement anyway :)

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-29 11:43                                 ` Toke Høiland-Jørgensen
@ 2023-06-30 18:54                                   ` Stanislav Fomichev
  2023-07-01  0:52                                   ` John Fastabend
  1 sibling, 0 replies; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-30 18:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jakub Kicinski, John Fastabend, Alexei Starovoitov,
	Donald Hunter, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, Network Development

On Thu, Jun 29, 2023 at 4:43 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jakub Kicinski <kuba@kernel.org> writes:
>
> > On Tue, 27 Jun 2023 14:43:57 -0700 John Fastabend wrote:
> >> What I think would be the most straight-forward thing and most flexible
> >> is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
> >> and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
> >> corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
> >> don't spend any cycles building the metadata thing or have to even
> >> worry about read kfuncs. The BPF program has read access to any
> >> fields they need. And with the skb, xdp pointer we have the context
> >> that created the descriptor and generate meaningful metrics.
> >
> > Sorry but this is not going to happen without my nack. DPDK was a much
> > cleaner bifurcation point than trying to write datapath drivers in BPF.
> > Users having to learn how to render descriptors for all the NICs
> > and queue formats out there is not reasonable. Isovalent hired
> > a lot of former driver developers so you may feel like it's a good
> > idea, as a middleware provider. But for the rest of us the matrix
> > of HW x queue format x people writing BPF is too large. If we can
> > write some poor man's DPDK / common BPF driver library to be selected
> > at linking time - we can as well provide a generic interface in
> > the kernel itself. Again, we never merged explicit DPDK support,
> > your idea is strictly worse.
>
> I agree: we're writing an operating system kernel here. The *whole
> point* of an operating system is to provide an abstraction over
> different types of hardware and provide a common API so users don't have
> to deal with the hardware details.
>
> I feel like there's some tension between "BPF as a dataplane API" and
> "BPF as a kernel extension language" here, especially as the BPF
> subsystem has grown more features in the latter direction. In my mind,
> XDP is still very much a dataplane API; in fact that's one of the main
> selling points wrt DPDK: you can get high performance networking but
> still take advantage of the kernel drivers and other abstractions that
> the kernel provides. If you're going for raw performance and the ability
> to twiddle every tiny detail of the hardware, DPDK fills that niche
> quite nicely (and also shows us the pains of going that route).

Since the thread has been quiet for a day, here is how I'm planning to proceed:
- remove most of the devtx_frame context (but still keep it for
stashing descriptor pointers and having a common kfunc api)
- keep common kfunc interface for common abstractions
- separate skb/xdp hooks - this is probably a good idea anyway to not
mix them up (we are focusing mostly on xdp here)
- continue using raw fentry for now, let's reconsider later, depending
on where we end up with generic apis vs non-generic ones
- add tx checksum to show how this tx-dev-bound framework can be
extended (and show similarities between the timestamp and checksum)

Iow, I'll largely keep the same approach but will try to expose raw
skb/xdp_frame + add tx-csum. Let's reconvene once I send out v3. Thank
you all for the valuable feedback!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-06-29 11:43                                 ` Toke Høiland-Jørgensen
  2023-06-30 18:54                                   ` Stanislav Fomichev
@ 2023-07-01  0:52                                   ` John Fastabend
  2023-07-01  3:11                                     ` Jakub Kicinski
  1 sibling, 1 reply; 72+ messages in thread
From: John Fastabend @ 2023-07-01  0:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Kicinski, John Fastabend
  Cc: Stanislav Fomichev, Alexei Starovoitov, Donald Hunter, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Network Development

Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
> > On Tue, 27 Jun 2023 14:43:57 -0700 John Fastabend wrote:
> >> What I think would be the most straight-forward thing and most flexible
> >> is to create a <drvname>_devtx_submit_skb(<drivname>descriptor, sk_buff)
> >> and <drvname>_devtx_submit_xdp(<drvname>descriptor, xdp_frame) and then
> >> corresponding calls for <drvname>_devtx_complete_{skb|xdp}() Then you
> >> don't spend any cycles building the metadata thing or have to even
> >> worry about read kfuncs. The BPF program has read access to any
> >> fields they need. And with the skb, xdp pointer we have the context
> >> that created the descriptor and generate meaningful metrics.
> >
> > Sorry but this is not going to happen without my nack. DPDK was a much
> > cleaner bifurcation point than trying to write datapath drivers in BPF.
> > Users having to learn how to render descriptors for all the NICs
> > and queue formats out there is not reasonable. Isovalent hired

I would expect BPF/driver experts would write the libraries for the
datapath API that the network/switch developer is going to use. I would
even put the BPF programs in kernel and ship them with the release
if that helps.

We have different visions on who the BPF user is that writes XDP
programs I think.

> > a lot of former driver developers so you may feel like it's a good
> > idea, as a middleware provider. But for the rest of us the matrix
> > of HW x queue format x people writing BPF is too large. If we can

Its nice though that we have good coverage for XDP so the matrix
is big. Even with kfuncs though we need someone to write support.
My thought is its just a question of if they write it in BPF
or in C code as a reader kfunc. I suspect for these advanced features
its only a subset at least upfront. Either way BPF or C you are
stuck finding someone to write that code.

> > write some poor man's DPDK / common BPF driver library to be selected
> > at linking time - we can as well provide a generic interface in
> > the kernel itself. Again, we never merged explicit DPDK support, 
> > your idea is strictly worse.
> 
> I agree: we're writing an operating system kernel here. The *whole
> point* of an operating system is to provide an abstraction over
> different types of hardware and provide a common API so users don't have
> to deal with the hardware details.

And just to be clear what we sacrifice then is forwards/backwards
portability. If its a kernel kfunc we need to add a kfunc for
every field we want to read and it will only be available then.
Further, it will need some general agreement that its useful for
it to be added. A hardware vendor wont be able to add some arbitrary
field and get access to it. So we lose this by doing kfuncs.

Its pushing complexity into the kernel that we maintain in kernel
when we could push the complexity into BPF and maintain as user
space code and BPF codes. Its a choice to make I think.

Also abstraction can cost cycles. Here we have to prepare the
structure and call kfunc. The kfunc can be inlined if folks
do the work. It may be small cost but not free.

> 
> I feel like there's some tension between "BPF as a dataplane API" and
> "BPF as a kernel extension language" here, especially as the BPF

Agree. I'm obviously not maximizing for ease of use for the dataplane
API as BPF. IMO though even with the kfunc abstraction its niche work
writing low level datapath code that requires exposing a user
API higher up the stack. With a DSL (P4, ...) for example you could 
abstract away the complexity and then compile down into these
details. Or if you like tables an Openflow style table interface
would provide a table API.

> subsystem has grown more features in the latter direction. In my mind,
> XDP is still very much a dataplane API; in fact that's one of the main
> selling points wrt DPDK: you can get high performance networking but
> still take advantage of the kernel drivers and other abstractions that

I think we agree on the goal a fast datapath for the nic.

> the kernel provides. If you're going for raw performance and the ability
> to twiddle every tiny detail of the hardware, DPDK fills that niche
> quite nicely (and also shows us the pains of going that route).

Summary on my side is we minimize kernel complexity
by raw descriptor reads, we don't need to know what we
want to read in the future and we need folks who understand
the hardware regardless of where the code lives in BPF
or C. C certainly helps the picking what kfunc to use
but we also have BTF that solves this struct/offset problem
for non-networking use cases already.

> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-07-01  0:52                                   ` John Fastabend
@ 2023-07-01  3:11                                     ` Jakub Kicinski
  2023-07-03 18:30                                       ` John Fastabend
  0 siblings, 1 reply; 72+ messages in thread
From: Jakub Kicinski @ 2023-07-01  3:11 UTC (permalink / raw)
  To: John Fastabend
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Alexei Starovoitov, Donald Hunter, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Network Development

On Fri, 30 Jun 2023 17:52:05 -0700 John Fastabend wrote:
> Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <kuba@kernel.org> writes:
> > > Sorry but this is not going to happen without my nack. DPDK was a much
> > > cleaner bifurcation point than trying to write datapath drivers in BPF.
> > > Users having to learn how to render descriptors for all the NICs
> > > and queue formats out there is not reasonable. Isovalent hired  
> 
> I would expect BPF/driver experts would write the libraries for the
> datapath API that the network/switch developer is going to use. I would
> even put the BPF programs in kernel and ship them with the release
> if that helps.
> 
> We have different visions on who the BPF user is that writes XDP
> programs I think.

Yes, crucially. What I've seen talking to engineers working on TC/XDP
BPF at Meta (and I may not be dealing with experts, Martin would have
a broader view) is that they don't understand basics like s/g or
details of checksums.

I don't think it is reasonable to call you, Maxim, Nik and co. "users".
We're risking building system so complex normal people will _need_ an
overlay on top to make it work.

> > > a lot of former driver developers so you may feel like it's a good
> > > idea, as a middleware provider. But for the rest of us the matrix
> > > of HW x queue format x people writing BPF is too large. If we can  
> 
> Its nice though that we have good coverage for XDP so the matrix
> is big. Even with kfuncs though we need someone to write support.
> My thought is its just a question of if they write it in BPF
> or in C code as a reader kfunc. I suspect for these advanced features
> its only a subset at least upfront. Either way BPF or C you are
> stuck finding someone to write that code.

Right, but kernel is a central point where it can be written, reviewed,
cross-optimized and stored.

> > > write some poor man's DPDK / common BPF driver library to be selected
> > > at linking time - we can as well provide a generic interface in
> > > the kernel itself. Again, we never merged explicit DPDK support, 
> > > your idea is strictly worse.  
> > 
> > I agree: we're writing an operating system kernel here. The *whole
> > point* of an operating system is to provide an abstraction over
> > different types of hardware and provide a common API so users don't have
> > to deal with the hardware details.  
> 
> And just to be clear what we sacrifice then is forwards/backwards
> portability.

Forward compatibility is also the favorite word of HW vendors when 
they create proprietary interfaces. I think it's incorrect to call
cutting functionality out of a project forward compatibility.
If functionality is moved the surface of compatibility is different.

> If its a kernel kfunc we need to add a kfunc for
> every field we want to read and it will only be available then.
> Further, it will need some general agreement that its useful for
> it to be added. A hardware vendor wont be able to add some arbitrary
> field and get access to it. So we lose this by doing kfuncs.

We both know how easy it is to come up with useful HW, so I'm guessing
this is rhetorical.

> Its pushing complexity into the kernel that we maintain in kernel
> when we could push the complexity into BPF and maintain as user
> space code and BPF codes. Its a choice to make I think.

Right, and I believe having the code in the kernel, appropriately
integrated with the drivers is beneficial. The main argument against 
it is that in certain environments kernels are old. But that's a very
destructive argument.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-07-01  3:11                                     ` Jakub Kicinski
@ 2023-07-03 18:30                                       ` John Fastabend
  2023-07-03 19:33                                         ` Jakub Kicinski
  0 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2023-07-03 18:30 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Alexei Starovoitov, Donald Hunter, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Network Development

Jakub Kicinski wrote:
> On Fri, 30 Jun 2023 17:52:05 -0700 John Fastabend wrote:
> > Toke Høiland-Jørgensen wrote:
> > > Jakub Kicinski <kuba@kernel.org> writes:
> > > > Sorry but this is not going to happen without my nack. DPDK was a much
> > > > cleaner bifurcation point than trying to write datapath drivers in BPF.
> > > > Users having to learn how to render descriptors for all the NICs
> > > > and queue formats out there is not reasonable. Isovalent hired  
> > 
> > I would expect BPF/driver experts would write the libraries for the
> > datapath API that the network/switch developer is going to use. I would
> > even put the BPF programs in kernel and ship them with the release
> > if that helps.
> > 
> > We have different visions on who the BPF user is that writes XDP
> > programs I think.
> 
> Yes, crucially. What I've seen talking to engineers working on TC/XDP
> BPF at Meta (and I may not be dealing with experts, Martin would have
> a broader view) is that they don't understand basics like s/g or
> details of checksums.

Interesting data point. But these same engineers will want to get
access to the checksum, but don't understand it? Seems if your
going to start reading/writing descriptors even through kfuncs
we need to get some docs/notes on how to use them correctly then.
We certainly wont put guardrails on the read/writes for performance
reasons.

> 
> I don't think it is reasonable to call you, Maxim, Nik and co. "users".
> We're risking building system so complex normal people will _need_ an
> overlay on top to make it work.

I consider us users. We write networking CNI and observability/sec
tooling on top of BPF. Most of what we create is driven by customer
environments and performance. Maybe not typical users I guess, but
also Meta users are not typical and have their own set of constraints
and insights.

> 
> > > > a lot of former driver developers so you may feel like it's a good
> > > > idea, as a middleware provider. But for the rest of us the matrix
> > > > of HW x queue format x people writing BPF is too large. If we can  
> > 
> > Its nice though that we have good coverage for XDP so the matrix
> > is big. Even with kfuncs though we need someone to write support.
> > My thought is its just a question of if they write it in BPF
> > or in C code as a reader kfunc. I suspect for these advanced features
> > its only a subset at least upfront. Either way BPF or C you are
> > stuck finding someone to write that code.
> 
> Right, but kernel is a central point where it can be written, reviewed,
> cross-optimized and stored.

We can check BPF codes into the kernel.

> 
> > > > write some poor man's DPDK / common BPF driver library to be selected
> > > > at linking time - we can as well provide a generic interface in
> > > > the kernel itself. Again, we never merged explicit DPDK support, 
> > > > your idea is strictly worse.  
> > > 
> > > I agree: we're writing an operating system kernel here. The *whole
> > > point* of an operating system is to provide an abstraction over
> > > different types of hardware and provide a common API so users don't have
> > > to deal with the hardware details.  
> > 
> > And just to be clear what we sacrifice then is forwards/backwards
> > portability.
> 
> Forward compatibility is also the favorite word of HW vendors when 
> they create proprietary interfaces. I think it's incorrect to call
> cutting functionality out of a project forward compatibility.
> If functionality is moved the surface of compatibility is different.

Sure a bit of an abuse of terminology.

> 
> > If its a kernel kfunc we need to add a kfunc for
> > every field we want to read and it will only be available then.
> > Further, it will need some general agreement that its useful for
> > it to be added. A hardware vendor wont be able to add some arbitrary
> > field and get access to it. So we lose this by doing kfuncs.
> 
> We both know how easy it is to come up with useful HW, so I'm guessing
> this is rhetorical.

It wasn't rhetorical but agree we've been chasing this for years
and outside of environments where you own a very large data center
and sell out VMs or niche spaces its been hard to come up with a
really good general use case.

> 
> > Its pushing complexity into the kernel that we maintain in kernel
> > when we could push the complexity into BPF and maintain as user
> > space code and BPF codes. Its a choice to make I think.
> 
> Right, and I believe having the code in the kernel, appropriately
> integrated with the drivers is beneficial. The main argument against 
> it is that in certain environments kernels are old. But that's a very
> destructive argument.

My main concern here is we forget some kfunc that we need and then
we are stuck. We don't have the luxury of upgrading kernels easily.
It doesn't need to be an either/or discussion if we have a ctx()
call we can drop into BTF over the descriptor and use kfuncs for
the most common things. Other option is to simply write a kfunc
for every field I see that could potentially have some use even
if I don't fully understand it at the moment.

I suspect I am less concerned about raw access because we already
have BTF infra built up around our network observability/sec
solution so we already handle per kernel differences and desc.
just looks like another BTF object we want to read. And we
know what dev and types we are attaching to so we don't have
issues with is this a mlx or intel or etc device.

Also as a more practical concern how do we manage nic specific
things? Have nic spcific kfuncs? Per descriptor tx_flags and
status flags. Other things we need are ptr to skb and access
to the descriptor ring so we can pull stats off the ring. I'm
not arguing it can't be done with kfuncs, but if we go kfunc
route be prepared for a long list of kfuncs and driver specific
ones.

.John

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata
  2023-07-03 18:30                                       ` John Fastabend
@ 2023-07-03 19:33                                         ` Jakub Kicinski
  0 siblings, 0 replies; 72+ messages in thread
From: Jakub Kicinski @ 2023-07-03 19:33 UTC (permalink / raw)
  To: John Fastabend
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Alexei Starovoitov, Donald Hunter, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Network Development

On Mon, 03 Jul 2023 11:30:44 -0700 John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Fri, 30 Jun 2023 17:52:05 -0700 John Fastabend wrote:  
> > > I would expect BPF/driver experts would write the libraries for the
> > > datapath API that the network/switch developer is going to use. I would
> > > even put the BPF programs in kernel and ship them with the release
> > > if that helps.
> > > 
> > > We have different visions on who the BPF user is that writes XDP
> > > programs I think.  
> > 
> > Yes, crucially. What I've seen talking to engineers working on TC/XDP
> > BPF at Meta (and I may not be dealing with experts, Martin would have
> > a broader view) is that they don't understand basics like s/g or
> > details of checksums.  
> 
> Interesting data point. But these same engineers will want to get
> access to the checksum, but don't understand it? Seems if your
> going to start reading/writing descriptors even through kfuncs
> we need to get some docs/notes on how to use them correctly then.
> We certainly wont put guardrails on the read/writes for performance
> reasons.

Dunno about checksum, but it's definitely the same kind of person
that'd want access to timestamps.

> > I don't think it is reasonable to call you, Maxim, Nik and co. "users".
> > We're risking building system so complex normal people will _need_ an
> > overlay on top to make it work.  
> 
> I consider us users. We write networking CNI and observability/sec
> tooling on top of BPF. Most of what we create is driven by customer
> environments and performance. Maybe not typical users I guess, but
> also Meta users are not typical and have their own set of constraints
> and insights.

One thing Meta certainly does (and I think is a large part of success
of BPF) is delegating the development of applications away from the core
kernel team. Meta is different than a smaller company in that it _has_
a kernel team, but the "network application" teams I suspect are fairly
typical.

> > > Its pushing complexity into the kernel that we maintain in kernel
> > > when we could push the complexity into BPF and maintain as user
> > > space code and BPF codes. Its a choice to make I think.  
> > 
> > Right, and I believe having the code in the kernel, appropriately
> > integrated with the drivers is beneficial. The main argument against 
> > it is that in certain environments kernels are old. But that's a very
> > destructive argument.  
> 
> My main concern here is we forget some kfunc that we need and then
> we are stuck. We don't have the luxury of upgrading kernels easily.
> It doesn't need to be an either/or discussion if we have a ctx()
> call we can drop into BTF over the descriptor and use kfuncs for
> the most common things. Other option is to simply write a kfunc
> for every field I see that could potentially have some use even
> if I don't fully understand it at the moment.
> 
> I suspect I am less concerned about raw access because we already
> have BTF infra built up around our network observability/sec
> solution so we already handle per kernel differences and desc.
> just looks like another BTF object we want to read. And we
> know what dev and types we are attaching to so we don't have
> issues with is this a mlx or intel or etc device.
> 
> Also as a more practical concern how do we manage nic specific
> things? 

What are the NIC specific things?

> Have nic spcific kfuncs? Per descriptor tx_flags and
> status flags. Other things we need are ptr to skb and access
> to the descriptor ring so we can pull stats off the ring. I'm
> not arguing it can't be done with kfuncs, but if we go kfunc
> route be prepared for a long list of kfuncs and driver specific
> ones.

IDK why you say that, I gave the base list of offloads in an earlier
email.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2023-07-03 19:33 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21 17:02 [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 01/11] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 02/11] bpf: Resolve single typedef when walking structs Stanislav Fomichev
2023-06-22  5:17   ` Alexei Starovoitov
2023-06-22 17:55     ` Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-06-22  9:11   ` Jesper D. Brouer
2023-06-22 17:55     ` Stanislav Fomichev
2023-06-23 10:24       ` Jesper Dangaard Brouer
2023-06-23 17:41         ` Stanislav Fomichev
2023-06-24  9:02           ` Jesper Dangaard Brouer
2023-06-26 17:00             ` Stanislav Fomichev
2023-06-28  8:09               ` Magnus Karlsson
2023-06-28 18:49                 ` Stanislav Fomichev
2023-06-29  6:15                   ` Magnus Karlsson
2023-06-29 11:30                   ` [xdp-hints] " Toke Høiland-Jørgensen
2023-06-29 11:48                     ` Magnus Karlsson
2023-06-29 12:01                       ` Toke Høiland-Jørgensen
2023-06-29 16:21                         ` Stanislav Fomichev
2023-06-29 20:58                           ` Toke Høiland-Jørgensen
2023-06-30  6:22                         ` Magnus Karlsson
2023-06-30  9:19                           ` Toke Høiland-Jørgensen
2023-06-22 15:26   ` Simon Horman
2023-06-22 17:55     ` Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 04/11] bpf: Implement devtx hook points Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 05/11] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
2023-06-22 12:07   ` Jesper D. Brouer
2023-06-22 17:55     ` Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 06/11] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
2023-06-23 23:29   ` Vinicius Costa Gomes
2023-06-26 17:00     ` Stanislav Fomichev
2023-06-26 22:00       ` Vinicius Costa Gomes
2023-06-26 23:29         ` Stanislav Fomichev
2023-06-27  1:38           ` Vinicius Costa Gomes
2023-06-21 17:02 ` [RFC bpf-next v2 07/11] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 08/11] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
2023-06-23 11:12   ` Jesper D. Brouer
2023-06-23 17:40     ` Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 10/11] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev
2023-06-21 17:02 ` [RFC bpf-next v2 11/11] net/mlx5e: Support TX timestamp metadata Stanislav Fomichev
2023-06-22 19:57   ` Alexei Starovoitov
2023-06-22 20:13     ` Stanislav Fomichev
2023-06-22 21:47       ` Alexei Starovoitov
2023-06-22 22:13         ` Stanislav Fomichev
2023-06-23  2:35           ` Alexei Starovoitov
2023-06-23 10:16             ` Maryam Tahhan
2023-06-23 16:32               ` Alexei Starovoitov
2023-06-23 17:47                 ` Maryam Tahhan
2023-06-23 17:24             ` Stanislav Fomichev
2023-06-23 18:57             ` Donald Hunter
2023-06-24  0:25               ` John Fastabend
2023-06-24  2:52                 ` Alexei Starovoitov
2023-06-24 21:38                   ` Jakub Kicinski
2023-06-25  1:12                     ` Stanislav Fomichev
2023-06-26 21:36                       ` Stanislav Fomichev
2023-06-26 22:37                         ` Alexei Starovoitov
2023-06-26 23:29                           ` Stanislav Fomichev
2023-06-27 13:35                             ` Toke Høiland-Jørgensen
2023-06-27 21:43                             ` John Fastabend
2023-06-27 22:56                               ` Stanislav Fomichev
2023-06-27 23:33                                 ` John Fastabend
2023-06-27 23:50                                   ` Alexei Starovoitov
2023-06-28 18:52                               ` Jakub Kicinski
2023-06-29 11:43                                 ` Toke Høiland-Jørgensen
2023-06-30 18:54                                   ` Stanislav Fomichev
2023-07-01  0:52                                   ` John Fastabend
2023-07-01  3:11                                     ` Jakub Kicinski
2023-07-03 18:30                                       ` John Fastabend
2023-07-03 19:33                                         ` Jakub Kicinski
2023-06-22  8:41 ` [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
2023-06-22 17:55   ` Stanislav Fomichev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.